To visit our new website, go to:
The size and scope of this project requires careful alignment of team members and project tasks. Dawn Knight, Tess Fitzpatrick and Steve Morris, who have developed the vision for CorCenCC, recruited the larger project team and completed preparatory groundwork, constitute the CorCenCC
Management Team, coordinating work packages and aligning activities with project goals. Academic collaborators on the project represent expert knowledge and experience in crucial areas of the corpus design and application, and includes the project Co-Investigators, Research Associates/Assistants (4.4) and PhD students (2). Consultants associated with specific aspects of the project will join the team at relevant time points. The Project Advisory Group (PAG) ensures that key decisions benefit from the input of experienced academics and representatives of crucial stakeholder groups Management team
Dawn Knight (PI), Cardiff University
Dawn Knight is a Reader in Applied Linguistics at Cardiff University. Her current research interests lie predominantly in the areas of corpus linguistics, discourse analysis, digital interaction, non-verbal communication and the socio-linguistic contexts of communication. Dawn is the project lead, and is responsible for directing the CIs and RAs, managing the input from consultants and the advisory group, coordinating the work packages, and overseeing academic project outputs. She is also providing expertise on the design, construction and querying of corpora.
Tess Fitzpatrick (CI), Swansea University
Professor Tess Fitzpatrick is an applied linguist with a particular interest in the investigation of lexical acquisition and attrition; the investigation of lexical retrieval processes (with a focus on word association behaviour); the creation and evaluation of vocabulary knowledge assessment tools and the design and application of innovative language learning techniques. Tess is leading the design of the pedagogical tools, including the construction and application of frequency-based word lists. She is also supporting Dawn (with Steve Morris) in directing the CorCenCC project and ensuring targets are met.
Steve Morris (CI), Swansea University
Steve has worked in the field of Welsh for Adults for over thirty years, joining the Department of Continuing Education at Swansea University in 1991 as a lecturer and transferring to Academi Hywel Teifi in 2010 where he is an Associate Professor in the Department of Welsh lecturing in the field of language and supervising research in this area. Steve is managing the data collection and transcription for CorCenCC, and is coordinating non-academic project outputs, including transfer of knowledge to and from user groups.
Alex Lovell (CI), SwanseaUniversity
Alex Lovell is a lecturer in the Welsh Department at Swansea University. He is currently finishing his PhD, also in the Welsh Department at Swansea, in which he looks at how best Welsh as a second language can be successfully presented in comparatively non-Welsh speaking areas of Wales. Prior to his PhD, Alex completed a BA in Welsh as a second language speaker. He has previously contributed to the CorCenCC project as a transcriber during his postgraduate studies. In addition to language planning, Alex’s research interests include: Welsh Second Language; second language acquisition; language and education policy in Wales; second language testing and assessment; bilingualism and bilingual education.
Irena Spasic (CI), Cardiff University
Dr Irena Spasić is a Senior Lecturer in Cardiff’s School of Computer Science and Informatics, where she is also the Director of Research and the leader of the Text & Data Mining theme. She has a long track record of active interdisciplinary collaboration. Her research interests include text mining, knowledge representation, machine learning and their applications in social sciences, social media, life sciences and healthcare. Her team was ranked first in a 2008 NIH-funded challenge for disease status classification from hospital discharge summaries (https://www.i2b2.org/NLP/). She also led a team that was ranked third on information extraction from discharge summaries (2009), and ranked first in extracting types of information that proved the most difficult to model. Irena is overseeing the design and construction of the web-based infrastructure that brings together the corpus enquiry tools, taggers and pedagogic toolkit into a single web-based tool for the CorCenCC corpus.
Paul Rayson (CI), Lancaster University
Dr Paul Rayson is a Reader in Computer Science at Lancaster University, UK. He is director of the UCREL interdisciplinary research centre which carries out research in corpus linguistics and natural language processing (NLP). A long term focus of his work is the application of semantic-based NLP in extreme circumstances where language is noisy e.g. in historical, learner, speech, email, txt and other CMC varieties. His applied research is in the areas of online child protection, learner dictionaries, and text mining of historical corpora and annual financial reports. He is a co-investigator of the five-year ESRC Centre for Corpus Approaches to Social Science (CASS) which is designed to bring the corpus approach to bear on a range of social sciences. Paul is responsible for constructing the computational analysis framework and tools required for the semantic field annotation system and for integrating this into the CorCenCC interface.
Professor Thomas is currently Head of School of Education with a background in Psychology. Her main research interests span psycholinguistic studies of child language acquisition in relation to Welsh and bilingual Welsh-English development under conditions of minority language input; the relationship between minority language use and proficiency; the development of minority language and bilingual language assessment tools; and educational approaches to increasing language transmission, acquisition and use within families in disadvantaged areas. Enlli is responsible for guiding the construction and evaluation of the Welsh pedagogic toolkit for CorCenCC.
Dr Jonathan Morris is the Coleg Cymraeg Cenedlaethol Lecturer in Linguistics and Applied Linguistics at the School of Welsh, Cardiff University. His research focuses on sociolinguistic and phonetic aspects of bilingualism and second language acquisition. Specifically, he is interested in how social factors influence the Welsh and English of bilingual speakers. These social factors might influence how bilinguals produce their languages or how they use and feel about them. Jonathan is assisting with the coordination of data collection (WP1).
Dr. Scott Piao has rich experience in corpus tool development and natural language processing. He has worked on seven projects funded by EPSRC, AHRC, EU and JISC, covering research topics of corpus construction and annotation, text mining, and application of corpus and natural language processing techniques in social computing. In particular, he has been involved in the development of corpus semantic annotation system for multiple languages for many years. Scott is contributing to the development of the CorCenCC semantic tagger and crowdsourcing utilities.
Steven Neale (RA), Cardiff University
Dr Steven Neale joined Cardiff University in March 2016 as a Research Associate on the CorCenCC project. Before coming to Cardiff he was a Postdoctoral Researcher with the NLX – Natural Language and Speech group at the University of Lisbon in Portugal, where he had been working since 2014 after completing his PhD in Computing at the University of Tasmania in Australia. Before deciding to pursue an academic career, Steven spent his first years after university working in various film, TV and video production jobs. Steven is working with Irena and Dawn to build and operationalize the part-of-speech tagger, crowdsourcing applications, pedagogic toolkit and corpus infrastructure.
Laura Arman (RA), Cardiff University
Dr Laura Arman joined the CorCenCC project in January 2019 following two and a half years at Bangor University in Welsh-medium lecturing and pedagogical project work. Her background in linguistics and her work on Welsh contributes language-specific expertise to data collection, quality assurance and data processing on the project (for WP1). Laura’s research interests include the syntax–semantics interface, verb classes, computational linguistics, multilingualism and multilingual speech communities, especially within the minoritized language context.
Jennifer Needs (RA), Swansea University
Dr Jennifer Needs has recently completed a PhD in the School of Welsh at Cardiff University, in which she looked at the principles of online language learning materials development. She used these principles to inform the development of her own e-learning materials for adults learning Welsh, working in partnership with the Nant Gwrtheyrn Welsh Language and Heritage Centre to create bespoke online materials for their learners. Prior to her PhD, Jennifer worked as a Research Assistant at Cardiff University in the field of Welsh for Adults, and completed a BA in Linguistics and Spanish at Leeds University and an MA in Endangered Language Documentation and Revitalisation at SOAS, London. Jennifer is contributing to the collection of the Welsh language data for CorCenCC, developing the pedagogic toolkit and providing language-specific expertise throughout.
Mair Rees (RA), Swansea University
Following a 15 year career as an art therapist working mainly with adults who have a learning disability, Dr Mair Rees returned to full-time education to study for a BA Welsh at Cardiff University in 2004. Subsequently, she was fortunate enough to win a scholarship which also enabled her to undertake a PhD in Welsh literature. Since graduating in 2012 Mair has worked as a creative editor with Gomer Press, Llandysul. She regularly contributes reviews and articles to Welsh journals and also has a small business making quirky Welsh-language gifts and cards. Mair is contributing to the collection of the Welsh language data and providing language-specific expertise throughout the project.
Ignatius Ezeani (RA), Lancaster University
Ignatius Ezeani is a Research Associate at the UCREL Research Centre, Lancaster University. His current research interests revolve around, but not limited to, developing robust frameworks for adapting existing NLP models and techniques for low resource language research. He is particularly interested in such meaning abstractions and semantic relationships as captured by deep embedding models often trained with huge amounts of data from highly resourced languages and how to project same to low resource languages. He is also generally interested in the design and development of machine learning and deep neural models as well as their applications to, not just NLP, but to the broader field of data science. Ignatius is currently looking at efficient methods to improving the accuracy and reliability of the Welsh Semantic Tagger. Former CorCenCC Team Members
Gareth Watkins (RA): 2016-2017
Jeremy Evas (CI): 2016-2018
Mark Stonelake (CI): 2016-2018
Lowri Williams: 2017-2019
Laurence Anthony is Professor of Educational Technology and Applied Linguistics and former Director of the Center for English Language Education (CELESE), Faculty of Science and Engineering, Waseda University, Japan. For 25 years, he has worked extensively in the area of technical writing, editing, and translation, offering training seminars to some of Japan’s biggest international companies. He has also worked extensively in the area of corpus linguistics, receiving the National Prize of the Japan Association for English Corpus Studies (JAECS) in 2012 for his work on AntConc, and the development of various other corpus analysis tools. Laurence is providing strategic advice and practical guidance on designing and developing the corpus infrastructure, adapting and integrating the pedagogic toolkit and building corpus query facilities.
Kevin Scannell is Professor of Mathematics and Computer Science at Saint Louis University in the USA. He has collaborated with dozens of language communities around the world to create freely available and reusable language resources, with a particular focus on Irish, Scottish, and Manx Gaelic. In 2011, he founded the Indigenous Tweets project to promote the use of indigenous and minority languages in social media. Kevin is providing guidance on constructing the online CorCenCC corpus infrastructure.
Tom Cobb is the developer of the Compleat Lexical Tutor website and, on the CorCenCC project, will act as a consultant for the development of corpus-informed tutorial activities, in line with his work in English and French on LEXTUTOR.CA. Tom is providing access to and consultative advice on extending and adapting Lextutor for CorCenCC.
Professor Michael McCarthy is a corpus consultant on the CorCenCC project; 25 years’ experience of compiling, working with and publishing on spoken corpora. He has knowledge of Welsh to O-level. Michael is contributing to the establishment of a bespoke framework for collecting large scale spoken and/or Welsh corpora.
Professor Margaret Deuchar’s current research interests are in code-switching. With a research team at Bangor University she collected bilingual conversations from three groups of bilinguals: Welsh-English, Spanish-English and Welsh-Spanish. The conversations were transcribed and are made available in the public domain (www.bangortalk.org.uk). Data from the corpora have been used by Margaret’s collaborators and herself to evaluate competing linguistic theories of code-switching (e.g. Herring et al 2010), to compare code-switching patterns in three bilingual communities (Carter et al 2011), challenge influential views of the boundary between code-switching and borrowing (Stammers & Deuchar 2012) and to establish the key extralinguistic factors influencing the production of code-switching. Margaret is providing advice on developing a framework for data collection and on tagging Welsh corpus data.
Kevin Donnelly, Freelance
After a degree and doctorate in Bantu Languages at SOAS, London, Kevin worked for the Northern Ireland Civil Service in Belfast and then as a community activist and software developer in Anglesey. In 2003 he began localising free software in Welsh, and came up against the problem that few language resources for Welsh are available under a free license. This led to Eurfa, apertium-cy, corpora, and a POS-tagger for the three languages in the ESRC Bangor corpora. Kevin has worked with a variety of languages, and is currently working on Swahili and Māori. Kevin is assisting with developing of tools for digitising the mark-up and tagging of the CorCenCC data.
Project Advisory Group
Emyr Davies, CBAC-WJEC
Colin Williams, Fellow, St Edmund’s College, University of Cambridge
Gareth Morlais, Welsh-language technology and digital media specialist in the Welsh Language Unit, Welsh Government
Aran Jones, course author and company CEO for SaySomethingin.com Ltd
Owain Roberts, Acting Head of Research, National Library of Wales
Andrew Hawke, Managing Director of the Dictionary of the Welsh Language
Llion Jones, Director of Canolfan Bedwyr, Bangor
Karen Corrigan, Professor of Linguistics and English Language, Newcastle University
Maggie Tallerman, Professor of Linguistics, Newcastle University
Gwen Awbery, Aberystwyth University / University of Wales Trinity St David Mair Parry-Jones, National Assembly for Wales