Plan and context

To visit our new website, go to:


CorCenCC will forge transformative methods for corpus creation, impact and sustainability. It is community-driven, harnessing opportunities afforded by mobile technologies, specifically crowdsourcing and community collaboration. Impact is generated through a user-informed design, so that basic corpus functionalities for the querying of language use can be integrated into a bespoke toolkit for teachers and learners (within this project) and interface specifications for other user groups (e.g. translators, publishers, policy-makers, language technology developers, academics and others) beyond the project.


Who is involved?

The project engages and works closely with Welsh language users. It uses new technologies, such as crowdsourcing, to collect data, and draw contributors from the 562,000 Welsh speakers in the UK.

Welsh language users are recruited via social and broadcast media, roadshows and existing networks. Contributors are invited to record and upload their own data via a mobile app, and even contribute to data coding.

This approach promises representative language across genres, language varieties (regional and social) and contexts. Traditional data collection is used to supplement the crowdsourcing, ensuring a representative balance of data.



The completed corpus will be open-source and freely available for use by professional communities and anyone with an interest in language. Bespoke applications and instructions will be provided for different user groups.

It will enable, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; language learners learn from real life models of Welsh; and researchers to investigate patterns of language use and change.

CorCenCC will be a sustainable, permanent and user-oriented record of language and an in-built facility will allow data to be added and moderated beyond the life of the project.


What data is included?

Examples of the spoken, written and e-language data that is being collected includes:

  • Spoken: conversations with friends; with family; televised interviews and TV chat shows; workplace Welsh; radio shows; service encounters; phone calls; primary, secondary, tertiary and adult classroom interaction; political speeches; formal and informal interaction at social and cultural events
  • Written: Welsh learner writing; books; papurau bro; political documents; stories; letters and diaries; academic essays; academic textbooks; magazines; adverts, flyers/information leaflets; formal letters; signs
  • E-language: discussion boards; emails; blogs; websites, tweets; text messages; social media statuses