Creole NLP

This page aims to centralize NLP resources for Creoles. This is an open community effort and we welcome updates via pull requests on this website’s github repository.

Below is a table summarizing resources and their availability. The resource column links to the official url, where applicable. The status column includes links that have been verified, where applicable.
Papers on Creoles are collected here.

Language Resource Description Status
Antillean Creole CREOLORAL Audio, Transcriptions, and Translations Not open source
Bastimentos Creole English Endangered Language Archive Audio, Video, Transcriptions, Translations Not open source; Membership required
Gulf of Guinea Creoles The Gulf of Guinea Creole Corpora (Hagemeijer et al., 2014) Document Scans and Transcriptions Limited Verifiability
Haitian Kreyol Haitian Disaster Response Corpus (Munro, 2010) SMS Verified; E-mail authors for access.
Haitian Kreyol CMU Haitian Corpus Speech and Text Corpora Verified
Haitian Kreyol Corpus of Northern Haitian Creole Audio and Transcription Not open source
Hawaiian Pidgin Multilingual Hawai’i Linguistic Landscape Corpus (Purschke, 2021) Image Repository with Annotations Verified but currently unavailable, check back later
Malaccan Portuguese Creole Endangered Language Archive Audio, Video, Transcriptions, Translations Not open source; Membership required
Maurtian Creole ALLEX Project Concordance of 200k Words Not open source
Nigerian Pidgin NaijaSynCor (Bigi et al., 2017) Speech Recognition Verified
Nigerian Pidgin JW300 Corpus (Agić and Vulić, 2019) Parallel Texts for Machine Translation Verified
Nigerian Pidgin Pidgin UNMT (Ogueji and Ahia, 2019) Monolingual Texts for Machine Translation Verified
Nigerian Pidgin Naija-English Codeswitching Corpus (Ndubuisi-Obi et al., 2019) News Articles with Comments; Annotated for code switching Verified
Nigerian Pidgin Surface-Syntactic UD Treebank for Naija (Caron et al., 2019) Universal Dependencies Verified
Nigerian Pidgin Speech-to-Text Nigerian Pidgin Dataset (Ajisafe et al., 2020) Speech Recognition Verified
Nigerian Pidgin NaijaNER (Oyewusi et al., 2021) Named Entity Recognition Verified
Nigerian Pidgin Masakhaner (Adelani et al., 2021) Named Entity Recognition Verified
Nigerian Pidgin Nigerian Pidgin Tweets (Oyewusi et al., 2020) Sentiment Analysis Not open source
Portuguese Creole CreolData (Schang et al., 2005) Lexical Database Not open source
Reunionese Creole & Seychellois Creole Creolica Text and Short Stories in HTML or PDFs Verified
Singlish National University of Singapore SMS Corpus (Chen and Min-Yen, 2015) SMS Verified
Singlish Universal Dependencies for Colloquial Singaporean English (Wang et al., 2017) UD Treebank Verified
Singlish Webcrawler for Singaporean Hardware Forum (Tan et al., 2020) Webcrawler Verified
Singlish Singlish Sentiment Lexicon (Bajpai et al., 2017) Knowledge Base Not open source
Singlish Singlish SenticNet (Ho et al., 2018) Sentiment Resource Not open source
Sri Lankan Malay (Endangered) The Language Archive Audio and XML Verified




If you use this website, we would appreciate if you could cite our LREC 2022 paper:

@inproceedings{lent22creole,
  title = "What a Creole Wants, What a Creole Needs",
  author = "Lent, Heather and Ogueji, Kelechi and de Lhoneux, 
            Miryam and Ahia, Orevaoghene and S{\o}gaard, Anders",
  year = "2022",
  booktitle = "LREC"
}