Creole NLP
This page aims to centralize NLP resources for Creoles. This is an open community effort and we welcome updates via pull requests on this website’s github repository.
Below is a table summarizing resources and their availability. The resource column links to the official url, where applicable. The status column includes links that have been verified, where applicable.
Papers on Creoles are collected here.
Language | Resource | Description | Status |
Antillean Creole | CREOLORAL | Audio, Transcriptions, and Translations | Not open source |
Bastimentos Creole English | Endangered Language Archive | Audio, Video, Transcriptions, Translations | Not open source; Membership required |
Gulf of Guinea Creoles | The Gulf of Guinea Creole Corpora (Hagemeijer et al., 2014) | Document Scans and Transcriptions | Limited Verifiability |
Haitian Kreyol | Haitian Disaster Response Corpus (Munro, 2010) | SMS | Verified; E-mail authors for access. |
Haitian Kreyol | CMU Haitian Corpus | Speech and Text Corpora | Verified |
Haitian Kreyol | Corpus of Northern Haitian Creole | Audio and Transcription | Not open source |
Hawaiian Pidgin | Multilingual Hawai’i Linguistic Landscape Corpus (Purschke, 2021) | Image Repository with Annotations | Verified but currently unavailable, check back later |
Malaccan Portuguese Creole | Endangered Language Archive | Audio, Video, Transcriptions, Translations | Not open source; Membership required |
Maurtian Creole | ALLEX Project | Concordance of 200k Words | Not open source |
Nigerian Pidgin | NaijaSynCor (Bigi et al., 2017) | Speech Recognition | Verified |
Nigerian Pidgin | JW300 Corpus (Agić and Vulić, 2019) | Parallel Texts for Machine Translation | Verified |
Nigerian Pidgin | Pidgin UNMT (Ogueji and Ahia, 2019) | Monolingual Texts for Machine Translation | Verified |
Nigerian Pidgin | Naija-English Codeswitching Corpus (Ndubuisi-Obi et al., 2019) | News Articles with Comments; Annotated for code switching | Verified |
Nigerian Pidgin | Surface-Syntactic UD Treebank for Naija (Caron et al., 2019) | Universal Dependencies | Verified |
Nigerian Pidgin | Speech-to-Text Nigerian Pidgin Dataset (Ajisafe et al., 2020) | Speech Recognition | Verified |
Nigerian Pidgin | NaijaNER (Oyewusi et al., 2021) | Named Entity Recognition | Verified |
Nigerian Pidgin | Masakhaner (Adelani et al., 2021) | Named Entity Recognition | Verified |
Nigerian Pidgin | Nigerian Pidgin Tweets (Oyewusi et al., 2020) | Sentiment Analysis | Not open source |
Portuguese Creole | CreolData (Schang et al., 2005) | Lexical Database | Not open source |
Reunionese Creole & Seychellois Creole | Creolica | Text and Short Stories in HTML or PDFs | Verified |
Singlish | National University of Singapore SMS Corpus (Chen and Min-Yen, 2015) | SMS | Verified |
Singlish | Universal Dependencies for Colloquial Singaporean English (Wang et al., 2017) | UD Treebank | Verified |
Singlish | Webcrawler for Singaporean Hardware Forum (Tan et al., 2020) | Webcrawler | Verified |
Singlish | Singlish Sentiment Lexicon (Bajpai et al., 2017) | Knowledge Base | Not open source |
Singlish | Singlish SenticNet (Ho et al., 2018) | Sentiment Resource | Not open source |
Sri Lankan Malay (Endangered) | The Language Archive | Audio and XML | Verified |
If you use this website, we would appreciate if you could cite our LREC 2022 paper:
@inproceedings{lent22creole,
title = "What a Creole Wants, What a Creole Needs",
author = "Lent, Heather and Ogueji, Kelechi and de Lhoneux,
Miryam and Ahia, Orevaoghene and S{\o}gaard, Anders",
year = "2022",
booktitle = "LREC"
}