SPEECH RECOGNITION DATASETS FOR LOW-RESOURCE CONGOLESE LANGUAGES

KIMANUKA, USSEN ABRE; Maina, Ciira wa; Buyuk, Osman

DSpace Home
→
Academic Research Papers
→
Centre For Data Science And Artificial Intelligence
→
Centre For Data Science And Artificial Intelligence
→
View Item

SPEECH RECOGNITION DATASETS FOR LOW-RESOURCE CONGOLESE LANGUAGES

KIMANUKA, USSEN ABRE; Maina, Ciira wa; Buyuk, Osman

URI: http://repository.dkut.ac.ke:8080/xmlui/handle/123456789/8289

Date: 2023-04

Abstract:

Large pre-trained Automatic Speech Recognition (ASR) models have begun to perform better in low-resource languages due to the availability of data and transfer learning. However, a few languages have sufficient resources to benefit from transfer learning. This paper contributes to expanding speech recognition resources for under-represented languages. We release two new datasets to the research community: Lingala Read Speech Corpus consisting of 4 hours of labelled audio clips and Congolese Speech Radio Corpus containing 741 hours of unlabeled audio in 4 major spoken languages in the Democratic Republic of the Congo. Additionally, we obtain benchmark results for Congolese wav2vec2. We observe an average decrease of 2% in WER when a Congolese multilingual pretrained model is used for finetuning on Lingala. Importantly, our study is the first attempt towards benchmarking speech recognition systems for Lingala and the first-ever multilingual model for 4 Congolese languages spoken by 65 million people. Our data and models will be publicly available, and we hope they help advance research in ASR for low-resource languages.

Show full item record