Abstract:
Large pre-trained Automatic Speech Recognition (ASR) models have begun to
perform better in low-resource languages due to the availability of data and transfer learning. However, a few languages have sufficient resources to benefit from
transfer learning. This paper contributes to expanding speech recognition resources for under-represented languages. We release two new datasets to the
research community: Lingala Read Speech Corpus consisting of 4 hours of labelled audio clips and Congolese Speech Radio Corpus containing 741 hours of
unlabeled audio in 4 major spoken languages in the Democratic Republic of the
Congo. Additionally, we obtain benchmark results for Congolese wav2vec2. We
observe an average decrease of 2% in WER when a Congolese multilingual pretrained model is used for finetuning on Lingala. Importantly, our study is the
first attempt towards benchmarking speech recognition systems for Lingala and
the first-ever multilingual model for 4 Congolese languages spoken by 65 million
people. Our data and models will be publicly available, and we hope they help
advance research in ASR for low-resource languages.