JUCS - Journal of Universal Computer Science 30(13): 1849-1871, doi: 10.3897/jucs.118889
Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages
expand article infoRuan Visser, Trieko Grobler, Marcel Dunaiski
‡ Stellenbosch University, Stellenbosch, South Africa
Open Access
Abstract
To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of language model development under resource-constrained conditions. We investigate the interplay between model size, pretraining objectives, and multilingual dataset composition in the context of low-resource languages such as Zulu and Xhosa. In our approach, we initially pretrain language models from scratch on specific low-resource languages using a variety of model configurations, and incrementally add related languages to explore the effect of additional languages on the performance of these models. We demonstrate that smaller data volumes can be effectively leveraged, and that the choice of pretraining objective and multilingual dataset composition significantly influences model performance. Our monolingual and multilingual models, exhibit competitive, and in some cases superior, performance compared to established multilingual models such as XLM-R-base and AfroXLM-R-base.
Keywords
Language Modelling, Low-Resource Languages, Transformers, Multilingual, Pretraining