JUCS - Journal of Universal Computer Science 30(13): 1849-1871, doi: 10.3897/jucs.118889

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

Ruan Visser^‡, Trieko Grobler^‡, Marcel Dunaiski^‡

‡ Stellenbosch University, Stellenbosch, South Africa

Corresponding author: Ruan Visser ( 21051410@sun.ac.za )

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY-ND 4.0). This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.

Citation: Visser R, Grobler T, Dunaiski M (2024) Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages. JUCS - Journal of Universal Computer Science 30(13): 1849-1871. https://doi.org/10.3897/jucs.118889

Abstract

To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of language model development under resource-constrained conditions. We investigate the interplay between model size, pretraining objectives, and multilingual dataset composition in the context of low-resource languages such as Zulu and Xhosa. In our approach, we initially pretrain language models from scratch on specific low-resource languages using a variety of model configurations, and incrementally add related languages to explore the effect of additional languages on the performance of these models. We demonstrate that smaller data volumes can be effectively leveraged, and that the choice of pretraining objective and multilingual dataset composition significantly influences model performance. Our monolingual and multilingual models, exhibit competitive, and in some cases superior, performance compared to established multilingual models such as XLM-R-base and AfroXLM-R-base.

Keywords

Language Modelling, Low-Resource Languages, Transformers, Multilingual, Pretraining