Is Small Language Model the Silver Bullet to Low-Resource Languages Machine Translation?
Summary
This study investigates whether small language models (SLMs) can effectively address the translation challenges of low-resource languages (LRLs). Using the FLORES-200 benchmark, it evaluates SLMs across 200 languages, revealing significant performance gaps for LRLs compared to high-resource languages. The research explores knowledge distillation from large teacher models to SLMs via supervised fine-tuning. For example, the LLM-as-a-Judge score for EnglishâLuxembourgish translation improved from 0.36 to 0.89 for Llama-3.2-3B. The study examines various fine-tuning configurations, confirming that SLMs can retain general capabilities without catastrophic forgetting. It also tests the approach on other LRLs like Khasi, Assamese, and Ukrainian, showing consistent improvements. Results indicate that distillation from large models significantly enhances SLM performance in LRL translation, even with limited data. However, the study finds that LoRA fine-tuning is less effective and may increase hallucinations. The work highlights the limitations of current SLMs for LRLs and offers practical recommendations for improving translation systems through distillation, emphasizing the need for equitable progress in multilingual NLP.
PDF viewer
Chunks(66)
Chunk 0 · 1,997 chars
Is Small Language Model the Silver Bullet to Low-Resource Languages Machine Translation? Yewei Songâ 1, Lujun Liâ 1, Cedric Lothritz2, Saad Ezzini3, Lama Sleem1, Niccoloâ Gentile4, Radu State1, TegawendĂ© F. BissyandĂ©1, Jacques Klein1, 1University of Luxembourg, 2Luxembourg Institute of Science and Technology, 3King Fahd University of Petroleum and Minerals, 4Foyer S.A., Correspondence: yewei.song@uni.lu Abstract Low-resource languages (LRLs) lack sufficient linguistic resources and are underrepresented in benchmark datasets, resulting in persistently lower translation quality than high-resource languages, especially in privacy-sensitive and resource-limited contexts. Firstly, this study systematically evaluates state-of-the-art smaller Large Language Models in 200 languages us- ing the FLORES-200 benchmark, highlight- ing persistent deficiencies and disparities in the translation of LRLs. To mitigate these limi- tations, we investigate knowledge distillation from large pre-trained teacher models to Small Language Models (SLMs) through supervised fine-tuning. The results show substantial im- provements; for example, the translation per- formance of Englishâ Luxembourgish (ENâ LB), measured by the LLM-as-a-Judge score, increases from 0.36 to 0.89 in the validation set for Llama-3.2-3B. We further investigate various fine-tuning configurations and tasks to clarify the trade-offs between data scale and training efficiency, verify that the model re- tains its general capabilities without significant catastrophic forgetting after training, and ex- plore the distillation benefits to other LRLs on SLMs (Khasi, Assamese, and Ukrainian). In general, this work exposes the limitations and fairness issues of current SLMs in LRL transla- tion and systematically explores the potential of using the distillation of knowledge from large to small models, offering practical, empirically grounded recommendations to improve LRL translation systems1. 1 Introduction Low-resource languages
Chunk 1 · 1,979 chars
poses the limitations and fairness issues of current SLMs in LRL transla- tion and systematically explores the potential of using the distillation of knowledge from large to small models, offering practical, empirically grounded recommendations to improve LRL translation systems1. 1 Introduction Low-resource languages (LRLs) suffer from a lack of critical linguistic resources, and this scarcity is often rooted in socioeconomic, geographical, and political factors, which contribute to their poor sup- â These authors contributed equally to this work. 1https://anonymous.4open.science/r/mt_luxembourgish-408D port in academic research and industrial applica- tions (Nigatu et al., 2024). Recent progress has greatly improved transla- tion for High-Resource Languages (HRLs), large performance gaps remain for LRLs, especially in areas like finance and government, where privacy is crucial and models often run on low-power, offline devices (Zhong et al., 2024). Recent multilingual transfer and pretraining learning methods (Conneau et al., 2020; Artetxe and Schwenk, 2019), exempli- fied by initiatives such as No Language Left Behind (NLLB; Costa-jussa et al., 2022), have greatly im- proved cross-lingual representation and translation quality. However, these approaches typically rely on high quality substantial parallel datasets, which are rarely available to LRLs, especially in formal domains such as news and official communications. Translation from LRLs into HRLs is typically more straightforward because of the greater abundance of target-side resources, while the reverse direction remains considerably challenging. Furthermore, portable Small Language Models(SLMs), which is of a parameter size less than 4B, critical for mobile devices, exhibits weaker performance in LRL tasks, exacerbating the existing translation gap. To ex- plore the current âlandscapeâ and the applicability of transformer-based models for LRLs, this article makes the following three key
Chunk 2 · 1,999 chars
Language Models(SLMs), which is of a parameter size less than 4B, critical for mobile devices, exhibits weaker performance in LRL tasks, exacerbating the existing translation gap. To ex- plore the current âlandscapeâ and the applicability of transformer-based models for LRLs, this article makes the following three key contributions. First, we quantitatively analyze LLM perfor- mance in 200 languages using the FLORES-200 benchmark (Section 4), highlighting disparities that affect underrepresented languages. Our analy- sis underscores the concerning state of LLMs for LRLs and reveals that SLMs exhibit even more pronounced deficiencies in these languages. Sec- ond, we demonstrate that alternative single-sided data sources (e.g., news articles and monolingual resources) can be used to distill knowledge from teacher models (Section 5), improving translation quality on both sides ( LRL â HRL) in SLMs and help address the scarcity of parallel data. Third, arXiv:2503.24102v3 [cs.CL] 22 Aug 2025 -- 1 of 25 -- we investigate a range of fine-tuning configurations and methodologies (Section 6), offering practical guidelines for Supervised Fine-Tuning (SFT) uti- lizing distilled data generated by teacher models. 2 Related Work 2.1 Generative Models Transformer-based architectures have significantly advanced machine translation through multilin- gual embeddings and nuanced language generation (Zhao et al., 2023; Zhao et al.). Current transla- tion models typically employ encoder-decoder ar- chitectures with attention mechanisms (Bahdanau et al., 2015; Vaswani et al., 2017; Naveed et al., 2024), or decoder-only frameworks exemplified by the GPT series, recognized for computational ef- ficiency and ease of fine-tuning (Gao et al., 2022; Hendy et al., 2023). Recent methods such as back- translation (Sennrich et al., 2016), unsupervised translation (Lample et al., 2018), and multilingual systems such as OPUSMT (Tiedemann and Thottin- gal, 2020) further enhance translation quality.
Chunk 3 · 1,998 chars
or computational ef- ficiency and ease of fine-tuning (Gao et al., 2022; Hendy et al., 2023). Recent methods such as back- translation (Sennrich et al., 2016), unsupervised translation (Lample et al., 2018), and multilingual systems such as OPUSMT (Tiedemann and Thottin- gal, 2020) further enhance translation quality. How- ever, decoder-only models often face limitations for LRLs due to predominantly English-centric train- ing data (Brown et al., 2020; Hasan et al., 2024), leading to translation inaccuracies and hallucina- tions (Benkirane et al., 2024). Despite these chal- lenges, recent findings suggest that decoder-only ar- chitectures may outperform encoder-decoder mod- els in certain translation tasks (Gao et al., 2022; Silva et al., 2024), motivating our investigation into their application for improving LRL translations. 2.2 Limited Support for LRLs Despite considerable advances, current LLMs of- fer insufficient support for low-resource languages. Research consistently demonstrates substantial per- formance degradation in LRL translation tasks compared to high-resource languages (Robinson et al., 2023). This performance gap arises primarily from unbalanced training datasets that overwhelm- ingly favor high-resource languages (Blasi et al., 2022; Lankford et al., 2021). Furthermore, tok- enization biases and uneven data exposure hinder the ability of the models to accurately capture lin- guistic nuances unique to LRLs (Shen et al., 2024). Addressing these shortcomings requires targeted data enhancement techniques and customized fine- tuning methods to significantly enhance LLM ca- pabilities for low-resource language tasks (Elsner et al., 2024; Li et al., 2025b). 3 Research Questions RQ1: How effectively do large language models (LLMs) address low-resource machine translation, and what are the comparative performance gaps in translation quality among different model sizes and languages? RQ2: How much can distillation from mono- lingual LRL-side data improve
Chunk 4 · 1,997 chars
Li et al., 2025b). 3 Research Questions RQ1: How effectively do large language models (LLMs) address low-resource machine translation, and what are the comparative performance gaps in translation quality among different model sizes and languages? RQ2: How much can distillation from mono- lingual LRL-side data improve the translation performance of smaller LLMs in low-resource languages? RQ3: How do different SFT settings affect model performance on low-resource language translation tasks and do they risk compromising the modelâs general capabilities? Are similar improvements observed in diverse LRLs? 4 Investigation of LRLs 4.1 Situation of Language Support Recent investigations have revealed that although LLMs are increasingly advertised as multilingual, their effective support in languages is often limited to a subset of HRLs. Moreover, systematic eval- uations of language-specific performance remain scarce (for example (Lai et al., 2024; Marchisio et al., 2024; Lifewire, 2024; Ahuja et al., 2024)). Table 1 summarizes several models included in our experiments, their approximate parameter sizes, and the estimated number of languages they report- edly support. These figures are derived from offi- cial model documentation, benchmarking reports, and recent academic studies. Model Size Languages Date GPT-4o-mini â âŒ25 Jul. 2024 Llama-3.1-8B-it 8B/3B âŒ30 Jul. 2024 Llama-3.2-3B-it 3B âŒ20 Sept. 2024 Mistral-8B-Instruct-2410 8B âŒ25 Oct. 2024 Phi-3-mini-4k-instruct 4B âŒ20 Apr. 2024 Phi-3.5-mini-instruct 4B âŒ20 Aug. 2024 Qwen2.5 Instruct 1.5B/3B âŒ25 Sept. 2024 Gemma-2 Instruct 2B/9B âŒ20 Jul. 2024 Table 1: Overview of Multilingual Support in LLMs Despite these encouraging multilingual claims, the existing literature reveals that rigorous language-specific performance evaluations, espe- cially for low-resource languages, are insufficient. Most current research focuses on high-resource benchmarks, leaving open critical questions about -- 2 of 25 -- fairness and
Chunk 5 · 1,998 chars
espite these encouraging multilingual claims, the existing literature reveals that rigorous language-specific performance evaluations, espe- cially for low-resource languages, are insufficient. Most current research focuses on high-resource benchmarks, leaving open critical questions about -- 2 of 25 -- fairness and the accessibility of LLMs for diverse linguistic communities. 4.2 Evaluating Language Ability Generated Translation Golden Translation ... Large Language Models Evaluation Metrics LLM-as-a-Judge Semantic Metrics Figure 1: Evaluation pipeline on FLORES-200 We use the FLORES-200 benchmark to sys- tematically assess the performance of LLMs in multilingual machine translation tasks (Costa-jussa et al., 2022; Goyal et al., 2021a; GuzmĂĄn et al., 2019). FLORES-200 offers rigorously curated human-validated translation datasets across 200 languages that span diverse linguistic families and writing systems, making it highly effective for eval- uating translation quality in high-resource and low- resource linguistic contexts. Our experiments lever- age the full FLORES-200 dataset to comprehen- sively evaluate translation quality across as many languages as possible, emphasizing translations from various source languages into English. In addition to traditional metrics, we evaluated translation quality using the LLM-As-A-Judge (LLMaaJ) scores (Niklaus et al., 2025), which uses a large LLM to score translations from 0 to 1 based on semantic equivalence and naturalness. A score of 1.0 denotes a perfect translation and 0.0 a to- tally incorrect one. In practice, we consider a score â„ 0.8 as indicative of a good translation. Research has shown that LLMaaJ tolerates synonyms, para- phrases, and cross-linguistic structural variations, enabling it to better assess translation quality when there are multiple valid phrasings or when gram- matical and typological differences (e.g., omitted pronouns) are acceptable(Zheng et al., 2023; Pier- gentili et al., 2025). Regarding
Chunk 6 · 1,994 chars
aJ tolerates synonyms, para- phrases, and cross-linguistic structural variations, enabling it to better assess translation quality when there are multiple valid phrasings or when gram- matical and typological differences (e.g., omitted pronouns) are acceptable(Zheng et al., 2023; Pier- gentili et al., 2025). Regarding the LLMs investigated, as shown in Figure 1, we systematically traversed prominent proprietary APIs and open source models (refer to Table 1), presenting results using LLMaaJ metrics with quantitative semantic evaluations. Detailed LLMaaJ and BLEU scores for all source-to-English translations are provided in the Appendix Table 6 and the Appendix Table 7. 4.3 LLMs performance in FLORES-200 In this paper, we replace the regional map with Figure 2, which more precisely visualizes the dis- tribution of languages in our set of evaluations by linguistic family and script, thus answering RQ1. Each bar length is calculated based on the aver- age score, explicitly excluding the GPT4o -mini modelâs score. This family-level visualization makes it easier to identify which LRLs are included in our experiments and how they are situated in the broader typological space. Each bar in Figure 2 represents one language, grouped by its primary family, with bar length cor- responding to the average LLMaaJ score. The fig- ure reveals that LRLs are not evenly distributed across families: many under-resourced African, Austronesian, and Indigenous American languages cluster toward the lower end of the performance spectrum, while certain Indo-European LRLs (e.g., Luxembourgish, Maltese) perform moderately bet- ter, likely due to greater data availability or prox- imity to high-resource relatives. The circular layout also highlights structural gaps in the evaluation set. Languages absent from FLORES-200âsuch as many North American In- digenous languagesâdo not appear here, not be- cause models perform well on them, but because no evaluation data exists. This is particularly
Chunk 7 · 1,983 chars
ox- imity to high-resource relatives. The circular layout also highlights structural gaps in the evaluation set. Languages absent from FLORES-200âsuch as many North American In- digenous languagesâdo not appear here, not be- cause models perform well on them, but because no evaluation data exists. This is particularly rele- vant for languages with small speaker populations or those concentrated in politically marginalized communities, which remain invisible in current multilingual benchmarks. Consistent with previous work (Nekoto et al., 2020; Joshi et al., 2020), the lowest scores are ob- served for many NigerâCongo, Austronesian, and smaller Afro-Asiatic languages, reflecting the se- vere data scarcity. In contrast, LRLs in Eastern Europe and South/Southeast Asiaâsuch as Mace- donian or Sinhalaâachieve slightly higher average scores, possibly benefiting from historical ties to better-supported high-resource languages. How- ever, the overall pattern remains unchanged: LRLs across all families systematically lag behind high- resource languages, underscoring the need for tar- geted data collection, typologically diverse bench- marks, and bias mitigation strategies to ensure eq- uitable progress in multilingual NLP. -- 3 of 25 -- Figure 2: "Low-Resource" Linguistic results grouped by language families 4.4 Gap between Dwarf(Smaller) and Giant LLMs Across the Indo-Aryan, Germanic, and Slavic branches in Figure 3 (panels (a)â(c)), we observe a consistent pattern: smaller LLMs suffer a substantially larger performance drop on low- resource languages (LRLs) than on high-resource ones, while larger LLMs degrade far less. Con- cretely, LRLs such as Sinhala (Indo-Aryan), Lux- embourgish (Germanic), and Silesian (Slavic) ex- hibit steep declines in smaller models but remain comparatively competitive in larger models, as vi- sualized in Figure 3. This disparity indicates a sys- tematic bias in current systemsâparticularly pro- nounced in smaller modelsâtoward
Chunk 8 · 1,930 chars
ch as Sinhala (Indo-Aryan), Lux- embourgish (Germanic), and Silesian (Slavic) ex- hibit steep declines in smaller models but remain comparatively competitive in larger models, as vi- sualized in Figure 3. This disparity indicates a sys- tematic bias in current systemsâparticularly pro- nounced in smaller modelsâtoward high-resource languages. Addressing this gap calls for improved LRL data curation, transfer via knowledge distillation from larger LLMs, and more inclusive evaluation suites that foreground LRLs, alongside bias-mitigation strategies, so advances in NLP benefit all language communities. In principle, according to the Univer- sal Approximation Theorem (Hornik, 1991), if we assume that the translation task can be modeled as a linear mapping from one semantic space to an- other, then a small-capacity network is limited in its ability to capture complex patterns and is therefore more susceptible to being perturbed or contami- nated by the large-volume data typical of HRLs. Consequently, fine-tuning becomes intuitively all the more important for smaller models. 5 Fine-tuning on LRLs 5.1 Background and language selection As highlighted in the previous section, several low- resource languages, such as Luxembourgish and Assamese (Figure 3), show a substantial transla- tion quality gap among between large and small models. In this article, Luxembourgish serves as a representative case. Although officially recognized, it lacks sufficient high-quality corpora resources, leading to poor performance in SLMs. Its blend of Germanic roots and French influence adds com- plexity to NLP tasks. While larger LLMs handle Germanic languages reasonably well, they struggle with LRLs like Luxembourgish. Previous efforts to address this include LuxemBERT (Lothritz et al., 2022), LuxT5 (Plum et al., 2024), and LetzTrans- late (Song et al., 2023), a low-resource translation system based on OPUS-MT. -- 4 of 25 --
Chunk 9 · 1,995 chars
s. While larger LLMs handle Germanic languages reasonably well, they struggle with LRLs like Luxembourgish. Previous efforts to address this include LuxemBERT (Lothritz et al., 2022), LuxT5 (Plum et al., 2024), and LetzTrans- late (Song et al., 2023), a low-resource translation system based on OPUS-MT. -- 4 of 25 -- Hindi Bengali Awadhi Magahi Urdu Chhattisgarhi Maithili Bhojpuri Nepali Marathi Gujarati Assamese Sanskrit Eastern Panjabi Sindhi Odia Sinhala 0.0 0.2 0.4 0.6 0.8 LLM-as-a-Judge Score Model Performance Across Indo-Aryan Languages Model GPT4o-mini Llama-3.1-8B Llama-3.2-3B Ministral-8B Phi-3.5 Qwen2.5-1.5B Qwen2.5-3B gemma2-2b gemma2-9b German Dutch Swedish Afrikaans Danish Norwegian BokmÄl Limburgish Eastern Yiddish Luxembourgish Icelandic Faroese 0.2 0.4 0.6 0.8 LLM-as-a-Judge Score Model Performance Across Germanic Languages Model GPT4omini Llama-3.1-8B Llama-3.2-3B Ministral-8B Phi-3.5 Qwen2.5-3B gemma-2-2b gemma-2-9b Russian Czech Polish Ukrainian Bulgarian Slovak Croatian Bosnian Macedonian Serbian Slovenian Silesian Belarusian 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 LLM-as-a-Judge Score Model Performance Across Slavic Languages Model GPT4o-mini Llama-3.1-8B Llama-3.2-3B Ministral-8B Phi-3.5 Qwen2.5-1.5B Qwen2.5-3B gemma2-2b gemma2-9b (a) Indo-Aryan (b) Germanic (c) Slavic Figure 3: LLMaaJ scores of smaller LLMs on Indo-Aryan/Germanic/Slavic languages for to-English translation To examine generalizability, we additionally in- clude Ukrainian, Assamese and Khasi (an endan- gered language), both exhibiting similar linguistic and resource profiles, as supplementary tasks to broaden the scope of the analysis. Furthermore, generating LRL from English is more challenging for LLMs than in the reverse direction of previous research (Howcroft and Gkatzia, 2022). Regarding translation performance, LLMs exhibit a certain de- gree of fluent translation from LRL to English, but not vice versa (Gao et al., 2020). This asymmetry is
Chunk 10 · 1,990 chars
. Furthermore, generating LRL from English is more challenging for LLMs than in the reverse direction of previous research (Howcroft and Gkatzia, 2022). Regarding translation performance, LLMs exhibit a certain de- gree of fluent translation from LRL to English, but not vice versa (Gao et al., 2020). This asymmetry is also reflected to some extent in the hallucination issues observed when generating Luxembourgish, more details can be found in the appendix F.2. 5.2 Distillations and Soft-Target Quality In our scenario, having only a Luxembourgish cor- pus without English translations rules out conven- tional parallel-corpus training approaches, accu- rately reflecting the typical data situation and model generation of LRLs. To bridge the gap between comprehension and generation in this low-resource scenario, we propose a distillation-based approach. Using a teacher model that demonstrates a robust understanding of Luxembourgish, we can distill its knowledge into a student model using the available LRL single-side corpus. This process is expected to enhance the generation capabilities of the student model, enabling it to produce high-quality Lux- embourgish output despite the limited data, and thus address the core challenge of low-resource language translation. According to further human labeling of our GPT-4o distillation dataset in Lux- embourgish to English translation, 92% of our sam- ples were marked as fully correct. 5.3 Data Collection and Augmentations For the training data set, we constructed a Luxem- bourg data set using multiple sources, including the LuxemBERT corpus, example sentences in the Luxembourg Online Dictionary (LOD) dataset2, and additional news articles collected from previ- ous research published data on RTL Ltzebuerg3, following the LuxemBERT work. Previous research has demonstrated that inte- grating dictionary entries can effectively enrich low-resource translation systems by providing ex- plicit lexical alignments and clarifying
Chunk 11 · 1,995 chars
et2, and additional news articles collected from previ- ous research published data on RTL Ltzebuerg3, following the LuxemBERT work. Previous research has demonstrated that inte- grating dictionary entries can effectively enrich low-resource translation systems by providing ex- plicit lexical alignments and clarifying semantic nuances. For example, Ghazvininejadâs work im- proved translation fidelity in settings where parallel data is scarce (Ghazvininejad et al., 2023). Inspired by these findings, we also explore how the addi- tion group of datasets with dictionary checks using LOD can complement our distillation approach as shown in Figure 4. Details of using the dictionary usage in the Appendix C. WĂ©i vill Sprooche schwĂ€tzt Dir? Input for Fine-tuning Soft Target Translate it to English? How many languages do you speak? Sprooche=Language ! Figure 4: Pipeline of data augmentation 5.4 Model SFT State-of-the-art decoder-only models are typi- cally trained in three stages: pre-training, Super- vised Fine-Tuning (SFT) and further tuning using Reinforcement Learning from Human Feedback (Ouyang et al., 2022). When combined with tech- niques such as Low-Rank Adaptation (LoRA: (Hu et al., 2022)), SFT can also significantly improve performance on tasks with fewer resources. In this study, in order to validate our distillation strategy, we primarily adopt model distillation from LRL- side corpora and then incorporate SFT to equip the model to generate and reverse-translate the target 2https://data.public.lu/en/datasets/letzebuerger-online- dictionnaire-lod-linguistesch-daten/ 3https://www.rtl.lu/ -- 5 of 25 -- language. We use the classical approach of super- vised instruction fine-tuning for two different mod- els. The fundamental logic is to provide the model with input prompts and corresponding responses, optimizing the model to minimize prediction loss within this fixed framework. In decoder-only mod- els, text generation is performed recursively by predicting the
Chunk 12 · 1,992 chars
per- vised instruction fine-tuning for two different mod- els. The fundamental logic is to provide the model with input prompts and corresponding responses, optimizing the model to minimize prediction loss within this fixed framework. In decoder-only mod- els, text generation is performed recursively by predicting the probability distribution over the vo- cabulary for the next token. SFT aims primarily to maximize the probability of the correct next token, thereby teaching the trained model the relationships between semantics, vocabulary, and syntax in LRL, as well as their correspondence with HRL. 6 Experiments 6.1 Models and Datasets The latest open-source models are used as bench- mark models, and their instruction-tuned versions are utilized to leverage their general capabilities in generating dialogues and answering questions. Based on the current leaderboard for Luxembour- gish proficiency in LLMs (Lothritz and Cabot, 2025), combined with the experimental results for the Germanic language group in Section 4, we select the top two base tiny models, which are Llama-3.2-3B-Instruct from Meta and Gemma-2- 2b-it from Google. The design of the input templates is considered crucial. In order to prevent the model from los- ing its general communication and generalization abilities after instruction tuning, it is necessary for prompts to be designed in alignment with chat tem- plates that can be understood by the model. Based on this, basic prompt testing is conducted to iden- tify the most suitable prompt for the model. Chat- based models have been observed to be prone to losing their communication capabilities after SFT, leading to the generation of endless content and a significant increase in the likelihood of halluci- nations. Therefore, in the design of the questions, the corresponding starting prompts are set at the beginning of the model responses, such as "Here is the translation: ". Through this linguistic guidance, the probability of hallucination is
Chunk 13 · 1,996 chars
of endless content and a significant increase in the likelihood of halluci- nations. Therefore, in the design of the questions, the corresponding starting prompts are set at the beginning of the model responses, such as "Here is the translation: ". Through this linguistic guidance, the probability of hallucination is reduced and the model is also able to learn when to stop. For the training data set, the LRL monolingual corpus is used primarily as the base material, from which the LRL-to-English mapping capability is distilled from larger models. As described in Sec- tion 5.3, publicly available press datasets and dictio- nary example sentences are utilized as the monolin- gual corpus, and distillation is performed using var- ious teacher models. Finally, the correct word-to- word mapping capability is reinforced through the lemma search to verify the dictionary content. We classify fake targets distilled into four categories: fake targets obtained by distilling facebook/nllb- 200-3.3B (Distill-NLLB, DN), the fake targets obtained by distilling meta-llama/Llama-3.3-70B- Instruct. (Distill-Llama, DL), the fake targets ob- tained by distilling GPT-4o-mini (Distill-GPT4O, DG), and the fake targets obtained after performing dictionary checking (Distill-GPT-Dict-Checking, DGDC). Each category contains 621,033 data sam- ples used for model training, all having the same LRL side texts, while the corresponding fake tar- gets are generated by different teacher models. For the validation set, the latest 300 press data en- tries (Val 300) from 2024 are used as monolingual corpus data, and the corresponding LRL entities are identified for the English mappings, thus prevent- ing biases that may arise from the model having been trained on the validation dataset. And we also do a manual check for English translations. Fur- thermore, we utilize the FLORES-200 benchmark as an additional validation test set. 6.2 Metrics There are multiple options of metrics available for MT tasks
Chunk 14 · 1,994 chars
thus prevent- ing biases that may arise from the model having been trained on the validation dataset. And we also do a manual check for English translations. Fur- thermore, we utilize the FLORES-200 benchmark as an additional validation test set. 6.2 Metrics There are multiple options of metrics available for MT tasks (Lo et al., 2023) and we mainly used the following three metrics for performance eval- uation in our experiments: SPBLEU (Sentence- Piece BLEU), CharF++, and the Jaccard index. SP- BLEU measures the similarity between machine translation outputs and reference translations us- ing n-gram precision, employing a standardized SentencePiece model for subword tokenization and allowing effective differentiation between the per- formance of high-resource and low-resource lan- guages, making it very valuable for comparative evaluation of multilingual models. CharF++ ex- tends the character-level F score (PopoviÂŽc, 2015) metric used for machine translation evaluation, incorporating both character and word n-grams, showing a strong correlation with human judg- ments at both the system and the segment levels. The Jaccard index (da F. Costa, 2021) represents a fundamental statistical method to measure the similarity between sample sets, offering mathemat- ical simplicity and interpretability, which makes it widely applicable across scientific disciplines. For LLMaaJ, we use google/gemma-3-27b-it as the judger throughout the entire paper. -- 6 of 25 -- MT Direction Models Methods Val 300 FLORE 200 SPBLEU CharF++ Jaccard LLMaaJ SPBLEU CharF++ Jaccard LLMaaJ EN-LB Nllb-200-3.3B BM 19.97 37.03 0.27 0.75 31.14 49.62 0.35 0.85 Llama-3.3-70B-Instruct 24.35 46.58 0.27 0.87 22.55 43.08 0.26 0.83 Llama-3.2-3B-Instruct BM 6.46 26.78 0.12 0.36 4.80 22.10 0.09 0.36 DN 37.98 55.41 0.37 0.82 14.61 38.04 0.19 0.51 DL 40.71 57.37 0.40 0.79 20.93 41.51 0.22 0.52 DG 42.01 57.89 0.41 0.88 22.80 42.26 0.25 0.70 DGDC 42.16 57.87 0.42 0.89
Chunk 15 · 1,991 chars
nstruct 24.35 46.58 0.27 0.87 22.55 43.08 0.26 0.83 Llama-3.2-3B-Instruct BM 6.46 26.78 0.12 0.36 4.80 22.10 0.09 0.36 DN 37.98 55.41 0.37 0.82 14.61 38.04 0.19 0.51 DL 40.71 57.37 0.40 0.79 20.93 41.51 0.22 0.52 DG 42.01 57.89 0.41 0.88 22.80 42.26 0.25 0.70 DGDC 42.16 57.87 0.42 0.89 23.40 42.90 0.26 0.83 Gemma-2-2b-it BM 5.82 22.71 0.10 0.50 4.61 20.78 0.07 0.51 DN 41.77 57.71 0.42 0.89 20.41 41.21 0.25 0.78 DL 43.78 59.02 0.44 0.87 24.03 42.95 0.28 0.79 DG 44.58 59.73 0.45 0.87 23.47 42.72 0.28 0.76 DGDC 44.12 59.10 0.45 0.90 23.50 42.49 0.28 0.82 LB-EN Nllb-200-3.3B BM 40.51 56.81 0.48 0.81 48.45 65.03 0.56 0.85 Llama-3.3-70B-Instruct 54.14 74.24 0.57 0.89 33.96 58.02 0.41 0.86 Llama-3.2-3B-Instruct BM 26.31 45.98 0.33 0.58 17.62 36.79 0.26 0.46 DN 42.78 59.33 0.48 0.82 29.37 53.88 0.38 0.79 DL 54.64 70.98 0.57 0.82 31.72 56.50 0.41 0.79 DG 59.88 74.97 0.63 0.90 32.78 57.69 0.42 0.81 DGDC 57.88 73.46 0.60 0.89 32.56 57.60 0.41 0.85 Gemma-2-2b-it BM 27.11 47.44 0.34 0.60 14.99 37.77 0.26 0.45 DN 41.58 57.63 0.49 0.83 42.46 60.55 0.51 0.83 DL 58.95 72.15 0.62 0.83 41.47 60.33 0.50 0.82 DG 65.44 76.96 0.68 0.86 42.67 61.30 0.51 0.86 DGDC 62.75 75.13 0.65 0.89 42.73 61.25 0.51 0.85 Table 2: This table presents the performance results obtained from training on datasets generated using different distillation models and methods. We report experimental results on two datasets, VAL 300 and FLORES 200. Additionally, we evaluated the performance of Nllb-200-3.3B and Llama-3.3-70B-Instruct on the same datasets, which strongly validate the effectiveness of our training approach. BM refers to the Base Model without any SFT. LLMaaJ refers to LLM-as-a-Judge, which gives a score from 0.0 to 1.0 with a granularity of 0.1. 6.3 Results 6.3.1 Performance gain The results in Table 2 clearly demonstrate that fine- tuning in both translation
Chunk 16 · 1,990 chars
strongly validate the effectiveness of our training approach. BM refers to the Base Model without any SFT. LLMaaJ refers to LLM-as-a-Judge, which gives a score from 0.0 to 1.0 with a granularity of 0.1. 6.3 Results 6.3.1 Performance gain The results in Table 2 clearly demonstrate that fine- tuning in both translation directions is highly ef- fective. For example, the baseline ENâLB mod- els exhibit SPBLEU scores around 30, but after fine-tuning, these scores increase to nearly 38â40 values approaching our threshold for high-quality translations (SPBLEU > 40). In contrast, LBâEN translations consistently score above 40, yet gener- ating fluent Luxembourgish in the ENâLB direc- tion remains a significant challenge. Furthermore, our experiments indicate that even a 3B model, when effectively distilled, can rival or even surpass larger models in low-resource language translation tasks. Our results indicate that GPT-4o-based dis- tillation methods, in particular, produce substantial improvements in translation quality, confirming that parallel corpora generated by LLM represent a viable and promising strategy for supporting LRL translation tasks. In order to validate the model translation performance, we also extracted a por- tion of the data and asked Luxembourgers who are at least bilingual in Luxembourgish and En- glish to label it as ground truth as data quality val- idation. The SPBLEU score achieved with these labeled data was 51.08 on our fine-tuned Gemma 2â 2b, showing a comparable score calculated using GPT-generated data as ground truth. Regarding the LLMaaJ score of the model, we obtained perfor- mance evaluation results and trends that are largely consistent with those of the SPBLEU parameter, further cross-validating the feasibility of LLMaaJ. However, since LLMs are black-box models with limited interpretability, the scores produced by LL- MaaJ can only serve as a reference and do not guarantee accuracy or validity. To address RQ2, the performance
Chunk 17 · 1,992 chars
are largely consistent with those of the SPBLEU parameter, further cross-validating the feasibility of LLMaaJ. However, since LLMs are black-box models with limited interpretability, the scores produced by LL- MaaJ can only serve as a reference and do not guarantee accuracy or validity. To address RQ2, the performance improvement of the model after fine-tuning with data distilla- tion enhancement is highly significant. For the two tested models, the performance gains are reflected in SPBLEU scores that surpass those of certain expert translation models. Furthermore, the en- hancement observed in the ENâLB direction is greater than that in the reverse direction, further strengthening the modelâs ability to generate Lux- embourgish. Therefore, LRLs can substantially improve the translation capacity of the model for low-resource languages, and even smaller models can achieve promising results. -- 7 of 25 -- MT Direction Model BOOLQ CB COPA MULTIRC RECORD RTE WIC WSC AVG BM(Base Model) Llama-3.2-3B-Instruct 0.62 0.55 0.71 0.52 0.41 0.64 0.51 0.28 0.53 Gemma-2-2b-it 0.73 0.55 0.86 0.81 0.56 0.82 0.49 0.56 0.67 En-LB Llama-3.2-3B-Instruct-FT 0.64 0.39 0.60 0.52 0.39 0.60 0.48 0.11 0.47 Gemma-2-2b-it-FT 0.71 0.52 0.89 0.75 0.41 0.72 0.51 0.49 0.62 LB-EN Llama-3.2-3B-Instruct-FT 0.64 0.30 0.69 0.51 0.46 0.62 0.52 0.24 0.50 Gemma-2-2b-it-FT 0.69 0.25 0.90 0.76 0.45 0.73 0.51 0.43 0.59 Table 3: Variations in overall performance on the SuperGLUE benchmark before and after distillation training, evaluating whether fine-tuning on LRLs induces catastrophic forgetting. The model names appended with the suffix â-FTâ denote the models after applying the proposed distillation fine-tuning method. 6.3.2 Does data size really matter? Figure 5: Performance as a function of training data size ratio. The dashed lines with transparency indicate the trend of CharF++ scores, while the solid lines represent the SPBLEU scores. The x-axis denotes the proportion of the
Chunk 18 · 1,996 chars
pplying the proposed distillation fine-tuning method. 6.3.2 Does data size really matter? Figure 5: Performance as a function of training data size ratio. The dashed lines with transparency indicate the trend of CharF++ scores, while the solid lines represent the SPBLEU scores. The x-axis denotes the proportion of the training data relative to the full original training set. Figure 5 illustrates the strong influence of the size of the data set on the quality of the transla- tion in both directions (EnglishâLuxembourgish), more detailed data in the Appendix Table 9. Even using as little as 1% of the available data yields modest improvements over the base model, yet the most substantial gains emerge only at higher data ratios. For example, increasing the data from 25% to 100% nearly doubles SPBLEU in the ENâLB direction for both Llama 3.2â3B and Gemma 2â3B. Notably, Gemma 2â3B seems to learn faster in the lower data regimes, but shows some performance attenuation beyond the 50% threshold. 6.3.3 Catastrophic forgetting? As a general-purpose model, it is capable of not only performing translation tasks but also handling multiple tasks such as planning, solving mathe- matical problems, coding, etc., other than transla- tion. However, after training the model specifically for translation purposes, a critical question arises: Does the model suffer catastrophic forgetting? This issue is of urgent concern and has significant impli- cations for the potential of the model for general- ized usage. To investigate this, we compared the model performance with the SuperGLUE bench- mark (Sarlin et al., 2020) before and after training which is a widely adopted benchmark suite for eval- uating LLM general performance. Table 3 presents the performance results, indicating that fine-tuning, while enhancing translation capabilities, has a min- imal impact on the modelâs proficiency in other tasks, demonstrating its robustness and adaptabil- ity. The analysis confirms that distillation
Chunk 19 · 1,996 chars
ark suite for eval- uating LLM general performance. Table 3 presents the performance results, indicating that fine-tuning, while enhancing translation capabilities, has a min- imal impact on the modelâs proficiency in other tasks, demonstrating its robustness and adaptabil- ity. The analysis confirms that distillation can en- hance translation performance while preserving the overall aptitude of the model across various tasks 6.3.4 Can we do LoRA? We also carried out experiments using the same data to assess how the LoRA (Low-Rank Adapta- tion (Hu et al., 2022)) rank parameter influences training performance in translation tasks involving Luxembourgish and English. Specifically, we eval- uated the ranks 8, 16, 32 and 64 in our models. The results, presented in Table 4 and 8, indicate that variations in the LoRA rank parameter have a minimal influence on the overall translation per- formance, with differences typically within 1 to 2 SPBLEU points. More importantly, models fine- tuned using LoRA consistently underperformed compared to their fully fine-tuned counterparts, achieving notably lower BLEU scores compared to table 2. Moreover, after LoRA-based SFT, we also observed an increased tendency toward hal- lucination. Due to the consistently lower perfor- mance and negligible differences observed among the varying LoRA ranks, we opted not to use LoRA fine-tuning in machine translation tasks. In- stead, we focused on full-model fine-tuning, which demonstrated significantly better results. These findings suggest that, while LoRA provides com- putational efficiency, its limited parameter updates are insufficient to capture the nuanced linguistic -- 8 of 25 -- features required for effective translation of LRLs and may even be harmful Table 4: Impact of LoRA Rank on SPBLEU During Fine-Tuning, Evaluated Across Three Rank Values EN-LB Rank (LoRA) Val 300 FLORE 200 SPBLEU SPBLEU Llama 3.2-3B Base Model 6.46 4.80 32 12.95 9.46 64 13.05 9.23 128 13.32 9.27 Gemma
Chunk 20 · 1,991 chars
of 25 -- features required for effective translation of LRLs and may even be harmful Table 4: Impact of LoRA Rank on SPBLEU During Fine-Tuning, Evaluated Across Three Rank Values EN-LB Rank (LoRA) Val 300 FLORE 200 SPBLEU SPBLEU Llama 3.2-3B Base Model 6.46 4.80 32 12.95 9.46 64 13.05 9.23 128 13.32 9.27 Gemma 2-2B Base Model 5.82 4.61 32 13.07 8.88 64 13.17 9.12 128 13.31 9.21 6.3.5 How about other LRLs? We demonstrate that distillation from various large teacher models can elevate the low-resource trans- lation performance of smaller models to a level comparable to that of expert systems, thereby con- firming the potential of small models in translation tasks. To further verify the generality of our find- ings, we additionally extracted 10,000 sentences from the WMT 2025 in Khasi, Assamese, and Ukrainian (Facebook-WikiMatrix-1-eng-ukr subset filtered for sentence lengths between 200 and 299 tokens), along with 1,000 pairs of corresponding sentences as a validation set. Using the same methodology, we performed data distillation for one-sided sentences with three dif- ferent models: the previously mentioned NLLB model, the Llama 3.3-70B model, and GPT-4o- mini. We then trained Llama-3.2-3B with identical prompts and evaluated performance on the vali- dation set using the corresponding ground-truth annotations provided by the dataset. As shown in Figure 6 and Table 10, when the model performance is already highâsuch as in the AsâEn direction, where the base model reaches a score of 0.64âthe effect of distillation is not pronounced. In contrast, for the EnâAs, EnâKh, EnâLb, and LbâEn directions, the results reveal that distillation from the teacher model is critical, leading to substantial improvements in translation quality. This suggests that distilled data can ef- fectively impart knowledge of resource-scarce lan- guages to small models, with minimal degradation in their general performance. 7 Conclusion In this study, drawing on evaluations
Chunk 21 · 1,996 chars
from the teacher model is critical, leading to substantial improvements in translation quality. This suggests that distilled data can ef- fectively impart knowledge of resource-scarce lan- guages to small models, with minimal degradation in their general performance. 7 Conclusion In this study, drawing on evaluations conducted with the FLORES-200 dataset within the scope of RQ1, we quantitatively assessed both open source and closed source state-of-the-art models and exam- ined the performance distribution of LLM support for different language families. Our findings pro- vide quantitative evidence of the current state of technological inequity. Unfortunately, LLMs are far from a universal solution for certain endangered LRLs at present. To address RQ2, we used Luxembourgish as a representative case and primarily leveraged a monolingual Luxembourgish corpusâcombined with knowledge distillation and data augmenta- tion techniquesâto fine-tune compact 2B/3B-scale LLMs for bidirectional translation between Luxem- bourgish and English. Our findings reveal that, al- though smaller models inherently face greater chal- lenges in processing LRLs compared to larger ar- chitectures, targeted distillation markedly enhances their performance. The fine-tuned models deliver usable and reliable translations, validated not only through conventional semantic evaluation metrics but also via LLM-as-a-Judge, an LLM-based eval- uation framework. Finally, to provide further insights into distilla- tion within RQ3, our analysis highlights the in- fluence of the size of the dataset, revealing that even minimal data augmentation can substantially improve translation performance, while marginal gains from additional data tend to diminish as the dataset grows. We also verified that this approach generalizes to various low-resource languages, with only a minor impact on their overall capabilities after training. Lastly, in the context of translation tasks, the use of LoRA is not recommended,
Chunk 22 · 1,984 chars
ce, while marginal gains from additional data tend to diminish as the dataset grows. We also verified that this approach generalizes to various low-resource languages, with only a minor impact on their overall capabilities after training. Lastly, in the context of translation tasks, the use of LoRA is not recommended, as it yields limited performance improvements. In general, the rapid advancement of LLMs has not been fully extended to LRLs and remains geo- graphically correlated with regional development levels. Although linguistic proximity to HRL can partially alleviate this gap, the majority of LRLs are still not supported by current LLMs. This un- derscores both the inherent limitations of LLMs and the urgency of promoting greater technological equity. Moreover, the findings of this study offer practical guidance for developing portable and cost- efficient translation models that effectively support selected LRLs while preserving the overall capa- bilities of the models, thereby pointing toward a -- 9 of 25 -- As-En En-As Kh-En En-Kh Uk-En En-Uk En-Lb Lb-En 0.0 0.2 0.4 0.6 0.8 1.0 LLMaaJ 0.64 0.37 0.05 0.00 0.72 0.60 0.36 0.58 0.65 0.64 0.77 0.70 0.82 0.82 0.59 0.61 0.10 0.17 0.76 0.69 0.87 0.82 0.62 0.65 0.30 0.51 0.78 0.72 0.87 0.90 Method BM DN DL DG Figure 6: This figure compares the performance of four LRL pairs under the base model (Llama-3.2-3B) and under knowledge distillation from different teacher models, evaluated using the LLMaaJ metric. âAsâ denotes Assamese, âKhâ denotes Khasi, and âUkâ denotes Ukrainian. Notably, the KhâEn and En-âKh directions lack results for the DN setting (i.e., using NLLB-200-3.3B as the teacher model), as NLLB does not provide support for Khasi. promising direction for advancing LRL translation through transformer-based approaches. Limitations This study has several limitations that should be considered. Firstly, despite efforts to gather diverse data sources, the dataset size and diversity for
Chunk 23 · 1,991 chars
s the teacher model), as NLLB does not provide support for Khasi. promising direction for advancing LRL translation through transformer-based approaches. Limitations This study has several limitations that should be considered. Firstly, despite efforts to gather diverse data sources, the dataset size and diversity for Lux- embourgish remain constrained compared to high- resource languages. As a result, the generalizabil- ity of our findings might be limited. Additionally, our reliance on knowledge distillation from large pre-trained models assumes access to high-quality pretrained models, which may not be feasible in all low-resource contexts. Lastly, translation per- formance metrics such as BLEU scores may not fully capture the nuanced linguistic accuracy or cultural appropriateness of translations, necessi- tating complementary qualitative assessments in future studies. Moreover, future work may explore the validation of fine-tuning and distillation per- formance on more LRLs, as well as on artificially constructed languages, such as Elvish. For LRL analysis, we still have a lot of LRLs that have not been considered, like the Khoe languages branch in Namibia. Ethics Statement All models and resources developed in this work are strictly intended for research and educational purposes according to OpenAI usage guidelines; no model weights or derivatives are used â or will be used â for any commercial application. We exclu- sively utilize publicly available corpora or datasets for which explicit authorization has been obtained from the original data providers. All license terms have been reviewed to ensure full compliance with copyright, attribution, and sharing requirements. No personally identifiable information (PII) is collected during this research. All data processing, storage, and retention policies are fully aligned with the EU General Data Protection Regulation (GDPR). The dataset of LOD.lu is under the CC0 license. As most of RTL datasets are based
Chunk 24 · 1,999 chars
tion, and sharing requirements. No personally identifiable information (PII) is collected during this research. All data processing, storage, and retention policies are fully aligned with the EU General Data Protection Regulation (GDPR). The dataset of LOD.lu is under the CC0 license. As most of RTL datasets are based on articles from RTL, we cannot publish them, but we make them available to researchers on request. All code, models, and processed data artifacts will be released under an open-source, research- oriented license (e.g., CC BY-NC), accompanied by comprehensive documentation and bias-analysis methodology to promote transparency and repro- ducibility. We commit to ongoing ethical oversight through periodic reevaluation of datasets and model outputs, prompt updates in response to emerging concerns, and consultation with interdisciplinary advisory boards to ensure adherence to the highest ethical standards. References Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika Bali, et al. 2024. Megaverse: Benchmarking large language models across languages, modalities, mod- els and tasks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 2598â2637. -- 10 of 25 -- Mikel Artetxe and Holger Schwenk. 2019. Mas- sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Transac- tions of the association for computational linguistics, 7:597â610. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Kenza Benkirane, Laura Gongas, Shahar Pelles, Naomi Fuchs, Joshua Darmon, Pontus Stenetorp, David Ade- lani, and
Chunk 25 · 1,995 chars
eural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Kenza Benkirane, Laura Gongas, Shahar Pelles, Naomi Fuchs, Joshua Darmon, Pontus Stenetorp, David Ade- lani, and Eduardo SĂĄnchez. 2024. Machine trans- lation hallucination detection for low and high re- source languages using large language models. In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 9647â9665. DamiĂĄn Blasi, Antonios Anastasopoulos, and Gra- ham Neubig. 2022. Systematic inequalities in lan- guage technology performance across the worldâs languages. In Proceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486â5505. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmĂĄn, Ădouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8440â 8451. Marta Costa-jussa, James Cross, Onur Ăelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, LoĂŻc Barrault, Gabriel Gonzalez, Prangthip Hansanti, and Jeff Wang. 2022. No lan- guage left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672. Luciano da F. Costa. 2021. Further generalizations of the jaccard index. CoRR, abs/2110.09619. Micha Elsner et al. 2024. Shortcomings of llms for
Chunk 26 · 1,994 chars
Barrault, Gabriel Gonzalez, Prangthip Hansanti, and Jeff Wang. 2022. No lan- guage left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672. Luciano da F. Costa. 2021. Further generalizations of the jaccard index. CoRR, abs/2110.09619. Micha Elsner et al. 2024. Shortcomings of llms for low- resource translation: Retrieval and understanding are both the problem. In Proceedings of the Ninth Con- ference on Machine Translation, pages 1332â1354. Luyu Gao, Xinyi Wang, and Graham Neubig. 2020. Im- proving target-side lexical transfer in multilingual neural machine translation. In Findings of the Associ- ation for Computational Linguistics: EMNLP 2020, pages 3560â3566. Yingbo Gao, Christian Herold, Zijian Yang, and Her- mann Ney. 2022. Is encoder-decoder redundant for neural machine translation? In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 562â574. Marjan Ghazvininejad, Hila Gonen, and Luke Zettle- moyer. 2023. Dictionary-based phrase-level prompt- ing of large language models for machine translation. arXiv preprint arXiv:2302.07856. Google. 2024. Gemma-2-2b-it model card. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, MarcâAurelio Ranzato, Francisco GuzmĂĄn, and Angela Fan. 2021a. The flores-101 evaluation benchmark for low-resource and multilingual ma- chine translation. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, MarcâAurelio Ranzato, Francisco GuzmĂĄn, and Angela Fan. 2021b. The FLORES-101 evalu- ation benchmark for low-resource and multilingual machine translation. CoRR, abs/2106.03193. Francisco GuzmĂĄn, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and MarcâAurelio Ranzato. 2019. Two new evaluation
Chunk 27 · 1,994 chars
nzato, Francisco GuzmĂĄn, and Angela Fan. 2021b. The FLORES-101 evalu- ation benchmark for low-resource and multilingual machine translation. CoRR, abs/2106.03193. Francisco GuzmĂĄn, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and MarcâAurelio Ranzato. 2019. Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english. Md. Arid Hasan, Prerona Tarannum, Krishno Dey, Im- ran Razzak, and Usman Naseem. 2024. Do large language models speak all languages equally? a com- parative study in low-resource settings. Preprint, arXiv:2408.02237. Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Has- san Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. Preprint, arXiv:2302.09210. Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251â257. David M Howcroft and Dimitra Gkatzia. 2022. Most nlg is low-resource: hereâs what we can do about it. In Proceedings of the 2nd Workshop on Natural Lan- guage Generation, Evaluation, and Metrics (GEM), pages 336â350. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adap- tation of large language models. ICLR, 1(2):3. -- 11 of 25 -- Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282â6293. Association for Computational Linguis- tics. Wen Lai, Mohsen Mesgar, and Alexander Fraser. 2024. Llms beyond english: Scaling the multilingual ca- pability of llms with cross-lingual feedback. arXiv preprint arXiv:2406.01771. Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and MarcâAurelio Ranzato. 2018.
Chunk 28 · 1,999 chars
6282â6293. Association for Computational Linguis- tics. Wen Lai, Mohsen Mesgar, and Alexander Fraser. 2024. Llms beyond english: Scaling the multilingual ca- pability of llms with cross-lingual feedback. arXiv preprint arXiv:2406.01771. Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and MarcâAurelio Ranzato. 2018. Unsupervised ma- chine translation using monolingual corpora only. In International Conference on Learning Representa- tions (ICLR). SĂ©amus Lankford, Haithem Alfi, and Andy Way. 2021. Transformers for low-resource languages: Is fĂ©idir linn! In Proceedings of Machine Translation Summit XVIII: Research Track, pages 48â60. Lujun Li, Lama Sleem, Niccoloâ Gentile, Geoffrey Nichil, and Radu State. 2025a. Exploring the im- pact of temperature on large language models: Hot or cold? Procedia Computer Science, 264:242â251. International Neural Network Society Workshop on Deep Learning Innovations and Applications 2025. Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ali Payani, Ninghao Liu, and Mengnan Du. 2025b. Language ranker: A metric for quantifying llm performance across high and low-resource languages. In Proceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28186â28194. Lifewire. 2024. Llama 3 vs. llama 2: Why the newest model leaves its predecessor in the dust. Accessed: 2025-02-03. Meta Llama. 2024. Llama-3.2-3b-instruct model card. Chi-kiu Lo, Rebecca Knowles, and Cyril Goutte. 2023. Beyond correlation: Making sense of the score dif- ferences of new MT evaluation metrics. In Proceed- ings of Machine Translation Summit XIX, Vol. 1: Re- search Track, pages 186â199, Macau SAR, China. Asia-Pacific Association for Machine Translation. Cedric Lothritz and Jordi Cabot. 2025. Testing low- resource language support in llms using language proficiency exams: the case of luxembourgish. arXiv preprint arXiv:2504.01667. Cedric Lothritz, Bertrand Lebichot, Kevin Allix, Lisa Veiber, TegawendĂ© François D Assise Bissyande, Jacques Klein, Andrey
Chunk 29 · 1,998 chars
ine Translation. Cedric Lothritz and Jordi Cabot. 2025. Testing low- resource language support in llms using language proficiency exams: the case of luxembourgish. arXiv preprint arXiv:2504.01667. Cedric Lothritz, Bertrand Lebichot, Kevin Allix, Lisa Veiber, TegawendĂ© François D Assise Bissyande, Jacques Klein, Andrey Boytsov, Anne Goujon, and ClĂ©ment Lefebvre. 2022. Luxembert: Simple and practical data augmentation in language model pre- training for luxembourgish. In 13th Language Re- sources and Evaluation Conference (LREC 2022). Kelly Marchisio, Wei-Yin Ko, Alexandre BĂ©rard, ThĂ©o Dehaze, and Sebastian Ruder. 2024. Understanding and mitigating language confusion in llms. In Pro- ceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing, pages 6653â 6677. Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2024. A comprehensive overview of large language models. Preprint, arXiv:2307.06435. Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsud- deen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, et al. 2020. Participatory research for low-resourced machine translation: A case study in african languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144â2160. Association for Computational Linguis- tics. Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury. 2024. The zenoâs paradox of âlow-resourceâ lan- guages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17753â17774. Association for Computational Linguistics. Joel Niklaus, Jakob Merane, Luka Nenadic, Sina Ah- madi, Yingqiang Gao, Cyrill A. H. Chevalley, Claude Humbel, Christophe Gösken, Lorenzo Tanzi, Thomas LĂŒthi, Stefan Palombo, Spencer Poff, Boling Yang, Nan Wu, Matthew Guillod, Robin
Chunk 30 · 1,996 chars
ral Language Processing, pages 17753â17774. Association for Computational Linguistics. Joel Niklaus, Jakob Merane, Luka Nenadic, Sina Ah- madi, Yingqiang Gao, Cyrill A. H. Chevalley, Claude Humbel, Christophe Gösken, Lorenzo Tanzi, Thomas LĂŒthi, Stefan Palombo, Spencer Poff, Boling Yang, Nan Wu, Matthew Guillod, Robin MamiĂ©, Daniel Brunner, Julio Pereyra, and Niko Grupen. 2025. Swiltra-bench: The swiss legal translation bench- mark. Preprint, arXiv:2503.01372. Chinasa T Okolo and Marie Tano. 2024. Closing the gap: A call for more inclusive language technologies. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback. Advances in neural in- formation processing systems, 35:27730â27744. Andrea Piergentili, Beatrice Savoldi, Matteo Negri, and Luisa Bentivogli. 2025. An llm-as-a-judge approach for scalable gender-neutral translation evaluation. arXiv preprint arXiv:2504.11934. Alistair Plum, Tharindu Ranasinghe, and Christoph Purschke. 2024. Text generation models for luxem- bourgish with limited data: A balanced multilingual strategy. arXiv preprint arXiv:2412.09415. Maja PopoviÂŽc. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392â395, Lisbon, Portugal. Association for Computational Linguistics. -- 12 of 25 -- Nathaniel Robinson, Perez Ogayo, David R Mortensen, and Graham Neubig. 2023. Chatgpt mt: Competi- tive for high-(but not low-) resource languages. In Proceedings of the Eighth Conference on Machine Translation, pages 392â418. Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. 2020. Superglue: Learning feature matching with graph neural net- works. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938â4947. Rico Sennrich,
Chunk 31 · 1,991 chars
nference on Machine Translation, pages 392â418. Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. 2020. Superglue: Learning feature matching with graph neural net- works. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938â4947. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation mod- els with monolingual data. In ACL, pages 86â96. Association for Computational Linguistics. Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. 2024. The language barrier: Dissecting safety challenges of llms in multi- lingual contexts. In Findings of the Association for Computational Linguistics ACL 2024, pages 2668â 2680. Ana Silva, Nikit Srivastava, Tatiana Moteu Ngoli, Michael Röder, Diego Moussallem, and Axel-Cyrille Ngonga Ngomo. 2024. Benchmarking low-resource machine translation systems. In Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), pages 175â185, Bangkok, Thailand. Associa- tion for Computational Linguistics. Yewei Song, Saad Ezzini, Jacques Klein, Tegawende Bissyande, ClĂ©ment Lefebvre, and Anne Goujon. 2023. Letz translate: Low-resource machine trans- lation for luxembourgish. In 2023 5th International Conference on Natural Language Processing (IC- NLP), pages 165â170. IEEE. Jörg Tiedemann and Santhosh Thottingal. 2020. Opus- mtâbuilding open translation services for the world. In Annual Conference of the European Association for Machine Translation, pages 479â480. European Association for Machine Translation. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ćukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30. Yang Zhao, Jiajun Zhang, and Chengqing Zong. 2023. Transformer: A general framework from
Chunk 32 · 1,991 chars
hine Translation. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ćukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30. Yang Zhao, Jiajun Zhang, and Chengqing Zong. 2023. Transformer: A general framework from machine translation to others. Machine Intelligence Research, 20(4):514â538. Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. How do large lan- guage models handle multilingualism? In The Thirty- eighth Annual Conference on Neural Information Processing Systems. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595â46623. Tianyang Zhong, Zhenyuan Yang, Zhengliang Liu, Ruidong Zhang, Yiheng Liu, Haiyang Sun, Yi Pan, Yiwei Li, Yifan Zhou, Hanqi Jiang, Junhao Chen, and Tianming Liu. 2024. Opportunities and challenges of large language models for low-resource languages in humanities research. Preprint, arXiv:2412.04497. Appendix A Data Processing Dataset selection directly impacts the reliability and generalizability of experimental results. Our criteria include having enough test samples, provid- ing reference responses, and minimizing potential biases from overlap with pre-training data. FLORES-200 (Costa-jussa et al., 2022) is a benchmark dataset specifically designed for low- resource and multilingual machine translation, serv- ing as an extended version of FLORES-101 (Goyal et al., 2021b). It covers 200 languages and con- sists of sentences extracted from 842 web articles, with an average length of approximately 21 words. These sentences are divided into three datasets: dev, devtest, and a hidden test set. Since we require additional evaluation metrics, we use devtest as our set of tests in this study. In our paper,
Chunk 33 · 1,991 chars
languages and con- sists of sentences extracted from 842 web articles, with an average length of approximately 21 words. These sentences are divided into three datasets: dev, devtest, and a hidden test set. Since we require additional evaluation metrics, we use devtest as our set of tests in this study. In our paper, we primarily evaluate the translation performance of all 200 languages into English. However, in the subsequent model training, we focus solely on the Luxembourgish-English language pair for training and testing. The VAL 300 validation set was constructed us- ing 300 pieces of official news content from July 2024 as the source data. The corresponding ground truth in Luxembourg was generated using Chat- GPT, followed by dictionary-based verification to ensure validity. Furthermore, we extracted 30 sam- ples from the dataset and engaged Luxembourgish- English bilingual speakers to perform a quality as- sessment. B Experiments settings In our experiments, we used primarily two distinct models for supervised fine-tuning (SFT) to eval- uate performance and optimization strategies. To -- 13 of 25 -- ensure an effective training process, several hyper- parameters and model configurations were metic- ulously selected. Specifically, the warm-up ratio was set to 0.5, facilitating a gradual increase in the learning rate during the initial training phase for improved convergence stability. The maximum gradient norm was restricted to 0.3, serving as a mechanism to prevent excessively large parameter updates and promote stable optimization dynamics. Furthermore, the input sequence length was capped at 512 tokens, ensuring that all processed data ad- hered to this fixed-length constraint. A weight de- cay of 0.01 was applied to regularize the model parameters and mitigate the risk of overfitting. It is worth noting that all of our models were trained for only one epoch. This decision was based on our observation that evaluation metrics reached their optimal
Chunk 34 · 1,998 chars
ta ad- hered to this fixed-length constraint. A weight de- cay of 0.01 was applied to regularize the model parameters and mitigate the risk of overfitting. It is worth noting that all of our models were trained for only one epoch. This decision was based on our observation that evaluation metrics reached their optimal performance after a single epoch, while additional epochs exacerbated the impact of noisy data without yielding performance improvements. To ensure reproducibility across experiments, a fixed random seed of 3407 was utilized. For model architecture selection, two distinct approaches were considered: standard fine-tuning and LoRA. In cases where LoRA was employed, specific layers were targeted for adaptation, including "q_proj," "k_proj," "v_proj," "o_proj," "gate_proj," "up_proj," and "down_proj." The LoRA alpha pa- rameter was configured to a value of 8, while the dropout rate for LoRA layers was set to 0, indi- cating that no dropout-based regularization was applied to these low-rank adaptation layers. For tokenization and input preparation, a stan- dardized procedure was adopted to ensure con- sistency in sequence length across the examples. The tokenizer processed each input field by trun- cating sequences exceeding the maximum length of 512 tokens and padding shorter sequences to this fixed length. This was achieved using the âpadding="max_length"â option, thereby guaran- teeing uniformity in input representation prior to model training. During the inference stage, we set the temperature parameter to 0.1 (close to 0), which has been shown to help achieve optimal ma- chine translation performance (Li et al., 2025a). In addition, we set max_new_tokens to 512, enable do_sample = True, and set top_p = 0.9. C Dictionary Processing In our approach to enhancing translation accuracy, particularly for Luxembourgish, we developed a Model Reference SFT Methods Llama-3.2-3B-Instruct Llama, 2024 FS/ LoRA SFT gemma-2-2b-it Google, 2024 FS/ LoRA SFT Table 5:
Chunk 35 · 1,996 chars
et max_new_tokens to 512, enable do_sample = True, and set top_p = 0.9. C Dictionary Processing In our approach to enhancing translation accuracy, particularly for Luxembourgish, we developed a Model Reference SFT Methods Llama-3.2-3B-Instruct Llama, 2024 FS/ LoRA SFT gemma-2-2b-it Google, 2024 FS/ LoRA SFT Table 5: Various models and their SFT methods. "FS/ Lora SFT" refers to full-size and "Lora SFT" denotes Low-Rank Adaptation SFT only. retrieval pipeline using Haystack 2.0. The pipeline utilizes a BM25 retriever to identify relevant dic- tionary entries that align closely with the input text. The retrieved dictionary entries are then incorpo- rated directly into the prompt provided to GPT-4O, offering multiple lexical choices that help clarify ambiguous terms. This method operates as follows: first, the BM25 retriever ranks and returns the most relevant dic- tionary entries based on the Luxembourgish input. These entries serve as additional context within the prompt, guiding GPT-4o toward more accurate translations. Subsequently, the original Luxem- bourgish sentence and the relevant dictionary con- text are submitted to GPT-4o for translation. By explicitly integrating these dictionary options into the prompt, GPT-4o is better equipped to resolve lexical ambiguities and correct potential translation errors, enhancing translation accuracy and coher- ence. D Language Ability On LLMs D.1 Translation Performance and Human Development Disparities In this analysis, LRLs are operationally defined as those that comprise less than 0.1% of web content (according to W3Techs statistics4). The average LLMaaJ scores were calculated exclusively for the selected LRLs that also exist in the FLORES-200 dataset. Country - LRLs pairs were identified based on a mapping that utilizes Wikipedia-derived esti- mates of language speaker distribution. Figure 7 reveals a clear positive correlation be- tween a countryâs human development level (HDI) and the translation quality of its
Chunk 36 · 1,999 chars
cted LRLs that also exist in the FLORES-200 dataset. Country - LRLs pairs were identified based on a mapping that utilizes Wikipedia-derived esti- mates of language speaker distribution. Figure 7 reveals a clear positive correlation be- tween a countryâs human development level (HDI) and the translation quality of its low-resource lan- guages as judged by LLMs. Each point in the scat- ter represents a FLORES-200 language linked to a countryâs HDI, and the overall trend slopes upward â higher-HDI countries tend to have languages with higher LLMaaJ translation scores. This suggests that socioeconomic factors underpin disparities in 4https://w3techs.com/technologies/overview/ content_language -- 14 of 25 -- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average LLMaaJ Score 0.4 0.5 0.6 0.7 0.8 0.9 Human Development Index (HDI) Luxembourg New Zealand India Malaysia Burkina Faso Afghanistan Peru LLMaaJ Performance vs. HDI by Country and Continent Continent Asia Europe Africa Americas America Ocenia Figure 7: Scatter Plot of LLMaaJ Score and HDI Relation for LRLs LLM translation coverage, echoing the âdigital lan- guage divideâ observed in AI research (Okolo and Tano, 2024). In other words, languages from more developed regions generally receive far better sup- port in large multilingual models than those from less developed regions. When grouping languages by development tiers, the performance gap is stark. Languages from Very High HDI countries (HDI â„ 0.80) achieve an av- erage LLMaaJ score of around 0.54, more than double the 0.22 average for languages from Low HDI countries (HDI < 0.55). Median scores like- wise jump from only 0.15 in low-HDI settings to 0.53 in very-high-HDI settings. This means a typ- ical low-resource language in a highly developed society enjoys significantly better machine transla- tion quality than one in a low-development context. Crucially, it is not simply the number of speakers but the socioeconomic context and digital resources that dictate
Chunk 37 · 1,993 chars
0.53 in very-high-HDI settings. This means a typ- ical low-resource language in a highly developed society enjoys significantly better machine transla- tion quality than one in a low-development context. Crucially, it is not simply the number of speakers but the socioeconomic context and digital resources that dictate how well a language is served by AI. For instance, Hindi (with over 500 million speak- ers) has historically been treated as âlow-resourceâ for NLP, whereas a smaller language like Dutch (with a fraction of the speakers, but backed by a high-HDI country) is well-supported. The greater availability of data and funding in high-HDI envi- ronments allows LLMs to achieve markedly better translations for those languages. Geographic disparities are especially pro- nounced. Nearly all African languages in the study cluster toward the lower-left of Figure 7, indicating both low HDI and poor translation performance. In fact, none of the African languages evaluated approach the top tier of LLMaaJ scores â a find- ing consistent with reports that even state-of-the- art multilingual models still lag on African lan- guages due to limited training data and quality. By contrast, European languages (from countries with generally high HDI) occupy the upper range of the plot; these languages achieve some of the highest scores (e.g. minority languages like Oc- citan in France reach LLMaaJ â 0.76). Several Asian languages spoken in high-HDI regions like- wise perform strongly â for example, Standard Malay (Malaysia/Brunei) attains average scores above 0.80 in our data. Meanwhile, many lan- guages of low-HDI countries remain at the bottom: Dzongkha of Bhutan (medium HDI) has one of the lowest scores (LLMaaJ â 0.03), and numerous Sub-Saharan African languages (e.g. Tigrinya of Eritrea) register below 0.10. These patterns sug- gest that languages benefiting from a robust digital infrastructure or from close linguistic ties to well- resourced tongues (as Occitan does to
Chunk 38 · 1,990 chars
utan (medium HDI) has one of the lowest scores (LLMaaJ â 0.03), and numerous Sub-Saharan African languages (e.g. Tigrinya of Eritrea) register below 0.10. These patterns sug- gest that languages benefiting from a robust digital infrastructure or from close linguistic ties to well- resourced tongues (as Occitan does to French) see far better outcomes, whereas languages in impov- erished or isolated settings are left behind. Overall, the strong HDI-performance correla- tion highlights a systemic inequality in LLM cov- erage. The correlation coefficient score between HDI and LLMaaJ average score is 0.566, indicating -- 15 of 25 -- a medium-high correlation. Communities in low- development regions face a double disadvantage: they are underserved by technology on top of ex- isting socio-economic challenges. Indeed, globally fewer than 1% of languages have sufficient data to be considered high-resource, leaving speakers of the other 99% âessentially cut off from global technological progressâ. This lack of access to quality translation and language tools can hinder information access, education, and opportunities, thereby exacerbating the digital divide and reinforc- ing global inequalities. Our findings underscore that current multilingual AI models, despite their broad reach, de facto offer far stronger support for languages of wealthy, high-HDI communities than for those of poorer regions. Addressing this gap will require concerted efforts to bring truly inclu- sive language coverage to the forefront, rather than merely adding more languages without improving quality for the most disadvantaged. D.2 Result Tables -- 16 of 25 -- Table 6: The LLMaaJ results on the FLORES-200 dataset are derived from evaluations of 10 distinct large language models. Population estimates are based on heterogeneous sources, and the reported population are not guaranteed to be accurate. Therefore, they should be interpreted with appropriate caution. Language Name Language Branch
Chunk 39 · 1,999 chars
MaaJ results on the FLORES-200 dataset are derived from evaluations of 10 distinct large language models. Population estimates are based on heterogeneous sources, and the reported population are not guaranteed to be accurate. Therefore, they should be interpreted with appropriate caution. Language Name Language Branch Population GPT4o mini Llama- 3.1-8B Llama- 3.2-3B Ministral -8B Phi-3 Phi-3.5 Qwen2.5 -1.5B Qwen2.5 -3B gemma2 -2b gemma2 -9b Central Atlas Tamazight Berber 3-4 million 0.017 0.008 0.006 0.008 0.007 0.014 0.006 0.01 0.011 0.014 Kabyle 5 million 0.078 0.054 0.027 0.025 0.02 0.038 0.02 0.042 0.028 0.08 Tamasheq (Latin script) 500,000 0.143 0.101 0.067 0.082 0.088 0.093 0.061 0.09 0.096 0.142 Tamasheq (Tifinagh script) 500,000 0.021 0.009 0.007 0.009 0.008 0.022 0.005 0.013 0.016 0.018 Hausa Chadic 40 million 0.774 0.534 0.166 0.132 0.089 0.101 0.082 0.11 0.228 0.656 Somali Cushitic 20 million 0.735 0.257 0.112 0.143 0.077 0.121 0.063 0.107 0.112 0.5 West Central Oromo 10 million 0.617 0.079 0.067 0.047 0.028 0.051 0.023 0.07 0.035 0.121 Amharic Semitic 32 million 0.627 0.254 0.015 0.024 0.008 0.013 0.018 0.054 0.148 0.59 Hebrew 9 million 0.892 0.859 0.587 0.853 0.464 0.599 0.578 0.757 0.802 0.874 Maltese 520,000 0.892 0.793 0.551 0.428 0.237 0.261 0.202 0.311 0.627 0.855 Modern Standard Arabic 335 millions 0.881 0.858 0.792 0.847 0.573 0.799 0.771 0.832 0.814 0.863 Tigrinya 9 million 0.209 0.066 0.006 0.02 0.016 0.017 0.007 0.026 0.041 0.211 Egyptian Arabic 60 million 0.851 0.807 0.701 0.776 0.451 0.68 0.658 0.753 0.718 0.815 Mesopotamian Arabic 15 million 0.862 0.839 0.715 0.794 0.497 0.713 0.686 0.774 0.751 0.83 Moroccan Arabic 30 million 0.816 0.659 0.529 0.596 0.316 0.508 0.491 0.58 0.555 0.736 Najdi Arabic 10 million 0.861 0.868 0.772 0.826 0.542 0.775 0.751 0.817 0.788 0.842 North Levantine Arabic 20 million 0.869 0.813 0.706 0.774 0.461 0.677 0.654 0.757 0.735 0.823 South Levantine Arabic 24 million 0.875 0.824 0.714 0.788 0.485 0.715 0.673 0.767
Chunk 40 · 1,993 chars
0.816 0.659 0.529 0.596 0.316 0.508 0.491 0.58 0.555 0.736 Najdi Arabic 10 million 0.861 0.868 0.772 0.826 0.542 0.775 0.751 0.817 0.788 0.842 North Levantine Arabic 20 million 0.869 0.813 0.706 0.774 0.461 0.677 0.654 0.757 0.735 0.823 South Levantine Arabic 24 million 0.875 0.824 0.714 0.788 0.485 0.715 0.673 0.767 0.743 0.831 TaâĂĂŽizzi-Adeni Arabic 11 million 0.869 0.857 0.748 0.816 0.525 0.75 0.725 0.802 0.783 0.842 Tunisian Arabic 11 million 0.837 0.724 0.611 0.686 0.418 0.611 0.57 0.667 0.631 0.773 Khmer Khmer 16 million 0.797 0.718 0.415 0.08 0.061 0.082 0.117 0.259 0.233 0.699 Santali Munda 7.5 million 0.018 0.073 0.007 0.002 0.004 0.005 0.001 0.01 0.052 0.387 Vietnamese Vietic 76 million 0.881 0.867 0.839 0.856 0.623 0.676 0.833 0.854 0.849 0.875 Acehnese (Arabic script) Malayo-Polynesian 3.5 million 0.141 0.054 0.025 0.042 0.005 0.03 0.014 0.049 0.021 0.097 Acehnese (Latin script) 3.5 million 0.394 0.309 0.195 0.213 0.169 0.219 0.157 0.235 0.209 0.385 Balinese 3.3 million 0.652 0.542 0.375 0.322 0.274 0.298 0.249 0.35 0.383 0.624 Banjar (Arabic script) 4 million 0.179 0.083 0.039 0.054 0.008 0.045 0.019 0.05 0.021 0.093 Banjar (Latin script) 4 million 0.688 0.604 0.459 0.436 0.282 0.297 0.302 0.422 0.47 0.69 Buginese 4 million 0.346 0.228 0.161 0.172 0.161 0.188 0.133 0.194 0.198 0.296 Cebuano 21 million 0.877 0.743 0.496 0.538 0.379 0.38 0.287 0.414 0.614 0.819 Ilocano 8 million 0.765 0.526 0.33 0.265 0.239 0.245 0.162 0.255 0.372 0.672 Indonesian 43 million L1 0.894 0.883 0.859 0.871 0.814 0.815 0.841 0.869 0.869 0.889 Javanese 82 million 0.837 0.7 0.489 0.376 0.256 0.308 0.286 0.436 0.527 0.767 Minangkabau (Arabic script) 6.5 million 0.157 0.057 0.03 0.037 0.006 0.044 0.012 0.038 0.018 0.081 Minangkabau (Latin script) 6.5 million 0.671 0.618 0.422 0.365 0.251 0.265 0.26 0.383 0.416 0.704 Pangasinan 1.5 million 0.487 0.38 0.282 0.291 0.292 0.298 0.206 0.269 0.319 0.492 Plateau Malagasy 5 million 0.813 0.313 0.126 0.289 0.069 0.098 0.074 0.129 0.13
Chunk 41 · 1,997 chars
0.157 0.057 0.03 0.037 0.006 0.044 0.012 0.038 0.018 0.081 Minangkabau (Latin script) 6.5 million 0.671 0.618 0.422 0.365 0.251 0.265 0.26 0.383 0.416 0.704 Pangasinan 1.5 million 0.487 0.38 0.282 0.291 0.292 0.298 0.206 0.269 0.319 0.492 Plateau Malagasy 5 million 0.813 0.313 0.126 0.289 0.069 0.098 0.074 0.129 0.13 0.504 Standard Malay 18 million L1 0.889 0.872 0.829 0.858 0.742 0.728 0.769 0.83 0.853 0.881 Sundanese 42 million 0.854 0.687 0.464 0.414 0.286 0.325 0.324 0.45 0.47 0.748 Tagalog 28 million 0.889 0.846 0.751 0.798 0.667 0.621 0.428 0.624 0.816 0.876 Waray 3.7 million 0.856 0.679 0.447 0.552 0.386 0.408 0.297 0.403 0.553 0.79 Fijian 330,000 0.501 0.146 0.072 0.094 0.084 0.108 0.057 0.097 0.103 0.226 Maori 185,000 (L2) 0.689 0.412 0.176 0.295 0.166 0.192 0.102 0.2 0.183 0.471 Samoan 500,000 0.728 0.313 0.117 0.118 0.09 0.121 0.076 0.121 0.126 0.4 Central Aymara Aymara 2 million 0.168 0.085 0.074 0.083 0.072 0.092 0.061 0.093 0.087 0.126 Esperanto Constructed 2 million (est.) 0.89 0.869 0.798 0.865 0.714 0.707 0.574 0.708 0.807 0.878 Tok Pisin (English-based) 4 million 0.739 0.529 0.279 0.356 0.299 0.306 0.163 0.249 0.369 0.721 Haitian Creole (French-based) 10 million 0.839 0.615 0.381 0.443 0.24 0.281 0.169 0.304 0.406 0.739 Papiamento (Iberian-based) 340,000 0.831 0.702 0.505 0.536 0.426 0.439 0.352 0.504 0.499 0.783 Kabuverdianu (Portuguese-based) 1.2 million 0.786 0.587 0.436 0.496 0.38 0.412 0.319 0.459 0.454 0.672 Kannada Dravidian 44 million 0.825 0.77 0.663 0.775 0.016 0.026 0.081 0.314 0.624 0.816 Malayalam 38 million 0.845 0.797 0.664 0.777 0.015 0.027 0.102 0.341 0.663 0.844 Tamil 75 million 0.821 0.799 0.675 0.739 0.053 0.093 0.061 0.19 0.669 0.814 Telugu 81 million 0.846 0.802 0.731 0.772 0.031 0.045 0.108 0.337 0.667 0.831 Tosk Albanian Albanian 3 million 0.884 0.828 0.655 0.806 0.263 0.288 0.213 0.365 0.622 0.836 Armenian Armenian 6.7 million 0.867 0.835 0.569 0.838 0.086 0.124 0.078 0.22 0.634 0.841 Latgalian Baltic 150,000 0.581 0.361
Chunk 42 · 1,990 chars
0.061 0.19 0.669 0.814 Telugu 81 million 0.846 0.802 0.731 0.772 0.031 0.045 0.108 0.337 0.667 0.831 Tosk Albanian Albanian 3 million 0.884 0.828 0.655 0.806 0.263 0.288 0.213 0.365 0.622 0.836 Armenian Armenian 6.7 million 0.867 0.835 0.569 0.838 0.086 0.124 0.078 0.22 0.634 0.841 Latgalian Baltic 150,000 0.581 0.361 0.182 0.276 0.138 0.173 0.115 0.218 0.233 0.442 Lithuanian 3 million 0.877 0.815 0.668 0.801 0.297 0.292 0.326 0.541 0.787 0.864 Standard Latvian 1.75 million 0.886 0.822 0.665 0.812 0.322 0.35 0.353 0.59 0.785 0.872 Welsh Celtic 875,000 (L2) 0.896 0.816 0.577 0.749 0.136 0.183 0.118 0.285 0.419 0.813 Irish 1.2 million (L2) 0.86 0.731 0.428 0.58 0.107 0.137 0.082 0.21 0.249 0.72 Scottish Gaelic 60,000 0.8 0.567 0.276 0.249 0.098 0.134 0.073 0.174 0.144 0.564 Afrikaans Germanic 7 million 0.901 0.878 0.82 0.855 0.684 0.72 0.687 0.786 0.847 0.89 Danish 5.8 million 0.901 0.884 0.855 0.879 0.767 0.81 0.756 0.838 0.873 0.891 German 95 million (L1) 0.898 0.89 0.88 0.891 0.887 0.884 0.863 0.881 0.885 0.894 Limburgish 1.3 million 0.784 0.719 0.535 0.533 0.381 0.418 0.354 0.492 0.601 0.796 Eastern Yiddish 1 million 0.834 0.618 0.1 0.166 0.039 0.053 0.017 0.117 0.261 0.78 Faroese 70,000 0.845 0.639 0.417 0.491 0.254 0.279 0.183 0.317 0.375 0.709 Icelandic 350,000 0.876 0.768 0.526 0.714 0.241 0.252 0.173 0.315 0.476 0.789 Norwegian Bokmal 4 million 0.888 0.87 0.84 0.865 0.748 0.784 0.726 0.814 0.858 0.881 Norwegian Nynorsk 750,000 0.89 0.864 0.816 0.86 0.65 0.687 0.637 0.756 0.838 0.88 Swedish 10 million 0.899 0.892 0.875 0.879 0.791 0.822 0.777 0.841 0.874 0.893 Dutch 24 million 0.883 0.874 0.859 0.873 0.81 0.86 0.828 0.856 0.864 0.878 Luxembourgish 400,000 0.874 0.767 0.565 0.557 0.396 0.404 0.281 0.41 0.493 0.792 Greek Greek 13 million 0.88 0.854 0.791 0.852 0.604 0.635 0.475 0.672 0.82 0.868 Assamese Indo-Aryan 15 million 0.785 0.666 0.467 0.32 0.035 0.067 0.167 0.396 0.464 0.719 Awadhi 38 million 0.841 0.769 0.655 0.696 0.243 0.519 0.313 0.53 0.689
Chunk 43 · 1,999 chars
embourgish 400,000 0.874 0.767 0.565 0.557 0.396 0.404 0.281 0.41 0.493 0.792 Greek Greek 13 million 0.88 0.854 0.791 0.852 0.604 0.635 0.475 0.672 0.82 0.868 Assamese Indo-Aryan 15 million 0.785 0.666 0.467 0.32 0.035 0.067 0.167 0.396 0.464 0.719 Awadhi 38 million 0.841 0.769 0.655 0.696 0.243 0.519 0.313 0.53 0.689 0.796 Bengali 265 million 0.855 0.81 0.742 0.791 0.097 0.14 0.392 0.644 0.728 0.831 Bhojpuri 50 million 0.834 0.702 0.56 0.596 0.191 0.444 0.239 0.418 0.602 0.768 Chhattisgarhi 16 million 0.821 0.672 0.541 0.605 0.191 0.471 0.256 0.445 0.589 0.735 Eastern Panjabi 33 million 0.848 0.831 0.686 0.733 0.017 0.037 0.103 0.417 0.587 0.824 Gujarati 55 million 0.853 0.807 0.693 0.725 0.012 0.024 0.197 0.497 0.649 0.838 Hindi 600 million (L2) 0.871 0.841 0.806 0.832 0.408 0.727 0.49 0.705 0.822 0.862 Magahi 14 million 0.843 0.741 0.634 0.667 0.242 0.497 0.293 0.509 0.682 0.801 Maithili 35 million 0.855 0.722 0.589 0.57 0.191 0.454 0.245 0.422 0.624 0.788 Marathi 83 million 0.864 0.809 0.716 0.726 0.131 0.253 0.227 0.464 0.69 0.831 Nepali 25 million 0.851 0.75 0.576 0.717 0.205 0.375 0.233 0.465 0.688 0.825 Odia 37 million 0.796 0.692 0.242 0.027 0.014 0.025 0.055 0.365 0.041 0.637 Sanskrit 14000+ 0.624 0.536 0.389 0.41 0.18 0.31 0.165 0.327 0.341 0.596 Sindhi 32 million 0.824 0.721 0.346 0.126 0.042 0.081 0.064 0.167 0.214 0.625 -- 17 of 25 -- Sinhala 17 million 0.793 0.703 0.026 0.019 0.011 0.016 0.017 0.118 0.233 0.729 Urdu 100+ million L2 0.855 0.828 0.701 0.736 0.188 0.215 0.276 0.505 0.674 0.822 Kashmiri (Arabic script) 7 million 0.497 0.315 0.17 0.221 0.051 0.089 0.062 0.145 0.202 0.383 Kashmiri (Devanagari script) 7 million 0.411 0.213 0.146 0.191 0.069 0.132 0.073 0.144 0.16 0.299 Central Kurdish Iranian 6 million 0.594 0.763 0.224 0.071 0.014 0.026 0.033 0.099 0.127 0.574 Dari 10-12 million 0.86 0.873 0.745 0.793 0.405 0.415 0.561 0.684 0.775 0.84 Northern Kurdish 15 million 0.615 0.454 0.187 0.455 0.078 0.114 0.1 0.16 0.131 0.447 Southern Pashto 20
Chunk 44 · 1,995 chars
146 0.191 0.069 0.132 0.073 0.144 0.16 0.299 Central Kurdish Iranian 6 million 0.594 0.763 0.224 0.071 0.014 0.026 0.033 0.099 0.127 0.574 Dari 10-12 million 0.86 0.873 0.745 0.793 0.405 0.415 0.561 0.684 0.775 0.84 Northern Kurdish 15 million 0.615 0.454 0.187 0.455 0.078 0.114 0.1 0.16 0.131 0.447 Southern Pashto 20 million 0.792 0.725 0.395 0.601 0.077 0.12 0.127 0.241 0.234 0.588 Tajik 8-9 million 0.848 0.766 0.212 0.178 0.05 0.1 0.075 0.193 0.141 0.682 Western Persian 55 million 0.873 0.894 0.804 0.839 0.438 0.463 0.601 0.741 0.822 0.864 Catalan Romance 4 million 0.895 0.885 0.851 0.88 0.781 0.792 0.785 0.843 0.859 0.886 French 80+ million (L1) 0.896 0.891 0.885 0.892 0.892 0.889 0.881 0.887 0.886 0.894 Friulian 600,000 0.796 0.689 0.501 0.577 0.45 0.46 0.376 0.504 0.492 0.751 Galician 2.4 million 0.893 0.869 0.84 0.875 0.832 0.827 0.804 0.85 0.853 0.883 Italian 65 million 0.891 0.882 0.872 0.887 0.884 0.879 0.863 0.875 0.878 0.889 Ligurian 500,000 0.759 0.65 0.493 0.581 0.499 0.498 0.394 0.538 0.522 0.731 Lombard 3.5 million (est.) 0.817 0.663 0.49 0.597 0.447 0.458 0.348 0.503 0.504 0.747 Occitan 2 million 0.889 0.847 0.765 0.806 0.698 0.692 0.622 0.731 0.73 0.858 Portuguese 230 million 0.899 0.891 0.879 0.892 0.888 0.884 0.873 0.883 0.886 0.892 Romanian 24 million 0.898 0.889 0.867 0.873 0.729 0.77 0.754 0.829 0.867 0.893 Sardinian 1 million 0.758 0.68 0.505 0.538 0.426 0.426 0.34 0.476 0.51 0.746 Spanish 483 million L1 0.887 0.877 0.866 0.883 0.877 0.876 0.863 0.875 0.877 0.885 Venetian 2 million 0.858 0.792 0.677 0.772 0.614 0.612 0.542 0.695 0.703 0.842 Asturian 400,000 0.864 0.844 0.78 0.814 0.727 0.73 0.677 0.749 0.797 0.861 Sicilian 4.7 million 0.829 0.704 0.537 0.628 0.419 0.454 0.343 0.509 0.544 0.782 Belarusian Slavic 6.5 million 0.865 0.815 0.651 0.812 0.171 0.223 0.333 0.567 0.744 0.846 Russian 150 million (L1) 0.889 0.883 0.86 0.884 0.791 0.846 0.855 0.872 0.867 0.888 Ukrainian 35 million 0.892 0.875 0.822 0.873 0.616 0.762 0.729 0.818 0.858
Chunk 45 · 1,997 chars
.7 million 0.829 0.704 0.537 0.628 0.419 0.454 0.343 0.509 0.544 0.782 Belarusian Slavic 6.5 million 0.865 0.815 0.651 0.812 0.171 0.223 0.333 0.567 0.744 0.846 Russian 150 million (L1) 0.889 0.883 0.86 0.884 0.791 0.846 0.855 0.872 0.867 0.888 Ukrainian 35 million 0.892 0.875 0.822 0.873 0.616 0.762 0.729 0.818 0.858 0.885 Bosnian 3 million 0.895 0.869 0.804 0.871 0.612 0.576 0.644 0.788 0.823 0.883 Bulgarian 8 million 0.891 0.869 0.821 0.865 0.624 0.635 0.728 0.812 0.856 0.883 Croatian 5.6 million 0.891 0.87 0.826 0.866 0.595 0.563 0.646 0.781 0.828 0.88 Macedonian 2 million 0.89 0.858 0.762 0.858 0.432 0.45 0.592 0.742 0.797 0.872 Serbian 6.5 million 0.893 0.875 0.801 0.86 0.423 0.456 0.585 0.753 0.825 0.884 Slovenian 2.1 million 0.889 0.85 0.767 0.839 0.531 0.518 0.578 0.727 0.819 0.878 Czech 10.5 million 0.892 0.882 0.856 0.87 0.697 0.771 0.779 0.847 0.862 0.887 Polish 38 million 0.885 0.873 0.846 0.867 0.714 0.763 0.777 0.847 0.861 0.881 Silesian 1 million 0.808 0.698 0.557 0.592 0.362 0.401 0.38 0.541 0.587 0.784 Slovak 5.2 million 0.892 0.864 0.802 0.862 0.602 0.693 0.689 0.807 0.852 0.882 Japanese Japonic 125 million 0.878 0.858 0.825 0.851 0.761 0.819 0.799 0.846 0.833 0.869 Georgian South Caucasian 4 million 0.856 0.776 0.449 0.801 0.104 0.138 0.137 0.273 0.541 0.794 Korean Koreanic 81 million 0.875 0.843 0.786 0.842 0.573 0.766 0.76 0.823 0.792 0.861 Basque Isolate 750,000 0.865 0.79 0.563 0.786 0.184 0.233 0.128 0.24 0.558 0.832 Halh Mongolian Eastern Mongolic 3 million 0.834 0.699 0.151 0.514 0.042 0.084 0.065 0.136 0.147 0.613 Wolof Atlantic 10 million 0.3 0.141 0.088 0.109 0.107 0.147 0.08 0.12 0.11 0.173 Nigerian Fulfulde 14 million 0.191 0.105 0.061 0.072 0.075 0.092 0.05 0.085 0.081 0.128 Bemba Bantu 4 million 0.302 0.13 0.092 0.107 0.098 0.11 0.068 0.103 0.124 0.249 Chokwe 1.3 million 0.147 0.096 0.071 0.077 0.075 0.117 0.062 0.092 0.098 0.136 Ganda 7 million 0.45 0.156 0.091 0.107 0.08 0.092 0.065 0.097 0.099 0.247 Kamba 4 million 0.202 0.126
Chunk 46 · 1,998 chars
0.191 0.105 0.061 0.072 0.075 0.092 0.05 0.085 0.081 0.128 Bemba Bantu 4 million 0.302 0.13 0.092 0.107 0.098 0.11 0.068 0.103 0.124 0.249 Chokwe 1.3 million 0.147 0.096 0.071 0.077 0.075 0.117 0.062 0.092 0.098 0.136 Ganda 7 million 0.45 0.156 0.091 0.107 0.08 0.092 0.065 0.097 0.099 0.247 Kamba 4 million 0.202 0.126 0.087 0.095 0.098 0.118 0.068 0.108 0.101 0.171 Kikongo 7 million 0.267 0.118 0.074 0.103 0.101 0.11 0.076 0.12 0.112 0.189 Kikuyu 8 million 0.239 0.158 0.095 0.116 0.112 0.139 0.085 0.119 0.122 0.199 Kimbundu 3 million 0.133 0.077 0.056 0.075 0.071 0.087 0.054 0.077 0.082 0.125 Kinyarwanda 12 million 0.788 0.296 0.096 0.098 0.071 0.091 0.068 0.115 0.114 0.494 Lingala 8-10 million 0.554 0.156 0.095 0.134 0.117 0.135 0.094 0.141 0.118 0.225 Luba-Kasai 6.5 million 0.201 0.1 0.083 0.115 0.104 0.125 0.087 0.112 0.121 0.188 Northern Sotho 5 million 0.632 0.205 0.104 0.117 0.103 0.124 0.092 0.148 0.118 0.38 Nyanja 12 million 0.7 0.215 0.11 0.129 0.101 0.127 0.086 0.133 0.166 0.436 Rundi 9 million 0.679 0.194 0.083 0.083 0.07 0.086 0.062 0.113 0.101 0.322 Shona 11 million 0.764 0.208 0.103 0.149 0.095 0.124 0.086 0.123 0.143 0.531 Southern Sotho 5.6 million 0.744 0.196 0.095 0.1 0.089 0.111 0.087 0.136 0.125 0.461 Swahili 100+ million L2 0.857 0.768 0.665 0.602 0.212 0.233 0.09 0.188 0.736 0.839 Swati 2.5 million 0.55 0.168 0.111 0.112 0.081 0.103 0.073 0.122 0.116 0.382 Tsonga 3 million 0.525 0.15 0.081 0.095 0.082 0.108 0.057 0.092 0.096 0.242 Tswana 5 million 0.624 0.193 0.092 0.104 0.088 0.111 0.075 0.122 0.113 0.377 Tumbuka 2 million 0.504 0.166 0.094 0.105 0.089 0.114 0.069 0.114 0.125 0.284 Umbundu 6 million 0.135 0.076 0.063 0.069 0.064 0.086 0.045 0.078 0.087 0.122 Xhosa 8.2 million 0.776 0.248 0.124 0.154 0.103 0.132 0.077 0.139 0.192 0.612 Zulu 12 million 0.799 0.264 0.101 0.111 0.082 0.107 0.095 0.127 0.168 0.619 Fon Gbe 1.7 million 0.108 0.075 0.054 0.065 0.068 0.079 0.041 0.062 0.075 0.107 Ewe 7 million 0.138 0.097 0.071 0.08 0.068 0.083 0.054
Chunk 47 · 1,994 chars
0.086 0.045 0.078 0.087 0.122 Xhosa 8.2 million 0.776 0.248 0.124 0.154 0.103 0.132 0.077 0.139 0.192 0.612 Zulu 12 million 0.799 0.264 0.101 0.111 0.082 0.107 0.095 0.127 0.168 0.619 Fon Gbe 1.7 million 0.108 0.075 0.054 0.065 0.068 0.079 0.041 0.062 0.075 0.107 Ewe 7 million 0.138 0.097 0.071 0.08 0.068 0.083 0.054 0.074 0.077 0.124 Kabiye Gur 1.2 million 0.099 0.101 0.065 0.072 0.051 0.074 0.035 0.061 0.078 0.138 Mossi 7.5 million 0.124 0.076 0.064 0.077 0.066 0.081 0.057 0.076 0.077 0.117 Akan Kwa 11 million 0.511 0.201 0.109 0.127 0.128 0.148 0.088 0.135 0.147 0.306 Twi 17 million 0.504 0.226 0.133 0.14 0.129 0.161 0.09 0.143 0.158 0.341 Bambara Mande 14 million 0.119 0.086 0.067 0.076 0.069 0.094 0.051 0.077 0.084 0.12 Dyula 3 million 0.12 0.066 0.054 0.073 0.076 0.097 0.051 0.074 0.073 0.105 Igbo Volta 27 million 0.691 0.397 0.137 0.091 0.074 0.092 0.063 0.078 0.148 0.483 Yoruba 28 million 0.579 0.216 0.087 0.081 0.068 0.097 0.059 0.077 0.088 0.311 Sango Ubangian 5 million (L2) 0.154 0.101 0.076 0.091 0.098 0.113 0.074 0.096 0.108 0.145 Luo Nilotic 4.2 million 0.169 0.087 0.068 0.08 0.094 0.1 0.066 0.078 0.086 0.139 Nuer 1.4 million 0.065 0.038 0.033 0.036 0.023 0.037 0.02 0.05 0.038 0.065 Southwestern Dinka 2 million 0.134 0.111 0.089 0.096 0.096 0.11 0.072 0.098 0.107 0.136 Central Kanuri (Arabic script) Saharan 4 million 0.043 0.02 0.01 0.019 0.017 0.027 0.011 0.017 0.015 0.026 Central Kanuri (Latin script) 4 million 0.153 0.1 0.073 0.092 0.112 0.12 0.074 0.104 0.087 0.143 Ayacucho Quechua Quechua II 1 million 0.232 0.182 0.109 0.112 0.113 0.139 0.084 0.129 0.126 0.194 Chinese (Simplified) Sinitic 920 million (L1) 0.884 0.872 0.847 0.871 0.775 0.829 0.859 0.868 0.855 0.878 Chinese (Traditional) 31 million 0.881 0.861 0.825 0.857 0.714 0.807 0.847 0.855 0.842 0.871 Yue Chinese 60 million 0.884 0.896 0.828 0.858 0.724 0.8 0.84 0.862 0.846 0.873 Burmese Tibeto-Burman 33 million 0.748 0.672 0.075 0.616 0.021 0.033 0.033 0.094 0.178 0.638 Dzongkha 700,000
Chunk 48 · 1,995 chars
0.775 0.829 0.859 0.868 0.855 0.878 Chinese (Traditional) 31 million 0.881 0.861 0.825 0.857 0.714 0.807 0.847 0.855 0.842 0.871 Yue Chinese 60 million 0.884 0.896 0.828 0.858 0.724 0.8 0.84 0.862 0.846 0.873 Burmese Tibeto-Burman 33 million 0.748 0.672 0.075 0.616 0.021 0.033 0.033 0.094 0.178 0.638 Dzongkha 700,000 0.068 0.11 0.004 0.007 0.004 0.008 0.001 0.005 0.006 0.119 Jingpho 900,000 0.131 0.093 0.075 0.08 0.084 0.106 0.065 0.097 0.072 0.111 Meitei (Bengali script) 1.8 million 0.155 0.065 0.046 0.061 0.012 0.031 0.02 0.052 0.043 0.129 Mizo 900,000 0.334 0.325 0.203 0.185 0.189 0.217 0.158 0.219 0.328 0.593 Standard Tibetan 1.2 million 0.103 0.185 0.011 0.007 0.012 0.014 0.01 0.015 0.018 0.191 Shan Tai 3 million 0.128 0.417 0.085 0.092 0.107 0.132 0.08 0.1 0.118 0.191 Lao 7.5 million 0.658 0.384 0.073 0.081 0.069 0.093 0.071 0.132 0.125 0.521 Thai 36 million 0.879 0.868 0.819 0.828 0.451 0.591 0.773 0.831 0.818 0.872 Guarani Tupi 6-7 million 0.547 0.269 0.186 0.181 0.182 0.221 0.14 0.198 0.207 0.331 -- 18 of 25 -- Northern Uzbek Karluk 27 million 0.866 0.765 0.539 0.733 0.115 0.151 0.168 0.349 0.501 0.787 Uyghur 10 million 0.773 0.674 0.157 0.12 0.011 0.032 0.023 0.11 0.026 0.44 Bashkir Kipchak 1.2 million 0.837 0.762 0.311 0.463 0.128 0.192 0.143 0.243 0.384 0.746 Crimean Tatar 300,000 0.765 0.609 0.42 0.518 0.175 0.257 0.215 0.366 0.418 0.705 Kazakh 13 million 0.868 0.788 0.399 0.755 0.102 0.149 0.187 0.325 0.498 0.808 Kyrgyz 4.5 million 0.827 0.731 0.333 0.655 0.086 0.15 0.162 0.278 0.308 0.709 Tatar 5 million 0.863 0.776 0.376 0.715 0.112 0.177 0.158 0.266 0.375 0.739 North Azerbaijani Oghuz 9-10 million 0.837 0.776 0.618 0.749 0.21 0.262 0.267 0.491 0.636 0.804 South Azerbaijani 15-20 million 0.572 0.437 0.236 0.413 0.065 0.117 0.094 0.146 0.273 0.546 Turkish 75 million 0.884 0.857 0.809 0.82 0.497 0.614 0.625 0.775 0.825 0.878 Turkmen 7 million 0.834 0.538 0.289 0.287 0.102 0.153 0.115 0.211 0.257 0.656 Estonian Finnic 1.1 million 0.89 0.838 0.708
Chunk 49 · 1,984 chars
62 0.267 0.491 0.636 0.804 South Azerbaijani 15-20 million 0.572 0.437 0.236 0.413 0.065 0.117 0.094 0.146 0.273 0.546 Turkish 75 million 0.884 0.857 0.809 0.82 0.497 0.614 0.625 0.775 0.825 0.878 Turkmen 7 million 0.834 0.538 0.289 0.287 0.102 0.153 0.115 0.211 0.257 0.656 Estonian Finnic 1.1 million 0.89 0.838 0.708 0.811 0.175 0.222 0.314 0.531 0.777 0.869 Finnish 5.4 million 0.89 0.867 0.805 0.843 0.453 0.606 0.42 0.61 0.821 0.881 Hungarian Ugric 13 million 0.887 0.871 0.839 0.852 0.486 0.641 0.399 0.61 0.829 0.879 Table 7: The Corpus BLEU results on the FLORES-200 dataset are derived from evaluations of 10 distinct large language models. Population estimates are based on heterogeneous sources, and the reported population are not guaranteed to be accurate. Therefore, they should be interpreted with appropriate caution. Language Name Language Branch Population GPT4o Mini Llama 3.1 8B Llama 3.2 3B Ministral 8B Phi-3 Phi-3.5 Qwen2.5 1.5B Qwen2.5 3B gemma-2 2B gemma-2 9B Central Atlas Tamazight Berber 3-4 million 1.4 0.4 0.4 0.2 1.0 0.8 0.2 0.8 0.4 1.4 Kabyle 5 million 4.0 3.3 1.4 0.9 1.7 0.7 0.5 1.5 1.4 4.3 Tamasheq (Latin script) 500,000 5.2 3.9 2.7 1.9 4.3 1.7 1.0 3.4 3.3 4.9 Tamasheq (Tifinagh script) 500,000 1.3 0.4 0.3 0.2 1.0 0.7 0.1 0.5 0.6 1.1 Hausa Chadic 40 million 30.4 20.0 7.5 2.9 3.9 1.6 1.5 4.5 8.9 25.9 Somali Cushitic 20 million 26.6 10.8 5.3 3.2 4.0 1.3 1.9 4.0 4.2 19.1 West Central Oromo 10 million 17.2 3.5 1.9 0.9 1.7 0.7 0.3 1.5 1.1 4.2 Amharic Semitic 32 million 18.0 8.4 1.1 0.4 1.0 0.8 0.6 2.7 4.8 19.1 Hebrew 9 million 43.6 36.4 21.2 36.9 18.1 9.3 22.3 31.7 33.1 42.6 Maltese 520,000 51.8 41.1 26.1 16.8 9.1 3.6 4.4 12.2 28.3 49.4 Modern Standard Arabic 330 million 39.2 30.1 29.5 33.9 19.0 16.0 27.2 32.6 31.3 38.6 Modern Standard Arabic (Romanized) 330 million 25.1 10.1 4.5 4.8 2.9 1.3 1.3 6.3 2.2 14.2 Tigrinya 9 million 4.7 1.8 0.7 0.3 0.7 0.7 0.2 1.3 1.1 5.5 Egyptian Arabic 60 million 30.9 11.6 21.6 24.9 13.0 10.5 18.4 23.6 21.7
Chunk 50 · 1,996 chars
2.2 28.3 49.4 Modern Standard Arabic 330 million 39.2 30.1 29.5 33.9 19.0 16.0 27.2 32.6 31.3 38.6 Modern Standard Arabic (Romanized) 330 million 25.1 10.1 4.5 4.8 2.9 1.3 1.3 6.3 2.2 14.2 Tigrinya 9 million 4.7 1.8 0.7 0.3 0.7 0.7 0.2 1.3 1.1 5.5 Egyptian Arabic 60 million 30.9 11.6 21.6 24.9 13.0 10.5 18.4 23.6 21.7 29.5 Mesopotamian Arabic 15 million 33.8 12.2 23.0 26.7 14.9 12.5 20.8 25.9 24.7 31.9 Moroccan Arabic 30 million 29.1 13.7 17.0 18.1 9.9 7.3 13.2 18.4 16.3 25.7 Najdi Arabic 10 million 38.5 19.3 29.0 32.5 17.8 19.6 25.7 31.1 30.1 37.4 North Levantine Arabic 20 million 37.5 15.9 25.0 27.8 15.1 12.5 21.2 27.4 25.0 34.4 South Levantine Arabic 24 million 40.5 15.5 27.1 31.3 17.3 12.7 23.7 30.3 28.1 37.3 Taâizzi-Adeni Arabic 11 million 35.6 11.2 25.6 29.2 16.3 15.7 23.3 28.0 27.3 33.9 Tunisian Arabic 11 million 30.7 15.3 19.9 22.2 12.8 10.0 17.5 21.8 19.9 28.1 Khmer Khmer 16 million 25.3 17.4 12.5 2.0 3.1 1.7 3.5 9.2 6.3 22.3 Santali Munda 7.5 million 0.7 3.9 0.5 0.1 0.4 0.3 0.1 0.1 2.1 12.7 Vietnamese Vietic 76 million 35.8 33.4 30.0 31.4 19.7 12.5 28.6 32.1 29.7 36.6 Acehnese (Arabic script) Malayo-Polynesian 3.5 million 4.8 1.5 1.0 0.9 0.6 0.5 0.4 1.6 0.5 3.1 Acehnese (Latin script) 3.5 million 12.7 10.7 6.9 5.4 6.1 2.8 2.7 6.2 6.2 13.5 Balinese 3.3 million 22.9 17.9 12.4 8.0 8.5 3.6 4.9 10.1 11.9 22.4 Banjar (Arabic script) 4 million 6.2 1.4 1.2 0.8 0.6 0.5 0.4 1.9 0.5 3.1 Banjar (Latin script) 4 million 24.9 22.4 15.9 12.7 10.0 4.7 7.3 14.4 15.8 27.1 Buginese 4 million 10.2 6.7 5.2 4.5 5.1 2.6 2.7 5.9 6.0 9.4 Cebuano 21 million 42.8 32.6 20.7 19.4 14.3 5.6 9.3 16.3 24.1 39.2 Ilocano 8 million 29.2 20.5 13.6 7.2 8.4 3.8 4.1 9.3 12.6 26.5 Indonesian 43 million L1 44.4 40.9 37.0 38.0 32.4 22.9 33.5 37.3 38.0 44.9 Javanese 82 million 37.7 27.2 18.1 10.3 8.3 3.0 6.7 14.2 18.1 33.4 Minangkabau (Arabic script) 6.5 million 5.7 1.3 0.8 0.7 0.6 0.5 0.3 1.3 0.3 2.9 Minangkabau (Latin script) 6.5 million 24.9 23.1 16.0 9.8 8.9 4.3 6.9 12.4 13.4 27.8 Pangasinan
Chunk 51 · 1,999 chars
5 Indonesian 43 million L1 44.4 40.9 37.0 38.0 32.4 22.9 33.5 37.3 38.0 44.9 Javanese 82 million 37.7 27.2 18.1 10.3 8.3 3.0 6.7 14.2 18.1 33.4 Minangkabau (Arabic script) 6.5 million 5.7 1.3 0.8 0.7 0.6 0.5 0.3 1.3 0.3 2.9 Minangkabau (Latin script) 6.5 million 24.9 23.1 16.0 9.8 8.9 4.3 6.9 12.4 13.4 27.8 Pangasinan 1.5 million 17.8 14.7 11.7 9.7 10.6 5.4 5.8 10.3 11.0 18.1 Plateau Malagasy 5 million 27.4 11.0 5.2 9.5 3.7 1.5 1.5 3.9 4.5 17.1 Standard Malay 18 million L1 44.5 38.6 34.9 37.7 28.4 17.1 30.1 35.3 36.7 44.5 Sundanese 42 million 35.7 23.5 15.0 10.2 8.0 3.0 6.8 13.6 14.6 29.2 Tagalog 28 million 45.4 40.2 32.5 32.7 24.9 17.8 14.6 26.1 34.7 44.9 Waray 3.7 million 43.3 30.2 18.8 21.4 13.0 6.0 8.5 17.1 21.4 38.1 Fijian 330,000 13.3 5.9 3.5 3.0 3.7 1.5 1.5 3.7 3.6 8.9 Maori 50,000 L1 23.1 14.5 7.8 9.5 7.5 1.4 3.8 8.2 7.1 16.8 Samoan 500,000 26.2 12.5 5.9 3.9 4.5 1.3 1.9 4.6 4.4 16.0 Central Aymara Aymara 2 million 5.7 2.8 2.8 2.3 3.5 1.5 1.0 2.8 2.6 4.8 Esperanto N/A 45.1 40.3 35.2 40.6 30.2 14.0 23.7 30.5 35.1 44.3 Tok Pisin (English-based) 120,000 L1 19.8 15.2 9.9 11.4 10.4 2.9 3.7 8.0 11.2 22.6 Haitian Creole (French-based) 10 million 37.8 24.7 15.3 15.7 8.5 1.9 4.2 11.3 14.9 32.2 Papiamento (Iberian-based) 340,000 42.1 32.1 21.1 19.2 15.7 5.0 10.3 19.2 18.0 38.9 Kabuverdianu (Portuguese-based) 1.2 million 39.6 24.2 17.3 18.1 14.8 5.9 9.3 17.7 16.4 31.1 Kannada South Dravidian 44 million 29.1 17.8 19.2 23.0 1.2 1.3 2.1 8.6 16.3 28.8 Malayalam 38 million 30.8 21.6 18.6 22.7 1.4 0.9 2.3 8.8 18.1 31.4 Tamil 75 million 27.7 16.0 19.3 21.3 2.5 1.8 1.9 6.8 17.4 29.0 Telugu South-Central Dravidian 81 million 34.8 25.0 23.9 25.0 2.2 1.9 3.0 9.5 19.5 33.5 Tosk Albanian Albanian 3 million 39.1 28.9 22.8 31.5 8.7 3.0 5.6 12.1 21.1 36.3 Armenian Armenian 6.7 million 37.6 28.7 18.6 31.9 3.1 1.3 2.8 8.2 20.9 35.3 Latgalian Baltic 150,000 19.5 11.3 6.3 6.9 3.9 1.4 2.1 5.9 5.5 14.7 Lithuanian 3 million 33.7 28.0 20.2 26.1 8.6 3.9 8.7 16.7 25.7 33.9 Standard Latvian 1.75
Chunk 52 · 1,995 chars
33.5 Tosk Albanian Albanian 3 million 39.1 28.9 22.8 31.5 8.7 3.0 5.6 12.1 21.1 36.3 Armenian Armenian 6.7 million 37.6 28.7 18.6 31.9 3.1 1.3 2.8 8.2 20.9 35.3 Latgalian Baltic 150,000 19.5 11.3 6.3 6.9 3.9 1.4 2.1 5.9 5.5 14.7 Lithuanian 3 million 33.7 28.0 20.2 26.1 8.6 3.9 8.7 16.7 25.7 33.9 Standard Latvian 1.75 million 36.1 28.0 20.1 27.8 8.5 3.0 9.2 18.3 27.0 35.0 Welsh Celtic 875,000 55.0 45.4 29.5 37.8 7.4 2.2 5.5 14.7 19.5 47.0 Irish Celtic (Goidelic) 170k L1 37.1 27.8 16.0 20.9 5.6 2.0 3.5 10.2 10.0 30.2 Scottish Gaelic 60,000 30.6 19.6 10.5 8.6 4.4 1.2 2.8 7.1 5.8 21.0 Afrikaans Germanic 7 million 56.7 52.7 47.2 50.4 36.0 18.6 36.1 45.0 48.7 56.5 Danish 5.8 million 48.3 45.0 40.3 44.1 35.0 30.4 34.2 40.7 43.6 48.5 German 95 million (L1) 44.0 41.3 38.7 41.3 40.0 34.9 35.6 38.4 40.4 44.1 Limburgish 1.3 million 36.4 32.9 23.2 21.6 14.8 6.3 13.1 20.7 25.5 38.2 Eastern Yiddish 1 million 49.5 25.9 7.5 9.1 3.8 1.0 0.5 7.0 14.0 45.9 Faroese 70,000 36.9 25.8 16.5 17.9 10.4 3.9 5.9 12.5 14.0 29.9 Icelandic 350,000 35.2 27.0 17.5 24.4 9.6 4.0 6.9 12.4 16.5 30.0 -- 19 of 25 -- Norwegian BokmÄl 4 million 43.5 40.6 36.8 40.1 30.6 23.8 30.3 36.3 39.2 44.0 Norwegian Nynorsk 750,000 45.0 41.1 37.2 40.5 26.4 14.4 26.4 34.0 39.4 45.0 Swedish 10 million 48.1 46.0 42.9 43.5 35.6 31.2 36.1 40.9 43.0 48.6 Dutch 24 million 31.6 29.7 28.5 29.7 25.8 25.0 25.6 28.6 29.9 32.1 Luxembourgish 400,000 46.6 34.4 23.7 22.5 14.0 5.7 7.0 15.4 19.0 38.6 Greek Greek 13 million 35.5 32.4 28.2 31.2 19.3 13.9 15.5 23.7 29.8 35.8 Assamese Indo-Aryan 15 million 26.3 15.5 12.7 6.9 1.8 1.2 4.2 9.2 11.6 23.3 Awadhi 38 million 33.0 6.0 18.6 19.0 6.8 6.1 7.7 13.7 19.1 29.3 Bengali 265 million 33.0 22.6 24.0 24.3 3.8 2.0 10.8 19.1 21.8 31.7 Bhojpuri 50 million 26.5 13.8 14.0 13.4 5.6 3.8 5.0 9.7 14.1 22.7 Chhattisgarhi 16 million 36.6 12.7 17.0 16.7 5.7 5.1 5.5 13.2 17.6 29.3 Eastern Panjabi 33 million 34.8 12.2 23.7 23.9 1.3 0.7 2.9 12.6 18.0 34.5 Gujarati 55 million 36.0 18.8 23.5 22.6 1.3 1.0
Chunk 53 · 1,998 chars
265 million 33.0 22.6 24.0 24.3 3.8 2.0 10.8 19.1 21.8 31.7 Bhojpuri 50 million 26.5 13.8 14.0 13.4 5.6 3.8 5.0 9.7 14.1 22.7 Chhattisgarhi 16 million 36.6 12.7 17.0 16.7 5.7 5.1 5.5 13.2 17.6 29.3 Eastern Panjabi 33 million 34.8 12.2 23.7 23.9 1.3 0.7 2.9 12.6 18.0 34.5 Gujarati 55 million 36.0 18.8 23.5 22.6 1.3 1.0 5.1 15.2 19.9 35.0 Hindi 600 million 38.8 33.2 29.9 30.6 12.5 16.3 13.8 23.2 30.1 39.1 Magahi 14 million 38.2 14.1 20.9 19.7 7.0 6.1 7.2 13.9 22.1 33.7 Maithili 35 million 36.9 12.0 16.1 12.6 5.1 3.3 4.9 9.4 15.3 28.4 Marathi 83 million 34.1 21.0 21.9 20.1 3.7 2.2 4.9 12.7 19.9 33.3 Nepali 25 million 37.6 24.0 17.1 22.4 5.8 4.6 5.3 13.3 20.4 34.9 Odia 37 million 27.3 21.2 5.7 0.6 1.4 1.1 1.9 9.5 1.1 18.9 Sanskrit Few thousand L1 15.7 12.7 8.6 7.3 4.3 1.9 2.8 6.7 6.5 15.4 Sindhi 32 million 35.9 8.2 11.6 2.8 1.9 0.9 1.6 4.9 5.7 24.4 Sinhala 17 million 25.8 20.0 1.0 0.4 1.0 0.6 0.6 3.7 5.3 23.1 Urdu 70 million L1 33.3 8.8 22.7 22.7 5.5 2.6 7.4 14.9 20.4 32.2 Kashmiri (Arabic script) 7 million 14.2 6.4 4.9 3.0 2.3 1.1 1.2 3.8 4.3 10.3 Kashmiri (Devanagari script) 7 million 11.3 5.1 3.9 3.0 3.4 2.0 1.2 4.0 3.5 8.1 Central Kurdish Iranian 6 million 19.3 5.9 8.1 2.2 1.1 0.6 1.1 3.3 4.1 19.7 Dari 10-12 million 37.0 10.1 27.7 29.7 12.6 4.6 17.5 24.2 28.4 36.8 Northern Kurdish 15 million 19.3 14.5 6.3 13.2 3.2 1.2 1.4 3.9 4.0 15.5 Southern Pashto 20 million 29.0 9.0 12.2 17.3 2.9 1.1 3.6 7.0 5.8 19.9 Tajik 8-9 million 30.9 11.4 6.1 4.2 2.2 1.0 1.7 5.4 3.7 23.1 Western Persian 55 million 34.8 15.6 27.8 29.7 12.6 3.7 17.5 24.6 28.4 35.8 Catalan Romance 4 million 46.4 43.2 39.6 42.3 33.1 25.0 32.8 38.9 40.6 46.6 French 80+ million (L1) 45.2 42.9 39.9 42.6 41.6 37.3 38.2 41.3 42.1 45.5 Friulian 600,000 33.7 28.2 19.3 20.1 14.8 5.0 12.2 17.5 16.9 31.8 Galician 2.4 million 41.4 37.0 33.5 36.7 33.8 24.2 30.9 34.6 36.0 40.5 Italian 65 million 32.9 31.2 29.8 31.8 30.6 27.4 27.6 30.5 31.4 34.2 Ligurian 500,000 35.1 28.3 20.3 22.6 19.2 7.0 13.1 21.0 20.7 33.7 Lombard 3.5
Chunk 54 · 1,997 chars
42.6 41.6 37.3 38.2 41.3 42.1 45.5 Friulian 600,000 33.7 28.2 19.3 20.1 14.8 5.0 12.2 17.5 16.9 31.8 Galician 2.4 million 41.4 37.0 33.5 36.7 33.8 24.2 30.9 34.6 36.0 40.5 Italian 65 million 32.9 31.2 29.8 31.8 30.6 27.4 27.6 30.5 31.4 34.2 Ligurian 500,000 35.1 28.3 20.3 22.6 19.2 7.0 13.1 21.0 20.7 33.7 Lombard 3.5 million (est.) 35.8 25.9 19.6 22.4 16.1 5.9 10.4 18.8 19.2 32.2 Occitan 2 million 52.1 46.1 38.5 40.5 31.6 11.3 25.8 35.9 34.4 47.7 Portuguese 230 million 49.8 47.3 44.1 46.7 45.0 41.5 42.0 45.1 46.1 49.9 Romanian 24 million 43.1 40.0 36.9 37.9 27.5 15.9 29.4 34.8 38.6 43.9 Sardinian 1 million 34.4 31.2 22.0 21.6 15.6 6.1 11.8 19.1 20.7 35.7 Spanish 483 million L1 30.9 28.4 27.0 29.7 28.5 23.8 26.2 27.9 29.3 31.1 Venetian 2 million 40.0 34.7 27.0 31.7 23.9 6.6 18.6 28.3 28.8 40.5 Asturian 400,000 39.8 37.5 32.9 34.7 29.2 14.9 26.0 29.7 33.1 40.1 Sicilian 4.7 million 35.5 28.9 21.7 24.4 15.3 3.8 11.4 19.1 20.1 34.4 Belarusian Slavic (East) 6.5 million 20.8 16.5 13.1 17.4 4.7 2.6 6.3 11.7 15.3 20.2 Russian 150 million (L1) 35.9 33.0 30.5 33.2 26.6 24.3 28.7 31.5 32.4 35.9 Ukrainian 35 million 39.7 36.2 31.2 35.3 22.1 21.6 24.7 31.1 34.3 39.9 Bosnian Slavic (South) 3 million 42.5 38.1 32.0 37.1 22.5 12.2 23.9 31.9 33.6 42.2 Bulgarian 8 million 40.9 37.3 33.2 35.6 22.2 17.9 25.5 31.9 35.2 41.3 Croatian 5.6 million 37.7 34.9 31.3 33.4 20.4 12.0 22.3 29.0 30.7 37.8 Macedonian 2 million 42.0 37.7 30.7 36.1 16.0 7.9 21.3 30.3 32.0 41.7 Serbian 6.5 million 43.3 39.7 33.0 36.9 15.7 7.7 21.1 30.6 34.4 42.8 Slovenian 2.1 million 35.9 30.9 26.5 29.2 17.0 9.3 17.2 24.5 28.4 35.4 Czech Slavic (West) 10.5 million 40.2 37.8 34.2 35.5 24.6 23.1 27.2 33.8 35.1 40.4 Polish 38 million 30.1 27.5 25.3 26.6 19.9 14.1 21.9 25.2 27.0 30.5 Silesian <1 million 36.1 27.4 22.5 21.9 13.0 6.0 13.5 20.7 21.7 35.2 Slovak 5.2 million 39.7 34.6 30.1 34.2 20.5 14.6 23.6 30.5 33.6 39.3 Japanese Japonic 125 million 26.5 23.2 20.5 21.9 17.8 16.6 18.9 22.4 21.7 26.3 Georgian South Caucasian
Chunk 55 · 1,996 chars
35.1 40.4 Polish 38 million 30.1 27.5 25.3 26.6 19.9 14.1 21.9 25.2 27.0 30.5 Silesian <1 million 36.1 27.4 22.5 21.9 13.0 6.0 13.5 20.7 21.7 35.2 Slovak 5.2 million 39.7 34.6 30.1 34.2 20.5 14.6 23.6 30.5 33.6 39.3 Japanese Japonic 125 million 26.5 23.2 20.5 21.9 17.8 16.6 18.9 22.4 21.7 26.3 Georgian South Caucasian 4 million 27.5 20.3 11.3 21.5 3.2 1.4 3.0 7.0 12.1 24.4 Korean Koreanic 81 million 29.3 25.1 21.1 24.4 13.9 16.5 19.4 23.8 20.9 29.0 Basque N/A 750,000 30.1 24.7 15.3 23.6 4.9 1.6 2.8 6.2 15.3 28.8 Halh Mongolian Eastern Mongolic 3 million 28.1 8.9 4.4 12.1 1.6 0.9 1.2 4.3 3.5 17.6 Wolof Atlantic 10 million 10.2 5.7 3.9 2.9 4.4 1.4 2.0 5.0 3.5 6.7 Nigerian Fulfulde AtlanticâFula 14 million 6.8 4.1 2.5 2.5 3.9 1.6 1.3 3.5 2.6 5.3 Bemba Bantu 4 million 10.4 6.1 4.3 3.9 5.5 2.1 1.8 4.5 5.1 9.9 Chokwe 1.3 million 5.7 3.5 2.9 1.9 4.0 1.6 1.5 3.1 3.2 5.0 Ganda 7 million 15.0 7.1 4.5 3.0 4.6 1.7 1.9 4.1 4.2 10.1 Kamba 4 million 7.6 5.8 4.3 2.9 4.9 1.6 1.5 4.2 3.4 6.9 Kikongo 7 million 8.8 4.4 3.2 2.6 4.4 1.7 1.4 4.4 3.5 6.0 Kikuyu 8 million 8.2 5.7 3.3 3.2 4.8 1.9 1.3 3.8 3.8 6.5 Kimbundu 3 million 6.0 3.3 2.6 2.3 3.6 1.4 1.2 3.5 3.4 5.5 Kinyarwanda 12 million 27.7 11.3 4.6 3.5 4.1 1.2 1.4 3.8 4.6 17.9 Lingala 8-10 million 16.0 5.8 4.2 3.9 4.9 1.5 1.9 4.7 3.7 7.8 Luba-Kasai 6.5 million 7.7 3.8 2.7 2.9 4.1 2.0 1.8 4.4 3.9 6.8 Northern Sotho 5 million 27.9 9.9 5.4 3.6 4.7 1.8 1.3 5.0 4.4 18.0 Nyanja 12 million 21.9 8.7 4.4 3.8 4.7 1.5 2.3 5.4 6.1 15.3 Rundi 9 million 18.0 6.8 3.6 2.4 3.2 1.4 1.3 3.4 3.1 10.3 Shona 11 million 23.7 8.7 4.6 3.4 4.9 1.7 1.5 5.3 5.4 17.7 Southern Sotho 5.6 million 29.0 9.3 5.0 3.3 5.0 1.6 1.2 4.9 4.4 18.5 Swahili 16 million L1 43.1 35.0 28.8 23.8 8.5 1.5 3.4 9.2 29.5 42.3 Swati 2.5 million 18.2 7.3 4.2 3.3 4.0 1.7 1.6 4.6 3.6 14.1 Tsonga 3 million 18.6 7.3 4.3 3.0 4.7 1.7 1.7 4.1 3.5 9.9 Tswana 5 million 19.5 7.5 4.4 2.7 4.2 1.6 1.0 4.1 4.1 12.9 Tumbuka 2 million 11.7 6.2 3.7 3.2 4.3 1.5 1.4 4.1 4.4 8.6 Umbundu 6 million 5.5 3.0
Chunk 56 · 1,992 chars
million L1 43.1 35.0 28.8 23.8 8.5 1.5 3.4 9.2 29.5 42.3 Swati 2.5 million 18.2 7.3 4.2 3.3 4.0 1.7 1.6 4.6 3.6 14.1 Tsonga 3 million 18.6 7.3 4.3 3.0 4.7 1.7 1.7 4.1 3.5 9.9 Tswana 5 million 19.5 7.5 4.4 2.7 4.2 1.6 1.0 4.1 4.1 12.9 Tumbuka 2 million 11.7 6.2 3.7 3.2 4.3 1.5 1.4 4.1 4.4 8.6 Umbundu 6 million 5.5 3.0 2.7 2.2 3.6 1.3 1.0 3.1 3.0 5.0 Xhosa 8.2 million 31.8 10.5 5.4 4.6 5.1 1.5 1.6 5.6 6.8 25.0 Zulu 12 million 33.4 11.1 4.6 3.2 4.2 1.5 1.4 4.7 5.1 24.7 Fon Gbe 1.7 million 3.7 2.4 1.7 1.4 2.8 1.2 0.9 2.3 2.2 3.5 Ewe 7 million 5.1 2.9 2.5 2.1 3.3 1.3 0.8 2.4 2.2 4.3 KabiyĂš Gur 1.2 million 3.8 3.1 1.9 1.6 2.7 1.2 0.5 2.2 2.2 4.5 Mossi 7.5 million 4.5 2.7 2.3 2.4 3.3 1.1 1.4 3.0 2.9 4.5 Akan Kwa 11 million 13.4 7.5 5.0 3.6 5.9 2.2 1.5 5.2 5.3 10.4 Twi 17 million 14.6 9.0 5.4 3.4 5.8 2.3 1.6 5.4 5.6 11.8 Bambara Mande 14 million 5.8 3.0 2.6 2.4 3.9 1.1 1.0 3.7 3.0 5.0 Dyula 3 million 4.2 2.0 1.6 1.8 3.0 1.0 0.8 2.6 2.6 3.6 Igbo VoltaâNiger 27 million 24.0 14.2 5.7 1.6 3.5 1.6 0.9 3.7 5.7 17.6 -- 20 of 25 -- Yoruba 28 million 17.3 8.6 3.9 2.8 3.5 1.2 1.7 4.4 3.4 11.0 Sango Creolized Ubangian 400,000 L1 4.7 3.0 2.3 2.4 3.6 1.1 1.4 3.3 2.7 4.1 Luo Nilotic 4.2 million 6.3 3.6 3.3 2.9 3.9 1.6 1.7 3.9 3.2 5.3 Nuer 1.4 million 3.4 2.0 1.8 1.1 2.2 0.9 0.6 1.7 1.8 3.0 Southwestern Dinka 2 million 6.1 5.0 3.8 3.5 5.0 2.0 1.8 4.0 4.5 6.0 Central Kanuri (Arabic script) Saharan 4 million 2.2 1.1 0.7 0.6 0.9 0.6 0.3 1.3 0.5 1.4 Central Kanuri (Latin script) 4 million 5.9 3.1 2.8 2.9 4.9 2.3 1.2 4.0 2.6 5.3 Ayacucho Quechua Quechua 1 million 6.3 5.6 3.7 2.7 4.3 2.0 1.2 3.6 3.4 5.5 Chinese (Simplified) Sinitic 920 million 28.8 25.4 23.9 24.8 19.8 19.7 24.5 26.4 24.5 28.6 Chinese (Traditional) 31 million 27.4 23.8 21.8 23.4 17.3 16.5 22.5 25.0 22.0 27.3 Yue Chinese 60 million 29.6 14.8 23.5 25.7 19.6 15.7 24.6 26.7 23.6 29.5 Burmese Tibeto-Burman 33 million 21.5 12.1 2.1 14.3 1.3 0.9 1.3 4.2 4.0 17.7 Dzongkha 700,000 0.8 1.5 0.1 0.0 0.1 0.1 0.0 0.3 0.1 1.6 Jingpho
Chunk 57 · 1,993 chars
7 24.5 26.4 24.5 28.6 Chinese (Traditional) 31 million 27.4 23.8 21.8 23.4 17.3 16.5 22.5 25.0 22.0 27.3 Yue Chinese 60 million 29.6 14.8 23.5 25.7 19.6 15.7 24.6 26.7 23.6 29.5 Burmese Tibeto-Burman 33 million 21.5 12.1 2.1 14.3 1.3 0.9 1.3 4.2 4.0 17.7 Dzongkha 700,000 0.8 1.5 0.1 0.0 0.1 0.1 0.0 0.3 0.1 1.6 Jingpho 900,000 4.0 2.5 1.8 1.8 2.7 1.4 0.9 2.5 2.3 3.9 Meitei (Bengali script) 1.8 million 4.4 1.9 1.8 1.0 0.8 0.7 0.3 1.8 0.9 4.1 Mizo 900,000 9.3 8.6 6.8 5.2 5.9 3.1 2.7 5.4 8.3 14.2 Standard Tibetan 1.2 million 1.9 3.5 0.4 0.1 0.6 0.5 0.3 0.7 0.5 3.8 Shan Southwestern Tai 3 million 4.0 6.0 1.7 1.1 2.4 1.7 0.7 1.6 3.2 5.1 Lao Tai 7.5 million 20.1 10.3 2.2 2.1 3.5 2.5 1.8 6.3 3.7 17.8 Thai 36 million 29.6 21.0 23.6 23.0 11.4 10.6 20.1 25.1 23.7 30.6 Guarani TupiâGuarani 6-7 million 16.1 8.9 5.6 4.3 5.6 1.8 2.0 5.5 5.7 10.4 Northern Uzbek Karluk 27 million 32.2 21.5 14.0 21.0 3.3 1.0 3.7 8.7 12.0 28.5 Uyghur 10 million 20.3 7.3 4.4 3.0 0.8 0.4 0.6 2.9 1.5 11.0 Bashkir Kipchak 1.2 million 27.4 16.3 7.9 10.2 3.5 1.2 2.6 6.0 8.7 23.1 Crimean Tatar 300,000 24.6 16.9 11.7 13.8 5.6 2.4 4.9 9.7 11.3 23.0 Kazakh 13 million 33.8 19.6 11.6 20.9 3.1 1.5 4.5 9.3 12.3 28.6 Kyrgyz 4.5 million 22.6 11.1 7.6 13.9 2.5 1.1 3.1 6.4 6.6 17.9 Tatar 5 million 29.1 13.9 10.2 19.1 3.5 1.4 3.0 7.2 8.8 23.3 North Azerbaijani Oghuz 9-10 million 22.8 13.2 13.9 17.2 5.0 2.5 5.0 10.3 13.3 21.7 South Azerbaijani 15-20 million 14.7 5.4 5.6 8.9 2.3 0.9 1.3 3.7 5.5 14.4 Turkish 75 million 37.9 33.4 27.3 28.9 12.8 9.3 18.5 26.0 28.4 37.9 Turkmen 7 million 29.2 15.5 8.7 6.7 3.2 1.6 2.1 5.6 5.9 21.3 Estonian Finnic 1.1 million 38.2 31.3 23.2 28.7 6.2 2.4 8.9 17.5 26.6 36.6 Finnish 5.4 million 35.0 30.5 26.0 28.5 12.2 10.0 11.8 19.6 26.6 34.0 Hungarian Ugric 13 million 35.5 31.7 28.4 29.3 13.8 11.5 11.3 19.6 28.3 35.5 -- 21 of 25 -- E Dataset Size Influence Table 9 in the appendix presents a comprehensive analysis of how dataset size influences translation performance in our low-resource
Chunk 58 · 1,995 chars
nnish 5.4 million 35.0 30.5 26.0 28.5 12.2 10.0 11.8 19.6 26.6 34.0 Hungarian Ugric 13 million 35.5 31.7 28.4 29.3 13.8 11.5 11.3 19.6 28.3 35.5 -- 21 of 25 -- E Dataset Size Influence Table 9 in the appendix presents a comprehensive analysis of how dataset size influences translation performance in our low-resource Luxembourgish- English setting. We experimented with dataset sizes ranging from as small as 1% to the full dataset (100%). The results demonstrate a clear, positive correlation between the amount of data utilized during fine-tuning and the subsequent translation quality, as measured by BLEU scores. In both translation directions (ENâLB and LBâEN), we observed that even very small datasets (e.g., 1%â5%) provide measurable im- provements over baseline models, indicating that the models begin acquiring beneficial linguistic patterns early in the fine-tuning process. How- ever, substantial performance gains occur predom- inantly when increasing the dataset size beyond 25%. For instance, moving from 25% to 100% dataset size nearly doubles the SPBLEU scores for the ENâLB direction, clearly highlighting the significance of sufficient data availability for gener- ating fluent, accurate translations in low-resource languages. Interestingly, the Gemma-2-2B model displayed a relatively faster learning trajectory compared to the Llama-3.2-3B model in smaller data regimes (below 50%). Nevertheless, Gemma-2-2B exhib- ited a notable attenuation in performance improve- ments beyond the 50% data threshold, suggesting a diminishing return effect when datasets grow larger. Conversely, the Llama-3.2-3B model showed stead- ier improvements without significant attenuation up to the full dataset size, potentially indicating better scalability of linguistic capabilities with increased training data. F Case Study This section mainly presents several interesting text cases encountered during the pre-trained LLM gen- eration process, which were identified through our manual
Chunk 59 · 1,992 chars
cant attenuation up to the full dataset size, potentially indicating better scalability of linguistic capabilities with increased training data. F Case Study This section mainly presents several interesting text cases encountered during the pre-trained LLM gen- eration process, which were identified through our manual quality checks. F.1 Uncontrollable output A notable translation issue arises from unnecessary explanatory notes appended by the model, which negatively affects automated evaluation metrics. Consider the following example, where the original English input was: "He added that âthey should not, however, be asked to take on obligations that go be- yond their development stage, responsibility and capabilities.â" The model produced: "DĂ«i gouf och gesĂ€ftlech gesĂ«tt datt âsi si och net an Obergrenzen vum ieren En- twĂ©ckelungsschatz, ResponsabilitĂ€t a FĂ€hig- keeten uerecht bestallt.â (Note: Please keep in mind that Luxembourgish is a complex and nuanced language, and translations may vary depending on regional dialects and con- texts. This translation is provided in stan- dard Luxembourgish)" The corresponding ground truth translation in Luxembourgish is: "Hien huet bĂ€igefĂŒĂŒgt, datt âse awer net opgefuerdert sollte ginn, Verflichtungen ze iwwerhuelen, dĂ©i iwwer hiren EntwĂ©ck- lungsniveau, hir Verantwortung a FĂ€eg- keeten erausginn.â" The inclusion of this extraneous note, likely orig- inating from instruction-tuned general translation settings or cautious model outputs designed for con- versational contexts, significantly impacts BLEU and related metrics. Such notes introduce irrelevant lexical content and length discrepancies, lowering similarity scores and thereby inaccurately reflect- ing the modelâs genuine translation ability. To mitigate this issue, explicit prompt engineer- ing or fine-tuning with specific stopping criteria and clearer output formatting instructions is essential. This strategy helps ensure that model-generated translations remain
Chunk 60 · 1,990 chars
ilarity scores and thereby inaccurately reflect- ing the modelâs genuine translation ability. To mitigate this issue, explicit prompt engineer- ing or fine-tuning with specific stopping criteria and clearer output formatting instructions is essential. This strategy helps ensure that model-generated translations remain concise, relevant, and aligned closely with evaluation metric expectations. F.2 Repetition Problem During pre-training, we frequently encountered the issue of model hallucination, characterized by repetitive generation. This problem was notably less severe in the Gemma model compared to the Llama model. The repetitive generation primarily occurred in the English-to-Luxembourgish trans- lation direction, largely due to the modelâs insuf- ficient understanding of Luxembourgish. Conse- quently, the model frequently confused Luxembour- -- 22 of 25 -- Table 8: Impact of LoRA Rank on Performance During Fine-Tuning, Evaluated Across Three Rank Values EN-LB Rank (LoRA) Val 300 FLORE 200 SPBLEU CharF++ Jaccard SPBLEU CharF++ Jaccard Llama 3.2-3B Base Model 6.46 26.78 0.12 4.80 22.10 0.09 r = 32 12.95 33.09 0.19 9.46 29.64 0.14 r = 64 13.05 33.59 0.19 9.23 28.93 0.14 r = 128 13.32 34.09 0.20 9.27 29.16 0.14 Gemma 2-2B Base Model 5.82 22.71 0.10 4.61 20.78 0.07 r = 32 13.07 33.36 0.21 8.88 27.93 0.16 r = 64 13.17 33.35 0.21 9.12 28.06 0.16 r = 128 13.31 33.69 0.21 9.21 28.20 0.16 Table 9: Impact of Dataset Size on the Performance of Fine-Tuning English to Luxembourgish Dataset Ratio Val 300 FLORE 200 SPBLEU CharF++ Jaccard SPBLEU CharF++ Jaccard Llama 3.2 -3B 0% 6.46 26.78 0.12 4.80 22.10 0.09 1% 9.36 31.88 0.16 6.53 26.31 0.10 10% 18.61 40.51 0.23 9.79 30.65 0.14 50% 27.75 47.52 0.30 13.39 34.67 0.17 100% 42.16 57.87 0.42 23.40 42.90 0.26 Gemma 2-2B 0% 5.82 22.71 0.10 4.61 20.78 0.07 1% 14.36 35.06 0.21 9.01 27.99 0.15 10% 30.58 49.32 0.34 15.99 36.12 0.22 50% 41.32 57.18 0,42 22.30 41.69 0.27 100% 44.12 59.10 0.45 23.50 42.49 0.28 Luxembourgish to
Chunk 61 · 1,999 chars
1 40.51 0.23 9.79 30.65 0.14 50% 27.75 47.52 0.30 13.39 34.67 0.17 100% 42.16 57.87 0.42 23.40 42.90 0.26 Gemma 2-2B 0% 5.82 22.71 0.10 4.61 20.78 0.07 1% 14.36 35.06 0.21 9.01 27.99 0.15 10% 30.58 49.32 0.34 15.99 36.12 0.22 50% 41.32 57.18 0,42 22.30 41.69 0.27 100% 44.12 59.10 0.45 23.50 42.49 0.28 Luxembourgish to English Val 300 FLORE 200 SPBLEU CharF++ Jaccard SPBLEU CharF++ Jaccard Llama 3.2 -3B base Model 26.31 45.98 0.33 17.62 36.79 0.26 1% 34.18 54.63 0.4 22.68 45.98 0.32 10% 43.28 61.86 0.48 26.11 50.51 0.36 50% 49.60 67.15 0.53 29.18 54.35 0.39 100% 57.88 73.46 0.60 32.56 57.60 0.41 Gemma 2-2B base Model 27.11 47.44 0.34 14.99 37.77 0.26 1% 43.00 59.80 0.47 29.25 49.15 0.38 10% 54.41 68.86 0.58 36.14 55.67 0.45 50% 61.26 73.91 0.64 41.06 59.94 0.49 100% 62.75 75.13 0.65 42.73 61.25 0.51 -- 23 of 25 -- Table 10: Performance testing after SFT on Correspond- ing Validation Dataset (#1000 samples) Language Pair Methods SPBLEU CharF++ Jaccard LLMaaJ As-En BM 8.75 22.72 0.16 0.64 DN 9.00 23.03 0.16 0.65 DL 8.87 23.04 0.16 0.59 DG 9.43 23.69 0.16 0.62 En-As BM 2.27 10.84 0.03 0.37 DN 8.75 22.72 0.16 0.64 DL 8.09 29.03 0.18 0.61 DG 8.07 29.23 0.18 0.65 Kh-En BM 0.63 14.66 0.06 0.05 DN NA NA NA NA DL 2.79 18.66 0.10 0.10 DG 4.81 23.43 0.14 0.30 En-Kh BM 0.22 0.50 0.00 0.00 DN NA NA NA NA DL 4.81 16.95 0.15 0.17 DG 11.58 29.19 0.23 0.51 Uk-En BM 22.50 41.35 0.30 0.72 DN 25.34 44.06 0.33 0.77 DL 25.29 44.08 0.33 0.76 DG 24.81 43.76 0.32 0.78 En-Uk BM 13.57 30.19 0.15 0.60 DN 17.87 34.83 0.18 0.70 DL 17.97 34.83 0.19 0.69 DG 18.10 34.97 0.19 0.72 En-Lb BM 6.46 26.78 0.12 0.36 DN 37.98 55.41 0.37 0.82 DL 40.71 59.02 0.44 0.87 DG 44.58 59.73 0.45 0.87 Lb-En BM 26.31 45.98 0.33 0.58 DN 42.78 59.33 0.48 0.82 DL 54.64 70.98 0.57 0.82 DG 59.88 74.97 0.63 0.90 gish with German, exacerbating this confusion fur- ther. Below is an example
Chunk 62 · 1,996 chars
-Lb BM 6.46 26.78 0.12 0.36 DN 37.98 55.41 0.37 0.82 DL 40.71 59.02 0.44 0.87 DG 44.58 59.73 0.45 0.87 Lb-En BM 26.31 45.98 0.33 0.58 DN 42.78 59.33 0.48 0.82 DL 54.64 70.98 0.57 0.82 DG 59.88 74.97 0.63 0.90 gish with German, exacerbating this confusion fur- ther. Below is an example of input: Translate the following English input text into Luxembourgish. Do not include any additional information or unrelated content. Input: The United States Strategic Com- mand of the U.S. Department of Defense office is tracking the debris. Below is an illustrative output that presents repet- itive generation: Here is the translation: DâUnited States Strategic Command vum ustrategeschen vum ustrategeschen UfuerderungsbĂŒro vum ustrategeschen UfuerderungsbĂŒro vum us- trategeschen UfuerderungsbĂŒro vum us- trategeschen UfuerderungsbĂŒro vum us- trategeschen UfuerderungsbĂŒro vum us- trategeschen ... We also observed that when distillation datasets contained a small amount of hallucinated dataâsuch as infinitely repeating Luxembour- gish content (particularly prominent in Llama 3.3 70B)âthe subsequently fine-tuned models strongly inherited these hallucination patterns, resulting in substantial performance degradation. Therefore, maintaining high-quality data for fine-tuning is cru- cial. In comparison, data distilled from ChatGPT rarely exhibited infinite repetition issues, enabling the model to learn appropriate stopping criteria dur- ing generation. Additionally, we found that incor- porating certain generation prompts during training, such as "Here is the translation: ", could partially mitigate the inherited infinite generation problems originating from distilled models of Llama. F.3 Hallucination to German Given the English input sentence: "At first glance, it looks slightly chaotic at the arrival, the children make a game of clapping hands with everyone who arrives, but that also contributes to the special charm of the event. The Widdebierg is
Chunk 63 · 1,995 chars
lems originating from distilled models of Llama. F.3 Hallucination to German Given the English input sentence: "At first glance, it looks slightly chaotic at the arrival, the children make a game of clapping hands with everyone who arrives, but that also contributes to the special charm of the event. The Widdebierg is not as high as one might assume from the logo, but it is still 386 meters, which is not nothing," the model produced the following output: "Bei der Ankunft blickt man zunĂ€chst et- was chaotisch aus, die Kinder machen ein Spiel mit der Klatsch der HĂ€nde mit jedem, der kommt, aber das tragen auch zur beson- deren Charme vun der Ăvent vun der Ăieren. DĂ«i WiddebiergĂ«stcher ĂĂ€schten, well dass mĂ©i mĂ©i vun der Logo, mĂ€ si mĂ©i och 386 MĂ©ter, wĂ©i dĂ©i Ă©is gĂ«t nach LĂ©ck, wat dĂ©i- weis mĂ©i och net." In contrast, the Luxembourgish ground truth is: "Op den Ă©ischte BlĂ©ck gesĂ€it et op der Ar- rivĂ©e liicht chaotesch aus, dâKanner maache sech e Spaass draus, jidderengem, deen ukĂ«nnt, an dâHand ze klatschen, mĂ€ och dat drĂ©it zum spezielle Charme vun der Mani- festatioun bĂ€i. De Widdebierg ass wuel net esou hĂ©ich wĂ©i een dat um Logo kĂ©int un- huelen, mĂ€ Ă«mmerhi sinn et 386 Meter, dat ass net grad nĂ€ischt." This incorrect translation output primarily re- sults from excessive usage of German vocabu- -- 24 of 25 -- lary rather than proper Luxembourgish expressions. This phenomenon likely arises due to several fac- tors: âą Data Sparsity and Language Proximity: Luxembourgish and German share consid- erable lexical and syntactic similarities. In conditions of limited Luxembourgish-specific training data, the model might unintention- ally rely heavily on its knowledge of German, leading to significant linguistic interference. âą Pretraining Corpus Bias: The predominance of German texts over Luxembourgish in mul- tilingual pretraining datasets likely reinforces German lexical and structural patterns, espe- cially under resource-constrained fine-tuning conditions. âą Limited
Chunk 64 · 1,996 chars
knowledge of German, leading to significant linguistic interference. âą Pretraining Corpus Bias: The predominance of German texts over Luxembourgish in mul- tilingual pretraining datasets likely reinforces German lexical and structural patterns, espe- cially under resource-constrained fine-tuning conditions. âą Limited Distinctive Training Examples: Insufficient distinct Luxembourgish exam- ples during fine-tuning might not effectively guide the model away from Germanic lexical choices, resulting in mixed-language outputs or incorrect lexical selections. Addressing this issue effectively requires either extensive additional training data or targeted lin- guistic resources explicitly designed to empha- size lexical and grammatical distinctions between closely related languages such as Luxembourgish and German. G Prompt Design for LLM G.1 Prompt for LLM-as-a-Judge For the prompt, we mainly adopt the previous legal translation prompt structure (Niklaus et al., 2025) but customize it simply for only the transation needs without any domain emphasis specification. In this paper, we primarily employ google/gemma- 3-27b-it as the evaluation model to assess transla- tion quality, given its strong instruction-following capabilities and competitive performance among open-weight LLMs. For efficient model inference, we adopt SGLang as the serving framework, which enables streamlined deployment and low-latency response for both evaluation and generation tasks. Your task is to assess the accuracy, clarity, and fidelity of the modelâs translation to the golden translation. You will be provided the golden translation, and the modelâs translation. Your task is to judge how correct the modelâs translation is based on the golden translation, and then give a correctness score. The correctness score should be one of the below numbers: 0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 (totally right). You should give the correctness score directly. The correctness
Chunk 65 · 1,466 chars
how correct the modelâs translation is
based on the golden translation, and then
give a correctness score. The correctness
score should be one of the below numbers:
0.0 (totally wrong), 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, or 1.0 (totally right).
You should give the correctness score
directly. The correctness score must strictly
follow this format: "[[score]]", e.g., "The
correctness score: [[0.5]].
Golden Translation: {Golden Translation}
Model Translation: {Modelâs Translation}
G.2 Prompt for SFT
We primarily adopt the classical SFT approach,
where the model is trained to predict the next
token by minimizing the cross-entropy loss.
Consequently, training data typically consist of
input-output pairs, such as question-answer or
instruction-response formats. The input is usually
referred to as the prompt and the output as the an-
swer. During training, the prompt and answer are
concatenated and fed into the model, with the ob-
jective of guiding the model to generate the answer
portion. In this work, we employ the following
training template.
Below is an instruction that describes a task,
paired with an input that provides further
context. Write a response that appropriately
completes the request.
### Instruction:
Translate the following English input text
into Luxembourgish. Do not include any
additional information or unrelated content.
### Input:
{The sentence to be translated}
### Response:
{The translated sentence}
-- 25 of 25 --