Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
Summary
This paper introduces LINK (Lexical INterventions for Knowledge transfer), a data-level method to improve cross-lingual knowledge transfer for languages with scarce training data. Existing approaches often require large parallel corpora or auxiliary models, which are unavailable for many low-resource languages. LINK addresses this by randomly substituting words in high-resource English pretraining data with their translations using only a bilingual vocabulary, requiring no additional model training. The authors evaluate LINK on eight languages across five model sizes (137M to 2.7B parameters). Two strategies are tested: uniform interventions across random data subsets and domain-specific interventions targeting scientific content. Results show that LINK consistently improves downstream performance in data-constrained languages, sometimes surpassing baselines trained on significantly more target-language data. Notably, the method achieves up to a 2x speedup in training to reach equivalent performance. While uniform interventions degrade English performance, domain-specific interventions maintain high-resource accuracy while boosting target-language results. Experiments on truly low-resource languages like Swahili and Yoruba confirm benefits, though effectiveness correlates with bilingual vocabulary size. The study demonstrates that simple lexical substitutions during pretraining offer a scalable, near-zero-cost solution for enhancing multilingual language models without relying on expensive translation infrastructure.
PDF viewer
Chunks(45)
Chunk 0 · 1,999 chars
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions Anastasiia Sedova1,2, Natalie Schluter1‡, Skyler Seto1‡, Maartje ter Hoeve1‡ 1Apple, 2ITU †Shared senior authorship Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involv- ing scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK – a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vo- cabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabu- lary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance. Correspondence: Anastasiia Sedova: asedova@apple.com Date: May 25, 2026 1 Introduction Large language models (LLMs) have demonstrated remarkable performance across a wide range of language understanding and knowledge-intensive tasks (Brown et al., 2020; Bubeck et al., 2023; DeepSeek-AI et al., 2025; Gemini Team et al., 2025). These results are enabled by pretraining on massive text corpora, sometimes comprising tens of trillions of tokens.
Chunk 1 · 1,994 chars
onstrated remarkable performance across a wide range of language understanding and knowledge-intensive tasks (Brown et al., 2020; Bubeck et al., 2023; DeepSeek-AI et al., 2025; Gemini Team et al., 2025). These results are enabled by pretraining on massive text corpora, sometimes comprising tens of trillions of tokens. At this scale, high-quality training data is realistically available only in English (Li et al., 2024a; Penedo et al., 2024), and LLM research has therefore largely concentrated on it. Most other languages, in contrast, lack sufficient high-quality data in public web crawls: even for languages typically considered high-resource, the available data amounts to only 1–10% of the data available for English, while for truly low-resource languages, it can fall below 1B tokens (Penedo et al., 2025). Since data should scale roughly linearly with the size of the model (Hoffmann et al., 2022), this makes it difficult to adapt and effectively train LLMs for such languages (Pakray et al., 2025; Veitsman & Hartmann, 2025). One way to address this scarcity is to leverage data across other languages for multilingual modeling. Seto et al. (2025) demonstrate that the increased capacity of larger models consistently yields better performance than smaller language-specific models. Such models also naturally map semantically similar concepts across languages into a shared representation space, enabling effective cross-lingual knowledge transfer (Conneau et al., 2020b; Hu et al., 2021; Liu & Niehues, 2025). One of the factors impacting this alignment is the presence of pretraining samples in which similar concepts from different languages co-occur in the same sentence, forcing the model to predict tokens in one language from tokens in another (Wang et al., 2025). Existing methods that increase such cross-lingual mixing typically rely on translation systems, larger teacher models, or substantial parallel corpora (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b),
Chunk 2 · 1,992 chars
same sentence, forcing the model to predict tokens in one language from tokens in another (Wang et al., 2025). Existing methods that increase such cross-lingual mixing typically rely on translation systems, larger teacher models, or substantial parallel corpora (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b), which are often unavailable for the data-constrained languages that would potentially benefit most. 1 arXiv:2605.23885v1 [cs.CL] 22 May 2026 -- 1 of 22 -- Figure 1 Overview of LINK. Using a bilingual vocabulary, we substitute randomly selected words in a portion of the high-resource pretraining data with their translations in the low-resource language. The mix ratio controls what fraction of the high-resource data is replaced, and the replacement ratio determines the amount of replacements. We introduce LINK (Lexical INterventions for Knowledge transfer): a method that facilitates cross-lingual knowledge transfer1 through simple lexical substitutions in a multilingual pretraining mix (Figure 1). In contrast to previous work (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b), our method requires only a small bilingual vocabulary to replace the words in a portion of the high-resource training data (controlled by the mix ratio) with their low-resource translations up to a predefined proportion (replacement ratio). This makes LINK broadly applicable to virtually any language for which a bilingual vocabulary can be obtained (which, in practice, includes all written languages, even truly low-resource ones) and easy to integrate into pretraining pipelines at scale. The experiments are conducted across five model sizes on four data-constrained settings simulated from high- and mid-resource languages by reducing the amount of training data. This setup allows greater flexibility to examine a broader range of scenarios and compare them against each other on reliable benchmarks, which are largely unavailable for truly low-resource languages. We still
Chunk 3 · 1,992 chars
onstrained settings simulated from high- and mid-resource languages by reducing the amount of training data. This setup allows greater flexibility to examine a broader range of scenarios and compare them against each other on reliable benchmarks, which are largely unavailable for truly low-resource languages. We still validate our findings on four truly low-resource languages. We also analyze the effect of bilingual vocabulary size on transfer performance and experiment with different placements of the interventions: LINK_uni makes replacements on a randomly selected part of the data, and LINK_domain intervenes only on the domain-specific part, thus enabling targeted domain knowledge transfer while preserving most of the high-resource data. Our experiments demonstrate that such targeted interventions enable the model to maintain high performance in both languages. An ablation study further shows that LINK facilitates cross-lingual knowledge transfer even when interventions are applied to data unrelated to the target domain. Overall, using LINK, we show: • simple word-level substitutions improve cross-lingual knowledge transfer from high-resource to data- constrained language during pretraining, yielding better downstream performance in data-constrained language and up to 2× speedup in training under imbalanced data settings; • targeted interventions on a domain-specific portion of the high-resource training data lead to improve- ments comparable to intervening on the entire corpus, while maintaining high downstream performance in both languages; • interventions on data not directly related to the end task still contribute to cross-lingual knowledge transfer and improve downstream performance. 1In context of this paper, following previous work (Longpre et al., 2026; Li et al., 2024b; Conneau & Lample, 2019), by cross-lingual knowledge transfer, we refer to the ability of a model pretrained on data from multiple languages to leverage knowledge (such as factual
Chunk 4 · 1,994 chars
dge transfer and improve downstream performance. 1In context of this paper, following previous work (Longpre et al., 2026; Li et al., 2024b; Conneau & Lample, 2019), by cross-lingual knowledge transfer, we refer to the ability of a model pretrained on data from multiple languages to leverage knowledge (such as factual information or domain-specific concepts) acquired during pretraining from the data in one language when performing tasks in another. 2 -- 2 of 22 -- 2 Related Work Data Limitations for Multilingual Language Models Training language models requires data and model sizes to scale jointly (Hoffmann et al., 2022). As a result, the development of multilingual models has been closely tied to the availability of large-scale multilingual corpora. Widely used multilingual datasets (Wenzek et al., 2020; Xue et al., 2021b; Penedo et al., 2025) provide web-crawled text in over 100 languages, but with a dramatic long-tail distribution: for most languages, data volumes can be 10–100x smaller than English, making it difficult to train even moderately sized models. This data scarcity propagates through to model performance: models such as BLOOM (BigScience Workshop et al., 2023), Llama 2 (Touvron et al., 2023), and Qwen 2.5 (Team Qwen et al., 2025) have data mixtures with over 90% containing English. This is further compounded by the curse of multilinguality: with fixed model capacity, adding more languages degrades per- language performance (Conneau et al., 2020a), disproportionately harming low-resource languages (Chang et al., 2024). Cross-Lingual Knowledge Transfer Prior work has shown that multilingual models naturally learn to map semantically similar concepts across languages into a shared representation space (Conneau et al., 2020b; Hu et al., 2021; Liu & Niehues, 2025). Among the factors driving this is the co-occurence of similar concepts from different languages in the same context during training, which helps the model learn to predict tokens in one
Chunk 5 · 1,995 chars
y similar concepts across languages into a shared representation space (Conneau et al., 2020b; Hu et al., 2021; Liu & Niehues, 2025). Among the factors driving this is the co-occurence of similar concepts from different languages in the same context during training, which helps the model learn to predict tokens in one language from tokens in another (Luong et al., 2015; Wang et al., 2025), and even lightweight cross- lingual signal can meaningfully improve transfer (Li et al., 2024b). Such multilingual knowledge transfer is particularly important when target language data is limited: Tanzer et al. (2023) provide a benchmark for learning to translate a new language from a single book, and Seto et al. (2025) demonstrated that bilingual models trained with limited target-language data benefit from scaling auxiliary high-resource data. These and other works show that high-resource data can improve low-resource performance, but doing so without large parallel corpora or auxiliary models remains challenging. Word- and Segment-Level Interventions Simple word-level perturbations have seen success as an effective data augmentation techniques in natural language processing. Xie et al. (2017) show that noisy word substitution improves language models and machine translation systems, Wei & Zou (2019) further demonstrate that simple operations such as synonym replacement and random insertion improve text classification in low- data regimes. In machine translation, bilingual dictionaries have been used to improve translation of rare words (Fadaee et al., 2017), while random word replacements have been shown to regularize neural machine translation models (Wang et al., 2018). Kobayashi (2018) extend word substitution by using a language model to predict contextually appropriate replacements. These works establish that even simple lexical perturbations provide meaningful training signal. Recent work with LLM-generated code-swtiched data included in pretraining further suggests
Chunk 6 · 1,995 chars
t al., 2018). Kobayashi (2018) extend word substitution by using a language model to predict contextually appropriate replacements. These works establish that even simple lexical perturbations provide meaningful training signal. Recent work with LLM-generated code-swtiched data included in pretraining further suggests that code-switching abilities may be key to improving multilingual capabilities (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b). Our LINK method is most closely related to these methods but differs in two key respects. First, we make no explicit code-switching assumption. LINK is motivated by the topology of today’s LLMs: inspired by older research in multilingual word embeddings (Luong et al., 2015), we observe that by generating token representations using a bilingual context, we can enable multilingual representations of tokens for monolingual inputs. Second, LINK is lightweight: it requires no parallel corpora, no auxiliary models, and no additional training stages, but only a bilingual vocabulary obtainable at near-zero cost, making it applicable to truly low-resource settings where none of these resources are available. 3 Lexical Interventions for Cross-Lingual Knowledge Transfer We assume a data-constrained2 target language dataset DLR of limited size and a high-resource language dataset DHR available in effectively unlimited quantity3. Before training, we apply lexical interventions to 2We use data-constrained rather than low-resource, but retain the standard LR abbreviation for brevity. See Appendix A.8 for additional discussion. 3Although LINK has no formal limitation on the number of languages, we follow Seto et al. (2025) and focus on the bilingual setup for controllability and leave other experiments for future work. 3 -- 3 of 22 -- a portion of DHR: using a bilingual vocabulary VHR↔LR, we randomly replace words in DHR with their target-language translations. For a sample x = (w1, . . . , wn) ∈ D, where D ⊆ DHR is the subset
Chunk 7 · 1,996 chars
(2025) and focus on the bilingual
setup for controllability and leave other experiments for future work.
3
-- 3 of 22 --
a portion of DHR: using a bilingual vocabulary VHR↔LR, we randomly replace words in DHR with their
target-language translations. For a sample x = (w1, . . . , wn) ∈ D, where D ⊆ DHR is the subset selected for
intervention, let kx ∈ [0, n] denote the number of tokens to replace. The target replacement ratio r ∈ [0, 1]
controls what fraction of words in each sample are replaced. The intervened data is:
DHR+LR = {Replace(x, VHR↔LR, kx, r) | x ∈ D} , (3.1)
where Replace(x, V, k, r) swaps k randomly selected tokens in x with their translations from V , leaving
other words unchanged. Since bilingual vocabularies vary in coverage, the actual number of replacements
may fall below r.
An example of an English sample before and after LINK for German-English mix, 70% replacements:
[. . . ] Combine the lamb with the onion mixture. Add the cinnamon, oregano and red wine and cook for a
few minutes. Add the tomatoes and a cup of water or stock. [. . . ] See more Greek recipes. [. . . ]
[. . . ]Kombinieren der Lamm mit der Zwiebel mixture. Add der Zimtbaum, oregano und rot Wein
und Koch da a wenig minutes. Add der tomatoes und a Tasse aus Wasser oder Vorrat.[. . . ]Sehen
mehr Greek recipes.[. . . ]
Importantly, we aim to create cross-lingual co-occurrences rather than well-formed bilingual sentences; that is
why we do not filter for grammatical correctness or translation accuracy. This approach keeps the intervention
computationally cheap and turns it into a simple preprocessing step with negligible overhead requiring only
a dictionary lookup per token.
We primarily consider two replacement strategies:
• Uniform interventions (LINK_uni), where the words are replaced across a randomly selected portion
of DHR. From DHR, we randomly select a subset Drand
HR , which amount is defined by mix ratio, and
replace up to r fraction of words in each sentence x ∈ D withChunk 8 · 1,995 chars
r token. We primarily consider two replacement strategies: • Uniform interventions (LINK_uni), where the words are replaced across a randomly selected portion of DHR. From DHR, we randomly select a subset Drand HR , which amount is defined by mix ratio, and replace up to r fraction of words in each sentence x ∈ D with their translations from VHR↔LR. The remaining portion of the high-resource data, D′ HR = DHR \ Drand HR+LR, is included in the training set without modifications, as is the data-constrained DLR, which are not subject to replacements to not reduce the (already highly limited) data. The resulting training dataset is defined as: Dtrain = DLR ∪ Drand HR+LR ∪ D′ HR. (3.2) • Domain-specific interventions (LINK_domain), which applies interventions only to the task- or domain-specific portion of the high-resource data. This is motivated by the practical observation that domain-specific knowledge is often abundant in high-resource language but scarce in the data- constrained languages (e.g., scientific content), making it a natural candidate for targeted transfer. Under this more conservative setting, we address the lack of such domain data in the target language dataset, while we minimize the intervention cost on the high-resource dataset (which inevitably grows with increasing the size of intervened high-resource data). Let Dtask HR ⊂ DHR denote the domain-specific portion of the high-resource data, and Dnon-task HR := DHR \ Dtask HR . The domain-specific mixed subset Dtask HR+LR is created by making replacements in Dtask HR using VHR↔LR. The remaining data are included in the training set without modifications, resulting in: Dtrain = DLR ∪ Dtask HR+LR ∪ Dnon-task HR . (3.3) This approach preserves the majority of the high-resource training data, concentrating cross-lingual signal where it is most needed while maintaining English performance. 4 Experimental Setup English serves as the high-resource language for all experiments, as it is the only language
Chunk 9 · 1,998 chars
Dtask HR+LR ∪ Dnon-task HR . (3.3) This approach preserves the majority of the high-resource training data, concentrating cross-lingual signal where it is most needed while maintaining English performance. 4 Experimental Setup English serves as the high-resource language for all experiments, as it is the only language with sufficiently large and diverse public web-scale datasets. The data-constrained scenario is simulated by subsampling four languages (German, French, Hindi, and Chinese) to approximately 350M tokens each (roughly 400K 4 -- 4 of 22 -- documents), comparable to prior work (Seto et al., 2025) and closely mirroring the scale of genuinely low- resource languages in well-established multilingual datasets (Penedo et al., 2025; Xue et al., 2021a). This enables comparison with having more target-language data, broader ablation studies, and evaluation on well-established benchmarks whose translation is often unfeasible for truly low-resource languages. Apart from that, we additionally experiment with four truly low-resource languages: Swahili, Yoruba, Amharic, and Igbo (see Section 5.3 for the results). The English training data DHR is sampled from the FineWeb corpus (Penedo et al., 2024) in quantities sufficient to avoid repetition within a single model’s training. All non-English data DLR is sampled from FineWeb2 (Penedo et al., 2025). A bilingual vocabulary for each target language VHR↔LR is constructed by extracting word entries and their corresponding English translations from a bilingual resource. While any bilingual dictionary or lexicon can serve this purpose, we use Wiktionary (Wikimedia Foundation, 2025) as it provides translation pairs for over 4,400 languages, making bilingual vocabularies readily available even for languages without parallel corpora or translation systems. Dataset and vocabulary sizes are reported in Table 1. Language Bilingual vocabulary Training Tokens English – ∞ German (DE) 48,195 345M Chinese (ZH) 45,571 330M French (FR)
Chunk 10 · 1,989 chars
airs for over 4,400 languages, making bilingual vocabularies readily available even for languages without parallel corpora or translation systems. Dataset and vocabulary sizes are reported in Table 1. Language Bilingual vocabulary Training Tokens English – ∞ German (DE) 48,195 345M Chinese (ZH) 45,571 330M French (FR) 36,492 340M Hindi (HI) 25,001 342M Swahili (SW) 4,197 672M Amharic (AM) 1,487 435M Yoruba (YO) 675 96M Igbo (IG) 233 146M Table 1 Bilingual vocabulary sizes and training data amounts. We evaluate LINK_uni and LINK_domain on zero-shot QA tasks: ARC Easy and Challenge (Clark et al., 2018), Hellaswag (Zellers et al., 2019), Lambada (Paperno et al., 2016), PiQA (Bisk et al., 2019), SciQ (Welbl et al., 2017), and Winogrande (Sakaguchi et al., 2021). These are knowledge-based tasks that small models with limited data still per- form well on. Non-English evaluations are conducted via translation of the original dataset. We translate primarily using our own systems to ensure that translation artifacts will be consistent throughout. We train decoder-only GPT models at five scales (137M, 345M, 760M, 1.3B, and 2.7B parameters) using the Megatron-LM framework with the Aya multilingual tokenizer (250K vocabulary) (Üstün et al., 2024) and a sequence length of 1024. Training runs for 10K, 30K, 50K, and 100K steps, correspondingly, with a batch size of 1024 samples. In a separate set of experiments, we empirically determined an optimal ratio for combining English and LR language data of 97.5% (high-resource) to 2.5% (data-constrained)4. 5 Results We compare LINK against three baselines: (1) LR (UB) as an upper bound, trained with enough target language data for one training epoch, serving as an oracle and illustrating the gap that data scarcity creates5; (2) LR as a lower bound, trained solely on the scarce amount of resource-constrained data; and (3) LR + HR trained on a mixture of the scarce quantity of resource-constrained data and sufficient quantity of
Chunk 11 · 1,997 chars
language data for one training epoch, serving as an oracle and illustrating the gap that data scarcity creates5; (2) LR as a lower bound, trained solely on the scarce amount of resource-constrained data; and (3) LR + HR trained on a mixture of the scarce quantity of resource-constrained data and sufficient quantity of high- resource data. Due to space constraints, we restrict our discussion to results with the 1.3B parameter models, and defer results from smaller (137M, 345M, and 760M parameters) and larger (2.7B parameters) models to the Appendix A.3. 5.1 Uniform Interventions Table 2 reports evaluation results of our LINK_uni experiments. All results are presented for the mix ratio of 90 and replacement ratio of 70 based on the ablation study. In practice, however, vocabulary coverage often caps the actual per-sample replacement rate below the 70% target, meaning every word that can be replaced is replaced under this setup (see more details in Appendix A.6). The results of other replacement configurations are discussed in Section 6. LINK_uni yield consistent improvements in data-constrained language performance across all four languages, outperforming both LR and LR+HR baselines, often by a substantial margin. Remarkably, using only a 4The data mix ratio of 97.5:2.5 was selected from a range of 50:50 to 99:1 based on final checkpoint perplexity in preliminary English–German experiments 5Not available for Hindi due to limited data available in the Fineweb2 dataset. 5Note that perplexity values are not directly comparable across languages due to differences in tokenizer fertility. Full scores are provided in Appendix A.1. 5 -- 5 of 22 -- LR performance HR performance Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg German LR (UB) 37.2 28.6 41.9 28.7 62.8 65.9 51.9 45.3 33.5 24.1 33.5 28.9 60.2 61.3 52.1 41.9 LR 30.8 22.7 29.1 12.7 55.2 47.3 52.0 35.7 27.8 23.5 26.8 4.5 52.2 43.1 52.3 32.9 LR + HR 38.0 25.8 37.9 26.7 59.0 64.1 52.7 43.5 52.1 27.8 57.6
Chunk 12 · 1,997 chars
ance HR performance Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg German LR (UB) 37.2 28.6 41.9 28.7 62.8 65.9 51.9 45.3 33.5 24.1 33.5 28.9 60.2 61.3 52.1 41.9 LR 30.8 22.7 29.1 12.7 55.2 47.3 52.0 35.7 27.8 23.5 26.8 4.5 52.2 43.1 52.3 32.9 LR + HR 38.0 25.8 37.9 26.7 59.0 64.1 52.7 43.5 52.1 27.8 57.6 53.0 74.4 74.6 53.7 56.2 LINK_uni 40.3 26.6 39.4 26.9 59.8 64.7 53.5 44.5 50.0 27.4 54.3 49.4 71.4 72.3 56.4 54.5 LINK_domain 39.3 25.0 38.1 26.6 60.3 66.7 52.0 44.0 52.6 29.2 57.0 53.4 73.6 76.6 56.7 57.0 French LR (UB) 41.3 27.4 46.7 32.5 66.0 66.3 54.9 47.9 38.1 21.8 33.6 30.6 62.0 66.5 53.2 43.7 LR 31.9 21.2 30.1 10.5 55.7 54.4 53.2 36.7 28.2 21.7 26.6 4.3 52.0 46.6 49.9 32.8 LR + HR 38.4 24.9 40.2 28.3 60.7 65.1 54.2 44.5 53.3 28.5 57.8 52.4 74.7 77.4 54.9 57.0 LINK_uni 39.8 25.8 42.2 30.4 60.6 63.4 54.2 45.2 48.7 27.1 53.8 49.5 71.6 70.9 53.9 53.6 LINK_domain 40.4 26.2 41.3 29.2 61.0 64.2 52.9 45.0 52.0 29.2 57.3 53.0 73.5 75.3 55.0 56.5 Chinese LR (UB) 46.1 26.6 39.0 – 59.6 78.6 51.8 50.3 34.4 19.9 29.0 15.4 55.5 64.3 52.7 38.8 LR 33.5 23.7 30.5 – 53.8 62.9 49.4 42.3 27.3 20.3 26.7 1.4 50.8 40.1 48.7 30.7 LR + HR 39.7 24.6 37.3 – 58.3 75.5 51.5 47.8 53.2 29.7 56.4 50.2 73.8 73.8 55.2 56.0 LINK_uni 41.8 26.8 38.3 – 59.5 76.4 51.2 49.0 49.7 27.1 54.3 50.0 73.0 73.0 56.3 54.7 LINK_domain 41.5 25.4 37.8 – 59.0 77.3 51.0 48.7 50.3 27.1 56.2 51.9 73.8 74.5 55.2 55.6 Hindi LR 30.6 22.9 28.3 – 55.0 – 52.9 37.9 27.3 22.4 25.9 2.0 51.0 36.4 51.0 30.9 LR + HR 33.3 24.1 30.4 – 53.0 – 51.0 38.4 53.4 28.2 57.9 52.7 74.0 77.9 56.3 57.2 LINK_uni 35.0 24.6 31.8 – 54.0 – 52.5 39.6 48.1 27.3 52.5 47.5 72.1 75.3 53.2 53.7 LINK_domain 35.5 25.3 30.9 – 49.0 – 50.3 38.2 52.4 27.4 56.4 52.5 74.0 76.6 55.8 56.4 Table 2 LINK_uni and LINK_domain 1.3B model results across German, French, Chinese, and Hindi. Bold indicates best per task-language pair among LR (S), LR+HR, LINK_uni, and LINK_domain (excluding the LR (UB)). fraction of the data-constrained data available to
Chunk 13 · 1,993 chars
– 49.0 – 50.3 38.2 52.4 27.4 56.4 52.5 74.0 76.6 55.8 56.4 Table 2 LINK_uni and LINK_domain 1.3B model results across German, French, Chinese, and Hindi. Bold indicates best per task-language pair among LR (S), LR+HR, LINK_uni, and LINK_domain (excluding the LR (UB)). fraction of the data-constrained data available to LR (UB), LINK_uni even surpasses this upper bound baseline (e.g., ARC-Easy for German). This proves our hypothesis that simple word-level replacements can unlock cross-lingual knowledge transfer between languages powerful enough to close the gap with models trained on significantly more target data. Figure 2 LINK_uni and LINK_domain percent- age change in perplexity relative to the LR+HR baseline for 1.3B models, computed as PPLLINK_uni−PPLLR+HR PPLLR+HR 6. However, these gains come at the cost of high-resource lan- guage performance: English downstream scores drop by up to 5.5pp compared to the LR+HR baseline. Figure 2 also confirms this: massive uniform interventions performed by LINK_uni lead to a substantial increase in English perplexity across all bilingual setups, reflecting the degradation caused by modifying a large portion of the high-resource training data. The effect on data-constrained language perplexity is less consistent — it decreases for some languages but in- creases for others, suggesting that downstream gains from cross-lingual knowledge transfer do not necessarily correlate with lower general-domain perplexity. 5.2 Domain-Specific Interventions Next, we experiment with LINK_domain, where interventions are applied only to domain-specific data. In contrast to the LINK_uni setup, the only relevant data for replacements in LINK_domain is the domain- specific subset, and, within it, the mix ratio equals 100, while the remaining English data is left unmodified. Our primary motivation for including a domain specific intervention is to limit the amount of data that is replaced, to protect performance in English. As a representative
Chunk 14 · 1,998 chars
ments in LINK_domain is the domain- specific subset, and, within it, the mix ratio equals 100, while the remaining English data is left unmodified. Our primary motivation for including a domain specific intervention is to limit the amount of data that is replaced, to protect performance in English. As a representative domain, we focus on scientific content, targeting the ARC downstream task (Clark et al., 2018). To identify the relevant training data, following Grangier et al. (2025); Seto et al. (2025) we cluster the English data using k-means with multilingual BERT representations (Devlin et al., 2019) into 32 clusters. The resulting clusters show clear topic-wise separation, 6 -- 6 of 22 -- Figure 3 LINK_domain: ARC-Easy accuracy for 1.3B models. Top: target language evaluation; bottom: English evaluation. with one cluster (comprising roughly 5.3B tokens) corresponding to scientific knowledge – confirmed by assigning English and German ARC validation sets to the same centroids, which are almost entirely attributed to this cluster (see Appendix A.5). Therefore the interventions were applied to all samples attributed to this cluster while the other samples remains intact. We preserved the original cluster distribution in the pretraining mix. The results are provided in Table 2. Additionally, Figure 3 shows ARC-Easy (i.e., target domain) accuracy throughout training for all four languages. LINK_domain interventions match or exceed LINK_uni interven- tions on target-language performance (top row) while preserving English performance (bottom row), unlike LINK_uni, which causes a notable drop. Importantly, with both intervention strategies, LINK improves train- ing efficiency for the target language. For example, we see that for German, the LINK models reaches the baseline’s final performance at 40k steps versus 80k, representing an approximately 2× speedup, while for Hindi both intervention strategies surpass the baseline as early as 30k steps. Furthermore, Figure
Chunk 15 · 1,994 chars
es train- ing efficiency for the target language. For example, we see that for German, the LINK models reaches the baseline’s final performance at 40k steps versus 80k, representing an approximately 2× speedup, while for Hindi both intervention strategies surpass the baseline as early as 30k steps. Furthermore, Figure 2 shows that LINK_domain result in negligible perplexity changes for both languages, in stark contrast to the substan- tial increases in English perplexity observed with LINK_uni interventions. Together, these results show that targeting interventions to domain-relevant data achieves strong data-constrained language transfer without sacrificing high-resource performance. 5.3 Low-Resource Experiments Evaluating on truly low-resource languages introduces additional challenges: severely limited data limits bilingual vocabulary size and training corpus (e.g., Yoruba has only 96M tokens in FineWeb2 and 1.4% of the German vocabulary, see Table 1) constraining both what the model can learn from the target language directly, and the degree of overlap the two languages have in the data. Apart from that, standardized evaluation benchmarks are largely unavailable, as most established English benchmarks lack translations for these languages and automatic translation is often unreliable due to the same data scarcity being not enough to train reliable machine translation systems. Evaluation is therefore restricted to the few benchmarks with existing multilingual versions. 7 -- 7 of 22 -- AM IG SW YO 1.3B LR 44.7 60.1 47.8 38.2 LR + HR 42.1 61.6 48.5 40.0 LINK_uni 41.6 60.4 48.5 40.7 LINK_domain 40.9 60.4 50.1 38.7 345M LR 42.5 51.5 46.5 38.1 LR + HR 41.3 57.5 46.4 34.8 LINK_uni 42.4 59.1 46.8 39.7 LINK_domain 41.7 58.1 47.4 33.2 Table 3 LINK average target-language bench- mark results for true low-resource languages. Table 3 reports the average LINK results for four low-resource languages: Amharic, Igbo, Swahili, and Yoruba. The evalu- ation is done on the
Chunk 16 · 1,991 chars
R + HR 41.3 57.5 46.4 34.8 LINK_uni 42.4 59.1 46.8 39.7 LINK_domain 41.7 58.1 47.4 33.2 Table 3 LINK average target-language bench- mark results for true low-resource languages. Table 3 reports the average LINK results for four low-resource languages: Amharic, Igbo, Swahili, and Yoruba. The evalu- ation is done on the benchmarks that are available for these languages (see the exact list and full scores in Appendix A.4). The effect of domain interventions varies across languages. For Igbo, Swahili, and Yoruba, we see improvement in all setting except 1.3B Igbo setup. For Swahili, domain-specific interven- tions yield consistent improvements at both scales, while for Yoruba and Igbo, uniform interventions perform best. This can be explained by much larger vocabulary available for Swahili (4197) than for other languages, which motivates our next study of bilingual vocabulary size (see the next section). In the Amharic, the LR only baseline performs the best con- sistently. This may reflect the greater script and typological distance of Amharic from English, which limits the effectiveness of lexical replacement and, more generally, training with English data. 6 Ablation Studies LINK introduces several design choices including the bilingual vocabulary, replacement ratio, mix ratio, and domain relevance. To understand which of these factors most strongly influence target language performance and to guide practitioners in applying our method, we conduct a series of ablation studies. Model Sizes We evaluate across five model sizes (137M–2.7B parameters) to verify that our method’s bene- fits are not limited to a single scale. Both LINK_uni and LINK_domain consistently match or outperform the LR+HR baseline at every scale. Benefits emerge even at 137M and become increasingly pronounced at larger scales: the gap between LINK and the baseline widens as model size grows, with the largest improvements observed at 760M and 1.3B. This trend is consistent across both close
Chunk 17 · 1,995 chars
ain consistently match or outperform the LR+HR baseline at every scale. Benefits emerge even at 137M and become increasingly pronounced at larger scales: the gap between LINK and the baseline widens as model size grows, with the largest improvements observed at 760M and 1.3B. This trend is consistent across both close (German, French) and distant (Chi- nese, Hindi) language pairs, suggesting that larger models are better able to exploit the cross-lingual signal introduced by our method. Full results are reported in Appendix A.3. Figure 4 LINK ARC-Easy accuracy on German with reduced bilingual vocabularies (1.3B model). Reduced Vocabulary Size To examine the effect of vocabulary size on transfer performance, we conduct experiments with reduced German bilin- gual vocabularies to 50% and 10% of the original size (each vocabulary is a subset of a larger one). Figure 4 presents the results on ARC-Easy for both uniform and domain-specific settings. Both settings show a similar pattern: full vocabulary (100%) outperforms the LR+HR baseline, while re- ducing the vocabulary to 50% (around 24,000 word pairs) leads to a modest decline, which is amplified by further reduction to 10% (around 4,800 word pairs). These results indicate that vocabulary size is an important factor in transfer performance, with larger bilingual dictionaries yielding stronger gains. This corroborates inconsistent gains we ob- serve for some of the low-resource languages as the vocabulary size is 9% and lower of the German vocabulary for the four low resource languages we show. Additionally, we reduced the vocabulary even further to 1% - results of these experiments are provided in Appendix A.7. Encouragingly, obtaining vocabularies of this size is not a significant barrier, as Wiktionary alone provides at least 1,000 translation pairs for 186 languages and over 10,000 for 69. 8 -- 8 of 22 -- Repl Mix PPLEN PPLDE A-EDE LR 265.4 58.1 30.8 LR + HR 10.0 16.6 38.1 10 10 10.0 16.6 38.5 30 30 10.1 16.6
Chunk 18 · 1,999 chars
endix A.7. Encouragingly, obtaining vocabularies of this size is not a significant barrier, as Wiktionary alone provides at least 1,000 translation pairs for 186 languages and over 10,000 for 69. 8 -- 8 of 22 -- Repl Mix PPLEN PPLDE A-EDE LR 265.4 58.1 30.8 LR + HR 10.0 16.6 38.1 10 10 10.0 16.6 38.5 30 30 10.1 16.6 38.9 50 70 10.4 16.5 38.5 50 90 10.6 16.5 39.6 70 70 10.5 16.6 38.6 70 90 10.9 16.5 40.2 Table 4 LINK across varying replace- ment (Repl) and mixing (Mix) ratios (English–German, 1.3B). Replacement & Mix Ratio Table 4 presents an ablation over the replace- ment ratio and mixing ratio for the English–German pair, reporting English and German perplexity on FineWeb and FineWeb-2 valida- tion sets, as well as ARC Easy accuracy in German. Low values for both ratios (10/10, 30/30) have minimal effect on downstream perfor- mance. Increasing either ratio improves results, though the mixing ratio appears to have a slightly stronger effect; the best configura- tion (70/90) achieves over 2 points above the non-augmented baseline. Notably, target-language perplexity remains stable across all config- urations, while English perplexity increases by almost 1pp, further motivating LINK_domain. Setup PPLEN PPLLR A-ELR DE LR + HR 10.0 16.6 38.1 Uniform 16.9 16.8 40.3 Domain 10.0 16.5 39.3 Non-domain 11.9 16.7 40.5 FR LR + HR 9.9 8.9 38.4 Uniform 10.8 9.4 39.8 Domain 10.1 8.9 40.4 Non-domain 11.8 9.4 39.8 ZH LR + HR 10.0 42.4 39.7 Uniform 10.7 43.9 41.8 Domain 10.2 42.1 41.5 Non-domain 12.3 45.4 40.8 HI LR + HR 9.9 6.0 33.3 Uniform 11.2 5.8 35.0 Domain 10.1 6.0 35.5 Non-domain 12.4 5.8 35.9 Table 5 Comparison of LINK intervention strategies (1.3B). PPLEN/PPLLR: perplex- ity on FineWeb/FineWeb-2 validation; A-E: ARC Easy accuracy. Non-domain specific interventions and indirect knowledge transfer To disentangle the effect of topical alignment between the re- placed text and the target language from the effect of lexical ex- posure itself, we introduce a Non-domain control
Chunk 19 · 1,998 chars
LLR: perplex- ity on FineWeb/FineWeb-2 validation; A-E: ARC Easy accuracy. Non-domain specific interventions and indirect knowledge transfer To disentangle the effect of topical alignment between the re- placed text and the target language from the effect of lexical ex- posure itself, we introduce a Non-domain control experiment: in- terventions are applied to all data except for the domain-specific part. That is, replacements are applied only to Dnon-task HR = DHR \ Dtask HR , while the task-relevant data remains unchanged. The final training set is then Dtrain = DLR ∪ Dnon-task HR+LR ∪ Dtask HR . In our experimental setup from Section 5.2, the interventions on all non-scientific data means intervening on roughly the 95.5% of the original English dataset – i.e., more than domain-specific (4.5%) or even uniform (90%) interventions. As shown in Table 5, this massive increase in intervention volume leads to further LR improvements for all four languages compared to the LR + HR baseline, with non-domain interventions outperforming both uni- form and domain-specific setups for Hindi and German. This sug- gests that sheer volume of lexical exposure matters: the model benefits from encountering target-language vocabulary broadly across the training data, regardless of topical relevance. We ob- serve an expected cost to English performance, but strikingly also observe that our approach unlocks domain transfer even without directly intervening on the relevant data. 7 Conclusion This work proposes a data-level intervention method for improving language model pretraining in languages with scarce data. Our approach requires only a bilingual vocabulary, making it applicable at near-zero cost to over a thousand languages and at pretraining scale. Across different target languages and five model sizes, our method consistently improves downstream performance particularly for languages that are distant from the high-resource language, and remains effective even when interventions
Chunk 20 · 1,998 chars
applicable at near-zero cost to over a thousand languages and at pretraining scale. Across different target languages and five model sizes, our method consistently improves downstream performance particularly for languages that are distant from the high-resource language, and remains effective even when interventions are applied to only a small, domain-specific portion of the training data. These findings demonstrate a practical and scalable path toward building stronger language models for data-scarce languages. References BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Ro- man Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641. 9 -- 9 of 22 -- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165. Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712. Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a
Chunk 21 · 1,995 chars
Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712. Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4074–4096, 2024. Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz Daud, Abosede Grace Olanihun, et al. Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures. ArXiv, abs/2510.24081, 2025. URL https://api.semanticscholar.org/CorpusID:282401377. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018. Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook, NY, USA, 2019. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747/. Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross-lingual Structure in Pretrained Language Models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for
Chunk 22 · 1,993 chars
ps://aclanthology.org/2020.acl-main.747/. Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross-lingual Structure in Pretrained Language Models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6022–6034, Online, July 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.536. URL https://aclanthology.org/2020. acl-main.536/. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, et al. Deepseek-v3 technical report, 2025. URL https://arxiv.org/abs/2412.19437. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423/. Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data augmentation for low-resource neural machine transla- tion. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 567–573, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-2090. URL https://aclanthology.org/P17-2090/. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja
Chunk 23 · 1,999 chars
inguistics (Volume 2: Short Papers), pp. 567–573, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-2090. URL https://aclanthology.org/P17-2090/. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrit- twieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, et al. Gemini: A family of highly capable multimodal models, 2025. URL https://arxiv.org/abs/2312.11805. David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered- importance sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=p6ncr0eTKE. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol 10 -- 10 of 22 -- Vinyals, Jack W. Rae, and Laurent Sifre. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088. Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Siddhant, and Graham Neubig. Explicit alignment objectives for multilingual bidirectional encoders. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3633–3643, Online, June 2021. Association for Computational Linguistics. doi:
Chunk 24 · 1,998 chars
-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3633–3643, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. naacl-main.284. URL https://aclanthology.org/2021.naacl-main.284/. Sosuke Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 452– 457, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2072. URL https://aclanthology.org/N18-2072/. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng- Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-lm: in search of the next generation of training sets for language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA, 2024a. Curran Associates
Chunk 25 · 1,998 chars
r, Alexandros G. Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-lm: in search of the next generation of training sets for language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA, 2024a. Curran Associates Inc. ISBN 9798331314385. Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. PreAlign: Boosting cross-lingual transfer by early establishment of multilingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 10246–10257, Miami, Florida, USA, November 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.572. URL https://aclanthology.org/2024.emnlp-main.572/. Danni Liu and Jan Niehues. Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15979– 15996, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.778. URL https://aclanthology.org/2025.acl-long.778/. Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi. ATLAS: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=0BkvUY61MX. Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop on Vector Space Modeling for Natural
Chunk 26 · 1,997 chars
ations, 2026. URL https://openreview.net/forum?id=0BkvUY61MX. Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159, Denver, Colorado, June 2015. Association for Computational Linguistics. doi: 10.3115/v1/W15-1521. URL https://aclanthology.org/W15-1521/. Partha Pakray, Alexander Gelbukh, and Sivaji Bandyopadhyay. Natural language processing applications for low- resource languages. Natural Language Processing, 31(2):183–197, 2025. doi: 10.1017/nlp.2024.33. Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144/. Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=n6SCkn2QaG. 11 -- 11 of 22 -- Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all – adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and
Chunk 27 · 1,994 chars
tina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all – adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381. Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual LMs with data constraints in the targeted language. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp. 19096– 19122, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.977. URL https://aclanthology.org/2025.findings-acl.977/. Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book. In Arxiv, 2023. Team Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat
Chunk 28 · 1,988 chars
u, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288. Ahmet Üstün, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15894–15939, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.845. URL https://aclanthology.org/2024.acl-long.845/. Yana Veitsman and Mareike Hartmann. Recent advancements and challenges of Turkic Central Asian language processing. In Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, and Lasitha Uyangodage (eds.), Proceedings of the First Workshop on Language Models for Low-Resource Languages, pp. 309–324, Abu Dhabi, United Arab Emirates, January 2025. Association for Computational Linguistics. URL https://aclanthology.org/2025.loreslm-1.25/. Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. SwitchOut: an efficient data augmentation algorithm for neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 856–861, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1100.
Chunk 29 · 1,993 chars
neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 856–861, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1100. URL https://aclanthology.org/D18-1100/. Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, and Shujian Huang. Investigating and scaling up code-switching for multilingual language model pre-training. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp. 11032–11046, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.575. URL https:// aclanthology.org/2025.findings-acl.575/. Jason Wei and Kai Zou. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1670. URL https://aclanthology.org/D19-1670/. Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Der- czynski, Wei Xu, Alan Ritter, and Tim Baldwin (eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URL https://aclanthology.org/W17-4413/. 12 -- 12 of 22 -- Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet:
Chunk 30 · 1,996 chars
ed Text, pp. 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URL https://aclanthology.org/W17-4413/. 12 -- 12 of 22 -- Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020. lrec-1.494/. Wikimedia Foundation. Wiktionary: The free dictionary, 2025. URL https://www.wiktionary.org. Accessed: 2025. Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as smoothing in neural network language models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1VyHY9gg. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483–498, Online, June 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021. naacl-main.41/. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin
Chunk 31 · 1,994 chars
onal Linguistics: Human Language Technologies, pp. 483–498, Online, June 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021. naacl-main.41/. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483–498, Online, June 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021. naacl-main.41/. Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for multilingual transfer in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp. 7816–7836, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025. findings-acl.407. URL https://aclanthology.org/2025.findings-acl.407/. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472/. A Appendix A.1 Perplexity Results Table 6 reports validation perplexity on FineWeb-2 (target language, LR) and FineWeb (English, EN) for each setup across all four model sizes. At 137M, the smallest scale,
Chunk 32 · 1,995 chars
19. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472/. A Appendix A.1 Perplexity Results Table 6 reports validation perplexity on FineWeb-2 (target language, LR) and FineWeb (English, EN) for each setup across all four model sizes. At 137M, the smallest scale, LINK_uni incurs a substantial English perplexity penalty (e.g., 36.49 vs. 29.53 for German), while LINK_domain keeps English perplexity close to LR+HR across all languages. Target-language perplexity remains comparable to the baseline for both strategies. Starting at 345M, LINK_domain consistently achieves the lowest or near-lowest target-language perplexity while keeping English perplexity close to LR+HR. LINK_uni matches or slightly exceeds the LR+HR English perplexity but shows competitive LR perplexity, with the best Hindi result. This pattern strengthens at 760M, where LINK_domain achieves the best LR perplexity for German and French, with English perplexity remaining within 0.1–0.2 points of LR+HR. LINK_uni also improves LR perplexity over the baseline (e.g., 19.09 vs. 20.37 for German) but at a larger English cost. At 1.3B, the same trends hold: LINK_domain closely matches or improves upon LR+HR on both sides, i.e., its English perplexity remains within 1 point of LR+HR, while its LR perplexity improves for German and French and matches LR+HR for Chinese. 13 -- 13 of 22 -- LINK_uni yields comparable LR perplexity but incurs a slightly larger English penalty. Across all scales, the LR-only baseline diverges for larger models, highlighting the instability of training on scarce data alone. DE FR ZH HI Setup LR EN LR EN LR EN LR EN 137M LR (UB) 24.54 86.90 13.38 94.03 54.62 >102 – – LR 35.74 >102 19.94 >102 >102 >102 9.43 >102 LR + HR 62.85 29.53 30.56 29.37 >102 29.40 18.96 29.46 LINK_uni 68.43 36.49 32.41 36.34 >102 34.79 20.11 39.40 LINK_domain 63.08 29.80 30.63 29.73 >102 29.99 20.04 29.89 345M LR (UB) 10.02 32.15 6.61 32.95 14.63 67.43 –
Chunk 33 · 1,996 chars
EN 137M LR (UB) 24.54 86.90 13.38 94.03 54.62 >102 – – LR 35.74 >102 19.94 >102 >102 >102 9.43 >102 LR + HR 62.85 29.53 30.56 29.37 >102 29.40 18.96 29.46 LINK_uni 68.43 36.49 32.41 36.34 >102 34.79 20.11 39.40 LINK_domain 63.08 29.80 30.63 29.73 >102 29.99 20.04 29.89 345M LR (UB) 10.02 32.15 6.61 32.95 14.63 67.43 – – LR 29.40 >102 19.94 >102 >102 >102 7.87 >102 LR + HR 23.32 14.82 12.53 14.82 67.90 14.39 7.70 15.64 LINK_uni 23.37 14.83 12.81 14.85 67.39 14.39 7.68 15.60 LINK_domain 23.02 13.27 11.77 13.26 63.22 13.30 7.89 13.31 760M LR (UB) 8.39 25.55 5.75 25.94 11.09 51.25 – – LR >105 >106 28.45 >102 >102 >103 9.66 >102 LR + HR 20.37 11.20 9.92 11.17 50.42 11.23 6.71 11.21 LINK_uni 19.09 12.52 10.41 12.53 52.91 12.14 6.50 13.00 LINK_domain 18.73 11.29 9.83 11.29 50.47 11.33 6.67 11.35 1.3B LR (UB) 7.33 21.42 5.17 21.68 9.12 40.76 – – LR 58.12 >102 37.44 >102 >102 >103 11.87 >102 LR + HR 16.63 9.98 8.88 9.92 42.43 9.98 5.99 9.95 LINK_uni 16.52 10.84 9.42 10.83 43.86 10.67 5.77 11.23 LINK_domain 16.49 10.03 8.86 10.08 42.14 10.17 5.96 10.08 Table 6 Validation perplexity on FineWeb (EN) and FineWeb-2 (LR) across model sizes. Bold indicates lowest perplexity among LR (S), LR+HR, LINK_uni, and LINK_domain. A.2 Global MMLU Global MMLU scores remain close to the random baseline of 25.0 across all setups and languages, with most values falling in the 25–28 range. At 1.3B scale, the models lack sufficient capacity to perform meaningfully on this knowledge-intensive benchmark. German and Chinese show the largest gains from bilingual training (up to 27.6), while low-resource languages remain near chance level regardless of the intervention strategy. These results suggest that Global MMLU is not sensitive enough to capture the differences between our setups at this model scale, and we therefore focus our analysis on the other benchmarks. High-resource Low-resource Size Setup DE FR ZH HI AM IG SW YO 1.3B LR 25.8 – 26.0 24.9 25.8 24.9 25.4 25.4 LR + HR 27.5 25.9 27.2 23.7
Chunk 34 · 1,997 chars
results suggest that Global MMLU is not sensitive enough to capture the differences between our setups at this model scale, and we therefore focus our analysis on the other benchmarks. High-resource Low-resource Size Setup DE FR ZH HI AM IG SW YO 1.3B LR 25.8 – 26.0 24.9 25.8 24.9 25.4 25.4 LR + HR 27.5 25.9 27.2 23.7 25.9 26.3 26.1 25.6 LINK_uni 27.5 25.9 27.6 25.0 25.7 26.3 26.7 25.8 LINK_domain 27.4 27.3 27.6 23.8 25.5 26.7 26.6 25.7 345M LR 26.2 23.9 25.8 23.3 25.5 25.9 25.4 25.7 LR + HR 27.1 24.1 26.5 23.1 25.0 25.8 26.0 25.5 LINK_uni 26.1 24.5 26.9 22.9 25.4 25.7 26.2 26.0 LINK_domain 26.4 26.0 26.6 22.9 25.1 25.9 25.6 25.6 Table 7 Global MMLU results in the target language across all model sizes. 14 -- 14 of 22 -- A.3 Model Scaling This section reports downstream benchmark results broken down by model size. Tables 8, 9, and 10 present individual benchmark scores for the 760M, 345M, and 137M models, respectively. Figure 5 summarizes the average zero-shot accuracy on target-language benchmarks across all four model sizes. Figure 5 Average benchmark accuracy across four model sizes and all languages. LR performance HR performance Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg German LR (UB) 34.2 26.1 37.9 26.7 62.4 55.2 50.4 41.8 32.0 22.2 30.5 24.6 58.2 53.1 50.5 38.7 LR 31.5 23.4 28.5 9.1 56.0 45.5 50.4 34.9 27.5 24.5 26.7 2.6 49.9 40.8 49.2 31.6 LR + HR 34.7 24.4 32.6 21.8 55.3 62.0 51.9 40.4 49.5 26.8 49.4 45.4 71.1 72.9 54.3 52.8 LINK_uni 36.4 24.1 34.1 23.6 59.1 64.6 53.0 42.1 44.6 24.5 46.1 43.2 69.4 68.4 52.9 49.9 LINK_domain 36.8 25.5 35.2 23.7 58.8 63.0 50.7 41.9 47.4 26.4 49.4 46.1 71.1 72.9 54.5 52.5 French LR (UB) 39.9 25.9 41.9 29.1 64.2 61.6 55.1 45.4 35.6 20.7 30.9 23.6 58.9 64.8 48.4 40.4 LR 31.4 22.9 30.1 10.7 54.8 51.8 51.9 36.2 28.5 21.7 27.4 4.7 53.1 45.8 50.9 33.2 LR + HR 34.2 24.0 35.7 25.6 58.6 61.0 50.9 41.4 47.3 26.7 49.6 47.1 71.7 70.7 54.5 52.5 LINK_uni 37.4 25.2 36.9 27.0 60.3 63.2 52.3 43.2 43.9 24.5 45.6 42.8 69.4
Chunk 35 · 1,994 chars
25.9 41.9 29.1 64.2 61.6 55.1 45.4 35.6 20.7 30.9 23.6 58.9 64.8 48.4 40.4 LR 31.4 22.9 30.1 10.7 54.8 51.8 51.9 36.2 28.5 21.7 27.4 4.7 53.1 45.8 50.9 33.2 LR + HR 34.2 24.0 35.7 25.6 58.6 61.0 50.9 41.4 47.3 26.7 49.6 47.1 71.7 70.7 54.5 52.5 LINK_uni 37.4 25.2 36.9 27.0 60.3 63.2 52.3 43.2 43.9 24.5 45.6 42.8 69.4 69.6 54.4 50.0 LINK_domain 36.3 25.6 36.5 26.5 59.0 61.6 52.4 42.6 46.1 24.8 49.8 46.3 71.8 70.1 53.4 51.7 Chinese LR (UB) 43.0 25.4 36.4 – 60.1 76.7 51.4 48.8 31.9 20.5 27.9 12.1 54.0 59.3 48.9 36.4 LR 32.9 24.2 30.6 – 55.6 63.5 51.2 43.0 26.9 23.1 26.7 1.6 52.5 39.5 51.1 31.6 LR + HR 37.2 25.1 33.8 – 57.0 74.0 52.1 46.5 48.0 25.4 49.8 45.6 72.4 71.7 53.8 52.4 LINK_uni 39.8 25.6 35.3 – 57.7 73.7 51.6 47.3 47.0 25.2 46.7 42.2 69.6 70.9 52.6 50.6 LINK_domain 38.3 25.6 35.2 – 56.9 75.3 50.5 47.0 47.4 26.4 49.3 45.2 70.8 71.8 52.8 51.9 Hindi LR 32.1 23.4 27.6 – 53.0 – 46.9 36.6 27.8 20.6 26.4 2.3 52.3 38.7 49.8 31.1 LR + HR 32.4 22.3 28.7 – 52.0 – 49.1 36.9 48.4 26.4 49.8 46.8 71.6 73.1 52.6 52.7 LINK_uni 33.0 24.1 30.1 – 55.0 – 51.6 38.8 43.8 23.8 44.6 40.5 68.4 69.2 52.1 48.9 LINK_domain 33.5 24.5 29.7 – 53.0 – 51.2 38.4 46.5 25.2 49.0 46.0 71.2 69.7 51.2 51.3 Table 8 760M model results. 15 -- 15 of 22 -- LR performance HR performance Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg German LR (UB) 32.1 24.8 33.6 24.0 60.0 49.5 53.0 39.6 30.2 22.1 28.5 18.4 56.3 49.2 52.4 36.7 LR 32.3 24.0 28.6 15.7 54.9 51.3 51.5 36.9 29.6 20.8 26.6 6.2 51.4 50.9 50.1 33.7 LR + HR 33.1 24.5 30.6 20.4 55.6 61.6 50.8 39.5 40.6 23.2 37.8 34.9 67.0 65.3 51.2 45.7 LINK_uni 33.6 23.5 30.5 19.4 55.8 61.7 53.6 39.7 41.5 24.2 38.0 34.6 66.2 66.1 50.4 45.9 LINK_domain 33.2 24.7 30.9 19.6 54.8 61.4 52.7 39.6 43.5 24.0 41.0 38.8 67.6 66.5 54.4 48.0 French LR (UB) 35.5 24.5 37.1 23.9 62.2 61.8 52.7 42.5 33.1 20.6 28.7 16.9 56.1 58.1 50.2 37.7 LR 32.7 24.5 30.1 12.5 56.7 50.5 51.5 36.9 29.5 21.8 26.7 7.3 52.2 47.8 48.8 33.4 LR + HR 35.5 23.7 32.4 22.6 57.0 59.6 52.9
Chunk 36 · 1,996 chars
.4 45.9 LINK_domain 33.2 24.7 30.9 19.6 54.8 61.4 52.7 39.6 43.5 24.0 41.0 38.8 67.6 66.5 54.4 48.0 French LR (UB) 35.5 24.5 37.1 23.9 62.2 61.8 52.7 42.5 33.1 20.6 28.7 16.9 56.1 58.1 50.2 37.7 LR 32.7 24.5 30.1 12.5 56.7 50.5 51.5 36.9 29.5 21.8 26.7 7.3 52.2 47.8 48.8 33.4 LR + HR 35.5 23.7 32.4 22.6 57.0 59.6 52.9 40.5 41.5 23.1 37.9 35.8 66.9 66.2 50.5 46.0 LINK_uni 34.7 24.8 32.5 21.5 56.1 59.9 51.0 40.1 40.5 22.9 38.0 35.0 66.4 65.8 51.5 45.7 LINK_domain 33.3 24.3 32.0 22.3 56.5 58.0 49.9 39.5 43.9 24.1 40.8 38.1 67.6 68.5 51.7 47.8 Chinese LR (UB) 39.7 25.1 33.5 – 57.8 74.3 50.1 46.8 31.4 21.1 27.6 9.7 53.8 56.2 51.5 35.9 LR 31.4 25.0 30.2 – 55.2 65.3 50.6 43.0 28.1 20.1 26.8 2.7 51.7 41.6 48.7 31.4 LR + HR 36.5 24.0 32.4 – 55.5 72.4 48.5 44.9 42.0 23.5 38.7 36.3 66.3 66.5 51.5 46.4 LINK_uni 36.6 24.2 32.1 – 57.0 71.5 48.1 44.9 41.5 24.4 38.5 36.3 67.7 65.4 49.8 46.2 LINK_domain 35.9 24.4 32.2 – 55.1 70.9 49.9 44.7 42.9 24.4 40.7 38.2 68.5 65.0 51.9 47.4 Hindi LR 31.0 23.1 27.9 – 50.0 – 50.9 36.6 29.6 20.3 26.2 3.4 52.7 42.2 49.7 32.0 LR + HR 32.1 23.5 28.9 – 47.0 – 51.6 36.6 38.9 24.1 36.6 32.8 66.5 65.5 51.1 45.1 LINK_uni 31.9 25.0 28.6 – 49.0 – 51.6 37.2 40.7 21.7 36.5 32.9 66.0 63.2 50.1 44.4 LINK_domain 32.5 23.1 28.4 – 56.0 – 49.8 38.0 42.3 23.4 40.5 38.7 67.8 67.8 51.5 47.4 Table 9 345M model results. LR performance HR performance Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg German LR (UB) 27.2 23.6 26.6 12.2 53.5 41.9 50.0 33.6 28.5 21.5 27.0 5.4 52.0 41.4 49.6 32.2 LR 29.7 24.2 26.9 11.6 52.5 49.5 49.9 34.9 30.6 23.7 27.1 4.2 51.2 50.8 50.5 34.0 LR + HR 27.0 23.3 27.0 6.3 50.2 54.3 48.7 33.8 32.1 20.2 27.4 15.7 59.2 57.2 49.7 37.4 LINK_uni 27.8 24.1 26.6 6.5 50.9 52.2 49.6 34.0 30.9 20.1 26.9 13.0 55.2 52.4 49.1 35.3 LINK_domain 28.2 21.7 26.5 6.5 49.7 52.7 50.9 33.8 31.9 20.0 27.4 14.4 57.5 53.7 47.9 36.1 French LR (UB) 29.1 21.9 27.6 11.8 53.6 52.1 50.5 35.2 28.4 22.7 27.4 6.5 52.6 48.6 51.2 33.9 LR 28.4 23.1 27.2 10.9 52.5 49.5
Chunk 37 · 1,992 chars
7.2 49.7 37.4 LINK_uni 27.8 24.1 26.6 6.5 50.9 52.2 49.6 34.0 30.9 20.1 26.9 13.0 55.2 52.4 49.1 35.3 LINK_domain 28.2 21.7 26.5 6.5 49.7 52.7 50.9 33.8 31.9 20.0 27.4 14.4 57.5 53.7 47.9 36.1 French LR (UB) 29.1 21.9 27.6 11.8 53.6 52.1 50.5 35.2 28.4 22.7 27.4 6.5 52.6 48.6 51.2 33.9 LR 28.4 23.1 27.2 10.9 52.5 49.5 50.4 34.6 29.0 22.8 27.1 4.2 52.2 46.7 51.9 33.4 LR + HR 27.9 23.1 27.3 8.8 50.6 53.8 51.9 34.8 32.6 20.3 27.6 16.1 57.6 56.8 49.8 37.3 LINK_uni 28.5 23.9 27.2 9.7 51.7 52.5 49.1 34.8 30.8 20.9 26.9 13.7 55.1 51.7 51.9 35.9 LINK_domain 28.5 22.1 27.4 8.6 51.3 54.6 50.1 34.6 32.6 19.4 27.7 15.2 57.1 54.6 48.1 36.4 Chinese LR (UB) 32.4 22.9 28.2 – 52.6 65.8 51.3 42.2 29.1 21.4 27.0 3.5 52.7 45.3 49.4 32.6 LR 29.9 22.5 28.0 – 50.7 52.8 51.4 39.2 28.4 24.0 26.8 1.8 50.8 38.5 48.7 31.3 LR + HR 28.7 22.2 27.6 – 52.3 61.9 50.5 40.5 33.1 20.6 27.6 16.3 58.8 53.3 48.9 37.0 LINK_uni 28.8 22.8 27.2 – 51.0 62.1 52.4 40.7 31.1 21.6 27.5 13.2 56.9 51.5 50.1 36.0 LINK_domain 30.4 22.9 27.3 – 51.0 62.4 48.9 40.5 31.6 21.2 27.3 14.8 57.7 53.8 49.7 36.6 Hindi LR 29.8 22.4 26.8 – 49.0 – 49.6 35.5 28.3 22.6 26.4 2.9 52.5 40.5 49.9 31.9 LR + HR 29.0 24.7 27.3 – 50.0 – 49.6 36.1 32.1 20.6 27.5 15.8 58.3 57.2 49.9 37.4 LINK_uni 28.7 25.5 26.9 – 53.0 – 49.2 36.6 28.9 21.9 27.2 11.8 55.3 48.1 50.1 34.8 LINK_domain 29.2 24.2 27.4 – 53.0 – 50.0 36.8 31.9 19.9 27.5 15.3 58.1 53.7 47.4 36.2 Table 10 137M model results. Additionally, we conducted experiments on German for the 2.7B model; the results are provided in Table 11. The gains of our methods are even more pronounced than for smaller models: for example, LINK_domain achieves 45.2% accuracy on ARC-Easy, surpassing not only all baselines but even the upper-bound model trained on significantly larger amounts of German data. 16 -- 16 of 22 -- LR performance HR performance Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg German LR (UB) 44.6 27.6 49.7 33.1 66.9 68.4 53.9 49.2 37.5 25.0 41.4 36.8 64.9 67.9 53.8
Chunk 38 · 1,995 chars
urpassing not only all baselines but even the upper-bound model trained on significantly larger amounts of German data. 16 -- 16 of 22 -- LR performance HR performance Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg German LR (UB) 44.6 27.6 49.7 33.1 66.9 68.4 53.9 49.2 37.5 25.0 41.4 36.8 64.9 67.9 53.8 46.8 LR 32.3 23.7 28.6 9.7 54.7 48.8 51.9 35.7 29.0 22.7 27.2 3.2 52.1 42.2 50.8 32.4 LR + HR 42.0 28.7 44.2 28.6 62.1 66.0 54.6 46.6 59.8 34.1 67.4 60.6 76.6 82.5 60.5 63.1 LINK_uni 42.0 26.5 46.0 29.0 61.8 70.2 54.7 47.2 55.4 31.8 64.3 58.5 75.9 77.6 59.9 60.5 LINK_domain 45.2 26.6 46.3 28.9 62.5 68.6 54.6 47.6 60.6 33.5 68.0 60.9 77.3 81.3 60.4 63.1 Table 11 2.7B model results. A.4 Full Low-Resource Results This section reports downstream benchmark results for the true low-resource languages (Amharic, Igbo, Yoruba, and Swahili). The availability of target-language evaluation benchmarks varies by language: Amharic and Swahili have ARC-Easy, PiQA, and WinoGrande; Igbo has PiQA and WinoGrande; and Yoruba has ARC-Easy and PiQA translations. The results are presented in Tables 12 and 13. Due to validation data limitation, we report the results for the last checkpoint. LR performance HR performance Setup A-E PQ WG Avg A-E A-C HS LB PQ SQ WG Avg Amharic LR 31.2 52.0 51.1 44.8 26.9 22.6 26.4 0.3 52.7 29.2 46.9 29.3 LR + HR 23.0 52.0 51.1 42.0 53.3 29.6 57.2 51.3 74.0 75.9 57.0 56.9 LINK_uni 24.6 51.0 49.1 41.6 51.5 29.7 56.0 51.1 74.0 73.8 55.2 55.9 LINK_domain 25.9 47.0 50.0 41.0 52.5 27.7 57.1 52.8 74.4 76.9 54.6 56.6 Igbo LR – 69.0 51.1 60.0 27.2 22.9 25.0 0.2 50.8 32.5 50.1 29.8 LR + HR – 72.0 51.2 61.6 53.4 29.2 57.7 53.1 73.1 76.1 57.3 57.1 LINK_uni – 71.0 49.7 60.4 52.6 28.2 57.5 52.2 74.9 76.1 56.0 56.8 LINK_domain – 71.0 49.7 60.4 53.6 28.2 57.6 52.2 73.7 75.9 56.7 56.8 Yoruba LR 29.4 47.0 – 38.2 25.8 21.8 25.9 0.1 51.3 28.0 50.7 29.1 LR + HR 24.9 55.0 – 40.0 53.0 27.7 56.5 51.7 74.1 76.4 54.5 56.3 LINK_uni 24.5 57.0 – 40.8 52.7 29.0 56.8 50.8
Chunk 39 · 1,998 chars
.3 57.1 LINK_uni – 71.0 49.7 60.4 52.6 28.2 57.5 52.2 74.9 76.1 56.0 56.8 LINK_domain – 71.0 49.7 60.4 53.6 28.2 57.6 52.2 73.7 75.9 56.7 56.8 Yoruba LR 29.4 47.0 – 38.2 25.8 21.8 25.9 0.1 51.3 28.0 50.7 29.1 LR + HR 24.9 55.0 – 40.0 53.0 27.7 56.5 51.7 74.1 76.4 54.5 56.3 LINK_uni 24.5 57.0 – 40.8 52.7 29.0 56.8 50.8 74.3 78.2 55.7 56.8 LINK_domain 23.5 54.0 – 38.8 53.0 28.2 56.1 52.6 73.4 75.7 56.8 56.5 Swahili LR 32.2 61.0 50.3 47.8 30.6 21.6 26.5 2.2 52.6 40.0 50.0 31.9 LR + HR 28.1 67.0 50.5 48.5 55.0 29.0 57.5 52.9 74.6 78.5 56.0 57.6 LINK_uni 27.7 68.0 49.8 48.5 52.0 27.3 55.1 51.2 73.2 79.2 56.0 56.3 LINK_domain 28.7 70.0 51.6 50.1 53.5 29.0 57.5 52.9 73.4 77.1 55.6 57.0 Table 12 Results for low-resource languages (1.3B models). Results reported at the last checkpoint. A-E: ARC-Easy, PQ: PiQA, WG: WinoGrande. In order to better analyze the performance of our method for true low-resource languages, we ran an additional set of experiments with an extended number of training steps. We kept the amount of available data the same as it was for the main experiments, but doubled the amount of training steps: i.e., the 1.3B models were trained for 200,000 steps, and 345M models were trained for 60,000 steps. We report results across all checkpoints in Figures 6 and 7. One issue with low-resource evaluations is benchmark noise, caused by the limited number available bench- marks for these languages as well as their small size. For example, the GlobalPiQA benchmark (Chang et al., 2025) we used for this evaluation contains only 100 samples per language. While target-language bench- marks for high-resource languages such as German exhibit low variance (e.g., 0.6–1.1pp between consecutive checkpoints), low-resource benchmarks are 2–3× noisier: Igbo PiQA fluctuates by up to 16 points between consecutive checkpoints, Yoruba PiQA by up to 10 points, and Amharic PiQA by up to 6 points. 17 -- 17 of 22 -- LR performance HR performance Setup A-E PQ WG Avg A-E A-C HS LB PQ SQ
Chunk 40 · 1,995 chars
., 0.6–1.1pp between consecutive checkpoints), low-resource benchmarks are 2–3× noisier: Igbo PiQA fluctuates by up to 16 points between consecutive checkpoints, Yoruba PiQA by up to 10 points, and Amharic PiQA by up to 6 points. 17 -- 17 of 22 -- LR performance HR performance Setup A-E PQ WG Avg A-E A-C HS LB PQ SQ WG Avg Amharic LR 24.0 52.0 50.8 42.3 28.5 22.8 26.4 2.5 50.9 38.3 49.2 31.2 LR + HR 25.9 48.0 50.1 41.3 44.4 24.8 41.4 37.8 67.1 67.1 52.0 47.8 LINK_uni 27.5 50.0 49.6 42.4 42.5 24.0 40.1 36.2 67.8 65.8 50.3 46.7 LINK_domain 25.2 50.0 49.8 41.7 44.1 23.9 41.2 38.7 68.8 67.6 50.1 47.8 Igbo LR – 58.0 48.5 53.2 27.3 24.1 25.9 0.7 49.9 37.1 49.7 30.7 LR + HR – 65.0 50.0 57.5 43.1 24.0 41.3 38.6 68.8 64.9 51.3 47.4 LINK_uni – 68.0 50.3 59.1 43.2 23.6 40.8 37.3 68.1 66.9 52.1 47.4 LINK_domain – 67.0 49.2 58.1 44.3 24.3 40.9 38.8 68.1 66.8 50.4 47.7 Yoruba LR 23.9 53.0 – 38.5 25.9 20.9 25.1 0.2 49.6 29.7 49.3 28.7 LR + HR 22.7 47.0 – 34.9 44.4 23.4 40.4 38.3 67.8 67.2 53.6 47.9 LINK_uni 24.5 55.0 – 39.8 41.8 24.6 40.4 37.2 67.8 66.7 51.5 47.1 LINK_domain 22.5 44.0 – 33.2 43.1 24.7 40.9 37.5 69.0 69.4 51.3 48.0 Swahili LR 29.3 61.0 51.8 47.4 31.1 20.6 26.9 9.3 52.2 48.6 48.9 33.9 LR + HR 29.3 61.0 49.0 46.4 44.1 22.9 41.0 38.1 69.4 68.1 52.6 48.0 LINK_uni 27.1 64.0 49.4 46.8 42.1 23.5 39.3 37.0 67.8 67.1 52.7 47.1 LINK_domain 28.5 64.0 49.6 47.4 43.2 23.6 40.6 37.7 68.1 67.9 51.9 47.6 Table 13 Results for low-resource languages (345M models). Results reported at the last checkpoint. A-E: ARC-Easy, PQ: PIQA, WG: WinoGrande. Figure 6 1.3B models, extended (2×) runs up to 200K steps: per-checkpoint accuracy on low-resource benchmarks (M-ARC Easy, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands show standard deviation. 18 -- 18 of 22 -- Figure 7 345M models, extended (2×) runs up to 60K steps: per-checkpoint accuracy on low-resource benchmarks (M-ARC Easy, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands
Chunk 41 · 1,992 chars
y, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands show standard deviation. 18 -- 18 of 22 -- Figure 7 345M models, extended (2×) runs up to 60K steps: per-checkpoint accuracy on low-resource benchmarks (M-ARC Easy, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands show standard deviation. A.5 Clustering Results Figure 8 demonstrates the results of the k-means clustering (32 clusters) of the English FineWeb training data. Most clusters contain around 2-6B tokens apart from cluster 12 which is almost empty (0.21M to- kens, i.e., ∼0.0002% of all 118B total tokens). Cluster 5 hold around 5.3B tokens (5.55M samples), which corresponds to 4.46% tokens (3.76% samples) of the whole Fineweb dataset. Figure 8 Distribution of English FineWeb training tokens across the 32 k-means clusters. Next, we clustered ARC-Easy and ARC-Challenge benchmarks (both English and German versions) using the same centroids. The results are provided in Figure 9. Both benchmarks are heavily concentrated in a single Cluster 5, which accounts for more than half of the samples of the corresponding benchmarks: 57.24% of samples from ARC-Easy (English), 57.67% of samples from ARC-Easy (German), 58.87% of samples from ARC-Challenge (English), 59.11% of samples from ARC- Challenge(German). The manual inspection of the resulted cluster further confirm the initial hypothesis of scientific knowledge (which is mostly represented in ARC datasets) being concentrated in one cluster, what 19 -- 19 of 22 -- Figure 9 Results of clustering of ARC-Easy and ARC-Challenge validation datasets. allows us to use it for LINK_domain experiments. A.6 Target vs Actual Replacements Figure 10 shows the relationship between the target and actual replacement ratios for each language. Since replacements can only be performed for words present in the bilingual vocabulary, the actual replacement ratio is bounded by vocabulary coverage. For German, which has the largest
Chunk 42 · 1,996 chars
.6 Target vs Actual Replacements Figure 10 shows the relationship between the target and actual replacement ratios for each language. Since replacements can only be performed for words present in the bilingual vocabulary, the actual replacement ratio is bounded by vocabulary coverage. For German, which has the largest vocabulary (48,195 entries), the actual ratio closely tracks the target up to 30%, after which it plateaus around 55–57%. We conducted an extensive ablation over replacement ratios for German (Figure 10, left), and based on the finding that actual replacements saturate beyond a target ratio of 70, we evaluated only the 50 and 70 settings for the remaining languages. Figure 10 Target vs. actual replacement ratio for each language. The dashed line indicates the ideal case where all targeted words are replaced. The gap between target and actual ratios reflects bilingual vocabulary coverage — languages with smaller vocabularies (e.g., Hindi) saturate at lower actual replacement rates. French, Chinese, and Hindi exhibit a similar ceiling effect at target ratios of 50 and 70, with actual ratios reaching approximately 47–50%, 48–55%, and 49–55% respectively. The gap between the target and actual ratios varies across languages, reflecting differences in vocabulary size. Based on these results, we set the target replacement ratio to 70 for all experiments, as this effectively maximizes the number of replacements achievable with our vocabularies. The vocabulary-coverage ceiling is far more pronounced for truly low-resource languages. Figure 11 reports the actual per-document replacement rate at the fixed target of 70% across all eight languages used in this work. The four data-constrained high-resource languages reach the saturation level discussed above (50–56%), whereas the four truly low-resource languages fall dramatically below the target: Swahili reaches only 29.6%, Amharic 13.3%, Yoruba 8.2%, and Igbo 6.1%. These rates are upper-bounded by the size of
Chunk 43 · 1,997 chars
used in this work. The four data-constrained high-resource languages reach the saturation level discussed above (50–56%), whereas the four truly low-resource languages fall dramatically below the target: Swahili reaches only 29.6%, Amharic 13.3%, Yoruba 8.2%, and Igbo 6.1%. These rates are upper-bounded by the size of the available bilingual vocabularies (Table 1); even when every replaceable word is replaced, the per-document rate cannot exceed the fraction of words covered by the dictionary. 20 -- 20 of 22 -- Figure 11 Actual per-document replacement rate at target=70% for all eight languages. The dashed red line marks the 70% target. High-resource languages saturate around 50–56%; truly low-resource languages fall far below the target, bounded by the size of the available bilingual vocabularies. A.7 Reduced Vocabulary Size Experiments Figure 12 ARC-Easy accuracy on German at 1.3B with bilin- gual vocabularies reduced to 1% of the original size (around 480 word pairs). Set 1 and Set 2 are two independently sub- sampled 1% vocabularies (different sampling seeds). In addition to the experiments discussed in Sec- tion 6, we further reduced the original German vo- cabulary to 1% of the initial size - i.e., to around 480 word pairs. This vocabulary was subsampled from 10% vocabulary with two different seeds. The results presented in Figure 12 demonstrate that when the bilingual vocabulary is small, it becomes increasingly important which word pairs it con- tains (not only how many). Across the two in- dependently sampled 1% vocabularies, ARC-Easy accuracy varies by 1.5-1.7pp for both LINK_uni (39.0 vs. 37.4) and LINK_domain (38.5 vs. 36.9), indicating that vocabulary composition becomes a primary driver of transfer at this scale. The stronger of the two subsets still matches or exceeds the LR+HR baseline (38.0), while the weaker one falls slightly below it, implying that with only a few hundred translation pairs available, careful selection of entries matters. A.8
Chunk 44 · 1,833 chars
that vocabulary composition becomes a primary driver of transfer at this scale. The stronger of the two subsets still matches or exceeds the LR+HR baseline (38.0), while the weaker one falls slightly below it, implying that with only a few hundred translation pairs available, careful selection of entries matters. A.8 Data-Constrained vs Low-Resource Throughout this work, we use the term data-constrained rather than low-resource to describe our experimental settings. While the two terms refer to overlapping concepts, they highlight distinct challenges. Our method is broadly applicable to any language for which training data is limited, regardless of whether it is traditionally classified as low-resource. Several factors motivate this distinction. First, truly low-resource languages typically have limited bilingual vocabulary coverage, whereas our simulated settings use languages with large bilingual dictionaries. Second, our simulated settings are constructed by downsampling web-crawled data, which preserves the topical di- versity of the original corpus. In practice, low-resource language data is often drawn from a narrow set of sources (religious texts, government documents, or Wikipedia) resulting in a skewed domain distribution that our downsampling procedure does not capture. Third, low-resource languages frequently lack standardized evaluation benchmarks, limiting our ability to assess model performance comprehensively. To avoid conflating these distinct challenges, we reserve the term low-resource for the experiments in Sec- tion 5.3, supporting truly low-resource languages remains a core motivation of this work, and use data- constrained elsewhere. 21 -- 21 of 22 -- Apple and the Apple logo are trademarks of Apple Inc., registered in the U.S. and other countries and regions. 22 -- 22 of 22 --