Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

Summary

This paper introduces LINK (Lexical INterventions for Knowledge transfer), a data-level method to improve cross-lingual knowledge transfer for languages with scarce training data. Existing approaches often require large parallel corpora or auxiliary models, which are unavailable for many low-resource languages. LINK addresses this by randomly substituting words in high-resource English pretraining data with their translations using only a bilingual vocabulary, requiring no additional model training. The authors evaluate LINK on eight languages across five model sizes (137M to 2.7B parameters). Two strategies are tested: uniform interventions across random data subsets and domain-specific interventions targeting scientific content. Results show that LINK consistently improves downstream performance in data-constrained languages, sometimes surpassing baselines trained on significantly more target-language data. Notably, the method achieves up to a 2x speedup in training to reach equivalent performance. While uniform interventions degrade English performance, domain-specific interventions maintain high-resource accuracy while boosting target-language results. Experiments on truly low-resource languages like Swahili and Yoruba confirm benefits, though effectiveness correlates with bilingual vocabulary size. The study demonstrates that simple lexical substitutions during pretraining offer a scalable, near-zero-cost solution for enhancing multilingual language models without relying on expensive translation infrastructure.

PDF viewer

Chunks(45)

Chunk 0 · 1,999 chars

Multilingual Knowledge Transfer under Data
Constraints via Lexical Interventions
Anastasiia Sedova1,2, Natalie Schluter1‡, Skyler Seto1‡, Maartje ter Hoeve1‡
1Apple, 2ITU
†Shared senior authorship
Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with
insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involv-
ing scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource
language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge
transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are
largely unavailable for many languages. We propose LINK – a data-level intervention method that improves knowledge
transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vo-
cabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training
corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabu-
lary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model
sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to
reach equivalent performance.
Correspondence: Anastasiia Sedova: asedova@apple.com
Date: May 25, 2026
1 Introduction
Large language models (LLMs) have demonstrated remarkable performance across a wide range of language
understanding and knowledge-intensive tasks (Brown et al., 2020; Bubeck et al., 2023; DeepSeek-AI et al.,
2025; Gemini Team et al., 2025). These results are enabled by pretraining on massive text corpora, sometimes
comprising tens of trillions of tokens.

Chunk 1 · 1,994 chars

onstrated remarkable performance across a wide range of language
understanding and knowledge-intensive tasks (Brown et al., 2020; Bubeck et al., 2023; DeepSeek-AI et al.,
2025; Gemini Team et al., 2025). These results are enabled by pretraining on massive text corpora, sometimes
comprising tens of trillions of tokens. At this scale, high-quality training data is realistically available only
in English (Li et al., 2024a; Penedo et al., 2024), and LLM research has therefore largely concentrated on it.
Most other languages, in contrast, lack sufficient high-quality data in public web crawls: even for languages
typically considered high-resource, the available data amounts to only 1–10% of the data available for English,
while for truly low-resource languages, it can fall below 1B tokens (Penedo et al., 2025). Since data should
scale roughly linearly with the size of the model (Hoffmann et al., 2022), this makes it difficult to adapt and
effectively train LLMs for such languages (Pakray et al., 2025; Veitsman & Hartmann, 2025).
One way to address this scarcity is to leverage data across other languages for multilingual modeling. Seto
et al. (2025) demonstrate that the increased capacity of larger models consistently yields better performance
than smaller language-specific models. Such models also naturally map semantically similar concepts across
languages into a shared representation space, enabling effective cross-lingual knowledge transfer (Conneau
et al., 2020b; Hu et al., 2021; Liu & Niehues, 2025). One of the factors impacting this alignment is the
presence of pretraining samples in which similar concepts from different languages co-occur in the same
sentence, forcing the model to predict tokens in one language from tokens in another (Wang et al., 2025).
Existing methods that increase such cross-lingual mixing typically rely on translation systems, larger teacher
models, or substantial parallel corpora (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b),

Chunk 2 · 1,992 chars

same
sentence, forcing the model to predict tokens in one language from tokens in another (Wang et al., 2025).
Existing methods that increase such cross-lingual mixing typically rely on translation systems, larger teacher
models, or substantial parallel corpora (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b), which are often
unavailable for the data-constrained languages that would potentially benefit most.
1
arXiv:2605.23885v1 [cs.CL] 22 May 2026

-- 1 of 22 --

Figure 1 Overview of LINK. Using a bilingual vocabulary, we substitute randomly selected words in a portion of the
high-resource pretraining data with their translations in the low-resource language. The mix ratio controls what
fraction of the high-resource data is replaced, and the replacement ratio determines the amount of replacements.
We introduce LINK (Lexical INterventions for Knowledge transfer): a method that facilitates cross-lingual
knowledge transfer1 through simple lexical substitutions in a multilingual pretraining mix (Figure 1). In
contrast to previous work (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b), our method requires only a
small bilingual vocabulary to replace the words in a portion of the high-resource training data (controlled by
the mix ratio) with their low-resource translations up to a predefined proportion (replacement ratio). This
makes LINK broadly applicable to virtually any language for which a bilingual vocabulary can be obtained
(which, in practice, includes all written languages, even truly low-resource ones) and easy to integrate into
pretraining pipelines at scale.
The experiments are conducted across five model sizes on four data-constrained settings simulated from high-
and mid-resource languages by reducing the amount of training data. This setup allows greater flexibility to
examine a broader range of scenarios and compare them against each other on reliable benchmarks, which are
largely unavailable for truly low-resource languages. We still

Chunk 3 · 1,992 chars

onstrained settings simulated from high-
and mid-resource languages by reducing the amount of training data. This setup allows greater flexibility to
examine a broader range of scenarios and compare them against each other on reliable benchmarks, which are
largely unavailable for truly low-resource languages. We still validate our findings on four truly low-resource
languages. We also analyze the effect of bilingual vocabulary size on transfer performance and experiment
with different placements of the interventions: LINK_uni makes replacements on a randomly selected part
of the data, and LINK_domain intervenes only on the domain-specific part, thus enabling targeted domain
knowledge transfer while preserving most of the high-resource data. Our experiments demonstrate that such
targeted interventions enable the model to maintain high performance in both languages. An ablation study
further shows that LINK facilitates cross-lingual knowledge transfer even when interventions are applied to
data unrelated to the target domain.
Overall, using LINK, we show:
• simple word-level substitutions improve cross-lingual knowledge transfer from high-resource to data-
constrained language during pretraining, yielding better downstream performance in data-constrained
language and up to 2× speedup in training under imbalanced data settings;
• targeted interventions on a domain-specific portion of the high-resource training data lead to improve-
ments comparable to intervening on the entire corpus, while maintaining high downstream performance
in both languages;
• interventions on data not directly related to the end task still contribute to cross-lingual knowledge
transfer and improve downstream performance.
1In context of this paper, following previous work (Longpre et al., 2026; Li et al., 2024b; Conneau & Lample, 2019), by
cross-lingual knowledge transfer, we refer to the ability of a model pretrained on data from multiple languages to leverage
knowledge (such as factual

Chunk 4 · 1,994 chars

dge
transfer and improve downstream performance.
1In context of this paper, following previous work (Longpre et al., 2026; Li et al., 2024b; Conneau & Lample, 2019), by
cross-lingual knowledge transfer, we refer to the ability of a model pretrained on data from multiple languages to leverage
knowledge (such as factual information or domain-specific concepts) acquired during pretraining from the data in one language
when performing tasks in another.
2

-- 2 of 22 --

2 Related Work
Data Limitations for Multilingual Language Models Training language models requires data and model sizes to
scale jointly (Hoffmann et al., 2022). As a result, the development of multilingual models has been closely
tied to the availability of large-scale multilingual corpora. Widely used multilingual datasets (Wenzek et al.,
2020; Xue et al., 2021b; Penedo et al., 2025) provide web-crawled text in over 100 languages, but with
a dramatic long-tail distribution: for most languages, data volumes can be 10–100x smaller than English,
making it difficult to train even moderately sized models. This data scarcity propagates through to model
performance: models such as BLOOM (BigScience Workshop et al., 2023), Llama 2 (Touvron et al., 2023),
and Qwen 2.5 (Team Qwen et al., 2025) have data mixtures with over 90% containing English. This is further
compounded by the curse of multilinguality: with fixed model capacity, adding more languages degrades per-
language performance (Conneau et al., 2020a), disproportionately harming low-resource languages (Chang
et al., 2024).
Cross-Lingual Knowledge Transfer Prior work has shown that multilingual models naturally learn to map
semantically similar concepts across languages into a shared representation space (Conneau et al., 2020b; Hu
et al., 2021; Liu & Niehues, 2025). Among the factors driving this is the co-occurence of similar concepts
from different languages in the same context during training, which helps the model learn to predict tokens
in one

Chunk 5 · 1,995 chars

y similar concepts across languages into a shared representation space (Conneau et al., 2020b; Hu
et al., 2021; Liu & Niehues, 2025). Among the factors driving this is the co-occurence of similar concepts
from different languages in the same context during training, which helps the model learn to predict tokens
in one language from tokens in another (Luong et al., 2015; Wang et al., 2025), and even lightweight cross-
lingual signal can meaningfully improve transfer (Li et al., 2024b). Such multilingual knowledge transfer is
particularly important when target language data is limited: Tanzer et al. (2023) provide a benchmark for
learning to translate a new language from a single book, and Seto et al. (2025) demonstrated that bilingual
models trained with limited target-language data benefit from scaling auxiliary high-resource data. These
and other works show that high-resource data can improve low-resource performance, but doing so without
large parallel corpora or auxiliary models remains challenging.
Word- and Segment-Level Interventions Simple word-level perturbations have seen success as an effective data
augmentation techniques in natural language processing. Xie et al. (2017) show that noisy word substitution
improves language models and machine translation systems, Wei & Zou (2019) further demonstrate that
simple operations such as synonym replacement and random insertion improve text classification in low-
data regimes. In machine translation, bilingual dictionaries have been used to improve translation of rare
words (Fadaee et al., 2017), while random word replacements have been shown to regularize neural machine
translation models (Wang et al., 2018). Kobayashi (2018) extend word substitution by using a language
model to predict contextually appropriate replacements. These works establish that even simple lexical
perturbations provide meaningful training signal. Recent work with LLM-generated code-swtiched data
included in pretraining further suggests

Chunk 6 · 1,995 chars

t al., 2018). Kobayashi (2018) extend word substitution by using a language
model to predict contextually appropriate replacements. These works establish that even simple lexical
perturbations provide meaningful training signal. Recent work with LLM-generated code-swtiched data
included in pretraining further suggests that code-switching abilities may be key to improving multilingual
capabilities (Wang et al., 2025; Yoo et al., 2025; Li et al., 2024b).
Our LINK method is most closely related to these methods but differs in two key respects. First, we make
no explicit code-switching assumption. LINK is motivated by the topology of today’s LLMs: inspired by
older research in multilingual word embeddings (Luong et al., 2015), we observe that by generating token
representations using a bilingual context, we can enable multilingual representations of tokens for monolingual
inputs. Second, LINK is lightweight: it requires no parallel corpora, no auxiliary models, and no additional
training stages, but only a bilingual vocabulary obtainable at near-zero cost, making it applicable to truly
low-resource settings where none of these resources are available.
3 Lexical Interventions for Cross-Lingual Knowledge Transfer
We assume a data-constrained2 target language dataset DLR of limited size and a high-resource language
dataset DHR available in effectively unlimited quantity3. Before training, we apply lexical interventions to
2We use data-constrained rather than low-resource, but retain the standard LR abbreviation for brevity. See Appendix A.8
for additional discussion.
3Although LINK has no formal limitation on the number of languages, we follow Seto et al. (2025) and focus on the bilingual
setup for controllability and leave other experiments for future work.
3

-- 3 of 22 --

a portion of DHR: using a bilingual vocabulary VHR↔LR, we randomly replace words in DHR with their
target-language translations. For a sample x = (w1, . . . , wn) ∈ D, where D ⊆ DHR is the subset

Chunk 7 · 1,996 chars

(2025) and focus on the bilingual
setup for controllability and leave other experiments for future work.
3

-- 3 of 22 --

a portion of DHR: using a bilingual vocabulary VHR↔LR, we randomly replace words in DHR with their
target-language translations. For a sample x = (w1, . . . , wn) ∈ D, where D ⊆ DHR is the subset selected for
intervention, let kx ∈ [0, n] denote the number of tokens to replace. The target replacement ratio r ∈ [0, 1]
controls what fraction of words in each sample are replaced. The intervened data is:
DHR+LR = {Replace(x, VHR↔LR, kx, r) | x ∈ D} , (3.1)
where Replace(x, V, k, r) swaps k randomly selected tokens in x with their translations from V , leaving
other words unchanged. Since bilingual vocabularies vary in coverage, the actual number of replacements
may fall below r.
An example of an English sample before and after LINK for German-English mix, 70% replacements:
[. . . ] Combine the lamb with the onion mixture. Add the cinnamon, oregano and red wine and cook for a
few minutes. Add the tomatoes and a cup of water or stock. [. . . ] See more Greek recipes. [. . . ]
[. . . ]Kombinieren der Lamm mit der Zwiebel mixture. Add der Zimtbaum, oregano und rot Wein
und Koch da a wenig minutes. Add der tomatoes und a Tasse aus Wasser oder Vorrat.[. . . ]Sehen
mehr Greek recipes.[. . . ]
Importantly, we aim to create cross-lingual co-occurrences rather than well-formed bilingual sentences; that is
why we do not filter for grammatical correctness or translation accuracy. This approach keeps the intervention
computationally cheap and turns it into a simple preprocessing step with negligible overhead requiring only
a dictionary lookup per token.
We primarily consider two replacement strategies:
• Uniform interventions (LINK_uni), where the words are replaced across a randomly selected portion
of DHR. From DHR, we randomly select a subset Drand
HR , which amount is defined by mix ratio, and
replace up to r fraction of words in each sentence x ∈ D with

Chunk 8 · 1,995 chars

r token.
We primarily consider two replacement strategies:
• Uniform interventions (LINK_uni), where the words are replaced across a randomly selected portion
of DHR. From DHR, we randomly select a subset Drand
HR , which amount is defined by mix ratio, and
replace up to r fraction of words in each sentence x ∈ D with their translations from VHR↔LR. The
remaining portion of the high-resource data, D′
HR = DHR \ Drand
HR+LR, is included in the training set
without modifications, as is the data-constrained DLR, which are not subject to replacements to not
reduce the (already highly limited) data. The resulting training dataset is defined as:
Dtrain = DLR ∪ Drand
HR+LR ∪ D′
HR. (3.2)
• Domain-specific interventions (LINK_domain), which applies interventions only to the task- or
domain-specific portion of the high-resource data. This is motivated by the practical observation
that domain-specific knowledge is often abundant in high-resource language but scarce in the data-
constrained languages (e.g., scientific content), making it a natural candidate for targeted transfer.
Under this more conservative setting, we address the lack of such domain data in the target language
dataset, while we minimize the intervention cost on the high-resource dataset (which inevitably grows
with increasing the size of intervened high-resource data). Let Dtask
HR ⊂ DHR denote the domain-specific
portion of the high-resource data, and Dnon-task
HR := DHR \ Dtask
HR . The domain-specific mixed subset
Dtask
HR+LR is created by making replacements in Dtask
HR using VHR↔LR. The remaining data are included
in the training set without modifications, resulting in:
Dtrain = DLR ∪ Dtask
HR+LR ∪ Dnon-task
HR . (3.3)
This approach preserves the majority of the high-resource training data, concentrating cross-lingual
signal where it is most needed while maintaining English performance.
4 Experimental Setup
English serves as the high-resource language for all experiments, as it is the only language

Chunk 9 · 1,998 chars

Dtask
HR+LR ∪ Dnon-task
HR . (3.3)
This approach preserves the majority of the high-resource training data, concentrating cross-lingual
signal where it is most needed while maintaining English performance.
4 Experimental Setup
English serves as the high-resource language for all experiments, as it is the only language with sufficiently
large and diverse public web-scale datasets. The data-constrained scenario is simulated by subsampling
four languages (German, French, Hindi, and Chinese) to approximately 350M tokens each (roughly 400K
4

-- 4 of 22 --

documents), comparable to prior work (Seto et al., 2025) and closely mirroring the scale of genuinely low-
resource languages in well-established multilingual datasets (Penedo et al., 2025; Xue et al., 2021a). This
enables comparison with having more target-language data, broader ablation studies, and evaluation on
well-established benchmarks whose translation is often unfeasible for truly low-resource languages. Apart
from that, we additionally experiment with four truly low-resource languages: Swahili, Yoruba, Amharic,
and Igbo (see Section 5.3 for the results). The English training data DHR is sampled from the FineWeb
corpus (Penedo et al., 2024) in quantities sufficient to avoid repetition within a single model’s training. All
non-English data DLR is sampled from FineWeb2 (Penedo et al., 2025). A bilingual vocabulary for each target
language VHR↔LR is constructed by extracting word entries and their corresponding English translations from
a bilingual resource. While any bilingual dictionary or lexicon can serve this purpose, we use Wiktionary
(Wikimedia Foundation, 2025) as it provides translation pairs for over 4,400 languages, making bilingual
vocabularies readily available even for languages without parallel corpora or translation systems. Dataset
and vocabulary sizes are reported in Table 1.
Language Bilingual
vocabulary Training
Tokens
English – ∞
German (DE) 48,195 345M
Chinese (ZH) 45,571 330M
French (FR)

Chunk 10 · 1,989 chars

airs for over 4,400 languages, making bilingual
vocabularies readily available even for languages without parallel corpora or translation systems. Dataset
and vocabulary sizes are reported in Table 1.
Language Bilingual
vocabulary Training
Tokens
English – ∞
German (DE) 48,195 345M
Chinese (ZH) 45,571 330M
French (FR) 36,492 340M
Hindi (HI) 25,001 342M
Swahili (SW) 4,197 672M
Amharic (AM) 1,487 435M
Yoruba (YO) 675 96M
Igbo (IG) 233 146M
Table 1 Bilingual vocabulary sizes
and training data amounts.
We evaluate LINK_uni and LINK_domain on zero-shot QA tasks: ARC
Easy and Challenge (Clark et al., 2018), Hellaswag (Zellers et al., 2019),
Lambada (Paperno et al., 2016), PiQA (Bisk et al., 2019), SciQ (Welbl
et al., 2017), and Winogrande (Sakaguchi et al., 2021). These are
knowledge-based tasks that small models with limited data still per-
form well on. Non-English evaluations are conducted via translation of
the original dataset. We translate primarily using our own systems to
ensure that translation artifacts will be consistent throughout.
We train decoder-only GPT models at five scales (137M, 345M, 760M,
1.3B, and 2.7B parameters) using the Megatron-LM framework with the
Aya multilingual tokenizer (250K vocabulary) (Üstün et al., 2024) and
a sequence length of 1024. Training runs for 10K, 30K, 50K, and 100K
steps, correspondingly, with a batch size of 1024 samples. In a separate
set of experiments, we empirically determined an optimal ratio for combining English and LR language data
of 97.5% (high-resource) to 2.5% (data-constrained)4.
5 Results
We compare LINK against three baselines: (1) LR (UB) as an upper bound, trained with enough target
language data for one training epoch, serving as an oracle and illustrating the gap that data scarcity creates5;
(2) LR as a lower bound, trained solely on the scarce amount of resource-constrained data; and (3) LR +
HR trained on a mixture of the scarce quantity of resource-constrained data and sufficient quantity of

Chunk 11 · 1,997 chars

language data for one training epoch, serving as an oracle and illustrating the gap that data scarcity creates5;
(2) LR as a lower bound, trained solely on the scarce amount of resource-constrained data; and (3) LR +
HR trained on a mixture of the scarce quantity of resource-constrained data and sufficient quantity of high-
resource data. Due to space constraints, we restrict our discussion to results with the 1.3B parameter models,
and defer results from smaller (137M, 345M, and 760M parameters) and larger (2.7B parameters) models to
the Appendix A.3.
5.1 Uniform Interventions
Table 2 reports evaluation results of our LINK_uni experiments. All results are presented for the mix ratio
of 90 and replacement ratio of 70 based on the ablation study. In practice, however, vocabulary coverage
often caps the actual per-sample replacement rate below the 70% target, meaning every word that can be
replaced is replaced under this setup (see more details in Appendix A.6). The results of other replacement
configurations are discussed in Section 6.
LINK_uni yield consistent improvements in data-constrained language performance across all four languages,
outperforming both LR and LR+HR baselines, often by a substantial margin. Remarkably, using only a
4The data mix ratio of 97.5:2.5 was selected from a range of 50:50 to 99:1 based on final checkpoint perplexity in preliminary
English–German experiments
5Not available for Hindi due to limited data available in the Fineweb2 dataset.
5Note that perplexity values are not directly comparable across languages due to differences in tokenizer fertility. Full scores
are provided in Appendix A.1.
5

-- 5 of 22 --

LR performance HR performance
Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg
German
LR (UB) 37.2 28.6 41.9 28.7 62.8 65.9 51.9 45.3 33.5 24.1 33.5 28.9 60.2 61.3 52.1 41.9
LR 30.8 22.7 29.1 12.7 55.2 47.3 52.0 35.7 27.8 23.5 26.8 4.5 52.2 43.1 52.3 32.9
LR + HR 38.0 25.8 37.9 26.7 59.0 64.1 52.7 43.5 52.1 27.8 57.6

Chunk 12 · 1,997 chars

ance HR performance
Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg
German
LR (UB) 37.2 28.6 41.9 28.7 62.8 65.9 51.9 45.3 33.5 24.1 33.5 28.9 60.2 61.3 52.1 41.9
LR 30.8 22.7 29.1 12.7 55.2 47.3 52.0 35.7 27.8 23.5 26.8 4.5 52.2 43.1 52.3 32.9
LR + HR 38.0 25.8 37.9 26.7 59.0 64.1 52.7 43.5 52.1 27.8 57.6 53.0 74.4 74.6 53.7 56.2
LINK_uni 40.3 26.6 39.4 26.9 59.8 64.7 53.5 44.5 50.0 27.4 54.3 49.4 71.4 72.3 56.4 54.5
LINK_domain 39.3 25.0 38.1 26.6 60.3 66.7 52.0 44.0 52.6 29.2 57.0 53.4 73.6 76.6 56.7 57.0
French
LR (UB) 41.3 27.4 46.7 32.5 66.0 66.3 54.9 47.9 38.1 21.8 33.6 30.6 62.0 66.5 53.2 43.7
LR 31.9 21.2 30.1 10.5 55.7 54.4 53.2 36.7 28.2 21.7 26.6 4.3 52.0 46.6 49.9 32.8
LR + HR 38.4 24.9 40.2 28.3 60.7 65.1 54.2 44.5 53.3 28.5 57.8 52.4 74.7 77.4 54.9 57.0
LINK_uni 39.8 25.8 42.2 30.4 60.6 63.4 54.2 45.2 48.7 27.1 53.8 49.5 71.6 70.9 53.9 53.6
LINK_domain 40.4 26.2 41.3 29.2 61.0 64.2 52.9 45.0 52.0 29.2 57.3 53.0 73.5 75.3 55.0 56.5
Chinese
LR (UB) 46.1 26.6 39.0 – 59.6 78.6 51.8 50.3 34.4 19.9 29.0 15.4 55.5 64.3 52.7 38.8
LR 33.5 23.7 30.5 – 53.8 62.9 49.4 42.3 27.3 20.3 26.7 1.4 50.8 40.1 48.7 30.7
LR + HR 39.7 24.6 37.3 – 58.3 75.5 51.5 47.8 53.2 29.7 56.4 50.2 73.8 73.8 55.2 56.0
LINK_uni 41.8 26.8 38.3 – 59.5 76.4 51.2 49.0 49.7 27.1 54.3 50.0 73.0 73.0 56.3 54.7
LINK_domain 41.5 25.4 37.8 – 59.0 77.3 51.0 48.7 50.3 27.1 56.2 51.9 73.8 74.5 55.2 55.6
Hindi
LR 30.6 22.9 28.3 – 55.0 – 52.9 37.9 27.3 22.4 25.9 2.0 51.0 36.4 51.0 30.9
LR + HR 33.3 24.1 30.4 – 53.0 – 51.0 38.4 53.4 28.2 57.9 52.7 74.0 77.9 56.3 57.2
LINK_uni 35.0 24.6 31.8 – 54.0 – 52.5 39.6 48.1 27.3 52.5 47.5 72.1 75.3 53.2 53.7
LINK_domain 35.5 25.3 30.9 – 49.0 – 50.3 38.2 52.4 27.4 56.4 52.5 74.0 76.6 55.8 56.4
Table 2 LINK_uni and LINK_domain 1.3B model results across German, French, Chinese, and Hindi. Bold indicates
best per task-language pair among LR (S), LR+HR, LINK_uni, and LINK_domain (excluding the LR (UB)).
fraction of the data-constrained data available to

Chunk 13 · 1,993 chars

– 49.0 – 50.3 38.2 52.4 27.4 56.4 52.5 74.0 76.6 55.8 56.4
Table 2 LINK_uni and LINK_domain 1.3B model results across German, French, Chinese, and Hindi. Bold indicates
best per task-language pair among LR (S), LR+HR, LINK_uni, and LINK_domain (excluding the LR (UB)).
fraction of the data-constrained data available to LR (UB), LINK_uni even surpasses this upper bound
baseline (e.g., ARC-Easy for German). This proves our hypothesis that simple word-level replacements can
unlock cross-lingual knowledge transfer between languages powerful enough to close the gap with models
trained on significantly more target data.
Figure 2 LINK_uni and LINK_domain percent-
age change in perplexity relative to the
LR+HR baseline for 1.3B models, computed as
PPLLINK_uni−PPLLR+HR
PPLLR+HR
6.
However, these gains come at the cost of high-resource lan-
guage performance: English downstream scores drop by up
to 5.5pp compared to the LR+HR baseline. Figure 2 also
confirms this: massive uniform interventions performed by
LINK_uni lead to a substantial increase in English perplexity
across all bilingual setups, reflecting the degradation caused
by modifying a large portion of the high-resource training
data. The effect on data-constrained language perplexity
is less consistent — it decreases for some languages but in-
creases for others, suggesting that downstream gains from
cross-lingual knowledge transfer do not necessarily correlate
with lower general-domain perplexity.
5.2 Domain-Specific Interventions
Next, we experiment with LINK_domain, where interventions are applied only to domain-specific data. In
contrast to the LINK_uni setup, the only relevant data for replacements in LINK_domain is the domain-
specific subset, and, within it, the mix ratio equals 100, while the remaining English data is left unmodified.
Our primary motivation for including a domain specific intervention is to limit the amount of data that
is replaced, to protect performance in English. As a representative

Chunk 14 · 1,998 chars

ments in LINK_domain is the domain-
specific subset, and, within it, the mix ratio equals 100, while the remaining English data is left unmodified.
Our primary motivation for including a domain specific intervention is to limit the amount of data that
is replaced, to protect performance in English. As a representative domain, we focus on scientific content,
targeting the ARC downstream task (Clark et al., 2018). To identify the relevant training data, following
Grangier et al. (2025); Seto et al. (2025) we cluster the English data using k-means with multilingual BERT
representations (Devlin et al., 2019) into 32 clusters. The resulting clusters show clear topic-wise separation,
6

-- 6 of 22 --

Figure 3 LINK_domain: ARC-Easy accuracy for 1.3B models. Top: target language evaluation; bottom: English
evaluation.
with one cluster (comprising roughly 5.3B tokens) corresponding to scientific knowledge – confirmed by
assigning English and German ARC validation sets to the same centroids, which are almost entirely attributed
to this cluster (see Appendix A.5). Therefore the interventions were applied to all samples attributed to
this cluster while the other samples remains intact. We preserved the original cluster distribution in the
pretraining mix.
The results are provided in Table 2. Additionally, Figure 3 shows ARC-Easy (i.e., target domain) accuracy
throughout training for all four languages. LINK_domain interventions match or exceed LINK_uni interven-
tions on target-language performance (top row) while preserving English performance (bottom row), unlike
LINK_uni, which causes a notable drop. Importantly, with both intervention strategies, LINK improves train-
ing efficiency for the target language. For example, we see that for German, the LINK models reaches the
baseline’s final performance at 40k steps versus 80k, representing an approximately 2× speedup, while for
Hindi both intervention strategies surpass the baseline as early as 30k steps. Furthermore, Figure

Chunk 15 · 1,994 chars

es train-
ing efficiency for the target language. For example, we see that for German, the LINK models reaches the
baseline’s final performance at 40k steps versus 80k, representing an approximately 2× speedup, while for
Hindi both intervention strategies surpass the baseline as early as 30k steps. Furthermore, Figure 2 shows
that LINK_domain result in negligible perplexity changes for both languages, in stark contrast to the substan-
tial increases in English perplexity observed with LINK_uni interventions. Together, these results show that
targeting interventions to domain-relevant data achieves strong data-constrained language transfer without
sacrificing high-resource performance.
5.3 Low-Resource Experiments
Evaluating on truly low-resource languages introduces additional challenges: severely limited data limits
bilingual vocabulary size and training corpus (e.g., Yoruba has only 96M tokens in FineWeb2 and 1.4% of
the German vocabulary, see Table 1) constraining both what the model can learn from the target language
directly, and the degree of overlap the two languages have in the data. Apart from that, standardized
evaluation benchmarks are largely unavailable, as most established English benchmarks lack translations for
these languages and automatic translation is often unreliable due to the same data scarcity being not enough
to train reliable machine translation systems. Evaluation is therefore restricted to the few benchmarks with
existing multilingual versions.
7

-- 7 of 22 --

AM IG SW YO
1.3B
 LR 44.7 60.1 47.8 38.2
LR + HR 42.1 61.6 48.5 40.0
LINK_uni 41.6 60.4 48.5 40.7
LINK_domain 40.9 60.4 50.1 38.7
345M
 LR 42.5 51.5 46.5 38.1
LR + HR 41.3 57.5 46.4 34.8
LINK_uni 42.4 59.1 46.8 39.7
LINK_domain 41.7 58.1 47.4 33.2
Table 3 LINK average target-language bench-
mark results for true low-resource languages.
Table 3 reports the average LINK results for four low-resource
languages: Amharic, Igbo, Swahili, and Yoruba. The evalu-
ation is done on the

Chunk 16 · 1,991 chars

R + HR 41.3 57.5 46.4 34.8
LINK_uni 42.4 59.1 46.8 39.7
LINK_domain 41.7 58.1 47.4 33.2
Table 3 LINK average target-language bench-
mark results for true low-resource languages.
Table 3 reports the average LINK results for four low-resource
languages: Amharic, Igbo, Swahili, and Yoruba. The evalu-
ation is done on the benchmarks that are available for these
languages (see the exact list and full scores in Appendix A.4).
The effect of domain interventions varies across languages. For
Igbo, Swahili, and Yoruba, we see improvement in all setting
except 1.3B Igbo setup. For Swahili, domain-specific interven-
tions yield consistent improvements at both scales, while for
Yoruba and Igbo, uniform interventions perform best. This
can be explained by much larger vocabulary available for
Swahili (4197) than for other languages, which motivates our
next study of bilingual vocabulary size (see the next section).
In the Amharic, the LR only baseline performs the best con-
sistently. This may reflect the greater script and typological distance of Amharic from English, which limits
the effectiveness of lexical replacement and, more generally, training with English data.
6 Ablation Studies
LINK introduces several design choices including the bilingual vocabulary, replacement ratio, mix ratio, and
domain relevance. To understand which of these factors most strongly influence target language performance
and to guide practitioners in applying our method, we conduct a series of ablation studies.
Model Sizes We evaluate across five model sizes (137M–2.7B parameters) to verify that our method’s bene-
fits are not limited to a single scale. Both LINK_uni and LINK_domain consistently match or outperform the
LR+HR baseline at every scale. Benefits emerge even at 137M and become increasingly pronounced at larger
scales: the gap between LINK and the baseline widens as model size grows, with the largest improvements
observed at 760M and 1.3B. This trend is consistent across both close

Chunk 17 · 1,995 chars

ain consistently match or outperform the
LR+HR baseline at every scale. Benefits emerge even at 137M and become increasingly pronounced at larger
scales: the gap between LINK and the baseline widens as model size grows, with the largest improvements
observed at 760M and 1.3B. This trend is consistent across both close (German, French) and distant (Chi-
nese, Hindi) language pairs, suggesting that larger models are better able to exploit the cross-lingual signal
introduced by our method. Full results are reported in Appendix A.3.
Figure 4 LINK ARC-Easy accuracy on German with reduced
bilingual vocabularies (1.3B model).
Reduced Vocabulary Size To examine the effect
of vocabulary size on transfer performance, we
conduct experiments with reduced German bilin-
gual vocabularies to 50% and 10% of the original
size (each vocabulary is a subset of a larger one).
Figure 4 presents the results on ARC-Easy for
both uniform and domain-specific settings. Both
settings show a similar pattern: full vocabulary
(100%) outperforms the LR+HR baseline, while re-
ducing the vocabulary to 50% (around 24,000 word
pairs) leads to a modest decline, which is amplified
by further reduction to 10% (around 4,800 word
pairs). These results indicate that vocabulary size
is an important factor in transfer performance,
with larger bilingual dictionaries yielding stronger
gains. This corroborates inconsistent gains we ob-
serve for some of the low-resource languages as the vocabulary size is 9% and lower of the German vocabulary
for the four low resource languages we show. Additionally, we reduced the vocabulary even further to 1% -
results of these experiments are provided in Appendix A.7. Encouragingly, obtaining vocabularies of this size
is not a significant barrier, as Wiktionary alone provides at least 1,000 translation pairs for 186 languages
and over 10,000 for 69.
8

-- 8 of 22 --

Repl Mix PPLEN PPLDE A-EDE
LR 265.4 58.1 30.8
LR + HR 10.0 16.6 38.1
10 10 10.0 16.6 38.5
30 30 10.1 16.6

Chunk 18 · 1,999 chars

endix A.7. Encouragingly, obtaining vocabularies of this size
is not a significant barrier, as Wiktionary alone provides at least 1,000 translation pairs for 186 languages
and over 10,000 for 69.
8

-- 8 of 22 --

Repl Mix PPLEN PPLDE A-EDE
LR 265.4 58.1 30.8
LR + HR 10.0 16.6 38.1
10 10 10.0 16.6 38.5
30 30 10.1 16.6 38.9
50 70 10.4 16.5 38.5
50 90 10.6 16.5 39.6
70 70 10.5 16.6 38.6
70 90 10.9 16.5 40.2
Table 4 LINK across varying replace-
ment (Repl) and mixing (Mix) ratios
(English–German, 1.3B).
Replacement & Mix Ratio Table 4 presents an ablation over the replace-
ment ratio and mixing ratio for the English–German pair, reporting
English and German perplexity on FineWeb and FineWeb-2 valida-
tion sets, as well as ARC Easy accuracy in German. Low values for
both ratios (10/10, 30/30) have minimal effect on downstream perfor-
mance. Increasing either ratio improves results, though the mixing
ratio appears to have a slightly stronger effect; the best configura-
tion (70/90) achieves over 2 points above the non-augmented baseline.
Notably, target-language perplexity remains stable across all config-
urations, while English perplexity increases by almost 1pp, further
motivating LINK_domain.
Setup PPLEN PPLLR A-ELR
DE
LR + HR 10.0 16.6 38.1
Uniform 16.9 16.8 40.3
Domain 10.0 16.5 39.3
Non-domain 11.9 16.7 40.5
FR
LR + HR 9.9 8.9 38.4
Uniform 10.8 9.4 39.8
Domain 10.1 8.9 40.4
Non-domain 11.8 9.4 39.8
ZH
LR + HR 10.0 42.4 39.7
Uniform 10.7 43.9 41.8
Domain 10.2 42.1 41.5
Non-domain 12.3 45.4 40.8
HI
LR + HR 9.9 6.0 33.3
Uniform 11.2 5.8 35.0
Domain 10.1 6.0 35.5
Non-domain 12.4 5.8 35.9
Table 5 Comparison of LINK intervention
strategies (1.3B). PPLEN/PPLLR: perplex-
ity on FineWeb/FineWeb-2 validation; A-E:
ARC Easy accuracy.
Non-domain specific interventions and indirect knowledge transfer
To disentangle the effect of topical alignment between the re-
placed text and the target language from the effect of lexical ex-
posure itself, we introduce a Non-domain control

Chunk 19 · 1,998 chars

LLR: perplex-
ity on FineWeb/FineWeb-2 validation; A-E:
ARC Easy accuracy.
Non-domain specific interventions and indirect knowledge transfer
To disentangle the effect of topical alignment between the re-
placed text and the target language from the effect of lexical ex-
posure itself, we introduce a Non-domain control experiment: in-
terventions are applied to all data except for the domain-specific
part. That is, replacements are applied only to Dnon-task
HR =
DHR \ Dtask
HR , while the task-relevant data remains unchanged.
The final training set is then Dtrain = DLR ∪ Dnon-task
HR+LR ∪ Dtask
HR .
In our experimental setup from Section 5.2, the interventions on
all non-scientific data means intervening on roughly the 95.5%
of the original English dataset – i.e., more than domain-specific
(4.5%) or even uniform (90%) interventions. As shown in Table 5,
this massive increase in intervention volume leads to further LR
improvements for all four languages compared to the LR + HR
baseline, with non-domain interventions outperforming both uni-
form and domain-specific setups for Hindi and German. This sug-
gests that sheer volume of lexical exposure matters: the model
benefits from encountering target-language vocabulary broadly
across the training data, regardless of topical relevance. We ob-
serve an expected cost to English performance, but strikingly
also observe that our approach unlocks domain transfer even
without directly intervening on the relevant data.
7 Conclusion
This work proposes a data-level intervention method for improving language model pretraining in languages
with scarce data. Our approach requires only a bilingual vocabulary, making it applicable at near-zero cost
to over a thousand languages and at pretraining scale. Across different target languages and five model
sizes, our method consistently improves downstream performance particularly for languages that are distant
from the high-resource language, and remains effective even when interventions

Chunk 20 · 1,998 chars

applicable at near-zero cost
to over a thousand languages and at pretraining scale. Across different target languages and five model
sizes, our method consistently improves downstream performance particularly for languages that are distant
from the high-resource language, and remains effective even when interventions are applied to only a small,
domain-specific portion of the training data. These findings demonstrate a practical and scalable path toward
building stronger language models for data-scarce languages.
References
BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Ro-
man Castagné, Alexandra Sasha Luccioni, François Yvon, et al. Bloom: A 176b-parameter open-access multilingual
language model, 2023. URL https://arxiv.org/abs/2211.05100.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical
commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641.
9

-- 9 of 22 --

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL
https://arxiv.org/abs/2005.14165.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee,
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang.
Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712.
Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a

Chunk 21 · 1,995 chars

Peter Lee,
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang.
Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. URL https://arxiv.org/abs/2303.12712.
Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling
for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural
Language Processing, pp. 4074–4096, 2024.
Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Boda Sadallah, Abeer Kashar, Aitazaz
Daud, Abosede Grace Olanihun, et al. Global piqa: Evaluating physical commonsense reasoning across 100+
languages and cultures. ArXiv, abs/2510.24081, 2025. URL https://api.semanticscholar.org/CorpusID:282401377.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.
Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Curran Associates Inc., Red Hook,
NY, USA, 2019.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán,
Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation
learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July 2020a. Association
for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747/.
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross-lingual Structure in
Pretrained Language Models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings
of the 58th Annual Meeting of the Association for

Chunk 22 · 1,993 chars

ps://aclanthology.org/2020.acl-main.747/.
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging Cross-lingual Structure in
Pretrained Language Models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6022–6034, Online, July 2020b.
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.536. URL https://aclanthology.org/2020.
acl-main.536/.
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi
Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun
Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, et al. Deepseek-v3
technical report, 2025. URL https://arxiv.org/abs/2412.19437.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional
transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423/.
Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data augmentation for low-resource neural machine transla-
tion. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pp. 567–573, Vancouver, Canada, July 2017. Association for
Computational Linguistics. doi: 10.18653/v1/P17-2090. URL https://aclanthology.org/P17-2090/.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk,
Andrew M. Dai, Anja

Chunk 23 · 1,999 chars

inguistics (Volume 2: Short Papers), pp. 567–573, Vancouver, Canada, July 2017. Association for
Computational Linguistics. doi: 10.18653/v1/P17-2090. URL https://aclanthology.org/P17-2090/.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk,
Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrit-
twieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, et al. Gemini: A family
of highly capable multimodal models, 2025. URL https://arxiv.org/abs/2312.11805.
David Grangier, Simin Fan, Skyler Seto, and Pierre Ablin. Task-adaptive pretrained language models via clustered-
importance sampling. In The Thirteenth International Conference on Learning Representations, 2025. URL https:
//openreview.net/forum?id=p6ncr0eTKE.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego
de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican,
George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol
10

-- 10 of 22 --

Vinyals, Jack W. Rae, and Laurent Sifre. Training compute-optimal large language models. In Proceedings of the
36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022.
Curran Associates Inc. ISBN 9781713871088.
Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Siddhant, and Graham Neubig. Explicit alignment objectives for
multilingual bidirectional encoders. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur,
Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pp. 3633–3643, Online, June 2021. Association for Computational Linguistics. doi:

Chunk 24 · 1,998 chars

-Tur,
Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pp. 3633–3643, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
naacl-main.284. URL https://aclanthology.org/2021.naacl-main.284/.
Sosuke Kobayashi. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Marilyn
Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 452–
457, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2072.
URL https://aclanthology.org/N18-2072/.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick
Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen,
Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-
Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal,
Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen,
Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle
Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini,
Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt,
and Vaishaal Shankar. Datacomp-lm: in search of the next generation of training sets for language models. In
Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook,
NY, USA, 2024a. Curran Associates

Chunk 25 · 1,998 chars

r, Alexandros G. Dimakis, Yair Carmon, Achal Dav, Ludwig Schmidt,
and Vaishaal Shankar. Datacomp-lm: in search of the next generation of training sets for language models. In
Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook,
NY, USA, 2024a. Curran Associates Inc. ISBN 9798331314385.
Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. PreAlign: Boosting cross-lingual transfer
by early establishment of multilingual alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 10246–10257, Miami,
Florida, USA, November 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.572.
URL https://aclanthology.org/2024.emnlp-main.572/.
Danni Liu and Jan Niehues. Middle-layer representation alignment for cross-lingual transfer in fine-tuned LLMs.
In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the
63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15979–
15996, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi:
10.18653/v1/2025.acl-long.778. URL https://aclanthology.org/2025.acl-long.778/.
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik,
Chen-Yu Lee, and Sayna Ebrahimi. ATLAS: Adaptive transfer scaling laws for multilingual pretraining, finetuning,
and decoding the curse of multilinguality. In The Fourteenth International Conference on Learning Representations,
2026. URL https://openreview.net/forum?id=0BkvUY61MX.
Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in
mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop
on Vector Space Modeling for Natural

Chunk 26 · 1,997 chars

ations,
2026. URL https://openreview.net/forum?id=0BkvUY61MX.
Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual quality in
mind. In Phil Blunsom, Shay Cohen, Paramveer Dhillon, and Percy Liang (eds.), Proceedings of the 1st Workshop
on Vector Space Modeling for Natural Language Processing, pp. 151–159, Denver, Colorado, June 2015. Association
for Computational Linguistics. doi: 10.3115/v1/W15-1521. URL https://aclanthology.org/W15-1521/.
Partha Pakray, Alexander Gelbukh, and Sivaji Bandyopadhyay. Natural language processing applications for low-
resource languages. Natural Language Processing, 31(2):183–197, 2025. doi: 10.1017/nlp.2024.33.
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle,
Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a
broad discourse context. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525–1534, Berlin, Germany, August
2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144/.
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von
Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The
Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL
https://openreview.net/forum?id=n6SCkn2QaG.
11

-- 11 of 22 --

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran,
Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all –
adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and

Chunk 27 · 1,994 chars

tina Messmer, Negar Foroutan, Amir Hossein Kargaran,
Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all –
adapting pre-training data processing to every language, 2025. URL https://arxiv.org/abs/2506.20920.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd
schema challenge at scale. Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381.
URL https://doi.org/10.1145/3474381.
Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual LMs
with data constraints in the targeted language. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mo-
hammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp. 19096–
19122, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi:
10.18653/v1/2025.findings-acl.977. URL https://aclanthology.org/2025.findings-acl.977/.
Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to
translate a new language from one grammar book. In Arxiv, 2023.
Team Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng
Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren
Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin
Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang
Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report,
2025. URL https://arxiv.org/abs/2412.15115.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat

Chunk 28 · 1,988 chars

u, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report,
2025. URL https://arxiv.org/abs/2412.15115.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models,
2023. URL https://arxiv.org/abs/2307.09288.
Ahmet Üstün, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari,
Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff,
Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual
language model. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15894–15939, Bangkok,
Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.845. URL
https://aclanthology.org/2024.acl-long.845/.
Yana Veitsman and Mareike Hartmann. Recent advancements and challenges of Turkic Central Asian language
processing. In Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith
Premasiri, Fiona Anting Tan, and Lasitha Uyangodage (eds.), Proceedings of the First Workshop on Language
Models for Low-Resource Languages, pp. 309–324, Abu Dhabi, United Arab Emirates, January 2025. Association
for Computational Linguistics. URL https://aclanthology.org/2025.loreslm-1.25/.
Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. SwitchOut: an efficient data augmentation algorithm
for neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 856–861, Brussels,
Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1100.

Chunk 29 · 1,993 chars

neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 856–861, Brussels,
Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1100. URL
https://aclanthology.org/D18-1100/.
Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng,
and Shujian Huang. Investigating and scaling up code-switching for multilingual language model pre-training.
In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the
Association for Computational Linguistics: ACL 2025, pp. 11032–11046, Vienna, Austria, July 2025. Association
for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.575. URL https://
aclanthology.org/2025.findings-acl.575/.
Jason Wei and Kai Zou. EDA: Easy data augmentation techniques for boosting performance on text classification
tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pp. 6382–6388, Hong Kong, China, November 2019. Association for Computational
Linguistics. doi: 10.18653/v1/D19-1670. URL https://aclanthology.org/D19-1670/.
Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Leon Der-
czynski, Wei Xu, Alan Ritter, and Tim Baldwin (eds.), Proceedings of the 3rd Workshop on Noisy User-generated
Text, pp. 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:
10.18653/v1/W17-4413. URL https://aclanthology.org/W17-4413/.
12

-- 12 of 22 --

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin,
and Edouard Grave. CCNet:

Chunk 30 · 1,996 chars

ed
Text, pp. 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:
10.18653/v1/W17-4413. URL https://aclanthology.org/W17-4413/.
12

-- 12 of 22 --

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin,
and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta
Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi,
Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis
(eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4003–4012, Marseille, France,
May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.
lrec-1.494/.
Wikimedia Foundation. Wiktionary: The free dictionary, 2025. URL https://www.wiktionary.org. Accessed: 2025.
Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as
smoothing in neural network language models. In International Conference on Learning Representations, 2017.
URL https://openreview.net/forum?id=H1VyHY9gg.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova,
Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy
Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pp. 483–498, Online, June 2021a.
Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.
naacl-main.41/.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin

Chunk 31 · 1,994 chars

onal Linguistics: Human Language Technologies, pp. 483–498, Online, June 2021a.
Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.
naacl-main.41/.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova,
Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy
Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pp. 483–498, Online, June 2021b.
Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.
naacl-main.41/.
Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, and Hwaran Lee. Code-switching curriculum learning for
multilingual transfer in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher
Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp. 7816–7836, Vienna,
Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.
findings-acl.407. URL https://aclanthology.org/2025.findings-acl.407/.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish
your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association
for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472/.
A Appendix
A.1 Perplexity Results
Table 6 reports validation perplexity on FineWeb-2 (target language, LR) and FineWeb (English, EN) for
each setup across all four model sizes. At 137M, the smallest scale,

Chunk 32 · 1,995 chars

19. Association
for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472/.
A Appendix
A.1 Perplexity Results
Table 6 reports validation perplexity on FineWeb-2 (target language, LR) and FineWeb (English, EN) for
each setup across all four model sizes. At 137M, the smallest scale, LINK_uni incurs a substantial English
perplexity penalty (e.g., 36.49 vs. 29.53 for German), while LINK_domain keeps English perplexity close
to LR+HR across all languages. Target-language perplexity remains comparable to the baseline for both
strategies.
Starting at 345M, LINK_domain consistently achieves the lowest or near-lowest target-language perplexity
while keeping English perplexity close to LR+HR. LINK_uni matches or slightly exceeds the LR+HR English
perplexity but shows competitive LR perplexity, with the best Hindi result. This pattern strengthens at
760M, where LINK_domain achieves the best LR perplexity for German and French, with English perplexity
remaining within 0.1–0.2 points of LR+HR. LINK_uni also improves LR perplexity over the baseline (e.g.,
19.09 vs. 20.37 for German) but at a larger English cost. At 1.3B, the same trends hold: LINK_domain
closely matches or improves upon LR+HR on both sides, i.e., its English perplexity remains within 1 point
of LR+HR, while its LR perplexity improves for German and French and matches LR+HR for Chinese.
13

-- 13 of 22 --

LINK_uni yields comparable LR perplexity but incurs a slightly larger English penalty. Across all scales, the
LR-only baseline diverges for larger models, highlighting the instability of training on scarce data alone.
DE FR ZH HI
Setup LR EN LR EN LR EN LR EN
137M
LR (UB) 24.54 86.90 13.38 94.03 54.62 >102 – –
LR 35.74 >102 19.94 >102 >102 >102 9.43 >102
LR + HR 62.85 29.53 30.56 29.37 >102 29.40 18.96 29.46
LINK_uni 68.43 36.49 32.41 36.34 >102 34.79 20.11 39.40
LINK_domain 63.08 29.80 30.63 29.73 >102 29.99 20.04 29.89
345M
LR (UB) 10.02 32.15 6.61 32.95 14.63 67.43 –

Chunk 33 · 1,996 chars

EN
137M
LR (UB) 24.54 86.90 13.38 94.03 54.62 >102 – –
LR 35.74 >102 19.94 >102 >102 >102 9.43 >102
LR + HR 62.85 29.53 30.56 29.37 >102 29.40 18.96 29.46
LINK_uni 68.43 36.49 32.41 36.34 >102 34.79 20.11 39.40
LINK_domain 63.08 29.80 30.63 29.73 >102 29.99 20.04 29.89
345M
LR (UB) 10.02 32.15 6.61 32.95 14.63 67.43 – –
LR 29.40 >102 19.94 >102 >102 >102 7.87 >102
LR + HR 23.32 14.82 12.53 14.82 67.90 14.39 7.70 15.64
LINK_uni 23.37 14.83 12.81 14.85 67.39 14.39 7.68 15.60
LINK_domain 23.02 13.27 11.77 13.26 63.22 13.30 7.89 13.31
760M
LR (UB) 8.39 25.55 5.75 25.94 11.09 51.25 – –
LR >105 >106 28.45 >102 >102 >103 9.66 >102
LR + HR 20.37 11.20 9.92 11.17 50.42 11.23 6.71 11.21
LINK_uni 19.09 12.52 10.41 12.53 52.91 12.14 6.50 13.00
LINK_domain 18.73 11.29 9.83 11.29 50.47 11.33 6.67 11.35
1.3B
LR (UB) 7.33 21.42 5.17 21.68 9.12 40.76 – –
LR 58.12 >102 37.44 >102 >102 >103 11.87 >102
LR + HR 16.63 9.98 8.88 9.92 42.43 9.98 5.99 9.95
LINK_uni 16.52 10.84 9.42 10.83 43.86 10.67 5.77 11.23
LINK_domain 16.49 10.03 8.86 10.08 42.14 10.17 5.96 10.08
Table 6 Validation perplexity on FineWeb (EN) and FineWeb-2 (LR) across model sizes. Bold indicates lowest
perplexity among LR (S), LR+HR, LINK_uni, and LINK_domain.
A.2 Global MMLU
Global MMLU scores remain close to the random baseline of 25.0 across all setups and languages, with most
values falling in the 25–28 range. At 1.3B scale, the models lack sufficient capacity to perform meaningfully
on this knowledge-intensive benchmark. German and Chinese show the largest gains from bilingual training
(up to 27.6), while low-resource languages remain near chance level regardless of the intervention strategy.
These results suggest that Global MMLU is not sensitive enough to capture the differences between our
setups at this model scale, and we therefore focus our analysis on the other benchmarks.
High-resource Low-resource
Size Setup DE FR ZH HI AM IG SW YO
1.3B
LR 25.8 – 26.0 24.9 25.8 24.9 25.4 25.4
LR + HR 27.5 25.9 27.2 23.7

Chunk 34 · 1,997 chars

results suggest that Global MMLU is not sensitive enough to capture the differences between our
setups at this model scale, and we therefore focus our analysis on the other benchmarks.
High-resource Low-resource
Size Setup DE FR ZH HI AM IG SW YO
1.3B
LR 25.8 – 26.0 24.9 25.8 24.9 25.4 25.4
LR + HR 27.5 25.9 27.2 23.7 25.9 26.3 26.1 25.6
LINK_uni 27.5 25.9 27.6 25.0 25.7 26.3 26.7 25.8
LINK_domain 27.4 27.3 27.6 23.8 25.5 26.7 26.6 25.7
345M
LR 26.2 23.9 25.8 23.3 25.5 25.9 25.4 25.7
LR + HR 27.1 24.1 26.5 23.1 25.0 25.8 26.0 25.5
LINK_uni 26.1 24.5 26.9 22.9 25.4 25.7 26.2 26.0
LINK_domain 26.4 26.0 26.6 22.9 25.1 25.9 25.6 25.6
Table 7 Global MMLU results in the target language across all model sizes.
14

-- 14 of 22 --

A.3 Model Scaling
This section reports downstream benchmark results broken down by model size. Tables 8, 9, and 10 present
individual benchmark scores for the 760M, 345M, and 137M models, respectively. Figure 5 summarizes the
average zero-shot accuracy on target-language benchmarks across all four model sizes.
Figure 5 Average benchmark accuracy across four model sizes and all languages.
LR performance HR performance
Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg
German
LR (UB) 34.2 26.1 37.9 26.7 62.4 55.2 50.4 41.8 32.0 22.2 30.5 24.6 58.2 53.1 50.5 38.7
LR 31.5 23.4 28.5 9.1 56.0 45.5 50.4 34.9 27.5 24.5 26.7 2.6 49.9 40.8 49.2 31.6
LR + HR 34.7 24.4 32.6 21.8 55.3 62.0 51.9 40.4 49.5 26.8 49.4 45.4 71.1 72.9 54.3 52.8
LINK_uni 36.4 24.1 34.1 23.6 59.1 64.6 53.0 42.1 44.6 24.5 46.1 43.2 69.4 68.4 52.9 49.9
LINK_domain 36.8 25.5 35.2 23.7 58.8 63.0 50.7 41.9 47.4 26.4 49.4 46.1 71.1 72.9 54.5 52.5
French
LR (UB) 39.9 25.9 41.9 29.1 64.2 61.6 55.1 45.4 35.6 20.7 30.9 23.6 58.9 64.8 48.4 40.4
LR 31.4 22.9 30.1 10.7 54.8 51.8 51.9 36.2 28.5 21.7 27.4 4.7 53.1 45.8 50.9 33.2
LR + HR 34.2 24.0 35.7 25.6 58.6 61.0 50.9 41.4 47.3 26.7 49.6 47.1 71.7 70.7 54.5 52.5
LINK_uni 37.4 25.2 36.9 27.0 60.3 63.2 52.3 43.2 43.9 24.5 45.6 42.8 69.4

Chunk 35 · 1,994 chars

25.9 41.9 29.1 64.2 61.6 55.1 45.4 35.6 20.7 30.9 23.6 58.9 64.8 48.4 40.4
LR 31.4 22.9 30.1 10.7 54.8 51.8 51.9 36.2 28.5 21.7 27.4 4.7 53.1 45.8 50.9 33.2
LR + HR 34.2 24.0 35.7 25.6 58.6 61.0 50.9 41.4 47.3 26.7 49.6 47.1 71.7 70.7 54.5 52.5
LINK_uni 37.4 25.2 36.9 27.0 60.3 63.2 52.3 43.2 43.9 24.5 45.6 42.8 69.4 69.6 54.4 50.0
LINK_domain 36.3 25.6 36.5 26.5 59.0 61.6 52.4 42.6 46.1 24.8 49.8 46.3 71.8 70.1 53.4 51.7
Chinese
LR (UB) 43.0 25.4 36.4 – 60.1 76.7 51.4 48.8 31.9 20.5 27.9 12.1 54.0 59.3 48.9 36.4
LR 32.9 24.2 30.6 – 55.6 63.5 51.2 43.0 26.9 23.1 26.7 1.6 52.5 39.5 51.1 31.6
LR + HR 37.2 25.1 33.8 – 57.0 74.0 52.1 46.5 48.0 25.4 49.8 45.6 72.4 71.7 53.8 52.4
LINK_uni 39.8 25.6 35.3 – 57.7 73.7 51.6 47.3 47.0 25.2 46.7 42.2 69.6 70.9 52.6 50.6
LINK_domain 38.3 25.6 35.2 – 56.9 75.3 50.5 47.0 47.4 26.4 49.3 45.2 70.8 71.8 52.8 51.9
Hindi
LR 32.1 23.4 27.6 – 53.0 – 46.9 36.6 27.8 20.6 26.4 2.3 52.3 38.7 49.8 31.1
LR + HR 32.4 22.3 28.7 – 52.0 – 49.1 36.9 48.4 26.4 49.8 46.8 71.6 73.1 52.6 52.7
LINK_uni 33.0 24.1 30.1 – 55.0 – 51.6 38.8 43.8 23.8 44.6 40.5 68.4 69.2 52.1 48.9
LINK_domain 33.5 24.5 29.7 – 53.0 – 51.2 38.4 46.5 25.2 49.0 46.0 71.2 69.7 51.2 51.3
Table 8 760M model results.
15

-- 15 of 22 --

LR performance HR performance
Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg
German
LR (UB) 32.1 24.8 33.6 24.0 60.0 49.5 53.0 39.6 30.2 22.1 28.5 18.4 56.3 49.2 52.4 36.7
LR 32.3 24.0 28.6 15.7 54.9 51.3 51.5 36.9 29.6 20.8 26.6 6.2 51.4 50.9 50.1 33.7
LR + HR 33.1 24.5 30.6 20.4 55.6 61.6 50.8 39.5 40.6 23.2 37.8 34.9 67.0 65.3 51.2 45.7
LINK_uni 33.6 23.5 30.5 19.4 55.8 61.7 53.6 39.7 41.5 24.2 38.0 34.6 66.2 66.1 50.4 45.9
LINK_domain 33.2 24.7 30.9 19.6 54.8 61.4 52.7 39.6 43.5 24.0 41.0 38.8 67.6 66.5 54.4 48.0
French
LR (UB) 35.5 24.5 37.1 23.9 62.2 61.8 52.7 42.5 33.1 20.6 28.7 16.9 56.1 58.1 50.2 37.7
LR 32.7 24.5 30.1 12.5 56.7 50.5 51.5 36.9 29.5 21.8 26.7 7.3 52.2 47.8 48.8 33.4
LR + HR 35.5 23.7 32.4 22.6 57.0 59.6 52.9

Chunk 36 · 1,996 chars

.4 45.9
LINK_domain 33.2 24.7 30.9 19.6 54.8 61.4 52.7 39.6 43.5 24.0 41.0 38.8 67.6 66.5 54.4 48.0
French
LR (UB) 35.5 24.5 37.1 23.9 62.2 61.8 52.7 42.5 33.1 20.6 28.7 16.9 56.1 58.1 50.2 37.7
LR 32.7 24.5 30.1 12.5 56.7 50.5 51.5 36.9 29.5 21.8 26.7 7.3 52.2 47.8 48.8 33.4
LR + HR 35.5 23.7 32.4 22.6 57.0 59.6 52.9 40.5 41.5 23.1 37.9 35.8 66.9 66.2 50.5 46.0
LINK_uni 34.7 24.8 32.5 21.5 56.1 59.9 51.0 40.1 40.5 22.9 38.0 35.0 66.4 65.8 51.5 45.7
LINK_domain 33.3 24.3 32.0 22.3 56.5 58.0 49.9 39.5 43.9 24.1 40.8 38.1 67.6 68.5 51.7 47.8
Chinese
LR (UB) 39.7 25.1 33.5 – 57.8 74.3 50.1 46.8 31.4 21.1 27.6 9.7 53.8 56.2 51.5 35.9
LR 31.4 25.0 30.2 – 55.2 65.3 50.6 43.0 28.1 20.1 26.8 2.7 51.7 41.6 48.7 31.4
LR + HR 36.5 24.0 32.4 – 55.5 72.4 48.5 44.9 42.0 23.5 38.7 36.3 66.3 66.5 51.5 46.4
LINK_uni 36.6 24.2 32.1 – 57.0 71.5 48.1 44.9 41.5 24.4 38.5 36.3 67.7 65.4 49.8 46.2
LINK_domain 35.9 24.4 32.2 – 55.1 70.9 49.9 44.7 42.9 24.4 40.7 38.2 68.5 65.0 51.9 47.4
Hindi
LR 31.0 23.1 27.9 – 50.0 – 50.9 36.6 29.6 20.3 26.2 3.4 52.7 42.2 49.7 32.0
LR + HR 32.1 23.5 28.9 – 47.0 – 51.6 36.6 38.9 24.1 36.6 32.8 66.5 65.5 51.1 45.1
LINK_uni 31.9 25.0 28.6 – 49.0 – 51.6 37.2 40.7 21.7 36.5 32.9 66.0 63.2 50.1 44.4
LINK_domain 32.5 23.1 28.4 – 56.0 – 49.8 38.0 42.3 23.4 40.5 38.7 67.8 67.8 51.5 47.4
Table 9 345M model results.
LR performance HR performance
Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg
German
LR (UB) 27.2 23.6 26.6 12.2 53.5 41.9 50.0 33.6 28.5 21.5 27.0 5.4 52.0 41.4 49.6 32.2
LR 29.7 24.2 26.9 11.6 52.5 49.5 49.9 34.9 30.6 23.7 27.1 4.2 51.2 50.8 50.5 34.0
LR + HR 27.0 23.3 27.0 6.3 50.2 54.3 48.7 33.8 32.1 20.2 27.4 15.7 59.2 57.2 49.7 37.4
LINK_uni 27.8 24.1 26.6 6.5 50.9 52.2 49.6 34.0 30.9 20.1 26.9 13.0 55.2 52.4 49.1 35.3
LINK_domain 28.2 21.7 26.5 6.5 49.7 52.7 50.9 33.8 31.9 20.0 27.4 14.4 57.5 53.7 47.9 36.1
French
LR (UB) 29.1 21.9 27.6 11.8 53.6 52.1 50.5 35.2 28.4 22.7 27.4 6.5 52.6 48.6 51.2 33.9
LR 28.4 23.1 27.2 10.9 52.5 49.5

Chunk 37 · 1,992 chars

7.2 49.7 37.4
LINK_uni 27.8 24.1 26.6 6.5 50.9 52.2 49.6 34.0 30.9 20.1 26.9 13.0 55.2 52.4 49.1 35.3
LINK_domain 28.2 21.7 26.5 6.5 49.7 52.7 50.9 33.8 31.9 20.0 27.4 14.4 57.5 53.7 47.9 36.1
French
LR (UB) 29.1 21.9 27.6 11.8 53.6 52.1 50.5 35.2 28.4 22.7 27.4 6.5 52.6 48.6 51.2 33.9
LR 28.4 23.1 27.2 10.9 52.5 49.5 50.4 34.6 29.0 22.8 27.1 4.2 52.2 46.7 51.9 33.4
LR + HR 27.9 23.1 27.3 8.8 50.6 53.8 51.9 34.8 32.6 20.3 27.6 16.1 57.6 56.8 49.8 37.3
LINK_uni 28.5 23.9 27.2 9.7 51.7 52.5 49.1 34.8 30.8 20.9 26.9 13.7 55.1 51.7 51.9 35.9
LINK_domain 28.5 22.1 27.4 8.6 51.3 54.6 50.1 34.6 32.6 19.4 27.7 15.2 57.1 54.6 48.1 36.4
Chinese
LR (UB) 32.4 22.9 28.2 – 52.6 65.8 51.3 42.2 29.1 21.4 27.0 3.5 52.7 45.3 49.4 32.6
LR 29.9 22.5 28.0 – 50.7 52.8 51.4 39.2 28.4 24.0 26.8 1.8 50.8 38.5 48.7 31.3
LR + HR 28.7 22.2 27.6 – 52.3 61.9 50.5 40.5 33.1 20.6 27.6 16.3 58.8 53.3 48.9 37.0
LINK_uni 28.8 22.8 27.2 – 51.0 62.1 52.4 40.7 31.1 21.6 27.5 13.2 56.9 51.5 50.1 36.0
LINK_domain 30.4 22.9 27.3 – 51.0 62.4 48.9 40.5 31.6 21.2 27.3 14.8 57.7 53.8 49.7 36.6
Hindi
LR 29.8 22.4 26.8 – 49.0 – 49.6 35.5 28.3 22.6 26.4 2.9 52.5 40.5 49.9 31.9
LR + HR 29.0 24.7 27.3 – 50.0 – 49.6 36.1 32.1 20.6 27.5 15.8 58.3 57.2 49.9 37.4
LINK_uni 28.7 25.5 26.9 – 53.0 – 49.2 36.6 28.9 21.9 27.2 11.8 55.3 48.1 50.1 34.8
LINK_domain 29.2 24.2 27.4 – 53.0 – 50.0 36.8 31.9 19.9 27.5 15.3 58.1 53.7 47.4 36.2
Table 10 137M model results.
Additionally, we conducted experiments on German for the 2.7B model; the results are provided in Table 11.
The gains of our methods are even more pronounced than for smaller models: for example, LINK_domain
achieves 45.2% accuracy on ARC-Easy, surpassing not only all baselines but even the upper-bound model
trained on significantly larger amounts of German data.
16

-- 16 of 22 --

LR performance HR performance
Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg
German
LR (UB) 44.6 27.6 49.7 33.1 66.9 68.4 53.9 49.2 37.5 25.0 41.4 36.8 64.9 67.9 53.8

Chunk 38 · 1,995 chars

urpassing not only all baselines but even the upper-bound model
trained on significantly larger amounts of German data.
16

-- 16 of 22 --

LR performance HR performance
Setup A-E A-C HS LB PQ SQ WG Avg A-E A-C HS LB PQ SQ WG Avg
German
LR (UB) 44.6 27.6 49.7 33.1 66.9 68.4 53.9 49.2 37.5 25.0 41.4 36.8 64.9 67.9 53.8 46.8
LR 32.3 23.7 28.6 9.7 54.7 48.8 51.9 35.7 29.0 22.7 27.2 3.2 52.1 42.2 50.8 32.4
LR + HR 42.0 28.7 44.2 28.6 62.1 66.0 54.6 46.6 59.8 34.1 67.4 60.6 76.6 82.5 60.5 63.1
LINK_uni 42.0 26.5 46.0 29.0 61.8 70.2 54.7 47.2 55.4 31.8 64.3 58.5 75.9 77.6 59.9 60.5
LINK_domain 45.2 26.6 46.3 28.9 62.5 68.6 54.6 47.6 60.6 33.5 68.0 60.9 77.3 81.3 60.4 63.1
Table 11 2.7B model results.
A.4 Full Low-Resource Results
This section reports downstream benchmark results for the true low-resource languages (Amharic, Igbo,
Yoruba, and Swahili). The availability of target-language evaluation benchmarks varies by language: Amharic
and Swahili have ARC-Easy, PiQA, and WinoGrande; Igbo has PiQA and WinoGrande; and Yoruba has
ARC-Easy and PiQA translations. The results are presented in Tables 12 and 13. Due to validation data
limitation, we report the results for the last checkpoint.
LR performance HR performance
Setup A-E PQ WG Avg A-E A-C HS LB PQ SQ WG Avg
Amharic
LR 31.2 52.0 51.1 44.8 26.9 22.6 26.4 0.3 52.7 29.2 46.9 29.3
LR + HR 23.0 52.0 51.1 42.0 53.3 29.6 57.2 51.3 74.0 75.9 57.0 56.9
LINK_uni 24.6 51.0 49.1 41.6 51.5 29.7 56.0 51.1 74.0 73.8 55.2 55.9
LINK_domain 25.9 47.0 50.0 41.0 52.5 27.7 57.1 52.8 74.4 76.9 54.6 56.6
Igbo
LR – 69.0 51.1 60.0 27.2 22.9 25.0 0.2 50.8 32.5 50.1 29.8
LR + HR – 72.0 51.2 61.6 53.4 29.2 57.7 53.1 73.1 76.1 57.3 57.1
LINK_uni – 71.0 49.7 60.4 52.6 28.2 57.5 52.2 74.9 76.1 56.0 56.8
LINK_domain – 71.0 49.7 60.4 53.6 28.2 57.6 52.2 73.7 75.9 56.7 56.8
Yoruba
LR 29.4 47.0 – 38.2 25.8 21.8 25.9 0.1 51.3 28.0 50.7 29.1
LR + HR 24.9 55.0 – 40.0 53.0 27.7 56.5 51.7 74.1 76.4 54.5 56.3
LINK_uni 24.5 57.0 – 40.8 52.7 29.0 56.8 50.8

Chunk 39 · 1,998 chars

.3 57.1
LINK_uni – 71.0 49.7 60.4 52.6 28.2 57.5 52.2 74.9 76.1 56.0 56.8
LINK_domain – 71.0 49.7 60.4 53.6 28.2 57.6 52.2 73.7 75.9 56.7 56.8
Yoruba
LR 29.4 47.0 – 38.2 25.8 21.8 25.9 0.1 51.3 28.0 50.7 29.1
LR + HR 24.9 55.0 – 40.0 53.0 27.7 56.5 51.7 74.1 76.4 54.5 56.3
LINK_uni 24.5 57.0 – 40.8 52.7 29.0 56.8 50.8 74.3 78.2 55.7 56.8
LINK_domain 23.5 54.0 – 38.8 53.0 28.2 56.1 52.6 73.4 75.7 56.8 56.5
Swahili
LR 32.2 61.0 50.3 47.8 30.6 21.6 26.5 2.2 52.6 40.0 50.0 31.9
LR + HR 28.1 67.0 50.5 48.5 55.0 29.0 57.5 52.9 74.6 78.5 56.0 57.6
LINK_uni 27.7 68.0 49.8 48.5 52.0 27.3 55.1 51.2 73.2 79.2 56.0 56.3
LINK_domain 28.7 70.0 51.6 50.1 53.5 29.0 57.5 52.9 73.4 77.1 55.6 57.0
Table 12 Results for low-resource languages (1.3B models). Results reported at the last checkpoint. A-E: ARC-Easy,
PQ: PiQA, WG: WinoGrande.
In order to better analyze the performance of our method for true low-resource languages, we ran an additional
set of experiments with an extended number of training steps. We kept the amount of available data the
same as it was for the main experiments, but doubled the amount of training steps: i.e., the 1.3B models
were trained for 200,000 steps, and 345M models were trained for 60,000 steps. We report results across all
checkpoints in Figures 6 and 7.
One issue with low-resource evaluations is benchmark noise, caused by the limited number available bench-
marks for these languages as well as their small size. For example, the GlobalPiQA benchmark (Chang et al.,
2025) we used for this evaluation contains only 100 samples per language. While target-language bench-
marks for high-resource languages such as German exhibit low variance (e.g., 0.6–1.1pp between consecutive
checkpoints), low-resource benchmarks are 2–3× noisier: Igbo PiQA fluctuates by up to 16 points between
consecutive checkpoints, Yoruba PiQA by up to 10 points, and Amharic PiQA by up to 6 points.
17

-- 17 of 22 --

LR performance HR performance
Setup A-E PQ WG Avg A-E A-C HS LB PQ SQ

Chunk 40 · 1,995 chars

., 0.6–1.1pp between consecutive
checkpoints), low-resource benchmarks are 2–3× noisier: Igbo PiQA fluctuates by up to 16 points between
consecutive checkpoints, Yoruba PiQA by up to 10 points, and Amharic PiQA by up to 6 points.
17

-- 17 of 22 --

LR performance HR performance
Setup A-E PQ WG Avg A-E A-C HS LB PQ SQ WG Avg
Amharic
LR 24.0 52.0 50.8 42.3 28.5 22.8 26.4 2.5 50.9 38.3 49.2 31.2
LR + HR 25.9 48.0 50.1 41.3 44.4 24.8 41.4 37.8 67.1 67.1 52.0 47.8
LINK_uni 27.5 50.0 49.6 42.4 42.5 24.0 40.1 36.2 67.8 65.8 50.3 46.7
LINK_domain 25.2 50.0 49.8 41.7 44.1 23.9 41.2 38.7 68.8 67.6 50.1 47.8
Igbo
LR – 58.0 48.5 53.2 27.3 24.1 25.9 0.7 49.9 37.1 49.7 30.7
LR + HR – 65.0 50.0 57.5 43.1 24.0 41.3 38.6 68.8 64.9 51.3 47.4
LINK_uni – 68.0 50.3 59.1 43.2 23.6 40.8 37.3 68.1 66.9 52.1 47.4
LINK_domain – 67.0 49.2 58.1 44.3 24.3 40.9 38.8 68.1 66.8 50.4 47.7
Yoruba
LR 23.9 53.0 – 38.5 25.9 20.9 25.1 0.2 49.6 29.7 49.3 28.7
LR + HR 22.7 47.0 – 34.9 44.4 23.4 40.4 38.3 67.8 67.2 53.6 47.9
LINK_uni 24.5 55.0 – 39.8 41.8 24.6 40.4 37.2 67.8 66.7 51.5 47.1
LINK_domain 22.5 44.0 – 33.2 43.1 24.7 40.9 37.5 69.0 69.4 51.3 48.0
Swahili
LR 29.3 61.0 51.8 47.4 31.1 20.6 26.9 9.3 52.2 48.6 48.9 33.9
LR + HR 29.3 61.0 49.0 46.4 44.1 22.9 41.0 38.1 69.4 68.1 52.6 48.0
LINK_uni 27.1 64.0 49.4 46.8 42.1 23.5 39.3 37.0 67.8 67.1 52.7 47.1
LINK_domain 28.5 64.0 49.6 47.4 43.2 23.6 40.6 37.7 68.1 67.9 51.9 47.6
Table 13 Results for low-resource languages (345M models). Results reported at the last checkpoint. A-E: ARC-Easy,
PQ: PIQA, WG: WinoGrande.
Figure 6 1.3B models, extended (2×) runs up to 200K steps: per-checkpoint accuracy on low-resource benchmarks
(M-ARC Easy, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands show standard
deviation.
18

-- 18 of 22 --

Figure 7 345M models, extended (2×) runs up to 60K steps: per-checkpoint accuracy on low-resource benchmarks
(M-ARC Easy, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands

Chunk 41 · 1,992 chars

y, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands show standard
deviation.
18

-- 18 of 22 --

Figure 7 345M models, extended (2×) runs up to 60K steps: per-checkpoint accuracy on low-resource benchmarks
(M-ARC Easy, M-PIQA, M-WinoGrande) for Amharic, Igbo, Yoruba, and Swahili. Shaded bands show standard
deviation.
A.5 Clustering Results
Figure 8 demonstrates the results of the k-means clustering (32 clusters) of the English FineWeb training
data. Most clusters contain around 2-6B tokens apart from cluster 12 which is almost empty (0.21M to-
kens, i.e., ∼0.0002% of all 118B total tokens). Cluster 5 hold around 5.3B tokens (5.55M samples), which
corresponds to 4.46% tokens (3.76% samples) of the whole Fineweb dataset.
Figure 8 Distribution of English FineWeb training tokens across the 32 k-means clusters.
Next, we clustered ARC-Easy and ARC-Challenge benchmarks (both English and German versions) using
the same centroids. The results are provided in Figure 9.
Both benchmarks are heavily concentrated in a single Cluster 5, which accounts for more than half of the
samples of the corresponding benchmarks: 57.24% of samples from ARC-Easy (English), 57.67% of samples
from ARC-Easy (German), 58.87% of samples from ARC-Challenge (English), 59.11% of samples from ARC-
Challenge(German). The manual inspection of the resulted cluster further confirm the initial hypothesis of
scientific knowledge (which is mostly represented in ARC datasets) being concentrated in one cluster, what
19

-- 19 of 22 --

Figure 9 Results of clustering of ARC-Easy and ARC-Challenge validation datasets.
allows us to use it for LINK_domain experiments.
A.6 Target vs Actual Replacements
Figure 10 shows the relationship between the target and actual replacement ratios for each language. Since
replacements can only be performed for words present in the bilingual vocabulary, the actual replacement
ratio is bounded by vocabulary coverage. For German, which has the largest

Chunk 42 · 1,996 chars

.6 Target vs Actual Replacements
Figure 10 shows the relationship between the target and actual replacement ratios for each language. Since
replacements can only be performed for words present in the bilingual vocabulary, the actual replacement
ratio is bounded by vocabulary coverage. For German, which has the largest vocabulary (48,195 entries), the
actual ratio closely tracks the target up to 30%, after which it plateaus around 55–57%. We conducted an
extensive ablation over replacement ratios for German (Figure 10, left), and based on the finding that actual
replacements saturate beyond a target ratio of 70, we evaluated only the 50 and 70 settings for the remaining
languages.
Figure 10 Target vs. actual replacement ratio for each language. The dashed line indicates the ideal case where
all targeted words are replaced. The gap between target and actual ratios reflects bilingual vocabulary coverage —
languages with smaller vocabularies (e.g., Hindi) saturate at lower actual replacement rates.
French, Chinese, and Hindi exhibit a similar ceiling effect at target ratios of 50 and 70, with actual ratios
reaching approximately 47–50%, 48–55%, and 49–55% respectively. The gap between the target and actual
ratios varies across languages, reflecting differences in vocabulary size. Based on these results, we set the
target replacement ratio to 70 for all experiments, as this effectively maximizes the number of replacements
achievable with our vocabularies.
The vocabulary-coverage ceiling is far more pronounced for truly low-resource languages. Figure 11 reports
the actual per-document replacement rate at the fixed target of 70% across all eight languages used in this
work. The four data-constrained high-resource languages reach the saturation level discussed above (50–56%),
whereas the four truly low-resource languages fall dramatically below the target: Swahili reaches only 29.6%,
Amharic 13.3%, Yoruba 8.2%, and Igbo 6.1%. These rates are upper-bounded by the size of

Chunk 43 · 1,997 chars

used in this
work. The four data-constrained high-resource languages reach the saturation level discussed above (50–56%),
whereas the four truly low-resource languages fall dramatically below the target: Swahili reaches only 29.6%,
Amharic 13.3%, Yoruba 8.2%, and Igbo 6.1%. These rates are upper-bounded by the size of the available
bilingual vocabularies (Table 1); even when every replaceable word is replaced, the per-document rate cannot
exceed the fraction of words covered by the dictionary.
20

-- 20 of 22 --

Figure 11 Actual per-document replacement rate at target=70% for all eight languages. The dashed red line marks the
70% target. High-resource languages saturate around 50–56%; truly low-resource languages fall far below the target,
bounded by the size of the available bilingual vocabularies.
A.7 Reduced Vocabulary Size Experiments
Figure 12 ARC-Easy accuracy on German at 1.3B with bilin-
gual vocabularies reduced to 1% of the original size (around
480 word pairs). Set 1 and Set 2 are two independently sub-
sampled 1% vocabularies (different sampling seeds).
In addition to the experiments discussed in Sec-
tion 6, we further reduced the original German vo-
cabulary to 1% of the initial size - i.e., to around
480 word pairs. This vocabulary was subsampled
from 10% vocabulary with two different seeds. The
results presented in Figure 12 demonstrate that
when the bilingual vocabulary is small, it becomes
increasingly important which word pairs it con-
tains (not only how many). Across the two in-
dependently sampled 1% vocabularies, ARC-Easy
accuracy varies by 1.5-1.7pp for both LINK_uni
(39.0 vs. 37.4) and LINK_domain (38.5 vs. 36.9),
indicating that vocabulary composition becomes
a primary driver of transfer at this scale. The
stronger of the two subsets still matches or exceeds
the LR+HR baseline (38.0), while the weaker one
falls slightly below it, implying that with only a
few hundred translation pairs available, careful selection of entries matters.
A.8

Chunk 44 · 1,833 chars

that vocabulary composition becomes
a primary driver of transfer at this scale. The
stronger of the two subsets still matches or exceeds
the LR+HR baseline (38.0), while the weaker one
falls slightly below it, implying that with only a
few hundred translation pairs available, careful selection of entries matters.
A.8 Data-Constrained vs Low-Resource
Throughout this work, we use the term data-constrained rather than low-resource to describe our experimental
settings. While the two terms refer to overlapping concepts, they highlight distinct challenges. Our method is
broadly applicable to any language for which training data is limited, regardless of whether it is traditionally
classified as low-resource.
Several factors motivate this distinction. First, truly low-resource languages typically have limited bilingual
vocabulary coverage, whereas our simulated settings use languages with large bilingual dictionaries. Second,
our simulated settings are constructed by downsampling web-crawled data, which preserves the topical di-
versity of the original corpus. In practice, low-resource language data is often drawn from a narrow set of
sources (religious texts, government documents, or Wikipedia) resulting in a skewed domain distribution that
our downsampling procedure does not capture. Third, low-resource languages frequently lack standardized
evaluation benchmarks, limiting our ability to assess model performance comprehensively.
To avoid conflating these distinct challenges, we reserve the term low-resource for the experiments in Sec-
tion 5.3, supporting truly low-resource languages remains a core motivation of this work, and use data-
constrained elsewhere.
21

-- 21 of 22 --

Apple and the Apple logo are trademarks of Apple Inc., registered in the U.S. and other countries and regions.
22

-- 22 of 22 --