English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

Authors: Mehak Dhaliwal, Shashwat Chaurasia, Yao Qin, Dezhi Hong, Thomas ButlerarXiv:2604.13286 Source

Recent study showing that LLMs remain English-centric in post-training, demonstrating representation gap with low-resource and non-English languages. Finds that increasing language coverage improves performance for low-resource languages without degrading English performance, and demonstrates zero-shot cross-lingual transfer limitations for distant language pairs.

Summary

This study investigates the impact of multilingual post-training on large language models (LLMs), challenging the common practice of English-centric fine-tuning. Using 220 training runs across two model families (Qwen-3 and Gemma-3) and 22 multilingual data mixtures, the researchers evaluated mathematical reasoning and API calling tasks. Key findings show that increasing language coverage during post-training generally improves performance across all languages, with low-resource languages benefiting the most. Even minimal multilinguality—adding a single non-English language—enhances English performance and cross-lingual generalization, making English-only training suboptimal. High language diversity enables strong zero-shot cross-lingual transfer, often matching or exceeding direct language inclusion, though typologically distant, low-resource languages see limited gains. The study also reveals that model capacity affects outcomes, with smaller models showing slight degradation in API calling tasks at higher multilinguality. These results highlight the benefits of diverse multilingual training for improving both English and non-English performance, while acknowledging limitations in representing global linguistic diversity and the potential for translation artifacts in dataset construction.

PDF viewer

Chunks(26)

Chunk 0 · 1,995 chars

Preprint. Under review.
English is Not All You Need: Systematically Exploring the
Role of Multilinguality in LLM Post-Training
Mehak Dhaliwal1∗, Shashwat Chaurasia2, Yao Qin1, Dezhi Hong2†, Thomas Butler2
1UC Santa Barbara, 2Amazon
Abstract
Despite the widespread multilingual deployment of large language models,
post-training pipelines remain predominantly English-centric, contributing
to performance disparities across languages. We present a systematic, con-
trolled study of the interplay between training language coverage, model
scale, and task domain, based on 220 supervised fine-tuning runs on paral-
lel translated multilingual data mixtures spanning mathematical reasoning
and API calling tasks, with models up to 8B parameters. We find that in-
creasing language coverage during post-training is largely beneficial across
tasks and model scales, with low-resource languages benefiting the most
and high-resource languages plateauing rather than degrading. Even min-
imal multilinguality helps: incorporating a single non-English language
improves both English performance and cross-lingual generalization, mak-
ing English-only post-training largely suboptimal. Moreover, at sufficient
language diversity, zero-shot cross-lingual transfer can match or exceed
the effects of direct language inclusion in a low-diversity setting, although
gains remain limited for typologically distant, low-resource languages.
1 Introduction
Large Language Models (LLMs) have achieved strong performance across a wide range
of tasks, driven by capabilities such as reasoning, instruction following, and structured
generation. These capabilities are largely enabled by a multi-stage pipeline of large-scale
pre-training followed by post-training on curated datasets (Shaham et al., 2024). Despite
growing global use of LLMs, post-training remains predominantly English-centric. While
task-specific English fine-tuning enables some degree of cross-lingual transfer, significant
performance disparities

Chunk 1 · 1,995 chars

e pipeline of large-scale
pre-training followed by post-training on curated datasets (Shaham et al., 2024). Despite
growing global use of LLMs, post-training remains predominantly English-centric. While
task-specific English fine-tuning enables some degree of cross-lingual transfer, significant
performance disparities persist across languages (Khanuja et al., 2023).
Prior work on multilingual post-training suggests a nuanced picture: replacing small
amounts of English data with a limited number of additional languages can improve
multilingual task performance, while larger substitutions often yield diminishing gains and
may degrade English performance (Shaham et al., 2024; Kew et al., 2024). Similarly, work
has shown that augmenting English task-specific data with multilingual data improves
multilingual outcomes (Lai & Nissim, 2024; Ji & Chen, 2025; Shimabucoro et al., 2025).
However, these findings remain fragmented and leave several questions unresolved. Exist-
ing studies typically focus on a narrow set of languages, tasks, and/or model scales, making
it difficult to understand how increasing language coverage systematically affects model
behavior. In particular, we lack a clear characterization of how trade-offs manifest across
(i) English vs. non-English performance, (ii) different task types, and (iii) varying model
capacities. This gap is especially important in light of multilingual pre-training literature,
which highlights a fundamental trade-off under fixed model capacity: expanding language
coverage can dilute performance in high-resource languages–a phenomenon often termed
the “curse of multilinguality” or “negative interference” (Conneau et al., 2020; Chang et al.,
2024; Longpre et al., 2025). While cross-lingual transfer can provide gains, it is still unclear
∗Work conducted during an internship at Amazon. Correspondence to: mdhaliwal@ucsb.edu
†Work conducted while at Amazon.
1
arXiv:2604.13286v1 [cs.CL] 14 Apr 2026

-- 1 of 16 --

Preprint. Under

Chunk 2 · 1,989 chars

au et al., 2020; Chang et al.,
2024; Longpre et al., 2025). While cross-lingual transfer can provide gains, it is still unclear
∗Work conducted during an internship at Amazon. Correspondence to: mdhaliwal@ucsb.edu
†Work conducted while at Amazon.
1
arXiv:2604.13286v1 [cs.CL] 14 Apr 2026

-- 1 of 16 --

Preprint. Under review.
RU
Multilingual Parallel
Data Pool
BN EN ES
JA SW
Core Languages
DE FR
Additional Languages
TH ZH
Tasks
• Math Reasoning
• API Calling
Total 2 Tasks
Model Training
Qwen-3
0.6B 1.7B 	8B
Total 5 models
Gemma-3
1B 	4B
N=2
N=1
N=5
N=9
2 Tasks x 22 Data Mixtures x 5 Models = 220 Training Runs
RU
Multilingual Data Mixtures
EN BN x4 combinations
Total 22 Data Mixtures
x1 combination	EN
x5 combinations	ES JA	N=4 BN EN
x1 combination	SW	ES JA	BN EN
x5 combinations	ES	BN EN	N=6 FR TH 	JA
x5 combinations	ES	BN EN 	JA	DE RU ZH	FR TH
x1 combination	ES	BN EN 	SW	JA	DE RU ZH	N=10 FR TH
Variable Core Language
Fixed Core Language
Fixed Additional Language
N=2
N=1
N=5
N=9
Figure 1: Overview of the experimental design. We start from a task-specific multilingual
parallel data pool consisting of five core languages, which are used to construct exhaustive
data mixture combinations, and five additional languages that enable scaling the exper-
iments to up to ten languages. We generate 22 data mixtures with increasing language
counts and varying combinations; within the multilingual data mixtures, languages shown
in blue indicate one possible combination, while languages shown in grey are fixed for the
corresponding number of languages. We train five models from two model families on each
mixture for two tasks, resulting in 22 × 2 × 5 = 220 total training runs.
when benefits outweigh interference effects during post-training and how these effects scale
with model size and task complexity.
We present a systematic study of multilingual task-specific post-training across two tasks,
two model families, and 220 training runs. Our controlled scaling design varies

Chunk 3 · 1,997 chars

al training runs.
when benefits outweigh interference effects during post-training and how these effects scale
with model size and task complexity.
We present a systematic study of multilingual task-specific post-training across two tasks,
two model families, and 220 training runs. Our controlled scaling design varies multilin-
guality by increasing the number of languages in the post-training mixture, using parallel
translated task data to avoid convolving our results with the effect of simply increasing the
dataset size with additional iid data (Shimabucoro et al., 2025; Ji & Chen, 2025).
We train models on mixtures of up to 10 languages (Figure 1) using the Qwen-3 family (0.6B,
1.7B, 8B) (Yang et al., 2025) and Gemma-3 (1B, 4B) (Kamath et al., 2025). Evaluation covers
mathematical reasoning and API calling across six languages that span different resource
levels, language families, regions, and scripts (Table 1).
We summarize our key findings as follows:
1. Multilingual scaling is largely beneficial across tasks and model scales: increasing
language coverage generally improves or maintains performance across tasks and model
scales, with low-resource languages continuing to benefit from added multilinguality
while high-resource languages plateau rather than degrade.
2. Even limited language diversity helps: adding parallel data in even a single non-English
language typically results in gains that generalize beyond the added language to other
languages, including English, making English-only post-training consistently suboptimal.
3. High diversity enables strong zero-shot cross-lingual transfer: increased language
diversity during post-training enables strong zero-shot cross-lingual transfer that can
match or exceed the effects of direct language inclusion in low-diversity settings, though
with limitations for typologically distant, low-resource languages.
Together, these findings highlight the limitations of predominantly English-centric post-
training, showing

Chunk 4 · 1,999 chars

les strong zero-shot cross-lingual transfer that can
match or exceed the effects of direct language inclusion in low-diversity settings, though
with limitations for typologically distant, low-resource languages.
Together, these findings highlight the limitations of predominantly English-centric post-
training, showing that increasing language diversity can systematically improve both
English performance and cross-lingual generalization.
2 Related Work
Multilingual Post-Training Prior work on multilingual post-training primarily falls into
two paradigms: data substitution, which replaces part of English data with multilingual
data, and data augmentation, which adds multilingual data on top of an English baseline.
2

-- 2 of 16 --

Preprint. Under review.
Substitution-based studies find that small replacements can improve multilingual perfor-
mance, but larger substitutions often yield diminishing returns, degrade English perfor-
mance, and offer limited benefits for low-resource languages, highlighting trade-offs under
a fixed data budget (Shaham et al., 2024; Kew et al., 2024).
Our work adopts the data augmentation paradigm, reflecting practical post-training settings
where English data remains fixed and additional multilingual data is added to improve
generalization (Shimabucoro et al., 2025). Prior augmentation work typically samples
data uniformly from a fixed set of languages and reports mixed findings: some observe
gains only for parameter-efficient methods like LoRA (Chen et al., 2024), while others also
find improvements under full fine-tuning (Shimabucoro et al., 2025; Lai & Nissim, 2024).
However, uniform sampling makes it difficult to disentangle whether gains stem from
increased data volume or from language diversity.
To better isolate these effects, we explicitly control language coverage during training.
Similar to Ji & Chen (2025), we incrementally add languages, but rather than following
a fixed expansion order, we systematically vary both the number

Chunk 5 · 1,987 chars

ngle whether gains stem from
increased data volume or from language diversity.
To better isolate these effects, we explicitly control language coverage during training.
Similar to Ji & Chen (2025), we incrementally add languages, but rather than following
a fixed expansion order, we systematically vary both the number and composition of
languages in each mixture. While Ji & Chen (2025) report strong gains from including the
test language during training, our results further highlight the role of language diversity
in driving strong cross-lingual generalization, even in the absence of direct test-language
inclusion.
Task-Dependent Effects of Multilinguality Beyond training mixture design, prior work
suggests that the impact of multilinguality also depends on the task. Linguistically driven
generative tasks, such as summarization or open-ended dialogue, tend to benefit more
from multilingual training than highly structured tasks such as classification or reason-
ing (Shimabucoro et al., 2025; Kew et al., 2024). We build on this observation by evaluating
multilingual post-training across two complementary task types: mathematical reasoning
(symbolic reasoning) and API calling (structured generation). While mathematical reason-
ing has been examined in prior multilingual studies (Lai & Nissim, 2024; Shimabucoro
et al., 2025), the multilingual dimension of API calling has received less systematic atten-
tion. Huang et al. (2025) include a multilingual function-calling evaluation task across
17 languages as part of a broader benchmark suite, and Kulkarni et al. (2025) introduce
MASSIVE-Agents, a multilingual function-calling benchmark spanning 52 languages. How-
ever, neither work studies how multilingual post-training mixtures affect function-calling.
3 Experimental Details
3.1 Tasks and Datasets
We evaluate multilingual fine-tuning on two tasks covering different aspects of model
capability: Mathematical Reasoning for symbolic reasoning, and API calling for

Chunk 6 · 1,994 chars

languages. How-
ever, neither work studies how multilingual post-training mixtures affect function-calling.
3 Experimental Details
3.1 Tasks and Datasets
We evaluate multilingual fine-tuning on two tasks covering different aspects of model
capability: Mathematical Reasoning for symbolic reasoning, and API calling for structured
generation.
Task 1: Mathematical Reasoning Mathematical reasoning is a widely used task for eval-
uating the reasoning capabilities of large language models (Shi et al., 2022; Lai & Nissim,
2024). We examine how multilingual exposure during fine-tuning impacts mathematical
reasoning performance across our evaluated languages.
For training, we use mCoT-MATH (Lai & Nissim, 2024), a large-scale multilingual math
reasoning dataset that provides chain-of-thought solutions for math word problems in 11
languages. During both training and evaluation, we elicit chain-of-thought reasoning by
prompting the model with the language-specific equivalent of the phrase “Let’s think step by
step” immediately before answer generation, consistent with the prompting format used
in mCoT-MATH. For each language, we sample 10,000 parallel examples for training and
200 examples for validation. For testing, we use the MGSM benchmark (Shi et al., 2022), a
human-translated set of 250 grade-school arithmetic reasoning problems for multilingual
evaluation.
3

-- 3 of 16 --

Preprint. Under review.
Language Resource Family Sub-Family Script
Evaluation + Training Languages
English (En) High Indo-European Germanic Latin
Spanish (Es) High Indo-European Romance Latin
Japanese (Ja) High Japonic – Kanji+Kana
Bengali (Bn) Low Indo-European Indic Bengali
Swahili (Sw) Low Niger-Congo – Latin
Unseen Evaluation Language
Telugu (Te) Low Dravidian Indic Telugu
Training-Only Languages
French (Fr) High Indo-European Romance Latin
German (De) High Indo-European Germanic Latin
Russian (Ru) High Indo-European Slavic Cyrillic
Chinese (Zh) High Sino-Tibetan – Han
Thai (Th) Low Kra-Dai –

Chunk 7 · 1,995 chars

(Sw) Low Niger-Congo – Latin
Unseen Evaluation Language
Telugu (Te) Low Dravidian Indic Telugu
Training-Only Languages
French (Fr) High Indo-European Romance Latin
German (De) High Indo-European Germanic Latin
Russian (Ru) High Indo-European Slavic Cyrillic
Chinese (Zh) High Sino-Tibetan – Han
Thai (Th) Low Kra-Dai – Thai
Table 1: Languages used in our study, grouped by their role in training and evaluation.
Evaluation: Following prior work, we extract the model’s final predicted answer from its
generated output and compute accuracy against the ground-truth answer.
Task 2: API Calling Tool-augmented LLMs offer numerous benefits such as improved
real-time access, reduced hallucination, and more efficient workflows (Qu et al., 2025);
however, most current datasets remain English-centric, limiting multilingual tool-use.
Therefore, we introduce mAPICall-Bank, a multilingual dataset for training and evaluating
API calling across 11 languages. Built on API-Bank (Li et al., 2023), which assesses LLMs’
ability to call external tools in realistic, multi-turn dialogue scenarios across diverse domains
and APIs, mAPICall-Bank focuses on the API calling subtask where models generate the
correct API invocation given a user utterance and a candidate API pool. We construct
the dataset by translating API-Bank into 11 languages using a state-of-the-art LLM (see
Appendix A for the prompt); it contains 3,174 training and 399 test examples per language,
with non-overlapping APIs between splits. In our experiments, we hold out 150 examples
from the training split for validation. To our knowledge, mAPICall-Bank is one of the first
multilingual API calling datasets, and we release scripts to regenerate it with varied LLMs
as the translation engine publicly, to support future research. We show dataset statistics in
Table 4 in Appendix A.
Evaluation: We parse the model’s output to extract the API name and the dictionary of
argument–value pairs. A prediction is marked correct only if the

Chunk 8 · 1,988 chars

e release scripts to regenerate it with varied LLMs
as the translation engine publicly, to support future research. We show dataset statistics in
Table 4 in Appendix A.
Evaluation: We parse the model’s output to extract the API name and the dictionary of
argument–value pairs. A prediction is marked correct only if the API name, all argument
names, and all corresponding argument values exactly match the ground truth.
3.2 Multilingual Training Setup
We use a set of eleven typologically diverse languages (Table 1), following prior multilingual
work (Shi et al., 2022; Lai & Nissim, 2024), which form the basis for our training mixtures
and evaluation settings. Our evaluation focuses on five ‘core’ languages: English (En),
Spanish (Es), Japanese (Ja), Bengali (Bn), and Swahili (Sw), chosen to span a range of
resource levels, language families, geographic regions, and writing systems. To further test
generalization to unseen languages, we additionally evaluate on Telugu (Te), a low-resource
language not included in any training mixture. The remaining languages—French (Fr), Thai
(Th), German (De), Russian (Ru), and Chinese (Zh)—are used only to expand multilingual
training mixtures beyond the evaluation set.
Using these languages, we construct a series of training mixtures that progressively increase
language diversity while keeping comparisons controlled (Figure 1). In each mixture,
all included languages contribute the same number of parallel examples, which isolates
the impact of language coverage from differences in data volume. Starting from the five
core languages, we construct mixtures with progressively more languages, introducing
additional ones to increase diversity. For intermediate diversity levels (e.g., 2, 4, 6, or 9
4

-- 4 of 16 --

Preprint. Under review.
2 	4 	6 	8 	10
20
40
60
80
API Calling Accuracy (%)
 	Qwen-3 • High-resource Languages
2 	4 	6 	8 	10
Qwen-3 • Low-resource Languages
2 	4 	6 	8 	10
Gemma-3 • High-resource Languages
2 	4 	6 	8

Chunk 9 · 1,997 chars

ones to increase diversity. For intermediate diversity levels (e.g., 2, 4, 6, or 9
4

-- 4 of 16 --

Preprint. Under review.
2 	4 	6 	8 	10
20
40
60
80
API Calling Accuracy (%)
 	Qwen-3 • High-resource Languages
2 	4 	6 	8 	10
Qwen-3 • Low-resource Languages
2 	4 	6 	8 	10
Gemma-3 • High-resource Languages
2 	4 	6 	8 	10
Gemma-3 • Low-resource Languages
2 	4 	6 	8 	10
Number of Languages
0
20
40
60
80
Math Reasoning Accuracy (%)	
2 	4 	6 	8 	10
Number of Languages
2 	4 	6 	8 	10
Number of Languages
2 	4 	6 	8 	10
Number of Languages
Model family / size
Qwen-3 0.6B 	Qwen-3 1.7B 	Qwen-3 8.0B 	Gemma-3 1.0B 	Gemma-3 4.0B
Figure 2: Effect of increasing training language coverage on model performance for Qwen-
3 and Gemma-3 models. Plots show average accuracy (%) with 95% Wilson confidence
intervals as a function of the number of training languages, grouped by high-resource and
low-resource evaluation languages, for API calling (top) and math reasoning (bottom).
languages), we evaluate multiple combinations of the core languages, as indicated by the
“x4” or “x5” configurations in Figure 1. This design ensures that observed effects reflect
general trends across language combinations rather than any single language, while still
enabling controlled analysis of individual language inclusion.
3.3 Model Backbones
We experiment with two open-weight LLM families: Qwen-3 (0.6B, 1.7B, 8B) (Yang et al.,
2025) and Gemma-3 (1B, 4B) (Kamath et al., 2025). For each model, we use the officially
released pretraining-stage checkpoints to ensure that observed performance reflects task-
specific fine-tuning on our dataset rather than prior instruction tuning.
3.4 Training Details
We train each model on each data mixture for six epochs using 8 NVIDIA A100 80GB
GPUs. All models use the AdamW optimizer with a learning rate of 1e-5, a cosine learning
rate scheduler with a 3% warmup ratio, and a weight decay of 0.01. We maintain an
effective batch size of 64 per GPU (global batch size of 512),

Chunk 10 · 1,988 chars

ails
We train each model on each data mixture for six epochs using 8 NVIDIA A100 80GB
GPUs. All models use the AdamW optimizer with a learning rate of 1e-5, a cosine learning
rate scheduler with a 3% warmup ratio, and a weight decay of 0.01. We maintain an
effective batch size of 64 per GPU (global batch size of 512), adjusting the micro-batch
size and gradient accumulation steps as needed based on model memory requirements.
Gradient checkpointing is enabled to reduce memory usage. For each mixture, we select the
checkpoint with the highest average validation set task accuracy across all languages in that
mixture, ensuring that selection does not implicitly favor any individual language within
the mix.
4 Results
4.1 Multilingual Scaling is Largely Beneficial Across Tasks and Model Scales
We first analyze the overall impact of scaling multilinguality by increasing training language
coverage. Figure 2 reports mean accuracy (95% Wilson confidence intervals) for Qwen-3
and Gemma-3 models of varying sizes, evaluated across both tasks and grouped by high-
and low-resource languages, as a function of the number of training languages.
Across both model families and tasks, performance generally improves or remains stable as
more training languages are added, with low-resource languages benefiting the most and
5

-- 5 of 16 --

Preprint. Under review.
EN Non-EN
(Direct Exposure)
Non-EN
(Cross-Lingual Transfer)
Evaluated Languages
0
2
4
6
8
10
12
14
Δ Accuracy (%) (Bilingual vs. EN Training)
API Calling Math Reasoning
Figure 3: Median accuracy (%) change from bilingual versus English-only post-training
across evaluation settings for API calling and math reasoning. Error bars show 95%
bootstrap confidence intervals. Bilingual training yields consistent gains across evaluation
settings, with the largest improvements under direct exposure and smaller but reliable gains
under cross-lingual transfer.
high-resource languages plateauing rather than degrading. This holds

Chunk 11 · 1,991 chars

ath reasoning. Error bars show 95%
bootstrap confidence intervals. Bilingual training yields consistent gains across evaluation
settings, with the largest improvements under direct exposure and smaller but reliable gains
under cross-lingual transfer.
high-resource languages plateauing rather than degrading. This holds consistently across
model scales of 1B parameters and above, suggesting that at sufficient capacity, increasing
language coverage during post-training is largely beneficial without incurring negative
transfer.
An exception to this trend arises for the smallest model (Qwen-3 0.6B) on API calling, where
performance initially improves with increasing multilinguality—up to four languages
for high-resource and five for low-resource settings—before showing a slight decline as
additional languages are introduced, suggesting capacity-driven multilingual interference
at this scale. We do not observe similar degradation for mathematical reasoning or in larger
models, indicating that this effect is confined to sub-1B models and the more structured API
calling task.
Due to experimental noise, we validate these trends using a pooled regression analysis in
Section 4.4 across the full set of experiments. This provides additional evidence for the
benefits of increasing language coverage during training. Additional per-language trends
are provided in the Appendix C.
In the following subsections, we zoom in on specific regions of this trend to better under-
stand the effects of multilinguality. We first examine the low-diversity (bilingual) setting
(Section 4.2), and then analyze the high-diversity regime (Section 4.3), where increased
language coverage enables stronger cross-lingual generalization.
4.2 English Is Not All You Need: Even Minimal Multilinguality Helps
We next examine the low-diversity (bilingual) setting to understand whether even minimal
multilinguality is beneficial. Specifically, we compare English-only post-training to bilingual
training that

Chunk 12 · 1,993 chars

erage enables stronger cross-lingual generalization.
4.2 English Is Not All You Need: Even Minimal Multilinguality Helps
We next examine the low-diversity (bilingual) setting to understand whether even minimal
multilinguality is beneficial. Specifically, we compare English-only post-training to bilingual
training that includes English and one additional language. Figure 3 reports median
accuracy differences across English and non-English evaluations (95% bootstrap confidence
intervals), while Table 2 summarizes win rates across configurations. For non-English
evaluations, we distinguish between direct exposure—where the evaluation language is
included during training—and cross-lingual transfer, where it is not.
Across evaluation settings, bilingual training consistently outperforms English-only post-
training. The largest gains occur under direct exposure, yielding median improvements
of 9.27% for API calling and 8.4% for mathematical reasoning, with wins in 87.5% of
configurations.
6

-- 6 of 16 --

Preprint. Under review.
Evaluation Setting Win Rate 95% CI
EN 75.0% [59.8, 85.8]
Non-EN (Direct Exposure) 87.5% [73.9, 94.5]
Non-EN (Cross-Lingual Transfer) 74.4% [67.1, 80.5]
Table 2: Win rates of bilingual post-training over English-only post-training across eval-
uation settings, aggregated across tasks and models. A win is defined as a configuration
where bilingual training achieves higher accuracy than English-only training.
These benefits extend beyond the added language. Even when the evaluation language is
absent from training, bilingual models improve performance through cross-lingual transfer
(+3.38% for API calling and +1.6% for mathematical reasoning), outperforming English-only
training in 74.4% of configurations.
Notably, multilinguality also improves English performance, with median gains of 0.88%
for API calling and 3.4% for mathematical reasoning, and wins in 75% of configurations.
We provide fine-grained per-language results for English-only and

Chunk 13 · 1,997 chars

al reasoning), outperforming English-only
training in 74.4% of configurations.
Notably, multilinguality also improves English performance, with median gains of 0.88%
for API calling and 3.4% for mathematical reasoning, and wins in 75% of configurations.
We provide fine-grained per-language results for English-only and bilingual post-training
relative to pretraining in Appendix B, shown as heatmaps over source–target language pairs.
As expected, post-training in nearly any language and setting improves performance across
all languages.
Together, these results show that English-only post-training is suboptimal: even minimal
multilingual exposure yields gains that generalize across languages, tasks, and model scales.
We next turn to higher-diversity settings to examine how increasing language coverage
further shapes cross-lingual generalization.
4.3 High Linguistic Diversity Supports Generalization Comparable to Direct Exposure
40
60
80
Qwen-3 8B
4 Languages
6 Languages
9 Languages
40 60 80
40
60
80
Gemma-3 4B
4 Languages
40 60 80
6 Languages
40 60 80
9 Languages
Zero-Shot Accuracy (%)
Bilingual Direct Exposure Accuracy (%)
API Calling Math Reasoning EN ES JA BN SW
Figure 4: Comparison of zero-shot cross-lingual transfer versus direct bilingual exposure
at varying levels of language diversity for Qwen-3 8B (top) and Gemma-3 4B (bottom), with
4 (left), 6 (middle), and 9 (right) training languages. High-resource languages (red) tend to
cluster near the diagonal, indicating strong zero-shot generalization that can compensate
for the absence of direct inclusion. Low-resource languages (blue) more often fall below the
diagonal, suggesting greater benefit of direct inclusion.
Having established that even limited multilingual diversity is beneficial, we now ask how
far these generalization effects extend. Specifically, can sufficient language diversity during
post-training compensate entirely for the absence of the target language in training?
7

-- 7 of 16

Chunk 14 · 1,988 chars

ater benefit of direct inclusion.
Having established that even limited multilingual diversity is beneficial, we now ask how
far these generalization effects extend. Specifically, can sufficient language diversity during
post-training compensate entirely for the absence of the target language in training?
7

-- 7 of 16 --

Preprint. Under review.
Figure 4 compares a low-diversity bilingual setting with direct exposure to the evaluation
language against higher-diversity settings (4, 6, or 9 languages) where the evaluation
language is excluded and performance relies entirely on zero-shot cross-lingual transfer. In
the figure, the diagonal indicates parity between the two settings, with points near or above
it indicating that zero-shot transfer matches or exceeds direct exposure. We present results
for the largest models in each family (Qwen-3 8B and Gemma-3 4B) in the main text, with
additional model sizes provided in Appendix D.
High-resource non-English languages generalize well under increased diversity. Results
for high-resource languages (red) tend to lie close to the diagonal, especially at higher
diversity levels (6 and 9 languages). In particular, zero-shot performance is statistically
indistinguishable from or significantly exceeds that of bilingual direct exposure in 100% of
cases for the 9- and 6-language settings, and in 87.5% of cases for the 4-language setting
(two-sided t-test), suggesting that linguistic diversity alone is sufficient to drive strong
generalization for these languages.
Low-resource, typologically distant languages benefit more from direct inclusion. Re-
sults for low-resource languages (blue) more often fall below the diagonal, indicating that
zero-shot cross-lingual transfer is less able to compensate for the absence of direct training
exposure. Even in the 9-language setting, zero-shot performance is statistically indistin-
guishable from direct exposure in 62.5% of cases, but never significantly exceeds it — in
contrast to

Chunk 15 · 1,995 chars

w the diagonal, indicating that
zero-shot cross-lingual transfer is less able to compensate for the absence of direct training
exposure. Even in the 9-language setting, zero-shot performance is statistically indistin-
guishable from direct exposure in 62.5% of cases, but never significantly exceeds it — in
contrast to high-resource languages, where zero-shot transfer matches or exceeds direct
exposure, indicating a stronger reliance on explicit inclusion for these languages.
English also benefits from zero-shot cross-lingual transfer. Despite being the dominant
language in post-training, English performance also improves with increased language
diversity even without direct English supervision. Across tasks, zero-shot cross-lingual
performance significantly exceeds or matches bilingual direct exposure in 100% of cases for
the 9- and 6-language settings, and in 75% of cases for the 4-language setting. API calling
shows stronger gains from increased language diversity, with cross-lingual performance
significantly exceeding direct exposure in 50% of cases and never significantly underper-
forming, suggesting that models can leverage structural and lexical variation from diverse
languages even without direct English supervision.
Prior work has shown a strong reliance on English for reasoning processes even in multilin-
gual settings, suggesting that mathematical reasoning may require more English supervision
for optimal performance (Etxaniz et al., 2024; Schut et al., 2025). Our results show that this
dependence diminishes with increased language diversity. Even for mathematical reasoning
evaluated in English, zero-shot cross-lingual performance matches direct exposure when 6
or more languages are included during training, though results are more variable at lower
diversity (4 languages), where 50% of cases perform worse than direct exposure.
Overall, these results show that at sufficient language diversity, zero-shot cross-lingual
transfer can match or even exceed

Chunk 16 · 1,996 chars

s direct exposure when 6
or more languages are included during training, though results are more variable at lower
diversity (4 languages), where 50% of cases perform worse than direct exposure.
Overall, these results show that at sufficient language diversity, zero-shot cross-lingual
transfer can match or even exceed direct language inclusion for high-resource and English
evaluations, though limitations remain for low-resource, typologically distant languages.
This interpretation is further supported by the pooled regression in Section 4.4, which shows
that broader training-language coverage remains positively associated with performance
even after controlling for direct target-language inclusion.
4.4 Pooled Regression Analysis
To test whether the patterns in Sections 4.1–4.3 persist in aggregate, we fit a pooled regression
model over evaluation instances, where each instance corresponds to evaluating a trained
model on a task-language pair. The regression includes model family, task, pretrained-only
status, whether the evaluation language appears in the training mixture, log10 model size,
and a transformed measure of training-language coverage, defined as √L/Lmax to allow
diminishing returns with additional languages. We fit the regression on a random 70% split
of evaluation instances and evaluate on the remaining 30%, where it achieves an out-of-
sample R2 of 80.5%. We report on coefficients with bootstrap 95% confidence intervals. We
interpret this analysis as complementing the analyses above with an aggregate summary
over evaluation instances.
8

-- 8 of 16 --

Preprint. Under review.
Feature Coef. 2.5% 97.5%
log10 model size 0.302 0.285 0.319
Qwen model family 0.265 0.248 0.281
Math task -0.207 -0.221 -0.193
Pretrained only -0.110 -0.167 -0.047
Eval. language in training 0.089 0.074 0.104 √L/Lmax 0.053 0.018 0.089
Table 3: Pooled regression over evaluation instances, where each instance corresponds to
evaluating a trained model on a task-language pair. The

Chunk 17 · 1,990 chars

Qwen model family 0.265 0.248 0.281
Math task -0.207 -0.221 -0.193
Pretrained only -0.110 -0.167 -0.047
Eval. language in training 0.089 0.074 0.104 √L/Lmax 0.053 0.018 0.089
Table 3: Pooled regression over evaluation instances, where each instance corresponds to
evaluating a trained model on a task-language pair. The model is trained on a random 70%
split of instances and evaluated on the remaining 30%, achieving an out-of-sample R2 of
80.5%. Coefficients are reported with bootstrap 95% confidence intervals. Here, L denotes
the number of training languages and Lmax the maximum number of training languages in
our experiments.
The pooled analysis confirms several expected trends, including benefits from larger models,
post-training, and direct inclusion of the evaluation language. Importantly, broader training-
language coverage is positively associated with performance even after controlling for
direct target-language inclusion (β = 0.053, 95% CI [0.018, 0.089]). This is consistent with
the discussion above indicating that multilingual gains are not explained solely by direct
language exposure.
5 Conclusion
This work presents a study of multilingual post-training under realistic conditions, ex-
amining how language coverage and composition interact with model capacity and task
structure. Our analysis spans both reasoning and tool-use settings.
Our findings show that English-centric post-training is typically suboptimal for cross-lingual
transfer. Increasing language coverage helps performance for all languages, including
English, and can compensate in some cases for a lack of direct inclusion of the target
language in training, although these gains are limited for typologically distant, low-resource
languages. We find that with limited exceptions for API calling at the smallest scales
(≤ 1B), models are able to benefit from increased multilingual diversity, particularly for
low-resource languages, without degrading performance in high-resource languages.
6

Chunk 18 · 1,998 chars

gains are limited for typologically distant, low-resource
languages. We find that with limited exceptions for API calling at the smallest scales
(≤ 1B), models are able to benefit from increased multilingual diversity, particularly for
low-resource languages, without degrading performance in high-resource languages.
6 Limitations
Our work provides a systematic study of multilingual post-training under controlled con-
ditions, but several limitations remain. First, our analysis focuses on 11 languages, which
do not capture the full global linguistic diversity of thousands of languages worldwide.
To partially address this, we select languages spanning a broad range of resource levels,
language families, geographic regions, and scripts. Second, although our experiments span
multiple model scales and two model families, the largest model we consider contains 8B
parameters, and it remains an open question how multilingual scaling effects evolve at
larger model sizes. To isolate the effects of language coverage, we hold the amount of data
per language constant and therefore do not examine how jointly scaling data volume and
language diversity interacts with model capacity or task difficulty. Additionally, our multi-
lingual datasets are constructed via translation of English task data, which may introduce
translation-specific artifacts (e.g. “translationese”); future work could investigate whether
similar effects hold for naturally occurring multilingual data. Finally, our analysis focuses
on mathematical reasoning and API calling as post-training tasks; other task domains may
exhibit different behavior under multilingual scaling.
9

-- 9 of 16 --

Preprint. Under review.
References
Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a
curse? language modeling for 250 high-and low-resource languages. In Proceedings of the
2024 Conference on Empirical Methods in Natural Language Processing, pp. 4074–4096, 2024.
Pinzhen Chen, Shaoxiong Ji,

Chunk 19 · 1,996 chars

. Under review.
References
Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a
curse? language modeling for 250 high-and low-resource languages. In Proceedings of the
2024 Conference on Empirical Methods in Natural Language Processing, pp. 4074–4096, 2024.
Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and
Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a
better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pp.
1347–1356, 2024.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin
Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of
the 58th annual meeting of the association for computational linguistics, pp. 8440–8451, 2020.
Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, and Mikel Artetxe.
Do multilingual language models think better in english? In Proceedings of the 2024
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies (Volume 2: Short Papers), pp. 550–564, 2024.
Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan.
Benchmax: A comprehensive multilingual evaluation suite for large language models.
arXiv preprint arXiv:2502.07346, 2025.
Shaoxiong Ji and Pinzhen Chen. How many languages make good multilingual instruction
tuning? a case study on bloom. In Proceedings of the 31st International Conference on
Computational Linguistics, pp. 2575–2581, 2025.
Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah
Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al.
Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 4, 2025.
Tannon Kew, Florian Schottmann, and Rico Sennrich. Turning english-centric llms into poly-
glots:

Chunk 20 · 1,997 chars

Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah
Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al.
Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 4, 2025.
Tannon Kew, Florian Schottmann, and Rico Sennrich. Turning english-centric llms into poly-
glots: How much multilinguality is needed? In Findings of the Association for Computational
Linguistics: EMNLP 2024, pp. 13097–13124, 2024.
Simran Khanuja, Sebastian Ruder, and Partha Talukdar. Evaluating the diversity, equity,
and inclusion of nlp technology: A case study for indian languages. In Findings of the
Association for Computational Linguistics: EACL 2023, pp. 1763–1777, 2023.
Mayank Kulkarni, Vittorio Mazzia, Judith Gaspers, Christopher Hench, Jack FitzGerald,
and AGI Amazon. Massive-agents: A benchmark for multilingual function-calling in 52
languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.
20193–20215, 2025.
Huiyuan Lai and Malvina Nissim. mcot: Multilingual instruction tuning for reasoning
consistency in language models. arXiv preprint arXiv:2406.02301, 2024.
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li,
Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented
llms. arXiv preprint arXiv:2304.08244, 2023.
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I Hsu, Isaac Caswell, Alex Pent-
land, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi, et al. Atlas: Adaptive transfer scaling
laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality.
arXiv preprint arXiv:2510.22037, 2025.
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun
Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of
Computer Science, 19(8):198343, 2025.
Lisa Schut, Yarin Gal, and Sebastian Farquhar. Do multilingual llms think in english? arXiv
preprint arXiv:2502.15603, 2025.
10

-- 10 of 16 --

Chunk 21 · 1,988 chars

i Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun
Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of
Computer Science, 19(8):198343, 2025.
Lisa Schut, Yarin Gal, and Sebastian Farquhar. Do multilingual llms think in english? arXiv
preprint arXiv:2502.15603, 2025.
10

-- 10 of 16 --

Preprint. Under review.
Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan
Eyal. Multilingual instruction tuning with just a pinch of multilinguality. arXiv preprint
arXiv:2401.01854, 2024.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi,
Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are
multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, and Sebastian Ruder. A post-trainer’s
guide to multilingual training data: Uncovering cross-lingual transfer dynamics. arXiv
preprint arXiv:2504.16677, 2025.
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,
Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint
arXiv:2505.09388, 2025.
11

-- 11 of 16 --

Preprint. Under review.
Figure 5: Example showing the final turns of a parallel multi-turn API calling interaction
from mAPICall-Bank in three languages: (a) English, (b) Spanish, and (c) Bengali.
Figure 6: Prompt used to create mAPICall-Bank
A mAPICall-Bank Construction
We construct mAPICall-Bank by translating the API calling subset of the original English
API-Bank dataset into multiple target languages using a state-of-the-art large language
model. The full translation prompt is provided in Figure 6.
During translation, we preserve the full structure of each example, including user utterances,
system responses, API names, and argument schemas. To maintain compatibility with tool
specifications, API names and parameter keys are kept in English across all languages,
while

Chunk 22 · 1,998 chars

translation prompt is provided in Figure 6.
During translation, we preserve the full structure of each example, including user utterances,
system responses, API names, and argument schemas. To maintain compatibility with tool
specifications, API names and parameter keys are kept in English across all languages,
while user-facing text and, when appropriate, argument values are translated into the target
language.
The translation prompt was iteratively refined to ensure structural validity and parseability
of model outputs. In particular, we enforce that translated examples retain the original
formatting and can be reliably parsed into API name and argument–value pairs. We perform
automated checks to validate output structure (e.g., JSON format) and manually inspect a
subset of examples across languages, with native speakers verifying linguistic accuracy and
consistency with the source data.
Figure 6 shows the prompt used to translate API-Bank to create mAPICall-Bank. Figure 5
shows an example of a parallel multi-turn interaction involving a single API call across three
languages. Dataset statistics including the number of queries, unique APIs, and examples
of APIs in the train and test split of mAPICall-Bank are shown in Table 4.
12

-- 12 of 16 --

Preprint. Under review.
Split #Queries #APIs Example APIs
Get_All_Sessions,
Train 3,174 1,535 get_device_details, get_relaxation_techniques
ModifyRegistration,
Test 399 49 Calculator, EmergencyKnowledge
Table 4: Statistics of the mAPICall-Bank dataset, including the number of user queries,
unique APIs, and examples of API names in each split.
B Monolingual vs. Bilingual Transfer Heatmaps
Figure 7 presents heatmaps of accuracy change relative to the pretrained baseline for
monolingual and bilingual training settings. Each cell shows the difference in performance
when training on a given source language (or language pair) and evaluating on a target
language.
These results complement the findings in Section 4.2. Consistent

Chunk 23 · 1,990 chars

eatmaps of accuracy change relative to the pretrained baseline for
monolingual and bilingual training settings. Each cell shows the difference in performance
when training on a given source language (or language pair) and evaluating on a target
language.
These results complement the findings in Section 4.2. Consistent with the observation that
English-only post-training is suboptimal, we find that adding a second language generally
leads to positive gains across a wide range of evaluation languages. Notably, improvements
are not limited to the added language: many bilingual configurations yield gains even when
the evaluation language is not included in training.
Further, the heatmaps show that gains are broadly distributed across language pairs rather
than concentrated in a small subset, suggesting that the benefits of multilingual post-
training are not driven by specific language combinations but reflect a more general effect
of multilingual exposure.
13

-- 13 of 16 --

Preprint. Under review.
(a) Qwen-3 Models
(b) Gemma-3 Models
Figure 7: Change in accuracy relative to the pre-trained baseline for bilingual versus English-
only post-training. Heatmaps show performance differences across evaluation languages
when English is paired with a single additional language (top: API calling; bottom: math
reasoning). Subfigures (a) and (b) correspond to Qwen-3 and Gemma-3.
14

-- 14 of 16 --

Preprint. Under review.
(a) Qwen-3 Models
(b) Gemma-3 Models
Figure 8: Per-language multilingual scaling trends for (a) Qwen-3 models, and (b) Gemma-3
models. Each cell shows mean accuracy as a function of the number of training languages
for API calling (top rows) and math reasoning (bottom rows), with scatter points indicating
individual results.
C Multilingual Scaling Trends Per Language
Figure 8 presents scaling trends for individual evaluation languages as a function of the
number of training languages, for (a) Qwen-3 and (b) Gemma-3 models.
These results complement the

Chunk 24 · 1,991 chars

ws) and math reasoning (bottom rows), with scatter points indicating
individual results.
C Multilingual Scaling Trends Per Language
Figure 8 presents scaling trends for individual evaluation languages as a function of the
number of training languages, for (a) Qwen-3 and (b) Gemma-3 models.
These results complement the aggregated trends shown in Figure 2. Consistent with the
main findings, we observe that with the exception of the smallest 0.6B model, increasing the
number of training languages generally improves or maintains performance across most
languages and model sizes. Gains are particularly pronounced for low-resource languages,
while high-resource languages tend to plateau as additional languages are introduced.
D Additional Results on Linguistic Diversity Driven Cross-Lingual
Generalization
Figure 9 presents the corresponding plots from Section 4.3 for smaller-scale models (Qwen-3
1.7B, Qwen-3 0.6B, and Gemma-3 1B).
We observe consistent trends with those reported for larger models: high-resource lan-
guages (in red) tend to lie close to the diagonal, indicating that language diversity enables
strong zero-shot cross-lingual transfer that matches or approaches the performance of direct
inclusion at these scales. Low-resource languages often benefit from explicit language inclu-
sion, particularly at the smallest scale (Qwen-3 0.6B). This is consistent with our findings in
Section 4.1, where we observe capacity-driven interference at higher levels of multilinguality
for this model.
15

-- 15 of 16 --

Preprint. Under review.
0.00
0.25
0.50
0.75
Qwen-3 1.7B
4 Languages
6 Languages
9 Languages
0.00
0.25
0.50
0.75
Qwen-3 0.6B
4 Languages
6 Languages
9 Languages
0.00 0.25 0.50 0.75
Bilingual Direct Exposure
0.00
0.25
0.50
0.75
Gemma-3 1B
4 Languages
0.00 0.25 0.50 0.75
Bilingual Direct Exposure
6 Languages
0.00 0.25 0.50 0.75
Bilingual Direct Exposure
9 Languages
Zero-Shot Accuracy
API Calling Math Reasoning EN ES JA BN SW
Figure 9:

Chunk 25 · 597 chars

s	
6 Languages	
9 Languages
0.00 	0.25 	0.50 	0.75
Bilingual Direct Exposure
0.00
0.25
0.50
0.75
Gemma-3 1B	
4 Languages
0.00 	0.25 	0.50 	0.75
Bilingual Direct Exposure
6 Languages
0.00 	0.25 	0.50 	0.75
Bilingual Direct Exposure
9 Languages
Zero-Shot Accuracy
API Calling 	Math Reasoning 	EN 	ES 	JA 	BN 	SW
Figure 9: Comparison of zero-shot cross-lingual transfer and direct bilingual exposure
for smaller models. Rows correspond to Qwen-3 1.7B (top), Qwen-3 0.6B (middle), and
Gemma-3 1B (bottom), with columns showing 4 (left), 6 (middle), and 9 (right) training
languages.
16

-- 16 of 16 --