English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training
Summary
This study investigates the impact of multilingual post-training on large language models (LLMs), challenging the common practice of English-centric fine-tuning. Using 220 training runs across two model families (Qwen-3 and Gemma-3) and 22 multilingual data mixtures, the researchers evaluated mathematical reasoning and API calling tasks. Key findings show that increasing language coverage during post-training generally improves performance across all languages, with low-resource languages benefiting the most. Even minimal multilingualityâadding a single non-English languageâenhances English performance and cross-lingual generalization, making English-only training suboptimal. High language diversity enables strong zero-shot cross-lingual transfer, often matching or exceeding direct language inclusion, though typologically distant, low-resource languages see limited gains. The study also reveals that model capacity affects outcomes, with smaller models showing slight degradation in API calling tasks at higher multilinguality. These results highlight the benefits of diverse multilingual training for improving both English and non-English performance, while acknowledging limitations in representing global linguistic diversity and the potential for translation artifacts in dataset construction.
PDF viewer
Chunks(26)
Chunk 0 ¡ 1,995 chars
Preprint. Under review. English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training Mehak Dhaliwal1â, Shashwat Chaurasia2, Yao Qin1, Dezhi Hong2â , Thomas Butler2 1UC Santa Barbara, 2Amazon Abstract Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, con- trolled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on paral- lel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that in- creasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even min- imal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, mak- ing English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages. 1 Introduction Large Language Models (LLMs) have achieved strong performance across a wide range of tasks, driven by capabilities such as reasoning, instruction following, and structured generation. These capabilities are largely enabled by a multi-stage pipeline of large-scale pre-training followed by post-training on curated datasets (Shaham et al., 2024). Despite growing global use of LLMs, post-training remains predominantly English-centric. While task-specific English fine-tuning enables some degree of cross-lingual transfer, significant performance disparities
Chunk 1 ¡ 1,995 chars
e pipeline of large-scale pre-training followed by post-training on curated datasets (Shaham et al., 2024). Despite growing global use of LLMs, post-training remains predominantly English-centric. While task-specific English fine-tuning enables some degree of cross-lingual transfer, significant performance disparities persist across languages (Khanuja et al., 2023). Prior work on multilingual post-training suggests a nuanced picture: replacing small amounts of English data with a limited number of additional languages can improve multilingual task performance, while larger substitutions often yield diminishing gains and may degrade English performance (Shaham et al., 2024; Kew et al., 2024). Similarly, work has shown that augmenting English task-specific data with multilingual data improves multilingual outcomes (Lai & Nissim, 2024; Ji & Chen, 2025; Shimabucoro et al., 2025). However, these findings remain fragmented and leave several questions unresolved. Exist- ing studies typically focus on a narrow set of languages, tasks, and/or model scales, making it difficult to understand how increasing language coverage systematically affects model behavior. In particular, we lack a clear characterization of how trade-offs manifest across (i) English vs. non-English performance, (ii) different task types, and (iii) varying model capacities. This gap is especially important in light of multilingual pre-training literature, which highlights a fundamental trade-off under fixed model capacity: expanding language coverage can dilute performance in high-resource languagesâa phenomenon often termed the âcurse of multilingualityâ or ânegative interferenceâ (Conneau et al., 2020; Chang et al., 2024; Longpre et al., 2025). While cross-lingual transfer can provide gains, it is still unclear âWork conducted during an internship at Amazon. Correspondence to: mdhaliwal@ucsb.edu â Work conducted while at Amazon. 1 arXiv:2604.13286v1 [cs.CL] 14 Apr 2026 -- 1 of 16 -- Preprint. Under
Chunk 2 ¡ 1,989 chars
au et al., 2020; Chang et al., 2024; Longpre et al., 2025). While cross-lingual transfer can provide gains, it is still unclear âWork conducted during an internship at Amazon. Correspondence to: mdhaliwal@ucsb.edu â Work conducted while at Amazon. 1 arXiv:2604.13286v1 [cs.CL] 14 Apr 2026 -- 1 of 16 -- Preprint. Under review. RU Multilingual Parallel Data Pool BN EN ES JA SW Core Languages DE FR Additional Languages TH ZH Tasks ⢠Math Reasoning ⢠API Calling Total 2 Tasks Model Training Qwen-3 0.6B 1.7B 8B Total 5 models Gemma-3 1B 4B N=2 N=1 N=5 N=9 2 Tasks x 22 Data Mixtures x 5 Models = 220 Training Runs RU Multilingual Data Mixtures EN BN x4 combinations Total 22 Data Mixtures x1 combination EN x5 combinations ES JA N=4 BN EN x1 combination SW ES JA BN EN x5 combinations ES BN EN N=6 FR TH JA x5 combinations ES BN EN JA DE RU ZH FR TH x1 combination ES BN EN SW JA DE RU ZH N=10 FR TH Variable Core Language Fixed Core Language Fixed Additional Language N=2 N=1 N=5 N=9 Figure 1: Overview of the experimental design. We start from a task-specific multilingual parallel data pool consisting of five core languages, which are used to construct exhaustive data mixture combinations, and five additional languages that enable scaling the exper- iments to up to ten languages. We generate 22 data mixtures with increasing language counts and varying combinations; within the multilingual data mixtures, languages shown in blue indicate one possible combination, while languages shown in grey are fixed for the corresponding number of languages. We train five models from two model families on each mixture for two tasks, resulting in 22 Ă 2 Ă 5 = 220 total training runs. when benefits outweigh interference effects during post-training and how these effects scale with model size and task complexity. We present a systematic study of multilingual task-specific post-training across two tasks, two model families, and 220 training runs. Our controlled scaling design varies
Chunk 3 ¡ 1,997 chars
al training runs. when benefits outweigh interference effects during post-training and how these effects scale with model size and task complexity. We present a systematic study of multilingual task-specific post-training across two tasks, two model families, and 220 training runs. Our controlled scaling design varies multilin- guality by increasing the number of languages in the post-training mixture, using parallel translated task data to avoid convolving our results with the effect of simply increasing the dataset size with additional iid data (Shimabucoro et al., 2025; Ji & Chen, 2025). We train models on mixtures of up to 10 languages (Figure 1) using the Qwen-3 family (0.6B, 1.7B, 8B) (Yang et al., 2025) and Gemma-3 (1B, 4B) (Kamath et al., 2025). Evaluation covers mathematical reasoning and API calling across six languages that span different resource levels, language families, regions, and scripts (Table 1). We summarize our key findings as follows: 1. Multilingual scaling is largely beneficial across tasks and model scales: increasing language coverage generally improves or maintains performance across tasks and model scales, with low-resource languages continuing to benefit from added multilinguality while high-resource languages plateau rather than degrade. 2. Even limited language diversity helps: adding parallel data in even a single non-English language typically results in gains that generalize beyond the added language to other languages, including English, making English-only post-training consistently suboptimal. 3. High diversity enables strong zero-shot cross-lingual transfer: increased language diversity during post-training enables strong zero-shot cross-lingual transfer that can match or exceed the effects of direct language inclusion in low-diversity settings, though with limitations for typologically distant, low-resource languages. Together, these findings highlight the limitations of predominantly English-centric post- training, showing
Chunk 4 ¡ 1,999 chars
les strong zero-shot cross-lingual transfer that can match or exceed the effects of direct language inclusion in low-diversity settings, though with limitations for typologically distant, low-resource languages. Together, these findings highlight the limitations of predominantly English-centric post- training, showing that increasing language diversity can systematically improve both English performance and cross-lingual generalization. 2 Related Work Multilingual Post-Training Prior work on multilingual post-training primarily falls into two paradigms: data substitution, which replaces part of English data with multilingual data, and data augmentation, which adds multilingual data on top of an English baseline. 2 -- 2 of 16 -- Preprint. Under review. Substitution-based studies find that small replacements can improve multilingual perfor- mance, but larger substitutions often yield diminishing returns, degrade English perfor- mance, and offer limited benefits for low-resource languages, highlighting trade-offs under a fixed data budget (Shaham et al., 2024; Kew et al., 2024). Our work adopts the data augmentation paradigm, reflecting practical post-training settings where English data remains fixed and additional multilingual data is added to improve generalization (Shimabucoro et al., 2025). Prior augmentation work typically samples data uniformly from a fixed set of languages and reports mixed findings: some observe gains only for parameter-efficient methods like LoRA (Chen et al., 2024), while others also find improvements under full fine-tuning (Shimabucoro et al., 2025; Lai & Nissim, 2024). However, uniform sampling makes it difficult to disentangle whether gains stem from increased data volume or from language diversity. To better isolate these effects, we explicitly control language coverage during training. Similar to Ji & Chen (2025), we incrementally add languages, but rather than following a fixed expansion order, we systematically vary both the number
Chunk 5 ¡ 1,987 chars
ngle whether gains stem from increased data volume or from language diversity. To better isolate these effects, we explicitly control language coverage during training. Similar to Ji & Chen (2025), we incrementally add languages, but rather than following a fixed expansion order, we systematically vary both the number and composition of languages in each mixture. While Ji & Chen (2025) report strong gains from including the test language during training, our results further highlight the role of language diversity in driving strong cross-lingual generalization, even in the absence of direct test-language inclusion. Task-Dependent Effects of Multilinguality Beyond training mixture design, prior work suggests that the impact of multilinguality also depends on the task. Linguistically driven generative tasks, such as summarization or open-ended dialogue, tend to benefit more from multilingual training than highly structured tasks such as classification or reason- ing (Shimabucoro et al., 2025; Kew et al., 2024). We build on this observation by evaluating multilingual post-training across two complementary task types: mathematical reasoning (symbolic reasoning) and API calling (structured generation). While mathematical reason- ing has been examined in prior multilingual studies (Lai & Nissim, 2024; Shimabucoro et al., 2025), the multilingual dimension of API calling has received less systematic atten- tion. Huang et al. (2025) include a multilingual function-calling evaluation task across 17 languages as part of a broader benchmark suite, and Kulkarni et al. (2025) introduce MASSIVE-Agents, a multilingual function-calling benchmark spanning 52 languages. How- ever, neither work studies how multilingual post-training mixtures affect function-calling. 3 Experimental Details 3.1 Tasks and Datasets We evaluate multilingual fine-tuning on two tasks covering different aspects of model capability: Mathematical Reasoning for symbolic reasoning, and API calling for
Chunk 6 ¡ 1,994 chars
languages. How- ever, neither work studies how multilingual post-training mixtures affect function-calling. 3 Experimental Details 3.1 Tasks and Datasets We evaluate multilingual fine-tuning on two tasks covering different aspects of model capability: Mathematical Reasoning for symbolic reasoning, and API calling for structured generation. Task 1: Mathematical Reasoning Mathematical reasoning is a widely used task for eval- uating the reasoning capabilities of large language models (Shi et al., 2022; Lai & Nissim, 2024). We examine how multilingual exposure during fine-tuning impacts mathematical reasoning performance across our evaluated languages. For training, we use mCoT-MATH (Lai & Nissim, 2024), a large-scale multilingual math reasoning dataset that provides chain-of-thought solutions for math word problems in 11 languages. During both training and evaluation, we elicit chain-of-thought reasoning by prompting the model with the language-specific equivalent of the phrase âLetâs think step by stepâ immediately before answer generation, consistent with the prompting format used in mCoT-MATH. For each language, we sample 10,000 parallel examples for training and 200 examples for validation. For testing, we use the MGSM benchmark (Shi et al., 2022), a human-translated set of 250 grade-school arithmetic reasoning problems for multilingual evaluation. 3 -- 3 of 16 -- Preprint. Under review. Language Resource Family Sub-Family Script Evaluation + Training Languages English (En) High Indo-European Germanic Latin Spanish (Es) High Indo-European Romance Latin Japanese (Ja) High Japonic â Kanji+Kana Bengali (Bn) Low Indo-European Indic Bengali Swahili (Sw) Low Niger-Congo â Latin Unseen Evaluation Language Telugu (Te) Low Dravidian Indic Telugu Training-Only Languages French (Fr) High Indo-European Romance Latin German (De) High Indo-European Germanic Latin Russian (Ru) High Indo-European Slavic Cyrillic Chinese (Zh) High Sino-Tibetan â Han Thai (Th) Low Kra-Dai â
Chunk 7 ¡ 1,995 chars
(Sw) Low Niger-Congo â Latin Unseen Evaluation Language Telugu (Te) Low Dravidian Indic Telugu Training-Only Languages French (Fr) High Indo-European Romance Latin German (De) High Indo-European Germanic Latin Russian (Ru) High Indo-European Slavic Cyrillic Chinese (Zh) High Sino-Tibetan â Han Thai (Th) Low Kra-Dai â Thai Table 1: Languages used in our study, grouped by their role in training and evaluation. Evaluation: Following prior work, we extract the modelâs final predicted answer from its generated output and compute accuracy against the ground-truth answer. Task 2: API Calling Tool-augmented LLMs offer numerous benefits such as improved real-time access, reduced hallucination, and more efficient workflows (Qu et al., 2025); however, most current datasets remain English-centric, limiting multilingual tool-use. Therefore, we introduce mAPICall-Bank, a multilingual dataset for training and evaluating API calling across 11 languages. Built on API-Bank (Li et al., 2023), which assesses LLMsâ ability to call external tools in realistic, multi-turn dialogue scenarios across diverse domains and APIs, mAPICall-Bank focuses on the API calling subtask where models generate the correct API invocation given a user utterance and a candidate API pool. We construct the dataset by translating API-Bank into 11 languages using a state-of-the-art LLM (see Appendix A for the prompt); it contains 3,174 training and 399 test examples per language, with non-overlapping APIs between splits. In our experiments, we hold out 150 examples from the training split for validation. To our knowledge, mAPICall-Bank is one of the first multilingual API calling datasets, and we release scripts to regenerate it with varied LLMs as the translation engine publicly, to support future research. We show dataset statistics in Table 4 in Appendix A. Evaluation: We parse the modelâs output to extract the API name and the dictionary of argumentâvalue pairs. A prediction is marked correct only if the
Chunk 8 ¡ 1,988 chars
e release scripts to regenerate it with varied LLMs as the translation engine publicly, to support future research. We show dataset statistics in Table 4 in Appendix A. Evaluation: We parse the modelâs output to extract the API name and the dictionary of argumentâvalue pairs. A prediction is marked correct only if the API name, all argument names, and all corresponding argument values exactly match the ground truth. 3.2 Multilingual Training Setup We use a set of eleven typologically diverse languages (Table 1), following prior multilingual work (Shi et al., 2022; Lai & Nissim, 2024), which form the basis for our training mixtures and evaluation settings. Our evaluation focuses on five âcoreâ languages: English (En), Spanish (Es), Japanese (Ja), Bengali (Bn), and Swahili (Sw), chosen to span a range of resource levels, language families, geographic regions, and writing systems. To further test generalization to unseen languages, we additionally evaluate on Telugu (Te), a low-resource language not included in any training mixture. The remaining languagesâFrench (Fr), Thai (Th), German (De), Russian (Ru), and Chinese (Zh)âare used only to expand multilingual training mixtures beyond the evaluation set. Using these languages, we construct a series of training mixtures that progressively increase language diversity while keeping comparisons controlled (Figure 1). In each mixture, all included languages contribute the same number of parallel examples, which isolates the impact of language coverage from differences in data volume. Starting from the five core languages, we construct mixtures with progressively more languages, introducing additional ones to increase diversity. For intermediate diversity levels (e.g., 2, 4, 6, or 9 4 -- 4 of 16 -- Preprint. Under review. 2 4 6 8 10 20 40 60 80 API Calling Accuracy (%) Qwen-3 ⢠High-resource Languages 2 4 6 8 10 Qwen-3 ⢠Low-resource Languages 2 4 6 8 10 Gemma-3 ⢠High-resource Languages 2 4 6 8
Chunk 9 ¡ 1,997 chars
ones to increase diversity. For intermediate diversity levels (e.g., 2, 4, 6, or 9 4 -- 4 of 16 -- Preprint. Under review. 2 4 6 8 10 20 40 60 80 API Calling Accuracy (%) Qwen-3 ⢠High-resource Languages 2 4 6 8 10 Qwen-3 ⢠Low-resource Languages 2 4 6 8 10 Gemma-3 ⢠High-resource Languages 2 4 6 8 10 Gemma-3 ⢠Low-resource Languages 2 4 6 8 10 Number of Languages 0 20 40 60 80 Math Reasoning Accuracy (%) 2 4 6 8 10 Number of Languages 2 4 6 8 10 Number of Languages 2 4 6 8 10 Number of Languages Model family / size Qwen-3 0.6B Qwen-3 1.7B Qwen-3 8.0B Gemma-3 1.0B Gemma-3 4.0B Figure 2: Effect of increasing training language coverage on model performance for Qwen- 3 and Gemma-3 models. Plots show average accuracy (%) with 95% Wilson confidence intervals as a function of the number of training languages, grouped by high-resource and low-resource evaluation languages, for API calling (top) and math reasoning (bottom). languages), we evaluate multiple combinations of the core languages, as indicated by the âx4â or âx5â configurations in Figure 1. This design ensures that observed effects reflect general trends across language combinations rather than any single language, while still enabling controlled analysis of individual language inclusion. 3.3 Model Backbones We experiment with two open-weight LLM families: Qwen-3 (0.6B, 1.7B, 8B) (Yang et al., 2025) and Gemma-3 (1B, 4B) (Kamath et al., 2025). For each model, we use the officially released pretraining-stage checkpoints to ensure that observed performance reflects task- specific fine-tuning on our dataset rather than prior instruction tuning. 3.4 Training Details We train each model on each data mixture for six epochs using 8 NVIDIA A100 80GB GPUs. All models use the AdamW optimizer with a learning rate of 1e-5, a cosine learning rate scheduler with a 3% warmup ratio, and a weight decay of 0.01. We maintain an effective batch size of 64 per GPU (global batch size of 512),
Chunk 10 ¡ 1,988 chars
ails We train each model on each data mixture for six epochs using 8 NVIDIA A100 80GB GPUs. All models use the AdamW optimizer with a learning rate of 1e-5, a cosine learning rate scheduler with a 3% warmup ratio, and a weight decay of 0.01. We maintain an effective batch size of 64 per GPU (global batch size of 512), adjusting the micro-batch size and gradient accumulation steps as needed based on model memory requirements. Gradient checkpointing is enabled to reduce memory usage. For each mixture, we select the checkpoint with the highest average validation set task accuracy across all languages in that mixture, ensuring that selection does not implicitly favor any individual language within the mix. 4 Results 4.1 Multilingual Scaling is Largely Beneficial Across Tasks and Model Scales We first analyze the overall impact of scaling multilinguality by increasing training language coverage. Figure 2 reports mean accuracy (95% Wilson confidence intervals) for Qwen-3 and Gemma-3 models of varying sizes, evaluated across both tasks and grouped by high- and low-resource languages, as a function of the number of training languages. Across both model families and tasks, performance generally improves or remains stable as more training languages are added, with low-resource languages benefiting the most and 5 -- 5 of 16 -- Preprint. Under review. EN Non-EN (Direct Exposure) Non-EN (Cross-Lingual Transfer) Evaluated Languages 0 2 4 6 8 10 12 14 Î Accuracy (%) (Bilingual vs. EN Training) API Calling Math Reasoning Figure 3: Median accuracy (%) change from bilingual versus English-only post-training across evaluation settings for API calling and math reasoning. Error bars show 95% bootstrap confidence intervals. Bilingual training yields consistent gains across evaluation settings, with the largest improvements under direct exposure and smaller but reliable gains under cross-lingual transfer. high-resource languages plateauing rather than degrading. This holds
Chunk 11 ¡ 1,991 chars
ath reasoning. Error bars show 95% bootstrap confidence intervals. Bilingual training yields consistent gains across evaluation settings, with the largest improvements under direct exposure and smaller but reliable gains under cross-lingual transfer. high-resource languages plateauing rather than degrading. This holds consistently across model scales of 1B parameters and above, suggesting that at sufficient capacity, increasing language coverage during post-training is largely beneficial without incurring negative transfer. An exception to this trend arises for the smallest model (Qwen-3 0.6B) on API calling, where performance initially improves with increasing multilingualityâup to four languages for high-resource and five for low-resource settingsâbefore showing a slight decline as additional languages are introduced, suggesting capacity-driven multilingual interference at this scale. We do not observe similar degradation for mathematical reasoning or in larger models, indicating that this effect is confined to sub-1B models and the more structured API calling task. Due to experimental noise, we validate these trends using a pooled regression analysis in Section 4.4 across the full set of experiments. This provides additional evidence for the benefits of increasing language coverage during training. Additional per-language trends are provided in the Appendix C. In the following subsections, we zoom in on specific regions of this trend to better under- stand the effects of multilinguality. We first examine the low-diversity (bilingual) setting (Section 4.2), and then analyze the high-diversity regime (Section 4.3), where increased language coverage enables stronger cross-lingual generalization. 4.2 English Is Not All You Need: Even Minimal Multilinguality Helps We next examine the low-diversity (bilingual) setting to understand whether even minimal multilinguality is beneficial. Specifically, we compare English-only post-training to bilingual training that
Chunk 12 ¡ 1,993 chars
erage enables stronger cross-lingual generalization. 4.2 English Is Not All You Need: Even Minimal Multilinguality Helps We next examine the low-diversity (bilingual) setting to understand whether even minimal multilinguality is beneficial. Specifically, we compare English-only post-training to bilingual training that includes English and one additional language. Figure 3 reports median accuracy differences across English and non-English evaluations (95% bootstrap confidence intervals), while Table 2 summarizes win rates across configurations. For non-English evaluations, we distinguish between direct exposureâwhere the evaluation language is included during trainingâand cross-lingual transfer, where it is not. Across evaluation settings, bilingual training consistently outperforms English-only post- training. The largest gains occur under direct exposure, yielding median improvements of 9.27% for API calling and 8.4% for mathematical reasoning, with wins in 87.5% of configurations. 6 -- 6 of 16 -- Preprint. Under review. Evaluation Setting Win Rate 95% CI EN 75.0% [59.8, 85.8] Non-EN (Direct Exposure) 87.5% [73.9, 94.5] Non-EN (Cross-Lingual Transfer) 74.4% [67.1, 80.5] Table 2: Win rates of bilingual post-training over English-only post-training across eval- uation settings, aggregated across tasks and models. A win is defined as a configuration where bilingual training achieves higher accuracy than English-only training. These benefits extend beyond the added language. Even when the evaluation language is absent from training, bilingual models improve performance through cross-lingual transfer (+3.38% for API calling and +1.6% for mathematical reasoning), outperforming English-only training in 74.4% of configurations. Notably, multilinguality also improves English performance, with median gains of 0.88% for API calling and 3.4% for mathematical reasoning, and wins in 75% of configurations. We provide fine-grained per-language results for English-only and
Chunk 13 ¡ 1,997 chars
al reasoning), outperforming English-only training in 74.4% of configurations. Notably, multilinguality also improves English performance, with median gains of 0.88% for API calling and 3.4% for mathematical reasoning, and wins in 75% of configurations. We provide fine-grained per-language results for English-only and bilingual post-training relative to pretraining in Appendix B, shown as heatmaps over sourceâtarget language pairs. As expected, post-training in nearly any language and setting improves performance across all languages. Together, these results show that English-only post-training is suboptimal: even minimal multilingual exposure yields gains that generalize across languages, tasks, and model scales. We next turn to higher-diversity settings to examine how increasing language coverage further shapes cross-lingual generalization. 4.3 High Linguistic Diversity Supports Generalization Comparable to Direct Exposure 40 60 80 Qwen-3 8B 4 Languages 6 Languages 9 Languages 40 60 80 40 60 80 Gemma-3 4B 4 Languages 40 60 80 6 Languages 40 60 80 9 Languages Zero-Shot Accuracy (%) Bilingual Direct Exposure Accuracy (%) API Calling Math Reasoning EN ES JA BN SW Figure 4: Comparison of zero-shot cross-lingual transfer versus direct bilingual exposure at varying levels of language diversity for Qwen-3 8B (top) and Gemma-3 4B (bottom), with 4 (left), 6 (middle), and 9 (right) training languages. High-resource languages (red) tend to cluster near the diagonal, indicating strong zero-shot generalization that can compensate for the absence of direct inclusion. Low-resource languages (blue) more often fall below the diagonal, suggesting greater benefit of direct inclusion. Having established that even limited multilingual diversity is beneficial, we now ask how far these generalization effects extend. Specifically, can sufficient language diversity during post-training compensate entirely for the absence of the target language in training? 7 -- 7 of 16
Chunk 14 ¡ 1,988 chars
ater benefit of direct inclusion. Having established that even limited multilingual diversity is beneficial, we now ask how far these generalization effects extend. Specifically, can sufficient language diversity during post-training compensate entirely for the absence of the target language in training? 7 -- 7 of 16 -- Preprint. Under review. Figure 4 compares a low-diversity bilingual setting with direct exposure to the evaluation language against higher-diversity settings (4, 6, or 9 languages) where the evaluation language is excluded and performance relies entirely on zero-shot cross-lingual transfer. In the figure, the diagonal indicates parity between the two settings, with points near or above it indicating that zero-shot transfer matches or exceeds direct exposure. We present results for the largest models in each family (Qwen-3 8B and Gemma-3 4B) in the main text, with additional model sizes provided in Appendix D. High-resource non-English languages generalize well under increased diversity. Results for high-resource languages (red) tend to lie close to the diagonal, especially at higher diversity levels (6 and 9 languages). In particular, zero-shot performance is statistically indistinguishable from or significantly exceeds that of bilingual direct exposure in 100% of cases for the 9- and 6-language settings, and in 87.5% of cases for the 4-language setting (two-sided t-test), suggesting that linguistic diversity alone is sufficient to drive strong generalization for these languages. Low-resource, typologically distant languages benefit more from direct inclusion. Re- sults for low-resource languages (blue) more often fall below the diagonal, indicating that zero-shot cross-lingual transfer is less able to compensate for the absence of direct training exposure. Even in the 9-language setting, zero-shot performance is statistically indistin- guishable from direct exposure in 62.5% of cases, but never significantly exceeds it â in contrast to
Chunk 15 ¡ 1,995 chars
w the diagonal, indicating that zero-shot cross-lingual transfer is less able to compensate for the absence of direct training exposure. Even in the 9-language setting, zero-shot performance is statistically indistin- guishable from direct exposure in 62.5% of cases, but never significantly exceeds it â in contrast to high-resource languages, where zero-shot transfer matches or exceeds direct exposure, indicating a stronger reliance on explicit inclusion for these languages. English also benefits from zero-shot cross-lingual transfer. Despite being the dominant language in post-training, English performance also improves with increased language diversity even without direct English supervision. Across tasks, zero-shot cross-lingual performance significantly exceeds or matches bilingual direct exposure in 100% of cases for the 9- and 6-language settings, and in 75% of cases for the 4-language setting. API calling shows stronger gains from increased language diversity, with cross-lingual performance significantly exceeding direct exposure in 50% of cases and never significantly underper- forming, suggesting that models can leverage structural and lexical variation from diverse languages even without direct English supervision. Prior work has shown a strong reliance on English for reasoning processes even in multilin- gual settings, suggesting that mathematical reasoning may require more English supervision for optimal performance (Etxaniz et al., 2024; Schut et al., 2025). Our results show that this dependence diminishes with increased language diversity. Even for mathematical reasoning evaluated in English, zero-shot cross-lingual performance matches direct exposure when 6 or more languages are included during training, though results are more variable at lower diversity (4 languages), where 50% of cases perform worse than direct exposure. Overall, these results show that at sufficient language diversity, zero-shot cross-lingual transfer can match or even exceed
Chunk 16 ¡ 1,996 chars
s direct exposure when 6 or more languages are included during training, though results are more variable at lower diversity (4 languages), where 50% of cases perform worse than direct exposure. Overall, these results show that at sufficient language diversity, zero-shot cross-lingual transfer can match or even exceed direct language inclusion for high-resource and English evaluations, though limitations remain for low-resource, typologically distant languages. This interpretation is further supported by the pooled regression in Section 4.4, which shows that broader training-language coverage remains positively associated with performance even after controlling for direct target-language inclusion. 4.4 Pooled Regression Analysis To test whether the patterns in Sections 4.1â4.3 persist in aggregate, we fit a pooled regression model over evaluation instances, where each instance corresponds to evaluating a trained model on a task-language pair. The regression includes model family, task, pretrained-only status, whether the evaluation language appears in the training mixture, log10 model size, and a transformed measure of training-language coverage, defined as âL/Lmax to allow diminishing returns with additional languages. We fit the regression on a random 70% split of evaluation instances and evaluate on the remaining 30%, where it achieves an out-of- sample R2 of 80.5%. We report on coefficients with bootstrap 95% confidence intervals. We interpret this analysis as complementing the analyses above with an aggregate summary over evaluation instances. 8 -- 8 of 16 -- Preprint. Under review. Feature Coef. 2.5% 97.5% log10 model size 0.302 0.285 0.319 Qwen model family 0.265 0.248 0.281 Math task -0.207 -0.221 -0.193 Pretrained only -0.110 -0.167 -0.047 Eval. language in training 0.089 0.074 0.104 âL/Lmax 0.053 0.018 0.089 Table 3: Pooled regression over evaluation instances, where each instance corresponds to evaluating a trained model on a task-language pair. The
Chunk 17 ¡ 1,990 chars
Qwen model family 0.265 0.248 0.281 Math task -0.207 -0.221 -0.193 Pretrained only -0.110 -0.167 -0.047 Eval. language in training 0.089 0.074 0.104 âL/Lmax 0.053 0.018 0.089 Table 3: Pooled regression over evaluation instances, where each instance corresponds to evaluating a trained model on a task-language pair. The model is trained on a random 70% split of instances and evaluated on the remaining 30%, achieving an out-of-sample R2 of 80.5%. Coefficients are reported with bootstrap 95% confidence intervals. Here, L denotes the number of training languages and Lmax the maximum number of training languages in our experiments. The pooled analysis confirms several expected trends, including benefits from larger models, post-training, and direct inclusion of the evaluation language. Importantly, broader training- language coverage is positively associated with performance even after controlling for direct target-language inclusion (β = 0.053, 95% CI [0.018, 0.089]). This is consistent with the discussion above indicating that multilingual gains are not explained solely by direct language exposure. 5 Conclusion This work presents a study of multilingual post-training under realistic conditions, ex- amining how language coverage and composition interact with model capacity and task structure. Our analysis spans both reasoning and tool-use settings. Our findings show that English-centric post-training is typically suboptimal for cross-lingual transfer. Increasing language coverage helps performance for all languages, including English, and can compensate in some cases for a lack of direct inclusion of the target language in training, although these gains are limited for typologically distant, low-resource languages. We find that with limited exceptions for API calling at the smallest scales (⤠1B), models are able to benefit from increased multilingual diversity, particularly for low-resource languages, without degrading performance in high-resource languages. 6
Chunk 18 ¡ 1,998 chars
gains are limited for typologically distant, low-resource languages. We find that with limited exceptions for API calling at the smallest scales (⤠1B), models are able to benefit from increased multilingual diversity, particularly for low-resource languages, without degrading performance in high-resource languages. 6 Limitations Our work provides a systematic study of multilingual post-training under controlled con- ditions, but several limitations remain. First, our analysis focuses on 11 languages, which do not capture the full global linguistic diversity of thousands of languages worldwide. To partially address this, we select languages spanning a broad range of resource levels, language families, geographic regions, and scripts. Second, although our experiments span multiple model scales and two model families, the largest model we consider contains 8B parameters, and it remains an open question how multilingual scaling effects evolve at larger model sizes. To isolate the effects of language coverage, we hold the amount of data per language constant and therefore do not examine how jointly scaling data volume and language diversity interacts with model capacity or task difficulty. Additionally, our multi- lingual datasets are constructed via translation of English task data, which may introduce translation-specific artifacts (e.g. âtranslationeseâ); future work could investigate whether similar effects hold for naturally occurring multilingual data. Finally, our analysis focuses on mathematical reasoning and API calling as post-training tasks; other task domains may exhibit different behavior under multilingual scaling. 9 -- 9 of 16 -- Preprint. Under review. References Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4074â4096, 2024. Pinzhen Chen, Shaoxiong Ji,
Chunk 19 ¡ 1,996 chars
. Under review. References Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4074â4096, 2024. Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pp. 1347â1356, 2024. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmĂĄn, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 8440â8451, 2020. Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, and Mikel Artetxe. Do multilingual language models think better in english? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 550â564, 2024. Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. Benchmax: A comprehensive multilingual evaluation suite for large language models. arXiv preprint arXiv:2502.07346, 2025. Shaoxiong Ji and Pinzhen Chen. How many languages make good multilingual instruction tuning? a case study on bloom. In Proceedings of the 31st International Conference on Computational Linguistics, pp. 2575â2581, 2025. Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre RamĂŠ, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 4, 2025. Tannon Kew, Florian Schottmann, and Rico Sennrich. Turning english-centric llms into poly- glots:
Chunk 20 ¡ 1,997 chars
Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre RamĂŠ, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 4, 2025. Tannon Kew, Florian Schottmann, and Rico Sennrich. Turning english-centric llms into poly- glots: How much multilinguality is needed? In Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 13097â13124, 2024. Simran Khanuja, Sebastian Ruder, and Partha Talukdar. Evaluating the diversity, equity, and inclusion of nlp technology: A case study for indian languages. In Findings of the Association for Computational Linguistics: EACL 2023, pp. 1763â1777, 2023. Mayank Kulkarni, Vittorio Mazzia, Judith Gaspers, Christopher Hench, Jack FitzGerald, and AGI Amazon. Massive-agents: A benchmark for multilingual function-calling in 52 languages. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 20193â20215, 2025. Huiyuan Lai and Malvina Nissim. mcot: Multilingual instruction tuning for reasoning consistency in language models. arXiv preprint arXiv:2406.02301, 2024. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023. Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I Hsu, Isaac Caswell, Alex Pent- land, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi, et al. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. arXiv preprint arXiv:2510.22037, 2025. Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of Computer Science, 19(8):198343, 2025. Lisa Schut, Yarin Gal, and Sebastian Farquhar. Do multilingual llms think in english? arXiv preprint arXiv:2502.15603, 2025. 10 -- 10 of 16 --
Chunk 21 ¡ 1,988 chars
i Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey. Frontiers of Computer Science, 19(8):198343, 2025. Lisa Schut, Yarin Gal, and Sebastian Farquhar. Do multilingual llms think in english? arXiv preprint arXiv:2502.15603, 2025. 10 -- 10 of 16 -- Preprint. Under review. Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Multilingual instruction tuning with just a pinch of multilinguality. arXiv preprint arXiv:2401.01854, 2024. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022. Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, and Sebastian Ruder. A post-trainerâs guide to multilingual training data: Uncovering cross-lingual transfer dynamics. arXiv preprint arXiv:2504.16677, 2025. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. 11 -- 11 of 16 -- Preprint. Under review. Figure 5: Example showing the final turns of a parallel multi-turn API calling interaction from mAPICall-Bank in three languages: (a) English, (b) Spanish, and (c) Bengali. Figure 6: Prompt used to create mAPICall-Bank A mAPICall-Bank Construction We construct mAPICall-Bank by translating the API calling subset of the original English API-Bank dataset into multiple target languages using a state-of-the-art large language model. The full translation prompt is provided in Figure 6. During translation, we preserve the full structure of each example, including user utterances, system responses, API names, and argument schemas. To maintain compatibility with tool specifications, API names and parameter keys are kept in English across all languages, while
Chunk 22 ¡ 1,998 chars
translation prompt is provided in Figure 6. During translation, we preserve the full structure of each example, including user utterances, system responses, API names, and argument schemas. To maintain compatibility with tool specifications, API names and parameter keys are kept in English across all languages, while user-facing text and, when appropriate, argument values are translated into the target language. The translation prompt was iteratively refined to ensure structural validity and parseability of model outputs. In particular, we enforce that translated examples retain the original formatting and can be reliably parsed into API name and argumentâvalue pairs. We perform automated checks to validate output structure (e.g., JSON format) and manually inspect a subset of examples across languages, with native speakers verifying linguistic accuracy and consistency with the source data. Figure 6 shows the prompt used to translate API-Bank to create mAPICall-Bank. Figure 5 shows an example of a parallel multi-turn interaction involving a single API call across three languages. Dataset statistics including the number of queries, unique APIs, and examples of APIs in the train and test split of mAPICall-Bank are shown in Table 4. 12 -- 12 of 16 -- Preprint. Under review. Split #Queries #APIs Example APIs Get_All_Sessions, Train 3,174 1,535 get_device_details, get_relaxation_techniques ModifyRegistration, Test 399 49 Calculator, EmergencyKnowledge Table 4: Statistics of the mAPICall-Bank dataset, including the number of user queries, unique APIs, and examples of API names in each split. B Monolingual vs. Bilingual Transfer Heatmaps Figure 7 presents heatmaps of accuracy change relative to the pretrained baseline for monolingual and bilingual training settings. Each cell shows the difference in performance when training on a given source language (or language pair) and evaluating on a target language. These results complement the findings in Section 4.2. Consistent
Chunk 23 ¡ 1,990 chars
eatmaps of accuracy change relative to the pretrained baseline for monolingual and bilingual training settings. Each cell shows the difference in performance when training on a given source language (or language pair) and evaluating on a target language. These results complement the findings in Section 4.2. Consistent with the observation that English-only post-training is suboptimal, we find that adding a second language generally leads to positive gains across a wide range of evaluation languages. Notably, improvements are not limited to the added language: many bilingual configurations yield gains even when the evaluation language is not included in training. Further, the heatmaps show that gains are broadly distributed across language pairs rather than concentrated in a small subset, suggesting that the benefits of multilingual post- training are not driven by specific language combinations but reflect a more general effect of multilingual exposure. 13 -- 13 of 16 -- Preprint. Under review. (a) Qwen-3 Models (b) Gemma-3 Models Figure 7: Change in accuracy relative to the pre-trained baseline for bilingual versus English- only post-training. Heatmaps show performance differences across evaluation languages when English is paired with a single additional language (top: API calling; bottom: math reasoning). Subfigures (a) and (b) correspond to Qwen-3 and Gemma-3. 14 -- 14 of 16 -- Preprint. Under review. (a) Qwen-3 Models (b) Gemma-3 Models Figure 8: Per-language multilingual scaling trends for (a) Qwen-3 models, and (b) Gemma-3 models. Each cell shows mean accuracy as a function of the number of training languages for API calling (top rows) and math reasoning (bottom rows), with scatter points indicating individual results. C Multilingual Scaling Trends Per Language Figure 8 presents scaling trends for individual evaluation languages as a function of the number of training languages, for (a) Qwen-3 and (b) Gemma-3 models. These results complement the
Chunk 24 ¡ 1,991 chars
ws) and math reasoning (bottom rows), with scatter points indicating individual results. C Multilingual Scaling Trends Per Language Figure 8 presents scaling trends for individual evaluation languages as a function of the number of training languages, for (a) Qwen-3 and (b) Gemma-3 models. These results complement the aggregated trends shown in Figure 2. Consistent with the main findings, we observe that with the exception of the smallest 0.6B model, increasing the number of training languages generally improves or maintains performance across most languages and model sizes. Gains are particularly pronounced for low-resource languages, while high-resource languages tend to plateau as additional languages are introduced. D Additional Results on Linguistic Diversity Driven Cross-Lingual Generalization Figure 9 presents the corresponding plots from Section 4.3 for smaller-scale models (Qwen-3 1.7B, Qwen-3 0.6B, and Gemma-3 1B). We observe consistent trends with those reported for larger models: high-resource lan- guages (in red) tend to lie close to the diagonal, indicating that language diversity enables strong zero-shot cross-lingual transfer that matches or approaches the performance of direct inclusion at these scales. Low-resource languages often benefit from explicit language inclu- sion, particularly at the smallest scale (Qwen-3 0.6B). This is consistent with our findings in Section 4.1, where we observe capacity-driven interference at higher levels of multilinguality for this model. 15 -- 15 of 16 -- Preprint. Under review. 0.00 0.25 0.50 0.75 Qwen-3 1.7B 4 Languages 6 Languages 9 Languages 0.00 0.25 0.50 0.75 Qwen-3 0.6B 4 Languages 6 Languages 9 Languages 0.00 0.25 0.50 0.75 Bilingual Direct Exposure 0.00 0.25 0.50 0.75 Gemma-3 1B 4 Languages 0.00 0.25 0.50 0.75 Bilingual Direct Exposure 6 Languages 0.00 0.25 0.50 0.75 Bilingual Direct Exposure 9 Languages Zero-Shot Accuracy API Calling Math Reasoning EN ES JA BN SW Figure 9:
Chunk 25 ¡ 597 chars
s 6 Languages 9 Languages 0.00 0.25 0.50 0.75 Bilingual Direct Exposure 0.00 0.25 0.50 0.75 Gemma-3 1B 4 Languages 0.00 0.25 0.50 0.75 Bilingual Direct Exposure 6 Languages 0.00 0.25 0.50 0.75 Bilingual Direct Exposure 9 Languages Zero-Shot Accuracy API Calling Math Reasoning EN ES JA BN SW Figure 9: Comparison of zero-shot cross-lingual transfer and direct bilingual exposure for smaller models. Rows correspond to Qwen-3 1.7B (top), Qwen-3 0.6B (middle), and Gemma-3 1B (bottom), with columns showing 4 (left), 6 (middle), and 9 (right) training languages. 16 -- 16 of 16 --