Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

Summary

This paper introduces XBridge, a framework that combines large language models (LLMs) with pretrained multilingual encoder-decoder translation models to enhance multilingual capabilities. While LLMs excel in general knowledge and reasoning, their multilingual performance is imbalanced, especially for low-resource or unseen languages. XBridge addresses this by offloading multilingual understanding and generation to a pretrained translation model, while keeping the LLM as a frozen English-centric knowledge core. The architecture includes lightweight cross-model mapping layers and an optimal transport-based alignment objective to resolve representation misalignment between components. Experiments on four LLMs across translation, reasoning, and summarization tasks show that XBridge outperforms strong baselines, particularly on low-resource languages, without retraining the LLM. The method uses a three-stage training strategy to progressively align representations and adapt the model for downstream tasks. Results indicate that XBridge effectively bridges the gap between LLMs and multilingual generation, achieving performance close to external translation models while preserving the LLM's core strengths.

PDF viewer

Chunks(48)

Chunk 0 · 1,995 chars

Language on Demand, Knowledge at Core: Composing LLMs with
Encoder-Decoder Translation Models for Extensible Multilinguality
Mengyu Bu1,2,3, Yang Feng1,2,3†
1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,
Chinese Academy of Sciences (ICT/CAS) 2 State Key Laboratory of AI Safety,
Institute of Computing Technology, Chinese Academy of Sciences
3 University of Chinese Academy of Sciences, Beijing, China
bumengyu23z@ict.ac.cn, fengyang@ict.ac.cn
Abstract
Large language models (LLMs) exhibit strong
general intelligence, yet their multilingual per-
formance remains highly imbalanced. Al-
though LLMs encode substantial cross-lingual
knowledge in a unified semantic space, they of-
ten struggle to reliably interface this knowledge
with low-resource or unseen languages. Fortu-
nately, pretrained encoder-decoder translation
models already possess balanced multilingual
capability, suggesting a natural complement to
LLMs. In this work, we propose XBridge, a
compositional encoder-LLM-decoder architec-
ture that offloads multilingual understanding
and generation to external pretrained transla-
tion models, while preserving the LLM as an
English-centric core for general knowledge pro-
cessing. To address the resulting representa-
tion misalignment across models, we introduce
lightweight cross-model mapping layers and
an optimal transport-based alignment objective,
enabling fine-grained semantic consistency for
multilingual generation. Experiments on four
LLMs across multilingual understanding, rea-
soning, summarization, and generation indicate
that XBridge outperforms strong baselines, es-
pecially on low-resource and previously unseen
languages, without retraining the LLM.1
1 Introduction
Large language models (LLMs) have demonstrated
remarkable general intelligence and reasoning abil-
ities (Touvron et al., 2023; Üstün et al., 2024;
Qwen et al., 2025), which are largely grounded
in a unified semantic knowledge space. However,
despite

Chunk 1 · 1,996 chars

iously unseen
languages, without retraining the LLM.1
1 Introduction
Large language models (LLMs) have demonstrated
remarkable general intelligence and reasoning abil-
ities (Touvron et al., 2023; Üstün et al., 2024;
Qwen et al., 2025), which are largely grounded
in a unified semantic knowledge space. However,
despite possessing substantial cross-lingual knowl-
edge, LLMs exhibit imbalanced multilingual per-
formance: while performing reliably in English and
a few high-resource languages, they often fail to ro-
bustly understand or generate text in low-resource
or unseen languages (Zhu et al., 2023; Chang et al.,
†Corresponding author: Yang Feng.
1https://github.com/ictnlp/XBridge
Swahili Question
Je, Dunia ni mviringo au ni mraba?
English Question
Is the Earth round or square?
Multilingual NMT Model
(e.g. NLLB)
Is the Earth round or square?
Is the Earth round or square?
Model
Composition
Multilingual Capacity
General Capacity
English-Centric LLM
(e.g. LLaMA)
Swahili Question
Je, Dunia ni mviringo au ni mraba?
English Question
Is the Earth round or square?
The Earth is round, because …
Sorry, I don't understand …
Multilingual Capacity
General Capacity
XBridge
(e.g. LLaMA + NLLB)
Swahili Question
Je, Dunia ni mviringo au ni mraba?
English Question
Is the Earth round or square?
Dunia ni mviringo, kwa sababu ...
The Earth is round, because …
General Capacity
Multilingual Capacity
Figure 1: Overview of XBridge. Pretrained multilingual
NMT models provide broad language coverage but lim-
ited general reasoning capability, while English-centric
LLMs excel at general reasoning yet struggle with low-
resource or unseen languages. XBridge harmonizes
these strengths through model composition, offloading
multilingual processing to the pretrained multilingual
model while leveraging the LLM as a knowledge core.
2024). This suggests that the core limitation of
LLMs lies not in the absence of knowledge, but
in the difficulty of interfacing this knowledge with
diverse linguistic

Chunk 2 · 1,992 chars

engths through model composition, offloading
multilingual processing to the pretrained multilingual
model while leveraging the LLM as a knowledge core.
2024). This suggests that the core limitation of
LLMs lies not in the absence of knowledge, but
in the difficulty of interfacing this knowledge with
diverse linguistic representation spaces.
Fortunately, a wealth of encoder-decoder based
neural machine translation (NMT) models (Xue
et al., 2021; Team et al., 2022) specialize in multi-
lingual understanding and generation, and thus pro-
vide complementary capabilities to LLMs. These
models support semantic transfer across hundreds
of languages, including many low-resource ones,
by learning a shared semantic representation space
across languages. In such models, the encoder
maps input text from different languages into the
shared semantic space, while the decoder subse-
quently projects these shared representations into
target-language outputs. This closed semantic loop
between understanding and generation, along with
the modular design of encoder and decoder, natu-
rally complements LLMs. Realizing such a compo-
sition would provide LLMs with extensible multi-
lingual capability, particularly for low-resource or
1
arXiv:2603.17512v4 [cs.CL] 16 Apr 2026

-- 1 of 23 --

unseen languages that are well modeled by NMT
systems but remain challenging for LLMs.
However, existing approaches only partially ad-
dress this goal, which integrate multilingual en-
coders to improve multilingual understanding by
injecting encoder representations into LLM in-
puts (Yoon et al., 2024; Huang et al., 2024; Ruan
et al., 2025). While effective for input under-
standing, these approaches leave generation largely
English-centric. A natural extension is to further
incorporate the multilingual decoder, but doing so
introduces a fundamental structural challenge. In
NMT, the encoder and decoder are jointly trained
within a unified representation space, whereas in-
serting a frozen LLM in

Chunk 3 · 1,994 chars

, these approaches leave generation largely
English-centric. A natural extension is to further
incorporate the multilingual decoder, but doing so
introduces a fundamental structural challenge. In
NMT, the encoder and decoder are jointly trained
within a unified representation space, whereas in-
serting a frozen LLM in between introduces a trans-
formation from the LLM input space to a different
output space shaped by its internal knowledge pro-
cessing. Consequently, the LLM outputs no longer
match the decoder’s expected cross-attention repre-
sentations, resulting in semantic misalignment that
cannot be resolved by simple projection.
To address this challenge, we propose XBridge,
which composes LLMs with pretrained multilin-
gual NMT models for extensible multilinguality.
XBridge adopts an encoder-LLM-decoder architec-
ture, where a multilingual encoder provides robust
semantic representations for multilingual inputs, a
frozen LLM serves as an English-centric core for
knowledge processing, and a multilingual decoder
generates outputs in the target language. From
a representation perspective, XBridge constructs
a semantic bridge that transforms representations
from the multilingual semantic space to the LLM
input space, through the LLM output space after
knowledge transformation, and finally into the de-
coder’s generation space. By explicitly aligning
heterogeneous representation spaces across these
modules, XBridge resolves the semantic mismatch
introduced by inserting a frozen LLM, achieving
extensible and generalizable multilingual under-
standing and generation.
We evaluate XBridge on four LLMs across mul-
tilingual understanding, reasoning, summarization,
and generation tasks. XBridge outperforms strong
baselines, with significant gains on low-resource
and unseen languages while preserving LLM’s core
capability. With minimal additional parameters,
limited training data, and parameter-efficient train-
ing, XBridge brings low-resource and unseen lan-
guage

Chunk 4 · 1,994 chars

summarization,
and generation tasks. XBridge outperforms strong
baselines, with significant gains on low-resource
and unseen languages while preserving LLM’s core
capability. With minimal additional parameters,
limited training data, and parameter-efficient train-
ing, XBridge brings low-resource and unseen lan-
guage performance close to that of external NMT
models, substantially narrowing the gap across lan-
guages without retraining the LLM.
2 Related Work
2.1 Data-Level Multilingual Enhancement for
LLMs
A line of work augments the multilingual capa-
bilities of LLMs at the data level by constructing
multilingual training corpora using pretrained mul-
tilingual or machine translation models (Li et al.,
2023; Zhang et al., 2023, 2024a,b). Typical ap-
proaches translate English instruction into multi-
ple languages (Chen et al., 2024), pre-translate
non-English inputs into English before task ex-
ecution (Qin et al., 2023; Chai et al., 2025), or
leverage Mix-of-Experts (MoE) for language ex-
pansion (Zhang et al., 2025b). Such approaches
generally require continual multilingual training
of LLMs, which may introduce translation noise
and interfere with existing language capabilities.
In practice, balancing performance across high-
and low-resource languages remains challenging,
as gains on low-resource languages often come
at degradation on high-resource ones (Gao et al.,
2024). In contrast, XBridge achieves multilingual
generalization through model composition without
multilingual retraining of the LLM.
2.2 Encoder-Augmented Multilingual LLMs
Another line of work augments LLMs with pre-
trained multilingual encoders, injecting encoder
representations into the LLM to improve multilin-
gual understanding. Yoon et al. (2024) leverage
multilingual encoders to support cross-lingual un-
derstanding, while Huang et al. (2024) reintroduce
multilingual inputs to better exploit the complemen-
tary strengths of language understanding and rea-
soning in LLMs. Ruan et al.

Chunk 5 · 1,995 chars

ions into the LLM to improve multilin-
gual understanding. Yoon et al. (2024) leverage
multilingual encoders to support cross-lingual un-
derstanding, while Huang et al. (2024) reintroduce
multilingual inputs to better exploit the complemen-
tary strengths of language understanding and rea-
soning in LLMs. Ruan et al. (2025) further explore
layer-wise fusion strategies to enhance the utiliza-
tion of encoder semantics. These approaches pri-
marily focus on improving multilingual understand-
ing at the input side, while generation remains gov-
erned by the LLM’s native language distribution,
typically English. Moreover, due to differences
in training objectives and tokenization schemes,
representation gaps persist between multilingual
encoders and LLMs, which limit the effective ex-
ploitation of encoder semantics. XBridge differs
from prior encoder-augmented methods by addi-
tionally incorporating a multilingual decoder to
support multilingual generation and by explicitly
aligning representations across models, enabling
more effective end-to-end multilingual behavior.
2

-- 2 of 23 --

Encoder
Mappingenc
Large Language Model
Decoder
Mappingdec
ℒCE_Dec
ℒCE_LLM 	ℒOT
Stage 1: Cross-Model Mapping
Encoder
Mappingenc
Large Language Model
ℒCE_LLM
Stage 2: Encoder-Side Adaptation
Encoder
Mappingenc
Large Language Model
Decoder
Mappingdec
ℒCE_Dec
ℒOT
Stage 3: Decoder-Side Adaptation
Layer1
LayerL-1	
…
Large Language Model
Encoder
Decoder
Mappingenc 	LLM Embedding
Language Token 𝑦
Multilingual Query 𝐱
English Instruction
English Response 𝐳
Mappingdec
OT Alignment
Cross
Attention
Multilingual
Response 𝐲
෩𝐇𝐳′
Hidden
𝐇𝐳 Hidden
𝐇𝐳′
Hidden of LayerL-1
Instruction Embedding
<sep> Token
Text Embedding
LayerL
Training only
Training & Inference
Figure 2: Left: XBridge composes a pretrained multilingual encoder-decoder with an LLM via lightweight mapping
layers for multilingual understanding and generation, keeping the LLM frozen as a knowledge core. Right: A
three-stage

Chunk 6 · 1,998 chars

nstruction Embedding
<sep> Token
Text Embedding
LayerL
Training only
Training & Inference
Figure 2: Left: XBridge composes a pretrained multilingual encoder-decoder with an LLM via lightweight mapping
layers for multilingual understanding and generation, keeping the LLM frozen as a knowledge core. Right: A
three-stage training strategy progressively aligns heterogeneous representations and adapts the encoder and decoder.
3 Method
Figure 2 presents the framework of our XBridge,
a compositional multilingual framework that inte-
grates a pretrained encoder-decoder NMT model
with an LLM. XBridge efficiently offloads multi-
lingual burden to the external NMT model while
preserving the LLM as an English-centric core for
general knowledge processing. XBridge adopts
an encoder-LLM-decoder architecture, connected
by lightweight cross-model mapping layers (Sec-
tion 3.1). To facilitate fine-grained semantic trans-
fer for multilingual generation, we introduce an
optimal transport-based token alignment objective
at the LLM-decoder interface (Section 3.2). For
stable optimization, XBridge employs a three-stage
training strategy that decouples coarse-grained
cross-model alignment from task-specific adapta-
tion (Section 3.3).
3.1 Architecture
XBridge adopts an encoder-LLM-decoder archi-
tecture to compose a pretrained encoder-decoder
NMT model with an LLM for extensible multilin-
gual understanding and generation.
Formally, given an input sequence x =
(x1, . . . , xn) in language Lx, we first encode it with
the pretrained multilingual encoder Enc(·), produc-
ing contextual representations Hx ∈ Rn×de . To
bridge the representation gap between the multi-
lingual encoder and LLM, we apply a lightweight
mapping Mappingenc(·) that projects Hx into the
LLM representation space, yielding ˜ Hx ∈ Rn×dl .
The mapped encoder representations are then in-
jected into the LLM together with a high-resource
(English) instruction prompt, enabling the LLM
to perform general knowledge processing

Chunk 7 · 1,988 chars

oder and LLM, we apply a lightweight
mapping Mappingenc(·) that projects Hx into the
LLM representation space, yielding ˜	Hx ∈ Rn×dl .
The mapped encoder representations are then in-
jected into the LLM together with a high-resource
(English) instruction prompt, enabling the LLM
to perform general knowledge processing condi-
tioned on encoder semantics. Let z = (z1, . . . , zm)
denote the sequence of English tokens generated
by the LLM. Rather than using the final-layer hid-
den states, we extract the penultimate-layer hidden
states, denoted as Hz′
∈ Rm×dl , as Zhang et al.
(2025a) show that the last layer is often tightly
aligned with the output vocabulary space, while
non-final layers retain richer semantic information.
To support multilingual generation, XBridge fur-
ther integrates a pretrained multilingual decoder
Dec(·) at the output side. Specifically, we apply a
decoder-side mapping Mappingdec(·) to project the
LLM hidden states into the decoder representation
space, obtaining ˜	Hz′
∈ Rm×dd , which are used
as key-value representations for cross-attention in
the decoder. Given target-language tokens ⟨y⟩ in
language Ly as decoder inputs, the decoder gen-
erates the output sequence y by attending to ˜	Hz′
,
producing text that follows the target-language dis-
tribution while remaining semantically grounded
in the LLM’s knowledge processing results.
3.2 Optimal Transport-Based Alignment
Although the mapped LLM representations ˜	Hz′
can be directly used as cross-attention inputs for
3

-- 3 of 23 --

multilingual decoding, token-level semantic mis-
alignment may arise due to heterogeneous tokeniza-
tions and representation spaces across models. To
encourage fine-grained semantic consistency at the
LLM-decoder interface, we introduce an optimal
transport (OT)-based alignment objective.
Specifically, given the English token sequence
z = (z1, . . . , zm) generated by the LLM, we re-
encode it using the same multilingual encoder
Enc(·), obtaining encoder

Chunk 8 · 1,997 chars

models. To
encourage fine-grained semantic consistency at the
LLM-decoder interface, we introduce an optimal
transport (OT)-based alignment objective.
Specifically, given the English token sequence
z = (z1, . . . , zm) generated by the LLM, we re-
encode it using the same multilingual encoder
Enc(·), obtaining encoder representations Hz ∈
Rk×de , where the sequence length k may differ
from m due to heterogeneous tokenizers. Since
Hz and the decoder-side LLM representations ˜	Hz′
are both derived from the same LLM output, they
are semantically equivalent in expectation, despite
residing in different representation spaces. We
therefore align Hz with ˜	Hz′
to enforce token-level
semantic alignment.
Due to sequence length mismatch by heteroge-
neous tokenizers, we formulate the alignment as
an optimal transport problem (Peyré et al., 2019),
which computes a soft, many-to-many matching
between the two sequences. Concretely, we define
the OT distance between Hz and ˜	Hz′
as:
D∗(Hz, ˜	Hz′
) = min
T≥0
X
i,j
Tij c(Hz
i , ˜	Hz′
j ),
s.t.
m	X
j=1
Tij = mz
i , ∀i ∈ {1, . . . , k}.
(1)
where Tij denotes the transport mass from Hz
i to
˜	Hz′
j , and c(·, ·) is the transport cost computed using
cosine distance. The mass distribution {mz
i } is
obtained by normalizing Hz. Appendix A presents
details of the OT formulation and optimization.
The OT loss provides flexible, token-level super-
vision that is robust to length mismatch. By reg-
ularizing the decoder-side mapping with encoder-
derived representations of the LLM’s own outputs,
the OT objective encourages ˜	Hz′
to preserve se-
mantic structures compatible with the multilingual
encoder-decoder space. This alignment not only
improves multilingual generation quality, but also
indirectly facilitates more effective utilization of
multilingual encoder signals by the LLM.
3.3 Three-Stage Training Strategy
To ensure stable optimization across models and
objectives, XBridge employs a three-stage training
strategy that progressively

Chunk 9 · 1,994 chars

ment not only
improves multilingual generation quality, but also
indirectly facilitates more effective utilization of
multilingual encoder signals by the LLM.
3.3 Three-Stage Training Strategy
To ensure stable optimization across models and
objectives, XBridge employs a three-stage training
strategy that progressively aligns heterogeneous
representations and adapts the model to down-
stream tasks, keeping the LLM frozen throughout.
Stage 1: Cross-Model Mapping Due to the sub-
stantial representation gaps between the multilin-
gual encoder and the LLM, as well as between
the LLM and the multilingual decoder, directly
bridging heterogeneous components is non-trivial.
We therefore first establish coarse-grained seman-
tic alignment among the multilingual encoder, the
LLM, and the multilingual decoder using trilin-
gual translation data (x, z, y), where z is an En-
glish sequence generated by the LLM. In this stage,
only the encoder-side mapping Mappingenc, the
decoder-side mapping Mappingdec, and the de-
coder cross-attention layers are trained, optimizing
the LLM English generation loss, the multilingual
decoder generation loss, and the optimal transport
alignment loss. This stage enables the LLM to
interpret multilingual encoder representations and
allows the decoder to attend to LLM hidden states
for multilingual generation.
Stage 2: Encoder-Side Adaptation After cross-
model semantic alignment is established, the sec-
ond stage adapts multilingual input representa-
tions to downstream instruction-following tasks.
We fine-tune only the encoder-side mapping layer
Mappingenc on task-specific instruction data by op-
timizing the LLM English generation loss, while
keeping all decoder-related components frozen.
This stage teaches the LLM how to use multilingual
representations to perform tasks, building upon the
aligned representation space learned in stage 1.
Stage 3: Decoder-Side Adaptation The third
stage focuses on improving multilingual genera-
tion quality by

Chunk 10 · 1,995 chars

n loss, while
keeping all decoder-related components frozen.
This stage teaches the LLM how to use multilingual
representations to perform tasks, building upon the
aligned representation space learned in stage 1.
Stage 3: Decoder-Side Adaptation The third
stage focuses on improving multilingual genera-
tion quality by adapting the LLM-decoder inter-
face. We update only Mappingdec and the decoder
cross-attention layers, optimizing the multilingual
decoder generation loss together with the optimal
transport alignment loss. Separating this stage from
stage 2 avoids conflicts between LLM and decoder
objectives: stage 2 first stabilizes the conditional
distribution of the LLM outputs, which stage 3 then
exploits to enhance decoder performance without
degrading task understanding.
Training Objectives Given encoder input se-
quence x with encoder representations Hx,
the LLM-generated English sequence z with
penultimate-layer hidden states Hz′
, decoder-
mapped representations ˜	Hz′
, and multilingual de-
coder output sequence y, the cross-entropy losses
of LLM and decoder are defined as:
LCE_LLM = − log pLLM(z | x, inst). (2)
4

-- 4 of 23 --

System Low-Resource Languages High-Resource Languages Average
Bn-En En-Bn Sw-En En-Sw Ja-En En-Ja De-En En-De X-En En-X
NLLB-200-1.3B 37.78 32.83 42.66 36.28 29.60 19.07 46.23 39.91 37.51 31.00
MetaMath-7B
MetaMath-7B 1.46 0.67 3.33 1.75 27.62 16.76 34.36 19.42 18.62 11.92
MindMerger 30.76 - 39.43 - 22.50 - 40.05 - 31.57 -
LayAlign 30.91 - 39.02 - 22.36 - 39.43 - 31.98 -
XBridge (Ours) 35.47 29.23 42.02 34.28 24.52 19.60 41.42 35.39 33.37 29.80
LLaMA3-8B
LLaMA3-8B 29.83 13.18 35.87 19.31 27.71 25.40 45.28 36.24 35.19 27.36
MindMerger 33.86 - 41.81 - 25.48 - 42.52 - 33.88 -
LayAlign 32.95 - 41.35 - 24.62 - 41.29 - 33.18 -
XBridge (Ours) 37.09 28.42 44.73 34.68 27.63 20.12 45.75 35.45 36.21 29.82
Aya-23-8B
Aya-23-8B 8.59 2.43 7.89 1.16 29.11 29.34 45.46 38.03 28.13 23.71
MindMerger 33.41 - 41.56 - 24.96 - 41.78 - 33.44 -
LayAlign

Chunk 11 · 1,996 chars

19 27.36
MindMerger 33.86 - 41.81 - 25.48 - 42.52 - 33.88 -
LayAlign 32.95 - 41.35 - 24.62 - 41.29 - 33.18 -
XBridge (Ours) 37.09 28.42 44.73 34.68 27.63 20.12 45.75 35.45 36.21 29.82
Aya-23-8B
Aya-23-8B 8.59 2.43 7.89 1.16 29.11 29.34 45.46 38.03 28.13 23.71
MindMerger 33.41 - 41.56 - 24.96 - 41.78 - 33.44 -
LayAlign 32.42 - 40.22 - 24.16 - 41.44 - 32.92 -
XBridge (Ours) 34.67 28.00 42.88 34.25 26.35 19.14 44.40 33.78 33.70 28.85
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct 22.15 8.30 15.05 4.35 25.92 25.76 42.32 32.10 30.21 24.75
MindMerger 34.20 - 42.75 - 25.43 - 43.46 - 34.71 -
LayAlign 33.39 - 41.26 - 26.11 - 42.12 - 34.02 -
XBridge (Ours) 35.89 27.59 43.24 34.55 25.50 18.66 44.55 33.02 34.69 28.64
Table 1: FLORES-101 translation results for stage 1. For clarity, we report results on two low-resource languages
(Bengali, Swahili) and two high-resource languages (Japanese, German), with complete results and COMET scores
in Appendix C. "X" denotes all languages except for English. We bold the best scores for each LLM group.
LCE_Dec = − log pDec(y | ˜	Hz′
, ⟨y⟩). (3)
Across stages, the overall training objective is:
L = λ1LCE_LLM + λ2LCE_Dec + λ3LOT. (4)
where different loss terms are activated depending
on the training stage, as illustrated in Figure 2.
4 Experiment
4.1 Experiment Setup
Base Models We evaluate XBridge on four rep-
resentative base LLMs: MetaMath-7B-V1.0 (Yu
et al., 2024), LLaMA3-8B (Grattafiori et al., 2024),
Aya-23-8B (Üstün et al., 2024), and Qwen2.5-7B-
Instruct (Qwen et al., 2025). As the pretrained
encoder-decoder NMT model, we adopt NLLB-
200-1.3B (Team et al., 2022), which covers 200
languages with strong multilingual capacity.
Baselines We compare XBridge with these
strong baselines: (1) SFT performs multilingual
instruction fine-tuning directly on each base LLM.
(2) Translate-Test (Artetxe et al., 2023) translates
inputs to English, queries the English-SFT LLM,
and translates the output back to the target language.
(3) MindMerger (Huang et

Chunk 12 · 1,612 chars

Baselines We compare XBridge with these
strong baselines: (1) SFT performs multilingual
instruction fine-tuning directly on each base LLM.
(2) Translate-Test (Artetxe et al., 2023) translates
inputs to English, queries the English-SFT LLM,
and translates the output back to the target language.
(3) MindMerger (Huang et al., 2024) augments
the LLM input with a pretrained multilingual en-
coder to enhance multilingual understanding, form-
ing a strong multilingual-to-English system. (4)
LayAlign (Ruan et al., 2025) further extends Mind-
Merger with layer-wise fusion strategies to better
integrate encoder representations into the LLM.
Language Setup Following Chen et al. (2024),
we experiment on ten languages: Bengali (Bn),
German (De), English (En), Spanish (Es), French
(Fr), Japanese (Ja), Russian (Ru), Swahili (Sw),
Thai (Th), and Chinese (Zh). These languages span
diverse language families and resource levels. We
treat Bn, Sw, and Th as low-resource languages,
and the remaining as high-resource ones.
Training Datasets For stage 1 training, we ex-
tract English-centric translation pairs from OPUS-
100 (Zhang et al., 2020). For XBridge, we further
translate the English sentences into other languages
Ly using NLLB-200-3.3B, constructing trilingual
x-en-y data. For stage 2 and stage 3, we adopt mul-
tilingual mathematical reasoning data from Ruan
et al. (2025) and multilingual abstractive summa-
rization data from XL-Sum (Hasan et al., 2021).
For XBridge, we construct bilingual responses us-
ing NLLB-200-3.3B. Appendix B presents details
5

-- 5 of 23 --

Bn 	De 	En 	Es 	Fr 	Ja 	Ru 	Sw 	Th

Chunk 13 · 1,997 chars

stage 3, we adopt mul-
tilingual mathematical reasoning data from Ruan
et al. (2025) and multilingual abstractive summa-
rization data from XL-Sum (Hasan et al., 2021).
For XBridge, we construct bilingual responses us-
ing NLLB-200-3.3B. Appendix B presents details
5

-- 5 of 23 --

Bn 	De 	En 	Es 	Fr 	Ja 	Ru 	Sw 	Th 	Zh
MGSM
30.0
40.0
50.0
60.0
70.0
80.0
Accuracy
MetaMath-7B-SFT
MetaMath-7B-Translate-Test
MetaMath-MindMerger
MetaMath-LayerAlign
MetaMath-XBridge-LLM
MetaMath-XBridge-Dec
LLaMA3-8B-SFT
LLaMA3-8B-Translate-Test
LLaMA3-MindMerger
LLaMA3-LayerAlign
LLaMA3-XBridge-LLM
LLaMA3-XBridge-Dec
Aya-23-8B-SFT
Aya-23-8B-Translate-Test
Aya-MindMerger
Aya-LayerAlign
Aya-XBridge-LLM
Aya-XBridge-Dec
Qwen2.5-7B-Instruct-SFT
Qwen2.5-7B-Instruct-Translate-Test
Qwen2.5-7B-Instruct-MindMerger
Qwen2.5-7B-Instruct-LayerAlign
Qwen2.5-7B-Instruct-XBridge-LLM
Qwen2.5-7B-Instruct-XBridge-Dec
Bn 	En 	Es 	Fr 	Ja 	Ru 	Sw 	Th 	Zh
XL-Sum
17.5
20.0
22.5
25.0
27.5
30.0
32.5
35.0
Rouge-L
MetaMath-7B-SFT
MetaMath-MindMerger
MetaMath-LayerAlign
MetaMath-XBridge
LLaMA3-8B-SFT
LLaMA3-MindMerger
LLaMA3-LayerAlign
LLaMA3-XBridge
Aya-23-8B-SFT
Aya-MindMerger
Aya-LayerAlign
Aya-XBridge
Qwen2.5-7B-Instruct-SFT
Qwen2.5-7B-Instruct-MindMerger
Qwen2.5-7B-Instruct-LayerAlign
Qwen2.5-7B-Instruct-XBridge
Figure 3: Multilingual reasoning accuracy on MGSM and multilingual summarization Rouge-L on XL-Sum, with
complete results in Appendix C. Models with the same base LLM share the same color scheme, where lighter shades
denote baselines and darker shades denote XBridge. "XBridge-LLM" refers to English reasoning by the LLM,
while "XBridge-Dec" refers to multilingual reasoning by the composed decoder. For XL-Sum, since the baselines
produce English-only summaries, we translate them into target languages using NLLB-200-1.3B for evaluation.
about data processing and statistics.
Evaluation Benchmarks For stage 1, we eval-
uate cross-model mapping quality on FLORES-
101 (Goyal et al., 2022). Given the strong

Chunk 14 · 1,983 chars

d decoder. For XL-Sum, since the baselines
produce English-only summaries, we translate them into target languages using NLLB-200-1.3B for evaluation.
about data processing and statistics.
Evaluation Benchmarks For stage 1, we eval-
uate cross-model mapping quality on FLORES-
101 (Goyal et al., 2022). Given the strong En-
glish ability of LLMs, we use x-en and en-x trans-
lation performance to measure multilingual under-
standing and generation, respectively, and report
BLEU (Papineni et al., 2002) and COMET (Rei
et al., 2020) scores. For base LLMs, we leverage
MMT-LLM (Zhu et al., 2024) framework to eval-
uate translation capability in a 1-shot setting. For
stage 2 and stage 3, we evaluate multilingual rea-
soning on MGSM (Shi et al., 2023) with Accuracy,
and multilingual abstractive summarization on XL-
Sum with multilingual Rouge-L (Lin, 2004).
Model Configuration and Training Details The
encoder-side mapping is implemented as a two-
layer multi-layer perceptron (MLP), while the
decoder-side mapping is a four-layer MLP com-
posed of two stacked two-layer MLP blocks. All
intermediate dimensions are aligned with the LLM
hidden size. We use the AdamW optimizer with
a learning rate of 2 × 10−5, train each stage for
3 epochs with a batch size of 128, and conduct
experiments on 8 NVIDIA H800 GPUs. We empir-
ically set λ1 = 1.0, λ2 = 1.0, and λ2 = 6.0 when
the corresponding losses are active, with detailed
activation schedules described in Section 3.3.
4.2 Experimental Results
XBridge effectively offloads multilingual capa-
bility to the external multilingual model, while
preserving the LLM as a knowledge and reason-
ing core. Table 1 evaluates the cross-model map-
ping learned in stage 1 on FLORES-101. Across all
base LLMs, XBridge substantially improves both
multilingual understanding and generation, with
especially large gains on low-resource languages
where base LLMs have limited capability. The
6

-- 6 of 23 --

Lrl-En En-Lrl Hrl-En En-Hrl X-En

Chunk 15 · 1,998 chars

ates the cross-model map-
ping learned in stage 1 on FLORES-101. Across all
base LLMs, XBridge substantially improves both
multilingual understanding and generation, with
especially large gains on low-resource languages
where base LLMs have limited capability. The
6

-- 6 of 23 --

Lrl-En En-Lrl Hrl-En En-Hrl X-En En-X
Direction
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
BLEU
MetaMath-7B w/o OT w/o Decoder XBridge
(a) FLORES-101
Lrl-LLM Lrl-Dec Hrl-LLM Hrl-Dec Avg-LLM Avg-Dec
Language Group
30.0
35.0
40.0
45.0
50.0
55.0
Accuracy
SFT w/o OT w/o Decoder w/o Stage 1 Joint Stage 2&3 XBridge
(b) MGSM
Lrl Hrl Avg
Language Group
21.0
22.0
23.0
24.0
25.0
26.0
27.0
Rouge-L
SFT w/o OT XBridge
(c) XL-Sum
Figure 4: Ablation analysis of XBridge. We compare different ablated variants of XBridge: encoder-only augmenta-
tion "w/o Decoder", loss ablation "w/o OT", removal of stage 1 "w/o Stage 1", and joint optimization of stage 2&3
"Joint Stage 2&3". “Lrl”, “Hrl”, and “Avg” denote low-, high-resource, and average performance, respectively.
performance of XBridge approaches that of the ex-
ternal NLLB-200-1.3B and outperforms encoder-
augmented baselines, showing that XBridge can
effectively offload multilingual ability to external
NMT models while keeping the LLM frozen as a
knowledge and reasoning core. Importantly, perfor-
mance on high-resource languages remains compa-
rable to base LLMs, indicating that offloading does
not degrade the original strengths of LLMs.
Encoder adaptation improves multilingual un-
derstanding without degrading English perfor-
mance. Figure 3 presents multilingual reason-
ing accuracy on MGSM after encoder adapta-
tion. XBridge outperforms the base LLM, encoder-
only baselines, and the Translate-Test pipeline.
Since MGSM accuracy is language-agnostic, these
gains directly reflect better semantic transfer be-
tween multilingual encoder representations and the
LLM reasoning space. These results indicate that
encoder-side adaptation

Chunk 16 · 1,991 chars

on. XBridge outperforms the base LLM, encoder-
only baselines, and the Translate-Test pipeline.
Since MGSM accuracy is language-agnostic, these
gains directly reflect better semantic transfer be-
tween multilingual encoder representations and the
LLM reasoning space. These results indicate that
encoder-side adaptation facilitates more effective
utilization of multilingual representations by the
LLM, improving multilingual reasoning without
sacrificing its English-centric reasoning capability.
Decoder adaptation achieves faithful multilin-
gual generation. We further evaluate decoder
adaptation on MGSM and XL-Sum in Figure 3.
On MGSM, decoder-generated multilingual rea-
soning (XBridge_Dec) achieves accuracy compa-
rable to English LLM outputs, suggesting that the
decoder can faithfully express reasoning content
across languages. On XL-Sum, XBridge consis-
tently outperforms encoder-augmented baselines
and achieves better average performance than the
SFT baseline, with particularly clear gains on lan-
guages where multilingual generation is more chal-
lenging. While translation-cascaded systems are
limited by the NMT model, XBridge directly lever-
ages the LLM’s knowledge through decoder adap-
tation, resulting in more stable multilingual genera-
tion across languages. These results demonstrate
the importance of decoder adaptation for robust
multilingual generation.
5 Analysis
5.1 Ablation Analysis
We conduct the ablation study on MetaMath-7B-
V1.0 to analyze the contribution of each component
and training strategies in XBridge, and evaluate
ablated variants on FLORES-101, MGSM, and XL-
Sum. Figure 4 presents the results, and Appendix C
provides detailed results.
Encoder-Decoder Collaboration Removing the
decoder (w/o Decoder) achieves competitive
multilingual-to-English understanding but fails to
support multilingual generation, and underper-
forms XBridge on MGSM. This confirms that
encoder-only augmentation is insufficient for mul-
tilingual reasoning and

Chunk 17 · 1,957 chars

detailed results.
Encoder-Decoder Collaboration Removing the
decoder (w/o Decoder) achieves competitive
multilingual-to-English understanding but fails to
support multilingual generation, and underper-
forms XBridge on MGSM. This confirms that
encoder-only augmentation is insufficient for mul-
tilingual reasoning and generation.
OT Alignment Objectives Similarly, removing
the OT alignment (w/o OT) leads to performance
degradation on all benchmarks, particularly for
multilingual generation, indicating that token-level
soft alignment plays a crucial role in bridging
heterogeneous representation spaces between the
LLM and the multilingual decoder.
Stage-Wise Optimization Skipping stage 1 (w/o
Stage 1) results in a substantial performance drop
across all metrics, suggesting that direct task-level
adaptation is insufficient when the representation
gap between the LLM and the multilingual model
remains large. Moreover, jointly training stage 2
and stage 3 (Joint Stage 2&3) underperforms the
stage-wise optimization, reflecting a trade-off be-
7

-- 7 of 23 --

Af-En
Ar-En
Az-En
Cs-En
El-En
Et-En
Fa-En
Fi-En
Gl-En
Gu-En
He-En
Hi-En
Hr-En
Id-En
It-En
Ka-En
Kk-En
Km-En
Ko-En
Lv-En
Lt-En
Ml-En
Mr-En
Mk-En
Mn-En
My-En
Nl-En
Ne-En
Pl-En
Pt-En
Ps-En
Ro-En
Sl-En
Sv-En
Ta-En
Te-En
Tr-En
Uk-En
Ur-En
Vi-En
Xh-En
0.0
10.0
20.0
30.0
40.0
50.0
BLEU
En-Af
En-Ar
En-Az
En-Cs
En-El
En-Et
En-Fa
En-Fi
En-Gl
En-Gu
En-He
En-Hi
En-Hr
En-Id
En-It
En-Ka
En-Kk
En-Km
En-Ko
En-Lv
En-Lt
En-Ml
En-Mr
En-Mk
En-Mn
En-My
En-Nl
En-Ne
En-Pl
En-Pt
En-Ps
En-Ro
En-Sl
En-Sv
En-Ta
En-Te
En-Tr
En-Uk
En-Ur
En-Vi
En-Xh
NLLB-200-1.3B 	MetaMath-7B 	w/o OT 	XBridge
Figure 5: Cross-lingual generalization to 41 untuned languages in FLORES-101. Left: X→En direction. Right:
En→X direction. We directly evaluate the ablation variants described in Section 5.1. Appendix C lists the included
untuned languages and provides detailed results.
Bn 	De 	En 	Es 	Fr 	Ja 	Ru 	Sw 	Th 	Zh	
Target

Chunk 18 · 1,998 chars

e
Figure 5: Cross-lingual generalization to 41 untuned languages in FLORES-101. Left: X→En direction. Right:
En→X direction. We directly evaluate the ablation variants described in Section 5.1. Appendix C lists the included
untuned languages and provides detailed results.
Bn 	De 	En 	Es 	Fr 	Ja 	Ru 	Sw 	Th 	Zh	
Target Languages
Bn
De
En
Es
Fr
Ja
Ru
Sw
Th
Zh
Source Languages
10
15
20
25
30
35
40
45
50
BLEU
(a) FLORES-101
Bn 	De 	En 	Es 	Fr 	Ja 	Ru 	Sw 	Th 	Zh	
Answer Languages
Bn
De
En
Es
Fr
Ja
Ru
Sw
Th
Zh
Query Languages
40
45
50
55
60
65
Accuracy
 (b) MGSM
Figure 6: Evaluation for language-on-demand genera-
tion. Appendix C presents detailed results.
tween LLM- and decoder-side generation objec-
tives. These results support the design of stage-
wise adaptation, where coarse-grained cross-model
alignment is first established, followed by fine-
grained encoder and decoder specialization, en-
abling XBridge to achieve stable and effective mul-
tilingual reasoning and generation.
5.2 Generalization to Untuned Languages
To examine whether the cross-model mappings
learned by XBridge are language-agnostic rather
than simply tied to specific training languages, we
evaluate cross-model cross-lingual transfer on 41
untuned languages (listed in Table 3) in Figure 5,
based on variants of Section 5.1.
XBridge yields substantial gains on untuned lan-
guages over the base LLM, with performance ap-
proaching the external NLLB model. This indicates
that stage 1 cross-model mapping learns language-
agnostic semantic transfer that generalizes beyond
tuned languages, rather than language-specific map-
ping. Meanwhile, performance in the En→X direc-
tions highlights the importance of optimal transport.
Removing the OT objective leads to a substantial
drop in generation quality, particularly where to-
kenization length differs across different tokeniz-
ers. These results suggest that OT enables robust
60 	40 	20 	0 	20 	40 	60
x
40
20
0
20
40
60
y
Encoder (HZ)
MappingDec (HZ')
(a) w/o

Chunk 19 · 1,904 chars

importance of optimal transport.
Removing the OT objective leads to a substantial
drop in generation quality, particularly where to-
kenization length differs across different tokeniz-
ers. These results suggest that OT enables robust
60 	40 	20 	0 	20 	40 	60
x
40
20
0
20
40
60
y
Encoder (HZ)
MappingDec (HZ')
(a) w/o OT
100 	50 	0 	50 	100
x
80
60
40
20
0
20
40
60
80
y
Encoder (HZ)
MappingDec (HZ')
 (b) w/ OT
Figure 7: Visualization of sentence-level representation
alignment for Chinese (Zh). We compare models trained
without OT (left) and with OT (right) using t-SNE.
alignment between heterogeneous tokenizations,
which is crucial for generalizable multilingual gen-
eration. Overall, the results demonstrate that cross-
model semantic alignment generalizes across lan-
guages, while OT is crucial for achieving reliable
generation-level generalization.
5.3 Language-on-Demand Generation
We verify the language-on-demand property of
XBridge by switching the target language token
⟨y⟩ to generate outputs in arbitrary languages with-
out retraining in Figure 6.
On FLORES-101, we evaluate translation be-
tween all language pairs. With the target language
fixed, changing the source language causes only mi-
nor performance differences, while variations pri-
marily depend on the target language. On MGSM,
we force the decoder to generate responses in lan-
guages different from the input query language. For
each input language, performance remains largely
stable across different output languages. These re-
sults indicate that XBridge enables stable language-
on-demand generation, supporting flexible multilin-
gual outputs while preserving a language-agnostic
reasoning core in the LLM.
8

-- 8 of 23 --

15 	20 	25 	30 	35 	40	
X 	En BLEU
15
20
25
30
35
40
En 
X BLEU
No En 	X capability
M2M100-1.2B	
LLaMA3-8B	
MindMerger	
XBridge (Ours)	
Lrl
Hrl
Avg
(a) FLORES-101
Lrl 	Hrl 	Avg	
Language

Chunk 20 · 1,986 chars

ration, supporting flexible multilin-
gual outputs while preserving a language-agnostic
reasoning core in the LLM.
8

-- 8 of 23 --

15 	20 	25 	30 	35 	40	
X 	En BLEU
15
20
25
30
35
40
En 
X BLEU
No En 	X capability
M2M100-1.2B	
LLaMA3-8B	
MindMerger	
XBridge (Ours)	
Lrl
Hrl
Avg
(a) FLORES-101
Lrl 	Hrl 	Avg	
Language Group
40.0
42.5
45.0
47.5
50.0
52.5
55.0
57.5
Accuracy
SFT
Translate-Test	
MindMerger	
LayAlign	
XBridge_LLM	
XBridge_Dec
 (b) MGSM
Figure 8: LLaMA3-8B composed with M2M100-1.2B.
“Lrl”, “Hrl”, and “Avg” denote low-resource, high-
resource, and average performance, respectively. Hol-
low markers placed on the bottom boundary indicate
models that lack En→X translation capability. Ap-
pendix C presents detailed results.
5.4 Representation Visualization
To analyze the effect of optimal transport (OT)
on aligning heterogeneous representations, we vi-
sualize sentence-level hidden states for each lan-
guage. Specifically, we compare encoder represen-
tations of LLM English outputs Hz with decoder-
side representations after mapping ˜	Hz′
. We ob-
tain sentence-level vectors via average pooling and
project them to 2-dim for visualization using t-
SNE (Van der Maaten and Hinton, 2008).
As shown in Figure 7, without OT, the two sets
of representations form largely separate clusters,
reflecting a substantial distribution gap at the LLM-
decoder interface. In contrast, when OT is applied,
the two sets of representations overlap substan-
tially, with density contours largely merged, in-
dicating that OT promotes fine-grained semantic
consistency and reduces token-level misalignment
across model components.
5.5 Composing with Different NMT Models
To further examine the generality of XBridge be-
yond a specific NMT backbone, we replace the
multilingual NMT model with M2M100-1.2B (Fan
et al., 2021) in Figure 8 while keeping the same
training and evaluation settings as in Section 4.
XBridge remains effective when composed
with M2M100-1.2B. On FLORES-101,

Chunk 21 · 1,997 chars

T Models
To further examine the generality of XBridge be-
yond a specific NMT backbone, we replace the
multilingual NMT model with M2M100-1.2B (Fan
et al., 2021) in Figure 8 while keeping the same
training and evaluation settings as in Section 4.
XBridge remains effective when composed
with M2M100-1.2B. On FLORES-101, XBridge
achieves strong cross-model transfer across low-
and high-resource language directions, demonstrat-
ing that the lightweight mapping layers can reli-
ably bridge NMT models and LLMs. On MGSM,
XBridge outperforms the translation-cascaded base-
line, indicating that the benefits of XBridge ex-
tend beyond translation quality to multilingual rea-
soning. Overall, these results demonstrate that
0 	5 	10 	15 	20 	25 	30 	35 	40	
X 	En BLEU
0
5
10
15
20
25
30
35
40
En 
X BLEU
NLLB	
MetaMath	
XBridge (Ours)	
600M	
1.3B
Lrl
Hrl
Avg
(a) FLORES-101
Lrl 	Hrl 	Avg	
Language Group
40.0
42.5
45.0
47.5
50.0
52.5
55.0
Accuracy
SFT 	XBridge_LLM (600M)	
XBridge_Dec (600M)	
XBridge_LLM (1.3B)	
XBridge_Dec (1.3B)
 (b) MGSM
Figure 9: MetaMath-7B composed with NLLB in dif-
ferent sizes (600M vs. 1.3B). “Lrl”, “Hrl”, and “Avg”
denote low-, high-resource, and average performance,
respectively. Appendix C presents detailed results.
XBridge is an architecture-agnostic framework that
generalizes across both different LLM backbones
and multilingual NMT backbones.
5.6 Impact of NMT Model Size
We further investigate the impact of NMT model
capacity on XBridge by comparing NLLB-200-
600M and NLLB-200-1.3B, both integrated with
MetaMath-7B. Figure 9 presents the results.
On FLORES-101, larger NLLB models consis-
tently improve multilingual capability across lan-
guages, indicating that stronger multilingual capac-
ity in the composed NMT model leads to better mul-
tilingual understanding and generation. On MGSM,
increasing the NLLB size brings only marginal
changes in reasoning accuracy, suggesting that rea-
soning performance is primarily determined by the
LLM core. These

Chunk 22 · 1,992 chars

n-
guages, indicating that stronger multilingual capac-
ity in the composed NMT model leads to better mul-
tilingual understanding and generation. On MGSM,
increasing the NLLB size brings only marginal
changes in reasoning accuracy, suggesting that rea-
soning performance is primarily determined by the
LLM core. These results align with our design,
indicating that the quality of the composed NMT
model directly influences cross-model mapping,
while reasoning remains governed by the LLM.
5.7 Supplementary Analysis
We conduct supplementary analysis, including effi-
ciency analysis (Appendix D.1), the case study on
MGSM (Appendix D.2), and evaluation on multi-
lingual commonsense reasoning (Appendix D.3).
6 Conclusion
In this paper, we propose XBridge, a compositional
framework that offloads multilingual capability to
an external encoder-decoder NMT model, while
preserving the LLM as an English-centric core
for general knowledge processing. Extensive ex-
periments demonstrate that XBridge enables effi-
cient multilingual extension, raising low-resource
and unseen language performance to near external
NMT, without compromising LLM’s core abilities.
9

-- 9 of 23 --

Limitations
While XBridge substantially mitigates multilin-
gual imbalance, notably improving performance
on low-resource and previously unseen languages
for LLMs, the overall model still exhibits some im-
balance in multilingual capabilities. This is primar-
ily due to the combined influence of the external
encoder-decoder NMT model and the base LLM,
which limits complete uniformity across languages.
Future work could further explore strategies to har-
monize these components.
Acknowledgements
We thank all the anonymous reviewers for their in-
sightful and valuable comments on this paper. This
work was supported by the grant from the Beijing
Natural Science Foundation (No. L257006).
References
Mikel Artetxe, Vedanuj Goswami, Shruti Bhosale, An-
gela Fan, and Luke Zettlemoyer. 2023. Revisiting
machine

Chunk 23 · 1,998 chars

ements
We thank all the anonymous reviewers for their in-
sightful and valuable comments on this paper. This
work was supported by the grant from the Beijing
Natural Science Foundation (No. L257006).
References
Mikel Artetxe, Vedanuj Goswami, Shruti Bhosale, An-
gela Fan, and Luke Zettlemoyer. 2023. Revisiting
machine translation for cross-lingual classification.
In Proceedings of the 2023 Conference on Empiri-
cal Methods in Natural Language Processing, pages
6489–6499, Singapore. Association for Computa-
tional Linguistics.
Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo,
Jiaheng Liu, Bing Wang, Xinnian Liang, Jiaqi Bai,
Tongliang Li, Qiyao Peng, and 1 others. 2025. xcot:
Cross-lingual instruction tuning for cross-lingual
chain-of-thought reasoning. In Proceedings of the
AAAI Conference on Artificial Intelligence, vol-
ume 39, pages 23550–23558.
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and
Ben Bergen. 2024. When is multilinguality a curse?
language modeling for 250 high- and low-resource
languages. In Proceedings of the 2024 Conference on
Empirical Methods in Natural Language Processing,
pages 4074–4096, Miami, Florida, USA. Association
for Computational Linguistics.
Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dong-
mei Zhang, and Jia Li. 2024. Breaking language
barriers in multilingual mathematical reasoning: In-
sights and observations. In Findings of the Associa-
tion for Computational Linguistics: EMNLP 2024,
pages 7001–7016.
Marco Cuturi. 2013. Sinkhorn distances: Lightspeed
computation of optimal transport. Advances in neu-
ral information processing systems, 26.
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep
Baines, Onur Celebi, Guillaume Wenzek, Vishrav
Chaudhary, and 1 others. 2021. Beyond english-
centric multilingual machine translation. Journal of
Machine Learning Research, 22(107):1–48.
Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen,
Jixing Li, and Shujian Huang. 2024. Multilingual pre-
training

Chunk 24 · 1,995 chars

arth Goyal, Mandeep
Baines, Onur Celebi, Guillaume Wenzek, Vishrav
Chaudhary, and 1 others. 2021. Beyond english-
centric multilingual machine translation. Journal of
Machine Learning Research, 22(107):1–48.
Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen,
Jixing Li, and Shujian Huang. 2024. Multilingual pre-
training and instruction tuning improve cross-lingual
knowledge alignment, but only shallowly. In Pro-
ceedings of the 2024 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume
1: Long Papers), pages 6101–6117, Mexico City,
Mexico. Association for Computational Linguistics.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán,
and Angela Fan. 2022. The Flores-101 evaluation
benchmark for low-resource and multilingual ma-
chine translation. Transactions of the Association for
Computational Linguistics, 10:522–538.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten,
Alex Vaughan, and 1 others. 2024. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783.
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Is-
lam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang,
M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-
sum: Large-scale multilingual abstractive summariza-
tion for 44 languages. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
pages 4693–4703, Online. Association for Computa-
tional Linguistics.
Zixian Huang, Wenhao Zhu, Gong Cheng, Lei Li, and
Fei Yuan. 2024. Mindmerger: Efficiently boosting
llm reasoning in non-english languages. Advances in
Neural Information Processing Systems, 37:34161–
34187.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Wein-
berger. 2015. From word embeddings to document
distances. In International conference on machine
learning, pages 957–966.

Chunk 25 · 1,987 chars

. Mindmerger: Efficiently boosting
llm reasoning in non-english languages. Advances in
Neural Information Processing Systems, 37:34161–
34187.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Wein-
berger. 2015. From word embeddings to document
distances. In International conference on machine
learning, pages 957–966. PMLR.
Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri
Aji, and Timothy Baldwin. 2023. Bactrian-x:
Multilingual replicable instruction-following mod-
els with low-rank adaptation. arXiv preprint
arXiv:2305.15011.
Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, and Xi-
ang Ren. 2021. Common sense beyond English:
Evaluating and improving multilingual language
models for commonsense reasoning. In Proceedings
of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1274–1287, Online.
Association for Computational Linguistics.
Chin-Yew Lin. 2004. Rouge: A package for automatic
evaluation of summaries. In Text summarization
branches out, pages 74–81.
10

-- 10 of 23 --

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, July 6-12, 2002, Philadelphia,
PA, USA, pages 311–318.
Gabriel Peyré, Marco Cuturi, and 1 others. 2019. Com-
putational optimal transport: With applications to
data science. Foundations and Trends® in Machine
Learning, 11(5-6):355–607.
Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang,
and Wanxiang Che. 2023. Cross-lingual prompt-
ing: Improving zero-shot chain-of-thought reasoning
across languages. In Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language
Processing, pages 2695–2709, Singapore. Associa-
tion for Computational Linguistics.
Qwen, :, An Yang, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu,

Chunk 26 · 1,993 chars

pt-
ing: Improving zero-shot chain-of-thought reasoning
across languages. In Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language
Processing, pages 2695–2709, Singapore. Associa-
tion for Computational Linguistics.
Qwen, :, An Yang, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan
Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan
Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin
Yang, Jiaxi Yang, Jingren Zhou, and 25 oth-
ers. 2025. Qwen2.5 technical report. Preprint,
arXiv:2412.15115.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2685–2702, Online. Association
for Computational Linguistics.
Zhiwen Ruan, Yixia Li, He Zhu, Longyue Wang, Wei-
hua Luo, Kaifu Zhang, Yun Chen, and Guanhua Chen.
2025. LayAlign: Enhancing multilingual reason-
ing in large language models via layer-wise adaptive
fusion and alignment strategy. In Findings of the
Association for Computational Linguistics: NAACL
2025, pages 1481–1495, Albuquerque, New Mexico.
Association for Computational Linguistics.
Adriaan MJ Schakel and Benjamin J Wilson.
2015. Measuring word significance using dis-
tributed representations of words. arXiv preprint
arXiv:1508.02297.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang,
Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
Yi Tay, Sebastian Ruder, Denny Zhou, and 1 others.
2023. Language models are multilingual chain-of-
thought reasoners. In The Eleventh International
Conference on Learning Representations.
NLLB Team, Marta R. Costa-jussà, James Cross, Onur
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Barrault,
Gabriel Mejia Gonzalez, Prangthip Hansanti, and
20 others. 2022. No language left behind:

Chunk 27 · 1,981 chars

Costa-jussà, James Cross, Onur
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Barrault,
Gabriel Mejia Gonzalez, Prangthip Hansanti, and
20 others. 2022. No language left behind: Scal-
ing human-centered machine translation. Preprint,
arXiv:2207.04672.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, and 1 others. 2023. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Ahmet Üstün, Viraat Aryabumi, Zheng Yong, Wei-Yin
Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhan-
dari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, and
1 others. 2024. Aya model: An instruction finetuned
open-access multilingual language model. In Pro-
ceedings of the 62nd Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 15894–15939.
Laurens Van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-sne. Journal of machine
learning research, 9(11).
Yujia Xie, Xiangfeng Wang, Ruijia Wang, and
Hongyuan Zha. 2020. A fast proximal point method
for computing exact wasserstein distance. In Un-
certainty in artificial intelligence, pages 433–453.
PMLR.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mT5: A massively multilingual
pre-trained text-to-text transformer. In Proceedings
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 483–498, On-
line. Association for Computational Linguistics.
Sho Yokoi, Ryo Takahashi, Reina Akama, Jun Suzuki,
and Kentaro Inui. 2020. Word rotator’s distance. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages

Chunk 28 · 1,995 chars

onal Linguistics:
Human Language Technologies, pages 483–498, On-
line. Association for Computational Linguistics.
Sho Yokoi, Ryo Takahashi, Reina Akama, Jun Suzuki,
and Kentaro Inui. 2020. Word rotator’s distance. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 2944–2960.
Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone
Kim, Sheikh Shafayat, and Minjoon Seo. 2024. Lang-
Bridge: Multilingual reasoning without multilingual
supervision. In Proceedings of the 62nd Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 7502–7522,
Bangkok, Thailand. Association for Computational
Linguistics.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU,
Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li,
Adrian Weller, and Weiyang Liu. 2024. Metamath:
Bootstrap your own mathematical questions for large
language models. In The Twelfth International Con-
ference on Learning Representations.
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen-
nrich. 2020. Improving massively multilingual neu-
ral machine translation and zero-shot translation. In
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1628–
1639, Online. Association for Computational Linguis-
tics.
11

-- 11 of 23 --

Shaolei Zhang, Qingkai Fang, Zhuocheng Zhang, Zhen-
grui Ma, Yan Zhou, Langlin Huang, Mengyu Bu,
Shangtong Gui, Yunji Chen, Xilin Chen, and 1 oth-
ers. 2023. Bayling: Bridging cross-lingual alignment
and instruction following through interactive trans-
lation for large language models. arXiv preprint
arXiv:2306.10968.
Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao
Guo, Yan Zhou, Xiaodong Liu, and Yang Feng.
2024a. Bayling 2: A multilingual large language
model with efficient language alignment. arXiv
preprint arXiv:2411.16300.
Shimao Zhang, Zhejian Lai, Xiang Liu, Shuaijie She,
Xiao Liu, Yeyun Gong, Shujian Huang, and Jiajun
Chen. 2025a. How does alignment enhance

Chunk 29 · 1,986 chars

Fang, Shoutao
Guo, Yan Zhou, Xiaodong Liu, and Yang Feng.
2024a. Bayling 2: A multilingual large language
model with efficient language alignment. arXiv
preprint arXiv:2411.16300.
Shimao Zhang, Zhejian Lai, Xiang Liu, Shuaijie She,
Xiao Liu, Yeyun Gong, Shujian Huang, and Jiajun
Chen. 2025a. How does alignment enhance llms’
multilingual capabilities? a language neurons per-
spective. Preprint, arXiv:2505.21505.
Xue Zhang, Yunlong Liang, Fandong Meng, Songming
Zhang, Yufeng Chen, Jinan Xu, and Jie Zhou. 2025b.
Less, but better: Efficient multilingual expansion for
LLMs via layer-wise mixture-of-experts. In Proceed-
ings of the 63rd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 17948–17963, Vienna, Austria. Associa-
tion for Computational Linguistics.
Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang,
Xiaolong Wang, Peng Li, Maosong Sun, and Yang
Liu. 2024b. Enhancing multilingual capabilities of
large language models through self-distillation from
resource-rich languages. In Proceedings of the 62nd
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 11189–
11204, Bangkok, Thailand. Association for Compu-
tational Linguistics.
Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu,
Shujian Huang, Lingpeng Kong, Jiajun Chen, and
Lei Li. 2024. Multilingual machine translation with
large language models: Empirical results and anal-
ysis. In Findings of the Association for Computa-
tional Linguistics: NAACL 2024, pages 2765–2781,
Mexico City, Mexico. Association for Computational
Linguistics.
Wenhao Zhu, Yunzhe Lv, Qingxiu Dong, Fei Yuan,
Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun
Chen, and Lei Li. 2023. Extrapolating large language
models to non-english by aligning languages. arXiv
preprint arXiv:2308.04948.
A Optimal Transport Algorithm for
Heterogeneous Representations
In this section, we briefly review the standard OT
formulation and describe how it is adapted to

Chunk 30 · 1,998 chars

uang, Lingpeng Kong, Jiajun
Chen, and Lei Li. 2023. Extrapolating large language
models to non-english by aligning languages. arXiv
preprint arXiv:2308.04948.
A Optimal Transport Algorithm for
Heterogeneous Representations
In this section, we briefly review the standard OT
formulation and describe how it is adapted to align
heterogeneous and unequal-length representation
sequences in XBridge.
A.1 Optimal Transport Between Discrete
Distributions
Optimal Transport (OT) provides a principled
framework for measuring the discrepancy between
two probability distributions by minimizing the
cost of transporting probability mass from one dis-
tribution to the other (Peyré et al., 2019). Consider
the following discrete transport problem: given two
probability distributions P and Q,
P = {(wi, mi)}n
i=1, s.t.
n	X
i=1
mi = 1,
Q = {(w′
j , m′
j )}n′
j=1, s.t.
n′
X
j=1
mj = 1.
(5)
where each support point wi, w′
j ∈ Rd is associ-
ated with a non-negative probability mass mi, m′
j .
Given a cost function c(wi, w′
j ) that measures the
unit cost of transporting mass from wi to w′
j , the
transport cost between P and Q is defined as:
D(P, Q) = min
T≥0
X
i,j
Tij c(wi, w′
j ),
s.t.
n′
X
j=1
Tij = mi, ∀i ∈ {1, . . . , n},
n	X
i=1
Tij = m′
j , ∀j ∈ {1, . . . , n′}.
(6)
where Tij denotes the mass transported from wi to
w′
j .
A.2 OT for Aligning Unequal-Length
Representation Sequences
In XBridge, we apply OT to align two heteroge-
neous token representation sequences:
Hz = (Hz
1 , . . . , Hz
k ),
˜	Hz′
= ( ˜	Hz′
1 , . . . , ˜	Hz′
m ). (7)
where k	̸	 = m in general due to different tok-
enization schemes. Both sequences originate from
the same underlying LLM output but are obtained
12

-- 12 of 23 --

through different encoding pathways, making ex-
plicit token-wise correspondence unavailable.
We formulate their alignment as the following
OT problem:
D(Hz, ˜	Hz′
) = min
T≥0
X
i,j
Tij c(Hz
i , ˜	Hz′
j ),
s.t.
m	X
j=1
Tij = mz
i , ∀i ∈ {1, . . . , k},
k	X
i=1
Tij = mz′
j , ∀j ∈ {1, . . . ,

Chunk 31 · 1,992 chars

btained
12

-- 12 of 23 --

through different encoding pathways, making ex-
plicit token-wise correspondence unavailable.
We formulate their alignment as the following
OT problem:
D(Hz, ˜	Hz′
) = min
T≥0
X
i,j
Tij c(Hz
i , ˜	Hz′
j ),
s.t.
m	X
j=1
Tij = mz
i , ∀i ∈ {1, . . . , k},
k	X
i=1
Tij = mz′
j , ∀j ∈ {1, . . . , m}.
(8)
where the cost function c(·, ·) is defined as cosine
distance.
The probability masses mz
i and mz′
j are obtained
by normalizing the ℓ1 norms of the correspond-
ing representations. This choice is motivated by
prior work (Schakel and Wilson, 2015; Yokoi et al.,
2020) showing that embedding norms correlate
with token importance, with semantically salient
words exhibiting larger magnitudes.
A.3 Approximate OT via Relaxed Marginal
Constraints
Solving the exact OT problem requires O(n3) lin-
ear programming, which is computationally pro-
hibitive for long sequences. While entropic regu-
larization methods such as Sinkhorn (Cuturi, 2013)
or IPOT (Xie et al., 2020) provide approximate
solutions, they still introduce significant overhead
during training.
Following Kusner et al. (2015), we adopt a re-
laxed OT formulation by removing the second
marginal constraint:
D∗(Hz, ˜	Hz′
) = min
T≥0
X
i,j
Tij c(Hz
i , ˜	Hz′
j ),
s.t.
m	X
j=1
Tij = mz
i , ∀i ∈ {1, . . . , k}.
(9)
This relaxation yields a lower bound of the ex-
act OT distance and admits a closed-form solution:
each representation Hz
i transports all its probabil-
ity mass to the most similar ˜	Hz′
j under the cosine
distance. The resulting transport plan naturally sup-
ports unequal-length alignments, making it well-
suited for sequences with heterogeneous tokeniza-
tions.
A.4 Role in XBridge
The proposed OT alignment provides a principled
mechanism for aligning heterogeneous representa-
tions without assuming positional correspondence.
Moreover, since the multilingual encoder is frozen
during training, the relaxed OT objective anchors
the alignment to encoder-defined semantic geom-
etry,

Chunk 32 · 1,994 chars

4 Role in XBridge
The proposed OT alignment provides a principled
mechanism for aligning heterogeneous representa-
tions without assuming positional correspondence.
Moreover, since the multilingual encoder is frozen
during training, the relaxed OT objective anchors
the alignment to encoder-defined semantic geom-
etry, encouraging decoder-side representations to
remain compatible with the multilingual encoder-
decoder space. Despite its simplicity, this approx-
imation is sufficient for our setting, as the goal is
semantic compatibility regularization rather than
exact distribution matching.
B Details for Training Data
Translation Data in Stage 1 We sample English-
centric translation pairs from OPUS-100 (Zhang
et al., 2020) and filter the off-target pairs, with 50k
samples per translation direction. For XBridge,
we further translate English sentences into other
languages Ly using NLLB-200-3.3B to construct
trilingual x-en-y data. To mitigate translation
noise in generation, we train XBridge using y-en-x,
where the encoder processes translated sentences
and the decoder processes natural sentences.
Multilingual Reasoning Data and Multilingual
Abstractive Summarization Data in Stage 2 and
Stage 3 We extract multilingual reasoning data
from Ruan et al. (2025), which contains 30K sam-
ples per language across ten languages (the same as
in Section 4.1). We extract multilingual abstractive
summarization data from XL-Sum (Hasan et al.,
2021). XL-Sum contains imbalanced multilingual
data, and we have set the data upper limit to 30K.
For XBridge, we additionally construct bilingual
responses using NLLB-200-3.3B.
Data Statistics Figure 10 presents detailed data
statistics for the training data.
C Detailed Results
Table 4, Table 6 and Table 7 present detailed
BLEU scores on FLORES-101 (COMET scores
in Table 5), accuracy on MGSM, and multilingual
Rouge-L on XL-Sum for the main experiments in
Section 4.
Table 8, Table 9 and Table 10 present results for
the ablation study in

Chunk 33 · 1,991 chars

ata
statistics for the training data.
C Detailed Results
Table 4, Table 6 and Table 7 present detailed
BLEU scores on FLORES-101 (COMET scores
in Table 5), accuracy on MGSM, and multilingual
Rouge-L on XL-Sum for the main experiments in
Section 4.
Table 8, Table 9 and Table 10 present results for
the ablation study in Section 5.1.
Table 3 presents the included untuned languages
and corresponding language codes. Table 11
presents detailed results for untuned language gen-
eralization in Section 5.2.
Table 12 presents BLEU scores for cross-lingual
generation on FLORES-101, and Table 13 presents
13

-- 13 of 23 --

Task 	Languages 	Data Composition 	Total Size
Translation Bn, De, En, Es, Fr, Ja,
Ru, Sw, Th, Zh 50K * 72 translation directions 	3.6M
Multilingual Reasoning Bn, De, En, Es, Fr, Ja,
Ru, Sw, Th, Zh 30K * 10 language 	300K
Multilingual Abstractive
Summarization
Bn, En, Es, Fr, Ja, Ru,
Sw, Th, Zh
Bn-8K, En-30K, Es-30K, Fr-9K,
Ja-7K, Ru-30K, Sw-8K, Th-7K,
Zh-30K
158K
Figure 10: Statistics of training datasets used in different stages.
System Training Inference
SFT 1.0x 1.0x
Translate-Test - 0.55x
MindMerger 1.42x 0.85x
XBridge 0.91x 0.66x
Table 2: Relative speed comparison.
Accuracy for language-on-demand generation on
MGSM in Section 5.3.
Table 14 and Table 15 present results for
LLaMA3-8B composed with M2M100-1.2B in
Section 5.5.
Table 17 and Table 16 present results for
MetaMath-7B composed with NLLB-200 in dif-
ferent sizes (600M vs. 1.3B) in Section 5.6.
D Supplementary Analysis
D.1 Efficiency Analysis
We compare the training and inference efficiency
of XBridge with SFT (LLM-only), MindMerger
(Encoder-LLM), and the cascaded Translate-Test
pipeline in Table 2. XBridge introduces only a
limited training overhead despite the additional en-
coder and decoder, due to its parameter-efficient
design. For inference, XBridge is slower than the
LLM-only method due to the additional decoding
for multilingual generation, but it remains faster
than the cascaded

Chunk 34 · 1,991 chars

est
pipeline in Table 2. XBridge introduces only a
limited training overhead despite the additional en-
coder and decoder, due to its parameter-efficient
design. For inference, XBridge is slower than the
LLM-only method due to the additional decoding
for multilingual generation, but it remains faster
than the cascaded Translate-Test pipeline. Overall,
XBridge trades moderate computational cost for
improved multilingual generation quality and ro-
bustness, while avoiding the inefficiency and error
accumulation of cascaded systems.
D.2 Case Study on MGSM
In the case study, we compare the outputs of
MindMerger, LayAlign, and XBridge in Fig-
ure 11. MindMerger and LayAlign adopt
encoder-augmented architectures, which only en-
able multilingual-to-English processing. As a re-
sult, the generated responses are always in English,
which is less friendly for multilingual users who ex-
pect outputs in their native languages. In contrast,
XBridge supports controllable target-language gen-
eration by explicitly specifying the decoder lan-
guage token, allowing the model to produce re-
sponses in different languages as required. This
demonstrates the advantage of the compositional
encoder-LLM-decoder design in providing flexible
multilingual generation.
D.3 Multilingual Commonsense Reasoning
We further extend XBridge to X-CSQA (Lin et al.,
2021) to evaluate both cross-lingual understand-
ing and logical reasoning. We report results on
MetaMath-7B and LLaMA3-8B across eight lan-
guages in Figure 12. XBridge outperforms Mind-
Merger and LayAlign on MetaMath-7B across
most languages, yielding the highest average accu-
racy. Notably, the gains are more pronounced on
non-English languages, while English performance
is preserved. This suggests that XBridge effec-
tively enhances multilingual understanding without
degrading the base model’s reasoning ability, and
achieves more balanced multilingual reasoning per-
formance across languages.
14

-- 14 of 23 --

Query: 制作一件袍子需要 2

Chunk 35 · 1,998 chars

ed on
non-English languages, while English performance
is preserved. This suggests that XBridge effec-
tively enhances multilingual understanding without
degrading the base model’s reasoning ability, and
achieves more balanced multilingual reasoning per-
formance across languages.
14

-- 14 of 23 --

Query: 制作一件袍子需要 2 匹蓝色纤维布料和这个数量一半的白色纤维布料。它一共需要用掉多少匹布料？
System 	Responses
MindMerger If 2 blue fibers are needed, then half that amount of white fibers is 2/2 = 1 white fiber.\nSo, in total, 2 blue fibers +
1 white fiber = 3 fibers are needed.\n#### 3\nThe answer is: 3
LayAlign If it takes 2 yards of blue fabric to make a robe, then it takes 2/2 = 1 yard of white fabric.\nIn total, it takes 2 yards
of blue fabric + 1 yard of white fabric = 3 yards of fabric.\n#### 3\nThe answer is: 3
XBridge
<ben_Beng> নীল ফাইবারেে সংখ্যা 	২। অরধ েক সংখ্যা 	২/২ = ১। সুতোং, সাদা ফাইবারেে সংখ্যা 	১। সুতোং,
ফাইবারেে ম াট সংখ্যা 	২ টট নীল ফাইবাে + ১ টট সাদা ফাইবাে = ৩ টট ফাইবাে। #### 3 উত্তে 	হল: ৩
<deu_Latn> Die Anzahl der gebrauchten blauen Stoffe beträgt 2. Die Hälfte dieser Anzahl ist 2/2 = 1. Also ist die
Anzahl der benötigten weißen Stoffe 1 . Daher beträgt die Gesamtzahl der benötigten Stoffe 2 blaue Stoffe + 1
weiße Stoffe = 3 Stoffe.
<eng_Latn> The number of blue fibers needed is 2. Half of that number is 2/2 = 1. So, the number of white fibers
needed is 1. Therefore, the total number of fibers needed is 2 blue fibers + 1 white fiber = 3 fibers. #### 3 The
answer is: 3
<spa_Latn> El número de fibras azules necesarias es 2. La mitad de ese número es 2/2 = 1. Así, el número de
fibras blancas necesarias es 1. Por lo tanto, el número total de fibras necesarias es 2 fibras azules + 1 fibra blanca
= 3 fibras. #### 3 La respuesta es: 3
<fra_Latn> Le nombre de fibres bleues nécessaires est de 2. La moitiéde ce nombre est de 2/2 = 1. Donc, le
nombre de fibres blanches nécessaires est de 1. Par conséquent, le nombre total de fibres nécessaires est de 2
fibres bleues + 1 fibres blanches = 3

Chunk 36 · 1,994 chars

s azules + 1 fibra blanca
= 3 fibras. #### 3 La respuesta es: 3
<fra_Latn> Le nombre de fibres bleues nécessaires est de 2. La moitiéde ce nombre est de 2/2 = 1. Donc, le
nombre de fibres blanches nécessaires est de 1. Par conséquent, le nombre total de fibres nécessaires est de 2
fibres bleues + 1 fibres blanches = 3 fibres.
<jpn_Jpan> 青い繊維の数は2です. その数の半分は2/2 = 1 です. したがって,必要な白い繊維の数は1です.
したがって,必要な繊維の総数は2つの青い繊維+1つの白い繊維=3つの繊維です. #### 3 答えは: 3
<rus_Cyrl> Количество синих тканей, необходимых, равняется 2. Половина этого числа равняется 2/2 = 1.
Таким образом, количество белых тканей, необходимых, равняется 1. Таким образом, общее количество
тканей, необходимых, равняется 2 синим тканям + 1 белому тканю = 3 тканям. #### 3 Ответ: 3
<swh_Latn> Idadi ya nyuzi za bluu zinahitajika ni 2. Nusu ya idadi hiyo ni 2/2 = 1. Kwa hiyo, idadi ya nyuzi
nyeupe zinahitajika ni 1. Kwa hiyo, jumla ya idadi ya nyuzi zinahitajika ni nyuzi 2 za bluu + nyuzi 1 nyeupe =
nyuzi 3. #### 3 Jibu ni: 3
<tha_Thai> จํานวนเชื้อสีฟ้าที่ต้องการคือ 2 ครึ่งหนึ่งของจํานวนนั้นคือ 2/2 = 1 ดังนั้น จํานวนเชื้อสีขาวที่ต้องการคือ 1 ดังนั้น
จํานวนเชื้อที่ต้องการทั้งหมดคือ 2 เชื้อสีฟ้า + 1 เชื้อสีขาว = 3 เชื้อ
<zho_Hans> 蓝色的数量是2. 需要的蓝色数量是2/2=1. 所以,需要的白色数量是1. 因此,所需的总数是2个
蓝色的数量+1个白色的数量=3个. #### 3 答案是: 3
Figure 11: Case study on multilingual reasoning. Red text indicates the language token fed to the decoder, which
controls the target generation language.
15

-- 15 of 23 --

De
En
Es
Fr
Ja
Ru
Sw
Zh
MetaMath-7B
41
46
51
56
61
De
En
Es
Fr
Ja
Ru
Sw
Zh
LLaMA3-8B
41
50
58
67
76
MindMerger 	LayerAlign 	XBridge
Figure 12: Radar plot comparison on X-CSQA.
ISO 639-1 Language ISO 639-1 Language ISO 639-1 Language
Af Akrikaans It Italian Pl Polish
Ar Arabic Ka Georgian Ps Pashto
Az Azerbaijani Kk Kazakh Pt Portuguese
Cs Czech Km Khmer Ro Romanian
El Modern Greek Ko Korean Sl Slovenian
Et Estonian Lt Lithuanian Sv Swedish
Fa Persian Lv Latvian Ta Tamil
Fi Finnish Mk Macedonian Te Telugu
Gl Galician Ml Malayalam Tr Turkish
Gu

Chunk 37 · 1,999 chars

age
Af Akrikaans It Italian Pl Polish
Ar Arabic Ka Georgian Ps Pashto
Az Azerbaijani Kk Kazakh Pt Portuguese
Cs Czech Km Khmer Ro Romanian
El Modern Greek Ko Korean Sl Slovenian
Et Estonian Lt Lithuanian Sv Swedish
Fa Persian Lv Latvian Ta Tamil
Fi Finnish Mk Macedonian Te Telugu
Gl Galician Ml Malayalam Tr Turkish
Gu Gujarati Mn Mongolian Uk Ukrainian
He Hebrew Mr Marathi Ur Urdu
Hi Hindi My Burmese Vi Vietnamese
Hr Croatian Ne Nepali Xh Xhosa
Id Indonesian Nl Dutch
Table 3: The included 41 untuned languages and corresponding language codes.
16

-- 16 of 23 --

System Bn-En En-Bn De-En En-De Es-En En-Es Fr-En En-Fr Ja-En En-Ja
NLLB-200-1.3B 37.78 32.83 46.23 39.91 35.37 31.35 47.14 51.33 29.60 19.07
MetaMath-7B
MetaMath-7B 1.46 0.67 34.36 19.42 36.33 27.42 15.54 11.64 27.62 16.76
MindMerger 30.76 - 40.05 - 28.77 - 38.65 - 22.50 -
LayerAlign 30.91 - 39.43 - 30.16 - 40.48 - 22.36 -
XBridge (Ours) 35.47 29.23 41.42 35.39 30.51 28.88 42.14 49.49 24.52 19.60
LLaMA3-8B
LLaMA3-8B 29.83 13.18 45.28 36.24 33.65 28.86 46.18 47.04 27.71 25.40
MindMerger 33.86 - 42.52 - 30.70 - 41.47 - 25.48 -
LayerAlign 32.95 - 41.29 - 30.36 - 39.40 - 24.62 -
XBridge (Ours) 37.09 28.42 45.75 35.45 32.00 29.59 46.10 49.38 27.63 20.12
Aya-23-8B
Aya-23-8B 8.59 2.43 45.46 38.03 33.22 30.50 46.17 49.45 29.11 29.34
MindMerger 33.41 - 41.78 - 30.75 - 40.26 - 24.96 -
LayerAlign 32.42 - 41.44 - 30.75 - 39.86 - 24.16 -
XBridge (Ours) 34.67 28.00 44.40 33.78 31.36 28.71 42.05 48.47 26.35 19.14
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct 22.15 8.30 42.32 32.10 32.15 29.31 43.86 45.09 25.92 25.76
MindMerger 34.20 - 43.46 - 31.99 - 42.00 - 25.43 -
LayerAlign 33.39 - 42.12 - 31.17 - 40.58 - 26.11 -
XBridge (Ours) 35.89 27.59 44.55 33.02 31.52 28.22 42.64 47.80 25.50 18.66
System Ru-En En-Ru Sw-En En-Sw Th-En En-Th Zh-En En-Zh X-En En-X
NLLB-200-1.3B 38.06 33.72 42.66 36.28 30.82 17.46 29.90 17.07 37.51 31.00
MetaMath-7B
MetaMath-7B 26.21 19.34 3.33 1.75 3.82 0.72 18.93 9.58 18.62 11.92
MindMerger 32.82 - 39.43

Chunk 38 · 1,999 chars

-
XBridge (Ours) 35.89 27.59 44.55 33.02 31.52 28.22 42.64 47.80 25.50 18.66
System Ru-En En-Ru Sw-En En-Sw Th-En En-Th Zh-En En-Zh X-En En-X
NLLB-200-1.3B 38.06 33.72 42.66 36.28 30.82 17.46 29.90 17.07 37.51 31.00
MetaMath-7B
MetaMath-7B 26.21 19.34 3.33 1.75 3.82 0.72 18.93 9.58 18.62 11.92
MindMerger 32.82 - 39.43 - 26.63 - 24.49 - 31.57 -
LayerAlign 34.28 - 39.02 - 26.56 - 24.59 - 31.98 -
XBridge (Ours) 34.97 31.03 42.02 34.28 28.30 17.23 20.97 23.04 33.37 29.80
LLaMA3-8B
LLaMA3-8B 37.31 30.49 35.87 19.31 30.39 19.80 30.46 25.90 35.19 27.36
MindMerger 34.16 - 41.81 - 29.14 - 25.76 - 33.88 -
LayerAlign 34.32 - 41.35 - 28.66 - 25.63 - 33.18 -
XBridge (Ours) 37.08 30.57 44.73 34.68 30.61 17.09 24.89 23.11 36.21 29.82
Aya-23-8B
Aya-23-8B 37.10 33.01 7.89 1.16 14.72 2.11 30.89 27.32 28.13 23.71
MindMerger 34.50 - 41.56 - 28.34 - 25.40 - 33.44 -
LayerAlign 34.64 - 40.22 - 27.62 - 25.15 - 32.92 -
XBridge (Ours) 33.74 28.90 42.88 34.25 28.95 16.52 18.94 21.87 33.70 28.85
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct 32.81 27.98 15.05 4.35 26.27 19.89 31.34 29.99 30.21 24.75
MindMerger 35.47 - 42.75 - 29.52 - 27.59 - 34.71 -
LayerAlign 34.97 - 41.26 - 28.91 - 27.66 - 34.02 -
XBridge (Ours) 35.06 29.77 43.24 34.55 29.01 16.26 24.78 21.89 34.69 28.64
Table 4: Detailed BLEU scores on FLORES-101 for stage 1. "X" denotes all languages except for English. We bold
the best scores for each LLM group.
17

-- 17 of 23 --

System Bn-En En-Bn De-En En-De Es-En En-Es Fr-En En-Fr Ja-En En-Ja
NLLB-200-1.3B 88.30 85.67 88.80 86.12 86.72 85.68 88.83 87.05 87.05 86.45
MetaMath-7B
MetaMath-7B 50.85 35.10 86.25 76.59 84.39 78.49 86.84 78.45 82.62 76.35
MindMerger 86.87 - 87.81 - 85.52 - 87.64 - 84.94 -
LayerAlign 87.10 - 87.77 - 85.75 - 87.77 - 84.78 -
XBridge (Ours) 88.03 84.98 88.17 85.31 86.16 85.26 88.07 86.53 86.04 85.99
LLaMA3-8B
LLaMA3-8B 86.21 69.38 88.83 85.62 86.89 85.22 88.96 86.45 86.90 87.47
MindMerger 87.52 - 88.29 - 85.89 - 87.86 - 85.68 -
LayerAlign 87.35 - 87.88 - 85.45 - 87.10

Chunk 39 · 1,995 chars

.52 - 87.64 - 84.94 -
LayerAlign 87.10 - 87.77 - 85.75 - 87.77 - 84.78 -
XBridge (Ours) 88.03 84.98 88.17 85.31 86.16 85.26 88.07 86.53 86.04 85.99
LLaMA3-8B
LLaMA3-8B 86.21 69.38 88.83 85.62 86.89 85.22 88.96 86.45 86.90 87.47
MindMerger 87.52 - 88.29 - 85.89 - 87.86 - 85.68 -
LayerAlign 87.35 - 87.88 - 85.45 - 87.10 - 85.53 -
XBridge (Ours) 88.53 84.77 88.97 85.57 86.66 85.23 88.92 86.53 87.01 85.33
Aya-23-8B
Aya-23-8B 68.76 40.01 88.40 85.27 86.42 85.26 88.37 86.85 86.97 89.75
MindMerger 87.89 - 88.32 - 86.28 - 88.08 - 86.11 -
LayerAlign 87.66 - 88.28 - 86.05 - 87.84 - 85.70 -
XBridge (Ours) 88.12 84.40 88.80 84.45 86.48 84.86 88.33 86.22 86.78 85.45
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct 83.02 61.32 88.13 84.31 86.28 85.35 88.21 86.42 86.06 88.76
MindMerger 88.07 - 88.75 - 86.52 - 88.31 - 85.94 -
LayerAlign 88.07 - 88.52 - 86.32 - 88.01 - 86.65 -
XBridge (Ours) 88.25 84.65 88.80 84.36 86.58 84.70 88.52 85.87 86.48 84.87
System Ru-En En-Ru Sw-En En-Sw Th-En En-Th Zh-En En-Zh X-En En-X
NLLB-200-1.3B 86.20 88.02 85.11 83.07 86.39 80.17 85.66 76.98 87.01 84.36
MetaMath-7B
MetaMath-7B 83.41 72.45 49.89 43.48 58.68 39.53 82.32 45.69 73.92 60.68
MindMerger 85.06 - 84.88 - 85.14 - 83.96 - 85.76 -
LayerAlign 85.51 - 84.95 - 85.30 - 84.42 - 85.93 -
XBridge (Ours) 85.38 86.65 84.87 80.23 86.07 81.38 82.00 83.52 86.09 84.43
LLaMA3-8B
LLaMA3-8B 86.19 87.17 82.74 74.06 87.04 83.54 86.33 85.93 86.68 82.76
MindMerger 85.33 - 85.30 - 85.83 - 84.48 - 86.24 -
LayerAlign 85.24 - 85.25 - 85.74 - 84.13 - 85.96 -
XBridge (Ours) 86.23 86.27 85.63 80.19 86.79 81.12 84.64 83.46 87.04 84.27
Aya-23-8B
Aya-23-8B 85.87 88.02 54.36 31.49 76.40 48.21 86.14 86.88 80.19 71.31
MindMerger 85.80 - 85.43 - 86.07 - 84.53 - 86.50 -
LayerAlign 85.74 - 85.32 - 85.78 - 84.55 - 86.32 -
XBridge (Ours) 85.88 85.87 85.36 80.03 86.39 81.03 83.23 82.80 86.60 83.90
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct 84.22 84.68 64.52 46.92 84.94 84.00 86.86 88.30 83.58 78.90
MindMerger 85.95 - 85.82 - 86.49 - 85.62 -

Chunk 40 · 1,998 chars

MindMerger 85.80 - 85.43 - 86.07 - 84.53 - 86.50 -
LayerAlign 85.74 - 85.32 - 85.78 - 84.55 - 86.32 -
XBridge (Ours) 85.88 85.87 85.36 80.03 86.39 81.03 83.23 82.80 86.60 83.90
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct 84.22 84.68 64.52 46.92 84.94 84.00 86.86 88.30 83.58 78.90
MindMerger 85.95 - 85.82 - 86.49 - 85.62 - 86.83 -
LayerAlign 85.82 - 85.66 - 86.42 - 85.61 - 86.79 -
XBridge (Ours) 85.91 85.80 85.43 80.23 86.45 80.51 84.99 82.79 86.82 83.75
Table 5: Detailed COMET scores on FLORES-101 for stage 1. "X" denotes all languages except for English. We
bold the best scores for each LLM group.
18

-- 18 of 23 --

System Bn De En Es Fr Ja Ru Sw Th Zh Avg
MetaMath-7B
SFT 30.00 40.40 48.00 39.60 39.20 32.40 39.60 35.20 32.80 36.00 37.32
Translate-Test 34.80 49.20 53.20 47.60 40.80 45.20 49.20 43.20 32.40 46.00 44.16
MindMerger 43.60 57.20 67.20 59.20 60.00 45.60 57.60 49.20 42.40 48.80 53.08
LayerAlign 40.40 54.80 62.80 59.20 52.40 51.60 56.40 53.20 47.20 51.20 52.92
XBridge_LLM (Ours) 48.40 54.80 67.20 60.80 57.20 47.20 57.60 46.40 45.60 52.80 53.80
XBridge_Dec (Ours) 47.20 52.80 66.80 60.00 51.60 44.80 56.00 45.60 44.40 52.00 52.12
LLaMA3-8B
SFT 38.80 46.40 54.00 54.80 44.40 38.80 45.20 40.80 44.00 49.20 45.64
Translate-Test 40.40 56.00 62.00 54.80 54.80 51.60 57.20 46.80 35.20 48.00 50.68
MindMerger 49.60 54.80 60.00 56.80 55.60 50.00 55.20 48.00 46.40 50.80 52.72
LayerAlign 50.00 56.00 62.40 60.40 57.60 50.00 56.80 47.60 48.80 52.80 54.24
XBridge_LLM (Ours) 47.20 59.20 65.20 62.00 62.00 50.80 60.40 49.60 54.40 56.40 56.72
XBridge_Dec (Ours) 46.00 56.80 62.80 61.20 57.20 46.40 58.40 49.60 51.20 55.20 54.48
Aya-23-8B
SFT 33.60 52.80 50.80 47.60 43.60 46.40 46.40 38.80 39.20 42.00 44.12
Translate-Test 34.00 50.00 57.20 48.40 45.60 42.80 50.40 43.60 33.20 50.80 45.60
MindMerger 40.80 49.60 48.00 48.00 46.00 42.40 48.00 42.40 33.60 40.40 43.92
LayerAlign 41.60 45.20 49.20 48.80 48.40 44.00 46.80 41.20 37.60 46.00 44.88
XBridge_LLM (Ours) 44.40 49.20 56.80 53.20 50.00

Chunk 41 · 1,996 chars

46.40 46.40 38.80 39.20 42.00 44.12
Translate-Test 34.00 50.00 57.20 48.40 45.60 42.80 50.40 43.60 33.20 50.80 45.60
MindMerger 40.80 49.60 48.00 48.00 46.00 42.40 48.00 42.40 33.60 40.40 43.92
LayerAlign 41.60 45.20 49.20 48.80 48.40 44.00 46.80 41.20 37.60 46.00 44.88
XBridge_LLM (Ours) 44.40 49.20 56.80 53.20 50.00 47.20 49.20 44.80 37.60 46.00 47.84
XBridge_Dec (Ours) 44.80 46.40 56.80 53.20 46.40 46.40 49.20 44.80 34.00 44.00 46.60
Qwen2.5-7B-Instruct
SFT 49.60 64.00 66.40 59.60 58.40 54.40 60.00 48.40 62.80 67.60 59.12
Translate-Test 43.20 67.60 74.40 62.40 60.40 59.60 63.20 58.40 44.00 62.80 59.60
MindMerger 60.40 71.60 83.60 76.40 73.60 64.80 77.20 65.20 66.40 70.00 70.92
LayerAlign 65.60 75.20 83.20 78.40 74.00 62.80 74.40 66.80 65.60 68.80 71.48
XBridge_LLM (Ours) 63.20 74.40 83.60 77.20 72.80 65.20 74.00 67.60 68.40 71.60 71.80
XBridge_Dec (Ours) 60.80 72.40 83.20 76.00 65.60 59.20 72.40 66.80 64.00 70.40 69.08
Table 6: Detailed accuracy results on MGSM. "XBridge-LLM" denotes English reasoning outputs from the LLM,
while "XBridge-Dec" denotes multilingual reasoning outputs via the external decoder. We bold the best scores for
each LLM group.
System Bn En Es Fr Ja Ru Sw Th Zh Avg
MetaMath-7B
SFT 16.79 33.27 23.68 28.42 29.20 25.42 27.01 18.68 20.16 24.74
MindMerger 17.93 27.27 21.03 26.05 28.11 22.25 26.49 19.20 18.56 22.99
LayerAlign 17.94 28.84 21.49 26.31 27.55 22.22 26.29 20.09 19.02 23.30
XBridge (Ours) 18.93 31.66 22.75 27.96 32.12 24.64 28.90 20.70 24.12 25.75
LLaMA3-8B
SFT 20.87 33.59 23.84 30.18 30.42 26.12 28.73 22.51 21.72 26.44
MindMerger 18.76 31.29 22.88 27.84 29.02 23.97 27.63 20.47 19.75 24.62
LayerAlign 18.78 30.78 22.98 27.47 28.55 23.59 27.58 20.48 19.56 24.42
XBridge (Ours) 20.65 33.94 24.87 29.62 33.74 26.42 30.75 23.19 25.91 27.68
Aya-23-8B
SFT 17.68 34.97 24.88 29.78 31.43 26.96 26.11 20.14 21.70 25.96
MindMerger 18.49 31.87 22.54 27.40 28.21 23.49 27.36 20.23 19.45 24.34
LayerAlign 17.93 31.56 22.23 26.83 28.17 23.07 27.25 20.35

Chunk 42 · 1,993 chars

7.47 28.55 23.59 27.58 20.48 19.56 24.42
XBridge (Ours) 20.65 33.94 24.87 29.62 33.74 26.42 30.75 23.19 25.91 27.68
Aya-23-8B
SFT 17.68 34.97 24.88 29.78 31.43 26.96 26.11 20.14 21.70 25.96
MindMerger 18.49 31.87 22.54 27.40 28.21 23.49 27.36 20.23 19.45 24.34
LayerAlign 17.93 31.56 22.23 26.83 28.17 23.07 27.25 20.35 19.23 24.07
XBridge (Ours) 19.03 33.10 23.70 27.59 32.82 25.48 28.78 20.96 25.23 26.30
Qwen2.5-7B-Instruct
SFT 19.52 33.52 24.79 29.14 30.81 26.20 26.13 22.16 21.41 25.97
MindMerger 18.17 30.38 22.06 25.95 28.33 22.92 26.40 20.04 18.85 23.68
LayerAlign 18.27 30.51 22.35 26.56 28.90 22.82 26.32 20.18 19.34 23.92
XBridge (Ours) 19.22 31.52 23.48 27.72 32.77 24.53 27.95 22.19 25.02 26.04
Table 7: Detailed multilingual Rouge-L results on XL-Sum. For XL-Sum, since the baselines produce English
summaries only, we translate them into target languages using NLLB-200-1.3B for comparison. We bold the best
scores for each LLM group.
19

-- 19 of 23 --

Variants Lrl-En En-Lrl Hrl-En En-Hrl X-En En-X
MetaMath-7B 2.87 1.05 26.50 17.36 18.62 11.92
w/o OT 35.33 22.20 32.98 25.30 33.76 24.27
w/o Decoder 36.22 - 35.01 - 35.41 -
XBridge 33.93 25.56 31.98 29.37 32.63 28.10
Table 8: Ablation results on FLORES-101 for stage 1. "Lrl" and "Hrl" denote low- and high-resource languages,
respectively. Following Shi et al. (2023), we treat Bn, Sw, and Th as low-resource languages, and the remaining as
high-resource ones. "X" denotes all languages except for English. We bold the best scores for each LLM group.
Variants Lrl-LLM Lrl-Dec Hrl-LLM Hrl-Dec Avg-LLM Avg-Dec
MetaMath-7B 5.60 - 50.06 - 36.72 -
SFT 32.67 - 39.31 - 37.32 -
w/o OT 48.27 47.20 56.23 54.57 53.84 52.36
w/o Decoder 48.27 - 56.91 - 54.32 -
w/o Stage 1 29.33 28.13 52.51 51.37 45.56 44.40
Joint Stage 2&3 43.07 42.53 48.63 47.26 46.96 45.84
XBridge 48.67 47.87 57.49 55.09 54.84 52.92
Table 9: Ablation accuracy results on MGSM. "Lrl" and "Hrl" denote low- and high-resource languages, respectively.
"-LLM" denotes

Chunk 43 · 1,995 chars

.84 52.36
w/o Decoder 48.27 - 56.91 - 54.32 -
w/o Stage 1 29.33 28.13 52.51 51.37 45.56 44.40
Joint Stage 2&3 43.07 42.53 48.63 47.26 46.96 45.84
XBridge 48.67 47.87 57.49 55.09 54.84 52.92
Table 9: Ablation accuracy results on MGSM. "Lrl" and "Hrl" denote low- and high-resource languages, respectively.
"-LLM" denotes English reasoning outputs from the LLM, while "-Dec" denotes multilingual reasoning outputs via
the external decoder. We bold the best scores.
Variants Bn En Es Fr Ja Ru Sw Th Zh Lrl Hrl Avg
SFT 16.79 33.27 23.68 28.42 29.20 25.42 27.01 18.68 20.16 20.83 26.69 24.74
w/o OT 18.55 31.49 22.68 27.86 32.05 24.27 28.35 20.27 23.81 22.39 27.03 25.48
XBridge 18.85 31.50 22.84 28.18 32.26 24.36 28.57 21.08 24.40 22.83 27.26 25.78
Table 10: Ablation Rouge-L results of XL-Sum on MetaMath-7B. "Lrl" and "Hrl" denote low- and high-resource
languages, respectively. We bold the best scores.
20

-- 20 of 23 --

Languages X→En En→X
NLLB MetaMath w/o OT XBridge NLLB MetaMath w/o OT XBridge
Af 58.33 17.64 53.69 54.90 42.69 12.79 37.30 41.75
Ar 42.42 24.59 40.61 38.20 33.13 10.94 16.81 26.37
Az 28.25 28.97 25.35 23.57 19.39 16.23 12.25 16.18
Cs 41.17 1.34 40.47 38.54 35.85 0.32 23.39 30.03
El 37.68 3.59 37.19 35.43 28.16 0.80 19.07 24.34
Et 38.22 29.84 33.99 33.30 29.77 9.62 19.61 26.28
Fa 36.91 8.67 35.10 33.06 26.86 2.27 17.28 22.33
Fi 45.36 31.28 42.34 41.34 38.09 18.29 30.98 34.92
Gl 43.18 2.67 40.69 39.12 37.72 0.78 25.71 33.11
Gu 44.32 28.17 43.02 40.53 36.21 18.51 19.81 28.92
He 42.93 0.21 40.75 39.82 38.07 0.12 26.84 33.31
Hi 39.18 29.48 39.40 37.57 31.65 13.01 20.71 26.83
Hr 45.60 24.70 44.42 42.83 46.27 12.54 35.95 40.81
Id 38.74 7.49 34.99 33.54 34.57 2.30 24.36 28.92
It 30.15 3.87 28.25 26.93 27.28 1.12 18.13 23.72
Ka 34.89 5.43 31.69 30.60 27.81 1.23 18.33 24.24
Kk 33.11 5.87 29.29 27.58 15.01 1.92 8.74 13.13
Km 30.88 0.08 27.63 26.25 24.84 0.16 15.51 20.58
Ko 22.49 36.64 15.39 6.83 0.18 13.52 0.70 0.51
Lt 34.38 2.85 33.89 32.04 28.25 1.59 19.97 24.62
Lv

Chunk 44 · 1,997 chars

54 34.57 2.30 24.36 28.92
It 30.15 3.87 28.25 26.93 27.28 1.12 18.13 23.72
Ka 34.89 5.43 31.69 30.60 27.81 1.23 18.33 24.24
Kk 33.11 5.87 29.29 27.58 15.01 1.92 8.74 13.13
Km 30.88 0.08 27.63 26.25 24.84 0.16 15.51 20.58
Ko 22.49 36.64 15.39 6.83 0.18 13.52 0.70 0.51
Lt 34.38 2.85 33.89 32.04 28.25 1.59 19.97 24.62
Lv 39.26 0.23 36.20 34.78 29.87 0.09 18.89 24.50
Mk 37.98 14.19 35.15 33.72 25.91 4.29 17.13 21.20
Ml 44.49 2.05 42.90 41.80 37.80 0.45 27.26 32.83
Mn 26.73 29.23 25.88 24.91 16.95 18.22 9.37 14.18
Mr 29.24 25.51 26.04 26.38 18.53 5.84 13.92 18.26
My 36.07 37.48 34.04 32.39 31.54 30.64 22.08 27.72
Ne 42.52 3.61 38.93 37.98 28.37 1.22 20.28 25.64
Nl 33.18 32.03 31.83 30.20 25.95 16.41 16.94 20.80
Pl 52.13 0.26 49.96 48.55 49.84 0.26 38.40 44.17
Ps 34.68 10.20 32.04 30.48 19.93 2.18 9.93 16.49
Pt 47.08 25.82 43.58 42.52 37.89 16.67 28.34 34.26
Ro 36.87 41.00 36.54 33.82 30.79 29.45 20.87 26.02
Sl 35.35 27.14 30.83 29.90 31.36 20.21 24.99 27.54
Sv 49.10 40.61 49.21 46.40 43.34 26.96 32.16 38.06
Ta 35.56 12.89 32.42 31.67 31.99 2.05 20.41 27.55
Te 41.87 10.57 37.98 37.60 38.21 3.45 24.07 29.98
Tr 40.87 18.17 37.78 35.94 34.67 7.62 20.11 28.19
Uk 41.39 36.98 39.21 33.53 32.30 22.16 22.45 27.34
Ur 37.23 0.56 35.27 33.65 27.06 0.27 15.98 24.49
Vi 38.84 25.91 36.37 34.11 39.95 11.68 29.97 35.84
Xh 38.49 20.87 37.20 35.22 27.76 9.99 20.24 26.42
Avg 38.71 17.29 36.28 34.57 30.78 8.98 21.10 26.64
Table 11: Detailed results on 41 untuned languages, composing MetaMath-7B with NLLB-200-1.3B. We evaluate
cross-model mapping quality on FLORES-101, following the main experiments. "X" denotes all languages except
for English. We bold the best scores for the LLM group.
21

-- 21 of 23 --

Bn De En Es Fr Ja Ru Sw Th Zh Avg
Bn - 22.47 37.09 20.56 31.36 16.69 21.03 22.85 13.62 17.42 22.57
De 24.74 - 45.75 25.57 39.82 18.73 26.51 26.34 14.82 19.81 26.90
En 28.42 35.45 - 29.59 49.38 20.12 30.57 34.68 17.09 23.11 29.82
Es 22.21 23.42 32.00 - 33.74 16.82 21.78 22.01 12.93 17.57

Chunk 45 · 1,995 chars

LLM group.
21

-- 21 of 23 --

Bn De En Es Fr Ja Ru Sw Th Zh Avg
Bn - 22.47 37.09 20.56 31.36 16.69 21.03 22.85 13.62 17.42 22.57
De 24.74 - 45.75 25.57 39.82 18.73 26.51 26.34 14.82 19.81 26.90
En 28.42 35.45 - 29.59 49.38 20.12 30.57 34.68 17.09 23.11 29.82
Es 22.21 23.42 32.00 - 33.74 16.82 21.78 22.01 12.93 17.57 22.50
Fr 24.26 28.62 46.10 26.64 - 17.75 25.90 26.99 14.47 19.96 25.63
Ja 20.76 20.05 27.63 18.59 26.94 - 18.82 19.76 12.60 17.16 20.26
Ru 22.89 25.65 37.08 23.41 34.80 17.85 - 23.96 13.91 18.82 24.26
Sw 23.16 25.36 44.73 22.37 36.35 17.44 23.27 - 14.24 18.18 25.01
Th 21.59 20.98 30.61 19.34 29.39 16.37 19.78 21.86 - 17.12 21.89
Zh 20.36 19.78 24.89 18.90 27.12 15.89 18.74 20.21 12.94 - 19.87
Avg 23.15 24.64 36.21 22.77 34.32 17.52 22.93 24.30 14.07 18.79 23.87
Table 12: BLEU scores for cross-lingual generation of XBridge, composing LLaMA3-8B with NLLB-200-1.3B.
Rows denote the source language and columns denote the target language. The source text is encoded by the
multilingual encoder, the LLM produces an English translation, and the decoder generates the target-language text
conditioned on the target-language token. For X→En directions, we directly evaluate the LLM outputs.
MGSM Bn De En Es Fr Ja Ru Sw Th Zh Avg
Bn 46.00 44.40 45.20 45.60 40.80 42.80 44.00 46.00 44.00 44.80 44.36
De 58.40 56.40 58.40 59.20 53.20 54.80 58.40 58.00 58.40 59.20 57.44
En 63.60 62.80 63.20 63.60 59.60 61.60 62.80 63.20 61.20 64.00 62.56
Es 60.00 57.60 61.60 61.20 55.20 56.80 60.40 60.80 57.60 60.80 59.20
Fr 58.80 58.40 60.40 60.80 57.20 56.80 60.80 59.60 58.00 60.00 59.08
Ja 49.20 47.20 49.20 50.00 43.60 46.80 48.80 48.80 48.00 48.80 48.04
Ru 57.60 58.40 59.60 60.00 55.20 57.60 58.40 58.40 56.00 58.00 57.92
Sw 49.20 48.00 50.00 50.40 45.60 48.00 49.20 49.60 48.40 48.80 48.72
Th 50.40 49.60 52.80 50.80 49.20 49.20 50.00 52.40 50.00 52.40 50.68
Zh 55.20 53.60 55.20 55.60 50.80 51.20 54.40 55.20 52.80 55.20 53.92
Avg 54.84 53.64 55.56 55.72 51.04 52.56 54.72 55.20 53.44

Chunk 46 · 1,991 chars

59.60 60.00 55.20 57.60 58.40 58.40 56.00 58.00 57.92
Sw 49.20 48.00 50.00 50.40 45.60 48.00 49.20 49.60 48.40 48.80 48.72
Th 50.40 49.60 52.80 50.80 49.20 49.20 50.00 52.40 50.00 52.40 50.68
Zh 55.20 53.60 55.20 55.60 50.80 51.20 54.40 55.20 52.80 55.20 53.92
Avg 54.84 53.64 55.56 55.72 51.04 52.56 54.72 55.20 53.44 55.20 54.19
Table 13: Accuracy for langauge-on-demand generation of XBridge, composing LLaMA3-8B with NLLB-200-1.3B.
Rows denote the query language and columns denote the response language. The query is first processed by the
encoder, the LLM produces an English response, and the decoder then generates the target language response via
the target language token.
System Lrl-En En-Lrl Hrl-En En-Hrl X-En En-X
M2M100-1.2B 28.00 22.25 33.11 32.55 31.41 29.12
LLaMA3-8B 32.03 17.43 36.77 32.32 35.19 27.36
MindMerger 31.37 - 32.67 - 32.24 -
LayerAlign 31.28 - 32.47 - 32.08 -
XBridge (Ours) 32.24 21.66 35.86 28.67 34.65 26.33
Table 14: Detailed FLORES-101 translation results for LLaMA3-8B composed with M2M100-1.2B. "Lrl" and
"Hrl" denote low- and high-resource languages, respectively. "X" denotes all languages except for English. We bold
the best scores for the LLM group.
System Bn De En Es Fr Ja Ru Sw Th Zh Avg
SFT 38.80 46.40 54.00 54.80 44.40 38.80 45.20 40.80 44.00 49.20 45.64
Translate-Test 40.40 56.00 62.00 54.80 54.80 51.60 57.20 46.80 35.20 48.00 50.68
MindMerger 52.40 55.60 62.40 55.60 59.60 50.40 58.40 47.60 50.80 56.40 54.92
LayerAlign 51.20 60.00 65.20 58.00 55.60 50.00 59.20 49.20 50.80 54.00 55.32
XBridge_LLM (Ours) 49.20 57.20 66.00 60.80 52.80 54.40 54.80 47.20 52.80 60.00 55.52
XBridge_Dec (Ours) 48.40 53.20 62.80 59.20 48.40 53.20 54.00 44.40 48.40 58.80 53.08
Table 15: Detailed MGSM results for LLaMA3-8B composed with M2M100-1.2B. "XBridge-LLM" denotes
English reasoning outputs from the LLM, while "XBridge-Dec" denotes multilingual reasoning outputs via the
external decoder. We bold the best scores for the LLM group.
22

-- 22 of 23 --

Chunk 47 · 1,580 chars

53.20 54.00 44.40 48.40 58.80 53.08
Table 15: Detailed MGSM results for LLaMA3-8B composed with M2M100-1.2B. "XBridge-LLM" denotes
English reasoning outputs from the LLM, while "XBridge-Dec" denotes multilingual reasoning outputs via the
external decoder. We bold the best scores for the LLM group.
22

-- 22 of 23 --

System Lrl-En En-Lrl Hrl-En En-Hrl X-En En-X
NLLB-200-600M 33.59 26.43 34.83 29.52 34.42 28.49
NLLB-200-1.3B 37.09 28.86 37.72 32.08 37.51 31.00
MetaMath-7B 2.87 1.05 26.50 17.36 18.62 11.92
XBridge (600M) 31.20 23.42 29.46 26.55 30.04 25.50
XBridge (1.3B) 35.26 26.91 32.42 31.24 33.37 29.80
Table 16: Detailed FLORES-101 translation results for MetaMath-7B composed with NLLB-200 in different sizes
(600M vs. 1.3B). "Lrl" and "Hrl" denote low- and high-resource languages, respectively. "X" denotes all languages
except for English. We bold the best scores for the LLM group.
System Bn De En Es Fr Ja Ru Sw Th Zh Avg
SFT 38.80 46.40 54.00 54.80 44.40 38.80 45.20 40.80 44.00 49.20 45.64
XBridge_LLM (600M) 47.60 56.00 65.60 56.00 56.00 47.20 56.00 50.80 49.20 50.00 53.44
XBridge_Dec (600M) 45.20 52.80 65.60 55.60 51.60 44.40 54.40 50.80 46.00 49.20 51.56
XBridge_LLM (1.3B) 48.40 54.80 67.20 60.80 57.20 47.20 57.60 46.40 45.60 52.80 53.80
XBridge_Dec (1.3B) 47.20 52.80 66.80 60.00 51.60 44.80 56.00 45.60 44.40 52.00 52.12
Table 17: Detailed FLORES-101 translation results for MetaMath-7B composed with NLLB-200 in different sizes
(600M vs. 1.3B). "X" denotes all languages except for English. We bold the best scores for the LLM group.
23

-- 23 of 23 --