Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

Authors: Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing HuangarXiv:2510.14466 Source

Summary

This paper introduces LiRA (Linguistic Robust Anchoring), a framework to improve cross-lingual performance of large language models (LLMs) for low-resource languages. LLMs typically struggle with these languages due to limited training data, translation noise, and unstable cross-lingual alignment. LiRA addresses these issues by jointly optimizing representation stability and cross-lingual semantic consistency through two components: Arca, which aligns low-resource inputs to an English semantic space using anchor-based alignment and collaborative encoding, and LaSR, a lightweight head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. The framework is theoretically grounded, guaranteeing bounded representation deviation and stable downstream performance under controlled anchoring error and translation-induced bias. The authors release a new multilingual product retrieval dataset covering seven Southeast Asian and South Asian languages. Extensive experiments across retrieval, ranking, question answering, and reasoning tasks show consistent improvements over existing methods. LiRA is plug-and-play, requiring only lightweight fine-tuning on top of existing pretrained models.

PDF viewer

Chunks(49)

Chunk 0 · 1,993 chars

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages
Haolin Li * 1 2 3 Haipeng Zhang * 3 Mang Li 3 Yaohua Wang 1 3 Lijie Wen 2 Yu Zhang 3 Biqing Huang 1
Abstract
Large language models (LLMs) continue to strug-
gle with low-resource languages, primarily due to
limited training data, translation noise, and unsta-
ble cross-lingual alignment. To address these chal-
lenges, we propose LiRA (Linguistic Robust An-
choring for LLMs)—a plug-and-play framework
that requires only lightweight fine-tuning on top
of existing pretrained backbones. LiRA jointly op-
timizes representation stability and cross-lingual
semantic consistency by combining two key com-
ponents: Arca (Anchored Representation Compo-
sition Architecture), which aligns low-resource
inputs to a shared English semantic space through
anchor-based alignment and collaborative encod-
ing; and LaSR (Language-coupled Semantic Rea-
soner), a lightweight, language-aware head that
enforces consistency regularization for unified
cross-lingual understanding, retrieval, and rea-
soning. We theoretically show that under con-
trolled anchoring error and translation-induced
bias, LiRA guarantees bounded representation
deviation and stable downstream performance un-
der local Lipschitz continuity. To facilitate re-
search, we release a new multilingual product
retrieval dataset covering five Southeast Asian
and two South Asian languages. Extensive exper-
iments across diverse low-resource benchmarks
demonstrate consistent improvements in retrieval,
ranking, question answering, and reasoning tasks.
Code will be publicly available on GitHub, and
the dataset will be hosted on Hugging Face.
1. Introduction
Large language models (LLMs) have made remarkable
progress in natural language understanding and reasoning.
Yet these gains are unevenly distributed: performance is con-
centrated in high-resource languages such as English and
1Department of Automation, Tsinghua University 2School of
Software, Tsinghua

Chunk 1 · 1,997 chars

1. Introduction
Large language models (LLMs) have made remarkable
progress in natural language understanding and reasoning.
Yet these gains are unevenly distributed: performance is con-
centrated in high-resource languages such as English and
1Department of Automation, Tsinghua University 2School of
Software, Tsinghua University 3Alibaba Group. Correspondence
to: Yu Zhang <¡x@email¿>, Biqing Huang <¡xx@email¿>.
Preprint. January 30, 2026.
Chinese, while low-resource languages (LRLs) continue to
lag far behind. This gap is driven by long-tailed pretraining
distributions (Haddow et al., 2022) (Figure 1a), limited or
noisy parallel data (Ataman et al., 2025), and unstable cross-
lingual alignment (Xu et al., 2025). As a result, directly
deploying LLMs for retrieval and reasoning in LRLs often
leads to degraded accuracy and brittle, inconsistent behavior,
limiting the promise of truly inclusive NLP systems.
Existing cross-lingual adaptation methods typically fall into
two families: machine translation (MT)-based pipelines
(Artetxe et al., 2023; Shubham, 2024) and multilingual ap-
proaches (Singh et al., 2024). Although MT-based pipelines
can be effective, they are prone to error propagation (Wu
et al., 2019) and semantic drift (Beinborn & Choenni, 2020),
especially for complex settings that require multi-step rea-
soning or nuanced interpretation. Multilingual encoders,
in contrast, offer more language-agnostic representations
but often fail to inherit the strong English-centric reasoning
ability exhibited by LLMs (Hu et al., 2020). Recent sys-
tems such as MindMerger (Huang et al., 2024) attempt to
narrow this gap by integrating the capabilities of LLMs and
multilingual models, yet its reliance on parallel translation
data may still lead to error propagation and semantic drift.
Similarly, Lusifer (Man et al., 2025) connects a multilingual
encoder with an LLM-based embedding model to enable
cross-lingual transfer; however, its training paradigm lacks
information from

Chunk 2 · 1,997 chars

and
multilingual models, yet its reliance on parallel translation
data may still lead to error propagation and semantic drift.
Similarly, Lusifer (Man et al., 2025) connects a multilingual
encoder with an LLM-based embedding model to enable
cross-lingual transfer; however, its training paradigm lacks
information from low-resource languages, which may com-
promise the performance of the transfer during inference.
Moreover, despite empirical gains, these lines of work re-
main theoretically underdeveloped and do not consistently
close the gap on key low-resource tasks such as retrieval,
ranking, and reasoning.
Our work is motivated by the observation that cross-
lingual adaptation still relies heavily on machine translation
pipelines or multilingual encoders. In real-world settings,
such as low-resource e-commerce, these solutions are often
undermined by translation noise and unstable cross-lingual
representations. More importantly, they rarely model how
semantic drift and representation-mapping errors propagate
through the system, making robustness difficult to analyze
and even harder to improve systematically. To address these
challenges, we introduce LiRA (Linguistic Representation
Anchoring), a plug-and-play framework that anchors low-
resource languages to an English semantic space while pre-
1
arXiv:2510.14466v2 [cs.CL] 29 Jan 2026

-- 1 of 26 --

Submission and Formatting Instructions for ICML 2026
EN CN TH BD PK
Lang Distribution
LLM Encoder EN
PK
CN
BD
TH
Semantic Embedding Space 	Reward Chart
LLM EN
CN
AR
BD
PK
🌟🌟🌟🌟🌟🌟
🌟🌟🌟🌟🌟
🌟🌟
🌟
🌟
Query_lr
Arca
EN
PK
CN
BD
TH
Emb_lr
LaSR
Query_En 	LLM Encoder
LLM Transformer
➕
Emb_lr 	Emb_en
Retrieval or Reasoning ✅
(a) Long-tailed pre-training distribution leads to large
cross-language performance disparity.
(b) Overview of our model framework. lr and en represent
low-resource language and English, respectively.
Figure 1. Challenge and overview of our LiRA.
serving LLM-level reasoning (Figure 1b). LiRA can

Chunk 3 · 1,999 chars

trieval or Reasoning ✅
(a) Long-tailed pre-training distribution leads to large
cross-language performance disparity.
(b) Overview of our model framework. lr and en represent
low-resource language and English, respectively.
Figure 1. Challenge and overview of our LiRA.
serving LLM-level reasoning (Figure 1b). LiRA can be
attached to a variety of pretrained backbones and requires
only lightweight fine-tuning to improve cross-lingual rep-
resentations. Unlike prior approaches, LiRA is grounded
in a rigorous theoretical framework that provides formal
guarantees on the completeness and stability of the learned
representations. LiRA integrates two complementary com-
ponents: (i) Arca, which reduces semantic drift by aligning
multilingual representations with English via critic–actor in-
teraction and feature anchoring; and (ii) LaSR, a lightweight
head that fuses multilingual and English embeddings with
queue-based objectives to support robust retrieval and rea-
soning in low-resource settings. In this way, LiRA leverages
strong English capabilities and transfers them effectively to
underrepresented languages.
Our main contributions are summarized as follows:
• We propose LiRA, a plug-and-play cross-lingual frame-
work that transfers LLMs’ strong English capability to
mid- and low-resource languages.
• We establish solid theoretical foundations, providing rig-
orous guarantees of LiRA’s completeness and stability.
• We release a product retrieval dataset covering 5 South-
east Asian and 2 South Asian mid- and low-resource lan-
guages, enabling further research in this underexplored
area.
• Extensive experiments across ranking, retrieval, and rea-
soning tasks demonstrate that LiRA achieves new state-
of-the-art performance.
2. Related Work
Cross-lingual Information Retrieval Early CLIR systems
relied on multilingual PLMs (e.g., mBERT (Devlin et al.,
2019), XLM-R (Conneau et al., 2020)) for alignment; recent
work moves toward supervision-light transfer and LLM-
embedding

Chunk 4 · 1,991 chars

ks demonstrate that LiRA achieves new state-
of-the-art performance.
2. Related Work
Cross-lingual Information Retrieval Early CLIR systems
relied on multilingual PLMs (e.g., mBERT (Devlin et al.,
2019), XLM-R (Conneau et al., 2020)) for alignment; recent
work moves toward supervision-light transfer and LLM-
embedding adaptation. LUSIFER integrates a multilingual
encoder with an English-specialized LLM-based embedding
model via a connector to yield strong zero-shot multilingual
retrieval (Man et al., 2025). However, the zero-shot train-
ing paradigm limits further performance gains. Beyond
passage-level retrieval, CCPR formulates contextualized
phrase-level cross-lingual retrieval to mitigate polysemy (Li
et al., 2024), while XRAG benchmarks end-to-end cross-
lingual RAG from retrieval to generation (Liu et al., 2025).
Domain resources (e.g., CrossMath) and synthetic datasets
(e.g., SWIM-IR) further broaden evaluation and training
coverage (Gore et al., 2024; Thakur et al., 2024). Existing
methods often rely on heuristic designs without a unified
theoretical framework, which may restrict their generality
and theoretical interpretability. By contrast, our approach is
grounded in a principled framework that unifies multilingual
latent embedding spaces, mitigates semantic drift, and sup-
ports plug-and-play modular training. Furthermore, cross-
lingual contamination can inflate reported performance (Yao
et al., 2024), prompting us to propose new real-world data
for fair comparison.
LLM-based Cross-lingual Reasoning Recent studies
(Cueva et al., 2024) have intensified attention to reason-
ing tasks in low-resource language settings (Kim et al.,
2023). MindMerger merges external multilingual under-
standing into LLMs and trains collaborative use of internal
reasoning and external language skills, substantially boost-
ing reasoning in non-English settings (Huang et al., 2024).
However, MindMerger relies heavily on parallel translation
corpora, which may introduce

Chunk 5 · 1,988 chars

3). MindMerger merges external multilingual under-
standing into LLMs and trains collaborative use of internal
reasoning and external language skills, substantially boost-
ing reasoning in non-English settings (Huang et al., 2024).
However, MindMerger relies heavily on parallel translation
corpora, which may introduce semantic drift due to trans-
lation noise. Process-level improvements—such as Chain-
of-Preference Optimization and Chain-of-Code—provide
transferable reasoning scaffolds that can be combined with
language anchoring strategies (Liu et al., 2024; Zhang et al.,
2024). Mechanism-focused studies reveal cross-lingual
knowledge barriers and even “cross-lingual collapse” of
reasoning traces toward dominant pretraining languages;
mitigation includes mixed-language finetuning, reward shap-
ing for language consistency, and representation steering to
strengthen non-English token representations (Chua et al.,
2024; Zhao et al., 2024; Park et al., 2025; Mahmoud et al.,
2025). Recent surveys systematize multilingual reasoning
evaluations and resources (Ghosh et al., 2025). These meth-
ods typically rely on a single embedding-space alignment
2

-- 2 of 26 --

Submission and Formatting Instructions for ICML 2026
primarily designed for reasoning tasks, and their perfor-
mance may not transfer well to broader settings such as
ranking and retrieval. In contrast, our proposed Arca min-
imizes translation-induced semantic drift, provides better
interpretability, and can be extended to support multiple
tasks—including reasoning, ranking, and retrieval—thereby
improving overall generalization.
3. Theoretical Foundations
3.1. Preliminaries
Setup. We study cross-lingual retrieval/reasoning under
low-resource inputs. Let X be the low-resource language
(LRL) sentence space and Y the English sentence space. For
each x ∈ X , an LLM-based translator T : X → Y produces
an observed translation y = T (x), and we introduce an (un-
observed) ideal translation y⋆ ∈ Y that is

Chunk 6 · 1,996 chars

study cross-lingual retrieval/reasoning under
low-resource inputs. Let X be the low-resource language
(LRL) sentence space and Y the English sentence space. For
each x ∈ X , an LLM-based translator T : X → Y produces
an observed translation y = T (x), and we introduce an (un-
observed) ideal translation y⋆ ∈ Y that is semantically
equivalent to x. We consider two representation paths into a
shared d-dimensional space: a trained multilingual encoder
g : X → Rd that embeds x directly into an English-aligned
semantic space, and an English encoder h : Y → Rd that
embeds English sentences. Our system forms the concate-
nated representation z(x) := [ g(x); h(y) ] ∈ R2d, which is
then fed into a downstream f (e.g., for retrieval, ranking or
reasoning). For analysis, we define the ideal reference rep-
resentation z⋆(x) := [ h(y⋆); h(y⋆) ] ∈ R2d, correspond-
ing to the target behavior where both paths agree on the
same English semantics; our goal is to bound the represen-
tation deviation ∥z(x) − z⋆(x)∥2 and the induced deviation
||f (z(x)) − f (z⋆(x))||2. See A.3 for theoretic explanation
for why we concanate two representation paths.
Assumption 1 (Semantic Anchoring). For all x ∈ X , the
mismatch between the anchor and the English encoding of
its translation is bounded:
∥g(x) − h(y∗)∥2 ≤ ϵ1, ϵ1 ≥ 0.
Assumption 2 (Translation fidelity). Let s be a latent se-
mantic variable with conditionals p(s | x) and p(s | y). The
translator T preserves semantics up to ϵ2 in KL:
DKL
p(s | x) ∥ p(s | T (x)) ≤ ϵ2, ϵ2 ≥ 0.
Definition 1 (RKHS representation). We model h(y) as the
kernel mean embedding (KME) of the semantic distribution
p(s | y) into an RKHS H induced by a positive–definite
kernel k:
h(y) = μp(s|y) := Es∼p(s|y)
φ(s), φ(s) := k(s, ·).
The kernel embedding maps any probability measure p over
the semantic space into the RKHS specified by k (i.e., μp =
Es∼p[φ(s)]). For bounded inputs, the kernel satisfies
0 < k(s, s) = k(s, ·), k(s, ·) H ≤ C2,
for some constant C > 0. See

Chunk 7 · 1,997 chars

positive–definite
kernel k:
h(y) = μp(s|y) := Es∼p(s|y)
φ(s), φ(s) := k(s, ·).
The kernel embedding maps any probability measure p over
the semantic space into the RKHS specified by k (i.e., μp =
Es∼p[φ(s)]). For bounded inputs, the kernel satisfies
0 < k(s, s) = k(s, ·), k(s, ·) H ≤ C2,
for some constant C > 0. See Appendix A.2 for details.
Definition 2 (Data-local Lipschitzness). On a finite discrete
domain, any encoder admits a (local) Lipschitz constant.
Concretely, over the dataset Ydata, we define the data-local
Lipschitz constant at y (with neighborhood radius δ) as
Lloc(y; δ) := max
y′∈Nδ (y)
fLLM(z) − fLLM(z⋆) 2
z − z⋆ 2
.
Here z = [ g(x) ; h(y) ], Nδ (y) = { y′ ∈ Ydata : 0 <
dtok(y, y′) ≤ δ } denotes the token-edit neighborhood (we
use δ = 1 in most experiments). We also report the empirical
q-quantile L(q)(δ), where q ∈ (0, 1) is the quantile level;
L(q)(δ) is the q-th empirical quantile of Lloc(y; δ) over y ∈
Ydata. ; for example, with q = 0.95 we observe L(0.95) ≈
0.034. See Appendix A.2 for details.
3.2. Theorem
Theorem (Representation deviation) Under Assumptions 1–
2 and Definitions 1–2, let z⋆ = [ h(y⋆); h(y⋆) ] and z =
[ g(x); h(y) ]. Then
∥z − z⋆∥2 ≤ ϵ1 + C √2 ϵ2 . (1)
Corollary (Downstream stability) For fLLM that is locally
Lipschitz with constant Lloc(y; δ) as in Definition 2, we
have
fLLM(z) − fLLM(z⋆) 2 ≤ Lloc(y; δ) ϵ1 + C√2 ϵ2
.
(2)
As ϵ1, ϵ2 → 0, we obtain
∥z − z⋆∥2 → 0 and ∥fLLM(z) − fLLM(z⋆)∥2 → 0.
The theorem and its corollary imply that, by minimizing ϵ1
and ϵ2, the model obtains high-fidelity and robust representa-
tions that effectively support downstream tasks. Full proofs
of the theorem and corollary are provided in Appendix A.1.
3.3. Theory assumptions and practical validity.
Our theoretical analysis does not require “ideal translations”
or “ideal semantic anchoring” to hold in practice. Instead, it
only models two ubiquitous sources of error in cross-lingual
alignment: (i) representation mapping error between an
ideal target

Chunk 8 · 1,985 chars

in Appendix A.1.
3.3. Theory assumptions and practical validity.
Our theoretical analysis does not require “ideal translations”
or “ideal semantic anchoring” to hold in practice. Instead, it
only models two ubiquitous sources of error in cross-lingual
alignment: (i) representation mapping error between an
ideal target vector z∗ and the observed representation z, and
(ii) translation-induced noise. Here, z∗ is introduced purely
as a mathematical reference for z—not as a strict real-world
prerequisite—and the assumptions are used to derive ob-
jectives that explicitly target semantic drift and vector-level
misalignment. Empirically, we observe that the gap between
z and z∗ (measured by similarity) decreases as training pro-
ceeds, and the loss consistently drops across different Arca
training tasks, supporting the practical relevance of our the-
oretical formulation. See D.2 for empirically measurement
during training process.
3

-- 3 of 26 --

Submission and Formatting Instructions for ICML 2026
T1: which quality
of …?
T2: which
attribute of …?
T3: which feature
of the Giza…?
T4: which
characteristic…?
Translators(<4B)
Llama-3.2-1B
Other MT Model
Qwen3-1.7B
Deepseek-R1-
Distill-Qwen-1.5B
Candidates
LLM Embedding	Multi-lingual Encoder
E_lr 	E_en
LLM Critic
Semantic Fidelity
Emotional
Consistency
Pragmatic Tone
Embeds Critic
MLP
E_en1
E_lr
E_en4
E_en2
E_en3
Adaptor
Pooling
Actor Model
✅
Based on the information in the
excerpt, which attribute of the
Pyramids at Giza would the
Egyptians of the New Kingdom
likely not have found surprising?
उता$यातील मा*हतीनुसार,
1गझा 4परॅ7म8स9या
कोण=या गुणव?ेवर Aयू
Cकंगडम इिजिIशयन
लोकांना आLचयN वाटPयाची
शQयता नRहती?
Figure 2. An overview of Arca.
4. Method
4.1. Arca
Let X be a low–resource source space and Y the English
space. For any x ∈ X , a translator T : X → Y yields
y = T (x). We introduce two representation paths: an
anchoring map g : X → Rd that lands source sentences
directly in an “English semantic” space, and an

Chunk 9 · 1,991 chars

नRहती?
Figure 2. An overview of Arca.
4. Method
4.1. Arca
Let X be a low–resource source space and Y the English
space. For any x ∈ X , a translator T : X → Y yields
y = T (x). We introduce two representation paths: an
anchoring map g : X → Rd that lands source sentences
directly in an “English semantic” space, and an English
encoder h : Y → Rd. We concatenate z = [g(x); h(y)] and
score it with an LLM critic. Arca aims to reduce the two
terms appearing in our generalization bound (Sec. 3.2): the
anchoring error ϵ1 and the translation distortion ϵ2. Arca
(Figure 2) comprises three modules: (i) a Translation Critic
that judges candidates with semantic/emotional/pragmatic
scores; (ii) an Embedding Critic that anchors feature paths
to translation paths via a regression-style penalty; and (iii)
an Actor trained with policy gradients that fuses both critics.
4.1.1. FEATURE ANCHORING (MINIMIZING ϵ1 )
Given x ∈ X with observed translation y = T (x) ∈ Y,
we align the multilingual path g(x) and the English path
h(y) by resolving tokenizer-length and dimensional mis-
matches via temporal pooling and a shared Adaptor. Let
gtok(x) ∈ RLx×dg and htok(y) ∈ RLy ×dh denote the token-
level streams before sentence pooling. We apply a temporal
pooling operator with Sfeat bins (a hyperparameter) to obtain
fixed-length sequences:
G(x) = PSfeat
gtok(x) ∈ RSfeat×dg ,
H(y) = PSfeat
htok(y) ∈ RSfeat×dh . (3)
A single shared Adaptor A(·) maps either side into a com-
mon d-dimensional space (with internal projections when
dg̸ = dh). We then pool along the temporal axis to form
sentence-level vectors:
Elr = poolA(G(x)) ∈ Rd,
Een = poolA(H(y)) ∈ Rd. (4)
The feature-anchoring objective contracts the path discrep-
ancy by cosine alignment:
Lanchor = 1 − cosElr , Een
. (5)
Minimizing Eq. (5) reduces the anchoring radius ϵ1 af-
ter length normalization (Eq. (3)) and feature alignment
(Eq. (4)).
4.1.2. TRANSLATION CRITIC (MINIMIZING ϵ2 )
Given a source x and its candidate set {yk}K
k=1,

Chunk 10 · 1,997 chars

oring objective contracts the path discrep-
ancy by cosine alignment:
Lanchor = 1 − cosElr , Een
. (5)
Minimizing Eq. (5) reduces the anchoring radius ϵ1 af-
ter length normalization (Eq. (3)) and feature alignment
(Eq. (4)).
4.1.2. TRANSLATION CRITIC (MINIMIZING ϵ2 )
Given a source x and its candidate set {yk}K
k=1, a
lightweight LLM judge produces three calibrated scores
sk, ek, pk ∈ [1, 10] for semantic fidelity, emotional consis-
tency, and pragmatic tone. We collect
rk = [ sk, ek, pk ]⊤.
Role for ϵ2. These scores probe adequacy and well-
formedness of yk with respect to x, serving as a proxy
for small semantic divergence between p(s | x) and p(s | yk);
maximizing their contribution in the policy drives smaller
ϵ2.
4

-- 4 of 26 --

Submission and Formatting Instructions for ICML 2026
उता$यातील मा*हतीनुसार, 1गझा
4परॅ7म8स9या कोण=या गुणव?ेवर Aयू
Cकंगडम इिजिIशयन लोकांना आLचयN
वाटPयाची शQयता नRहती?
Large Language Model
System Prompt:
Given a question,
retrieve passages
that answer the
question:
Based on the information in the excerpt,
which attribute of the Pyramids at Giza
would the Egyptians of the New
Kingdom likely not have found surprising?
LLM Encoder 	Multi-lingual Encoder
LLM Transformer
➕	E_en 	E_lr
out
query
documents
CorrQueue
Gold 	Pred
Rank Task FIFO Queue
Retrieval Task FIFO Queue
DocQueue
Doc 	Lang 	Doc Embeds
Training steps
(a) An overview of LaSR—given any-language input, shared encoders produce English
and multilingual features that are fused by a lightweight transformer into a single normalized
embedding (q/d) for retrieval and ranking.
(b) Two FIFO buffers: CorrQueue caches (pred, gold) for correlation-
based objectives; DocQueue caches (doc id, lang, embed) for listwise
nDCG with in-language negatives.)
Figure 3. An overview of LaSR.
4.1.3. OVERALL OBJECTIVE
For each candidate we form the policy feature by concate-
nating the critic scores with the adaptor similarity:
ck = [ sk, ek, pk, simk ]⊤. (6)
A small MLP produces logits gϕ(ck) and the

Chunk 11 · 1,993 chars

caches (doc id, lang, embed) for listwise
nDCG with in-language negatives.)
Figure 3. An overview of LaSR.
4.1.3. OVERALL OBJECTIVE
For each candidate we form the policy feature by concate-
nating the critic scores with the adaptor similarity:
ck = [ sk, ek, pk, simk ]⊤. (6)
A small MLP produces logits gϕ(ck) and the policy πϕ(k |
c1:K ) = softmax([gϕ(c1), . . . , gϕ(cK )])k. We use the
composite reward
Rk = 0.1 · (αsk + βek + γpk) + δ simk, (7)
sample a ∼ πϕ, and optimize with REINFORCE:
LRL = − Ea∼πϕ [Ra] ≈ − log πϕ(a | c1:K ) · Ra. (8)
The full Arca objective is
L = LRL + η Lanchor. (9)
Here, Lanchor reduces the anchoring error ϵ1, while LRL
favors low-distortion candidates, shrinking ϵ2—jointly tight-
ening the bound in Sec. 3.2.
4.2. LaSR
Given any-language text, a multilingual encoder yields Elr
while a shared English encoder (prompted LLM encoder)
yields Een. Elr denotes the text embedding in the low-
resource language, and Een denotes the embedding of the
corresponding English text. We fuse Een and Elr into a
single ℓ2-normalized embedding used for all ranking, re-
trieval and reasoning tasks. Training is supported by two
FIFO buffers: (i) CorrQueue for correlation-based objec-
tives under small batches, and (ii) DocQueue for listwise
nDCG with in-language negatives. Both queues are stop-
grad for cached entries and are updated in FIFO manner
with a maximum size K.
Encoders. Each branch follows “tokenize → encoder →
pooling”: Een and Elr, as shwon in Eq. (4). Pooling is fixed
and the two streams are concatenated and linearly projected
before entering the LLM Transformer. The transformer
attends across the two streams and returns a fused hidden
vector z, which is then normalized ˆz = z/∥z∥.
Training steps (Figure 3b). Each optimization step pro-
cesses two mini-batches: (1) query batch and (2) document
batch. We forward them through the shared encoders and
the LLM Transformer to obtain {ˆzq } and {ˆzd} and compute
s(q, d) for the task at hand. Two FIFO

Chunk 12 · 1,996 chars

idden
vector z, which is then normalized ˆz = z/∥z∥.
Training steps (Figure 3b). Each optimization step pro-
cesses two mini-batches: (1) query batch and (2) document
batch. We forward them through the shared encoders and
the LLM Transformer to obtain {ˆzq } and {ˆzd} and compute
s(q, d) for the task at hand. Two FIFO buffers are updated
in the background:
• CorrQueue (Rank task). Caches tuples (Pred, Gold)
produced on ranking datasets (e.g., STS). At step t we
concatenate the current predictions/labels with up to K
cached pairs (detached, no gradient) and compute a cor-
relation loss: LcorrQ = α (1 − Pearson) + (1 − α) (1 −
Soft-Spearman). See Appendix D.4 for details.
• DocQueue (Retrieval task). Caches (doc id, language,
ˆzd) from recent steps for in-language hard-negative min-
ing. Given a query, we form a candidate list (1 posi-
tive + mined negatives), compute differentiable ranks
and a soft nDCG@k objective Lndcg = 1 − nDCG@k.
To avoid mislabeled near negatives, we apply a tem-
peratured down-weighting on very similar non-positives
(“safe negatives”), and add two light regularizers (top-
1 hinge, mean/variance control) to stabilize training:
Lretr = Lndcg + λhLhinge + λr Lmv. See Appendix D.4
for details.
4.3. Motivation and Discussion.
A key challenge in low-resource multilingual reasoning is
the representation imbalance across languages: modern
LLMs typically acquire substantially stronger reasoning ca-
pability in high-resource languages (notably English), where
reasoning-oriented supervision is abundant, while compa-
rable signals are scarce in many low-resource languages.
Under this data constraint, our goal is to make the most of
5

-- 5 of 26 --

Submission and Formatting Instructions for ICML 2026
Table 1. Evaluation on our new LazRetrieval dataset. bold indicates the best result; underlined indicates the second best.
METHOD PARAMS BD ID MY PK TH PH VN AVG.
SENTENCE-T5-XXL 2021 4.8B 34.11 71.77 49.19 27.84 23.58 84.20 28.61 44.56
GTR-XXL 2021 4.8B

Chunk 13 · 1,989 chars

5 of 26 --

Submission and Formatting Instructions for ICML 2026
Table 1. Evaluation on our new LazRetrieval dataset. bold indicates the best result; underlined indicates the second best.
METHOD PARAMS BD ID MY PK TH PH VN AVG.
SENTENCE-T5-XXL 2021 4.8B 34.11 71.77 49.19 27.84 23.58 84.20 28.61 44.56
GTR-XXL 2021 4.8B 34.85 75.92 49.36 30.39 22.94 84.94 46.15 48.17
SIMCSE 2021 330M 31.76 66.71 43.27 27.68 32.32 74.00 44.16 44.88
CONTRIEVER 2022 110M 39.95 74.95 48.90 35.71 15.43 83.74 64.75 51.00
GTE-LARGE 2023 335M 39.36 77.59 51.88 36.76 17.21 87.65 65.34 52.61
BGE-EN-V1.5 2023 335M 41.06 78.78 53.28 37.52 18.35 88.18 68.72 54.09
E5-LARGE-V2 2024 335M 41.04 78.78 53.77 37.22 17.63 87.24 61.31 52.84
E5-MISTRAL-7B 2024 7.24B 48.27 75.43 71.01 53.62 61.75 83.18 65.44 64.51
QWEN3-E-0.6B 2025 0.6B 38.36 63.95 62.37 40.73 55.28 74.38 59.46 56.31
WITH LIRA-LARGE OURS 8.5B 48.60 74.43 71.26 49.84 66.39 83.90 70.67 66.44
QWEN3-E-4B 2025 4B 49.81 68.92 71.45 49.30 73.39 78.47 68.31 65.66
WITH LIRA-LARGE OURS 11.9B 56.24 75.29 77.03 55.69 77.72 84.81 74.92 71.67
QWEN3-E-8B 2025 8B 50.59 76.01 73.36 50.37 69.67 84.78 71.56 68.05
WITH LIRA-BASE OURS 14.4B 52.91 76.59 75.20 52.30 71.11 85.92 73.23 69.61
WITH LIRA-LARGE OURS 15.9B 57.70 77.33 77.93 56.67 78.52 86.03 75.86 72.86
WITH LIRA-MAX OURS 103B 66.30 78.53 81.54 68.53 83.12 87.44 78.48 77.71
Table 2. Evaluation on main public retrieval datasets, and the
effect of adding LiRA-Max to representative embedding back-
bones. bold indicates the best; underlined indicates the second
best. Column abbreviations: MLQA-R = MLQARetrieval, Bele-R
= BelebeleRetrieval.
METHOD MLQA-R BELE-R STS22 AVG.
SIMCSE 2021 7.41 18.35 37.95 21.24
ST5-XXL 2021 20.82 41.68 59.02 40.51
GTR-XXL 2021 20.19 38.02 60.11 39.44
CONTRIEVER 2022 9.75 22.94 41.72 24.80
GTE-LARGE 2023 16.99 31.82 53.79 34.20
,→ WITH LIRA 21.43 38.29 59.17 39.63
BGE-EN-1.5 2023 16.64 31.19 50.77 32.87
,→ WITH LIRA 21.01 35.64 53.94 36.83
E5-LARGE 2024 17.04 31.12 54.31

Chunk 14 · 1,993 chars

.41 18.35 37.95 21.24
ST5-XXL 2021 20.82 41.68 59.02 40.51
GTR-XXL 2021 20.19 38.02 60.11 39.44
CONTRIEVER 2022 9.75 22.94 41.72 24.80
GTE-LARGE 2023 16.99 31.82 53.79 34.20
,→ WITH LIRA 21.43 38.29 59.17 39.63
BGE-EN-1.5 2023 16.64 31.19 50.77 32.87
,→ WITH LIRA 21.01 35.64 53.94 36.83
E5-LARGE 2024 17.04 31.12 54.31 34.16
E5-MISTRAL 2024 31.54 54.75 71.37 52.55
,→ WITH LIRA 34.23 57.24 76.55 56.01
LUSIFER 2025 36.68 57.81 70.49 54.99
QWEN3-E-8B 2025 81.13 85.94 71.64 79.57
,→ WITH LIRA 82.01 87.03 75.00 81.35
the strong reasoning capabilities that LLMs have acquired
in high-resource languages and transfer them to underrepre-
sented languages in a robust and analyzable manner.
LiRA addresses the imbalance by explicitly modeling and
shrinking two ubiquitous sources of cross-lingual shift that
degrade downstream retrieval and reasoning: (i) semantic
drift caused by translation noise and linguistic variation,
and (ii) representation mapping error between multilin-
gual and English embedding spaces. Arca reduces both
shifts via anchor-based alignment and critic–actor training,
which aims to mitigate translation-induced noise rather than
propagate it. Importantly, LiRA is plug-and-play: it can be
attached to different pretrained backbones with lightweight
fine-tuning, and it does not assume ideal translations in
practice. On top of Arca, LaSR performs a two-way cou-
pling between multilingual and English representations with
queue-based objectives, balancing cross-lingual alignment
and task-driven discrimination so that English-centric rea-
soning signals can be leveraged without destabilizing multi-
lingual embeddings.
5. Experiment
5.1. Experimental Details
All experiments are conducted on a single server with 4×
A100-80GB GPUs. We evaluate LiRA on three families
of tasks to assess generality: (i) retrieval, measured by
nDCG@10; (ii) sentence ranking, measured by Pearson
correlation; and (iii) reading comprehension and mathemat-
ical reasoning, both measured by

Chunk 15 · 1,984 chars

l Details
All experiments are conducted on a single server with 4×
A100-80GB GPUs. We evaluate LiRA on three families
of tasks to assess generality: (i) retrieval, measured by
nDCG@10; (ii) sentence ranking, measured by Pearson
correlation; and (iii) reading comprehension and mathemat-
ical reasoning, both measured by accuracy. For retrieval
and sentence ranking we use Qwen3-Embedding-8B as the
backbone encoder. For reading comprehension and mathe-
matical reasoning we use Qwen3-8B as the backbone model.
Following Sec. 3.2, we prefer backbones with a smaller Lips-
chitz constant L to tighten error propagation; we thus select
a comparatively more stable LLM as the backbone (see
Appendix A.2 and Table 7). Model choices, full hyperpa-
rameters and additional implementation details are reported
in Appendix C and D, including Arca translation selection
strategy (C.1), prompt engineering (C.2) and bad case anal-
ysis (C.3). As discussed in D.3, we analyze training and
inference efficiency in detail. Despite adding parameters,
our method incurs no noticeable increase in wall-clock train-
ing or inference time.
Due to potential discrepancies in pretraining data, we en-
sured fairness by fine-tuning all models used in our ex-
6

-- 6 of 26 --

Submission and Formatting Instructions for ICML 2026
Table 3. MGSM accuracy (%). bold indicates the best; underlined indicates the second best.
METHOD 	PARAMS YEAR BN TH SW JA ZH DE FR RU ES EN AVG.
MONOREASON 7B 2024 6.8 7.2 6.8 36.4 38.4 55.2 54.4 52.0 57.2 68.8 38.3
MULTIREASON-LORA 7B 2022 29.6 35.2 28.0 52.0 54.8 59.6 58.4 62.4 59.6 64.8 50.4
MULTIREASON-SFT 7B 2024 33.2 40.0 42.0 42.0 42.0 45.2 44.8 45.2 48.0 52.0 43.4
QALIGN 	7B 2024 39.6 40.4 44.0 44.0 48.4 54.8 56.8 52.4 59.6 68.0 49.6
LANGBRIDGE 	7B 2024 42.8 50.4 43.2 40.0 45.2 56.4 50.8 52.4 58.0 63.2 50.2
TRANSLATE-EN 3.3B 2023 48.4 37.6 37.6 49.2 46.8 60.4 56.4 47.6 59.6 65.5 50.6
MINDMERGER-HARD 10.7B 2024 46.0 36.0 48.4 52.4 54.4 60.4 56.0 60.4 62.0 71.2

Chunk 16 · 1,992 chars

5.2 48.0 52.0 43.4
QALIGN 	7B 2024 39.6 40.4 44.0 44.0 48.4 54.8 56.8 52.4 59.6 68.0 49.6
LANGBRIDGE 	7B 2024 42.8 50.4 43.2 40.0 45.2 56.4 50.8 52.4 58.0 63.2 50.2
TRANSLATE-EN 3.3B 2023 48.4 37.6 37.6 49.2 46.8 60.4 56.4 47.6 59.6 65.5 50.6
MINDMERGER-HARD 10.7B 2024 46.0 36.0 48.4 52.4 54.4 60.4 56.0 60.4 62.0 71.2 54.7
MINDMERGER-SOFT 10.7B 2024 50.4 52.8 57.2 54.4 53.6 61.2 57.6 60.8 58.4 66.8 57.3
QWEN3-8B 	8B 2025 66.4 69.3 59.1 64.5 67.9 71.0 69.3 72.5 75.1 78.9 69.4
WITH LIRA-LARGE 15.9B OURS 69.6 72.5 61.9 69.1 70.3 69.3 69.8 73.9 75.0 79.2 71.1
Table 4. X-CSQA accuracy (%). bold indicates the best; underlined indicates the second best.
METHOD SW UR HI AR VI JA PL ZH NL RU IT DE PT FR ES EN AVG.
TRANSLATE-EN 2023 36.5 41.3 48.4 44.6 51.8 47.1 53.3 51.5 55.0 56.3 57.3 54.7 57.2 55.5 71.3 71.3 52.3
MULTIREASON-LORA 2022 25.1 32.0 39.2 42.2 56.6 55.9 60.6 62.2 61.3 62.8 66.3 64.9 66.2 67.4 67.7 79.3 56.9
MULTIREASON-SFT 2024 27.6 29.2 32.0 28.7 38.8 38.7 45.5 43.8 45.9 46.5 50.2 49.1 51.2 52.1 54.3 67.2 43.8
MONOREASON 2024 24.2 25.1 32.9 32.3 50.9 49.1 50.6 56.5 57.5 56.0 56.0 61.2 61.7 63.5 64.0 76.3 51.3
QALIGN 2024 35.1 32.6 37.8 36.3 50.5 49.2 57.1 54.8 56.3 58.3 58.3 58.8 59.8 60.3 63.1 75.7 52.3
LANGBRIDGE 2024 31.8 30.5 30.6 30.6 33.3 33.9 39.8 39.8 38.4 39.1 37.4 36.4 33.8 38.2 38.8 44.4 36.1
MINDMERGER-HARD 2024 33.1 29.9 40.4 37.7 52.9 49.9 54.7 55.4 58.0 59.7 58.6 61.9 62.5 63.6 75.2 75.2 53.1
MINDMERGER-SOFT 2024 45.5 46.2 48.4 51.4 60.6 53.9 63.3 62.9 63.8 66.8 67.0 67.1 68.1 69.1 75.2 78.1 61.0
QWEN3-8B 2025 35.7 51.6 52.8 60.9 63.0 59.3 62.5 66.6 64.7 64.1 67.6 66.9 68.2 69.8 70.1 82.8 62.9
WITH LIRA-LARGE OURS 40.8 52.7 55.6 63.9 65.0 61.3 64.2 68.3 67.9 66.3 69.7 70.8 72.0 70.7 74.6 84.2 65.5
periments on each dataset with identical training hyper-
parameters before evaluation. Interestingly, we observed
that fine-tuning Qwen3 brought no gains on existing public
datasets. This phenomenon may be attributed to Qwen3
having already seen

Chunk 17 · 1,995 chars

5.6 63.9 65.0 61.3 64.2 68.3 67.9 66.3 69.7 70.8 72.0 70.7 74.6 84.2 65.5
periments on each dataset with identical training hyper-
parameters before evaluation. Interestingly, we observed
that fine-tuning Qwen3 brought no gains on existing public
datasets. This phenomenon may be attributed to Qwen3
having already seen portions of these public datasets during
its pretraining phase. For the critic, the weights are set to
α, β, γ, δ = 0.4, 0.4, 0.3, 1.0.
We instantiate LiRA at three parameter scales. LiRA-Base
uses three lightweight MT models—OPUS-MT, m2m100-
418M, and nllb-200-600M—together with Qwen3-1.7B,
which also serves as the critic. LiRA-Large upgrades
the translator set to Qwen3-1.7B, DeepSeek-R1-Distill-
Qwen-1.5B, and Llama3.2-1B-Instruct, with Qwen3-1.7B
again acting as the critic model. Finally, LiRA-Max
employs three large LLMs—Qwen3-32B, DeepSeek-R1-
Distill-Qwen-32B, and Gemma-2-27B—where Qwen3-32B
is used as the critic. See D.1 for more detailed information.
Our analysis predicts that two noise can be amplified by the
backbone’s local sensitivity (captured by a local Lipschitz-
type factor). We estimate Lemp(δ) for Qwen3-Embedding
backbones (Tab. 7) and observe that larger models are con-
sistently more stable. This stability ranking matches the
retrieval trend on LazRetrieval (Tab. A.2): performance in-
creases from Qwen3-E-0.6B→4B→8B, and LiRA built on
more stable backbones yields larger gains (LiRA-Large on
4B/8B > on 0.6B; LiRA-Max further improves the aver-
age). Overall, Tab. 7–A.2 provide empirical evidence that
reducing local sensitivity mitigates error amplification and
improves downstream retrieval.
5.2. Datasets
We use standard public datasets: BelebeleRetrieval (Ban-
darkar et al., 2024), MLQARetrieval (Enevoldsen et al.,
2025), STS22 (Enevoldsen et al., 2025), MGSM (Shi et al.,
2022), and X-CSQA (Lin et al., 2021). The first two are
retrieval benchmarks, STS22 evaluates sentence-level corre-
lation, and MGSM/X-CSQA assess mathematical

Chunk 18 · 1,993 chars

dard public datasets: BelebeleRetrieval (Ban-
darkar et al., 2024), MLQARetrieval (Enevoldsen et al.,
2025), STS22 (Enevoldsen et al., 2025), MGSM (Shi et al.,
2022), and X-CSQA (Lin et al., 2021). The first two are
retrieval benchmarks, STS22 evaluates sentence-level corre-
lation, and MGSM/X-CSQA assess mathematical reasoning
and reading comprehension, respectively. As we have ob-
served above that contamination in cross-lingual training
data may inflate the reported performance, we addition-
ally introduce a new real-world dataset to enable a fair
comparison. We release a de-identified e-commerce re-
trieval dataset, LazRetrieval, and a larger companion set,
LazRetrieval-mega for further research. Both cover seven
Southeast-Asian languages: Vietnamese (Vi), Thai (Th),
Indonesian (Id), Malay (Ms), Urdu (Ur), Bengali (Bn), and
Filipino/Tagalog (Ph). LazRetrieval contains 10 k examples
per language; LazRetrieval-mega contains 1,000 k exam-
ples per language and is intended for pretraining/supporting
large-scale adaptation. Unless otherwise noted, our exper-
iments use LazRetrieval. Since MGSM and X-CSQA have
no training split in our setup, we evaluate in a zero-shot
fashion: the Arca is trained on BelebeleRetrieval, while
the lightweight LaSR head is left untrained for these tasks.
Detailed information about the datasets can be found in B.
5.3. Results Analysis
Retrieval & sentence ranking. On public benchmarks (Ta-
ble 2), LiRA consistently improves the base model (Qwen3-
E-8B) on all three metrics: MLQARetrieval 82.01 vs. 81.13
7

-- 7 of 26 --

Submission and Formatting Instructions for ICML 2026
(+0.88), BelebeleRetrieval 87.03 vs. 85.94 (+1.09), and
STS22 75.00 vs. 71.64 (+3.36), yielding a higher macro
average 81.35 (+1.78). On our new LazRetrieval-70K (Ta-
ble 1), Qwen3-Embedding with LiRA-Large also improves
the average from 68.05 to 72.86 (+4.81). The gains are par-
ticularly pronounced on relatively low-resource locales (e.g.,
pk +6.40, vn +4.30, my

Chunk 19 · 1,998 chars

09), and
STS22 75.00 vs. 71.64 (+3.36), yielding a higher macro
average 81.35 (+1.78). On our new LazRetrieval-70K (Ta-
ble 1), Qwen3-Embedding with LiRA-Large also improves
the average from 68.05 to 72.86 (+4.81). The gains are par-
ticularly pronounced on relatively low-resource locales (e.g.,
pk +6.40, vn +4.30, my +4.57), suggesting that anchoring
g(x) to the English space and the LaSR head together re-
duce translation/representation noise. We also observe better
rank-correlation, matching our design of queue-augmented
CorrQ and listwise soft-nDCG.
Scaling behaviour on LazRetrieval. On LazRetrieval, we
observe a consistent scaling trend across Qwen3 embed-
ding backbones of different sizes (Tab. 1). For the smallest
backbone (Qwen3-E-0.6B), attaching LiRA-Large yields
substantial gains on low-resource languages such as Bd and
Pk (e.g., +10.24 and +9.11 absolute points, respectively),
indicating that LiRA can effectively compensate for limited
model capacity. A similar pattern holds for Qwen3-E-4B,
where LiRA-Large improves Bd from 49.81 to 56.24 and
Pk from 49.30 to 55.69. For the full 8B backbone, LiRA
forms a smooth performance–capacity curve: the average
score increases from 68.05 to 69.61 with LiRA-Base, 72.86
with LiRA-Large, and 77.71 with LiRA-Max, accompa-
nied by consistent improvements on all seven LazRetrieval
languages. These results demonstrate that LiRA provides
monotonic benefits across parameter scales and is particu-
larly effective in enhancing smaller cross-lingual encoders.
Mathematic. On MGSM (Table 3), LiRA brings a small
but consistent gain over Qwen3-8B: macro avg 69.4 vs.
71.1 (+1.7). Per-language analysis shows improvements or
matches on 9/11 languages (e.g., Bn/Th/Zh), while perfor-
mance on several high-resource languages remains compara-
ble (De, Es). This indicates that the concatenated represen-
tation [g(x); h(y)] improves reasoning robustness without
compromising performance on high-resource languages.
Comprehension. On X-CSQA

Chunk 20 · 1,997 chars

nts or
matches on 9/11 languages (e.g., Bn/Th/Zh), while perfor-
mance on several high-resource languages remains compara-
ble (De, Es). This indicates that the concatenated represen-
tation [g(x); h(y)] improves reasoning robustness without
compromising performance on high-resource languages.
Comprehension. On X-CSQA (Table 4), LiRA outper-
forms Qwen3-8B on 15/16 languages and raises the macro
average from 62.9 to 65.5 (+2.6). Improvements concen-
trate on lower-resource or typologically distant languages
(Ur/Hi/Ar/Vi/Zh), consistent with our motivation that con-
catenating g(x) and h(y) adds complementary information
and mitigates information bottlenecks.
Cross-backbone robustness. Table 2 also evaluates LiRA
as a pluggable module on three representative encoders.
Across MLQA Retrieval, BelebeleRetrieval, and STS22,
LiRA consistently improves over the corresponding back-
bones. The averaged gains over the three tasks are positive
for all backbones tested, suggesting that the effect is not tied
to a single encoder family.
Table 5. Ablation study of LiRA on retrieval, sentence ranking,
and reasoning tasks.
METHOD NDCG@10 PEARSON ACC.
LIRA (FULL) 77.71 75.00 71.1
,→ – LLM CRITIC 71.29 72.19 68.9
,→ – EMBEDS CRITIC 65.77 61.78 67.3
,→ – TRANSLATIONS 75.48 74.39 70.5
,→ – MULTI-ENC 75.59 72.43 69.5
,→ – FIFO QUEUE 64.29 69.82 –
5.4. Ablation
The ablation results in Table 5 demonstrate the contribution
of each component in LiRA. Removing the LLM Critic or
Embeds Critic leads to the most significant performance
drop, particularly on Pearson correlation and accuracy, high-
lighting the importance of dual-level critics for effective
supervision. The translation and multilingual encoder mod-
ules also provide consistent gains, showing their role in
enhancing cross-lingual generalization. Finally, eliminating
the FIFO loss queue results in the largest degradation on
nDCG@10, confirming its necessity in stabilizing optimiza-
tion. Overall, each component is essential, and

Chunk 21 · 1,999 chars

ation and multilingual encoder mod-
ules also provide consistent gains, showing their role in
enhancing cross-lingual generalization. Finally, eliminating
the FIFO loss queue results in the largest degradation on
nDCG@10, confirming its necessity in stabilizing optimiza-
tion. Overall, each component is essential, and their synergy
ensures the robustness of LiRA across tasks. In our ablation
study, the three reported metrics are obtained on different
tasks: nDCG@10 is evaluated on LazRetrieval, Pearson is
evaluated on STS22, and Acc. is evaluated on MGSM.
5.5. Supplementary Experiments
To offer additional insight into the practical behavior of
our framework, we provide supplementary analyses in the
appendix. In particular, we report (i) an empirical break-
down of ARCA’s translation candidate selections across
datasets (Appendix C, Figure 10), which helps reveal po-
tential evaluator-induced preferences; (ii) estimates of the
RKHS boundedness surrogate b	C across backbone scales
(Appendix A, Table 6); and (iii) data-local Lipschitz statis-
tics Lemp(δ) under token-level perturbations (Appendix A,
Table 7). Together, these results provide practical guid-
ance for instantiating our theoretical bounds and interpreting
model behavior beyond the primary benchmarks.
6. Conclusion
We proposed LiRA, a framework for robust multilingual
LLM adaptation that unifies retrieval, sentence ranking,
and reasoning tasks under a common anchoring principle.
By combining anchored representations with critic-guided
alignment and queue-based objectives, LiRA consistently
improves over strong Qwen3 baselines across both public
benchmarks and our newly introduced LazRetrieval dataset.
Ablation studies further validate the complementary con-
tributions of each component. We hope our dataset and
framework can inspire future work on multilingual LLM
adaptation.
8

-- 8 of 26 --

Submission and Formatting Instructions for ICML 2026
References
Artetxe, M., Goswami, V., Bhosale, S., Fan, A., and

Chunk 22 · 1,987 chars

dataset.
Ablation studies further validate the complementary con-
tributions of each component. We hope our dataset and
framework can inspire future work on multilingual LLM
adaptation.
8

-- 8 of 26 --

Submission and Formatting Instructions for ICML 2026
References
Artetxe, M., Goswami, V., Bhosale, S., Fan, A., and Zettle-
moyer, L. Revisiting machine translation for cross-lingual
classification. arXiv preprint arXiv:2305.14240, 2023.
Ataman, D., Birch, A., Habash, N., Federico, M., Koehn,
P., and Cho, K. Machine translation in the era of large
language models: a survey of historical and emerging
problems. Information, 16(9):723, 2025.
Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla,
S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer,
L., and Khabsa, M. The belebele benchmark: a parallel
reading comprehension dataset in 122 language variants.
In Proceedings of the 62nd Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long
Papers), pp. 749–775, 2024.
Beinborn, L. and Choenni, R. Semantic drift in multilingual
representations. Computational Linguistics, 46(3):571–
603, 2020.
Chua, L. et al. Crosslingual capabilities and knowledge
barriers in large language models. OpenReview: BCyAl-
Moyx5, 2024. ICLR 2025 submission.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,
Wenzek, G., Guzm ´an, F., Grave, E., Ott, M., Zettlemoyer,
L., and Stoyanov, V. Unsupervised cross-lingual represen-
tation learning at scale. In Proceedings of the 58th annual
meeting of the association for computational linguistics,
pp. 8440–8451, 2020.
Cueva, E., Monroy, A. L., S ´anchez-Vega, F., and Solorio,
T. Adaptive cross-lingual text classification through in-
context one-shot demonstrations. In Proceedings of the
2024 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Lan-
guage Technologies (Volume 1: Long Papers), pp. 8317–
8335, 2024.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

Chunk 23 · 1,996 chars

ingual text classification through in-
context one-shot demonstrations. In Proceedings of the
2024 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Lan-
guage Technologies (Volume 1: Long Papers), pp. 8317–
8335, 2024.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan-
guage understanding. In Proceedings of the 2019 confer-
ence of the North American chapter of the association for
computational linguistics: human language technologies,
volume 1 (long and short papers), pp. 4171–4186, 2019.
Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur,
A., Stap, D., Gala, J., Siblini, W., Krzemi ´nski, D., Winata,
G. I., et al. Mmteb: Massive multilingual text embedding
benchmark. arXiv preprint arXiv:2502.13595, 2025.
Ghosh, A., Dutta, D., Saha, S., and Agarwal, C. A survey of
multilingual reasoning in language models. Findings of
the Association for Computational Linguistics: EMNLP,
2025:8920–8936, 2025.
Gore, J. et al. Crossmath: Towards cross-lingual math
information retrieval. In ACM Digital Libraries, 2024.
Haddow, B., Bawden, R., Miceli-Barone, A. V., Helcl, J.,
and Birch, A. Survey of low-resource machine translation.
Computational Linguistics, 48(3):673–732, 2022.
Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., and
Johnson, M. Xtreme: A massively multilingual multi-task
benchmark for evaluating cross-lingual generalisation. In
International conference on machine learning, pp. 4411–
4421. PMLR, 2020.
Huang, Z., Zhu, W., Cheng, G., Li, L., and Yuan, F. Mind-
merger: Efficiently boosting llm reasoning in non-english
languages. Advances in Neural Information Processing
Systems, 37:34161–34187, 2024.
Kim, S., Ki, D., Kim, Y., and Lee, J. Cross-lingual qa: A
key to unlocking in-context cross-lingual performance,
2023.
Li, H. et al. Cross-lingual contextualized phrase retrieval.
In EMNLP Findings, 2024.
Lin, B. Y., Lee, S., Qiao, X., and Ren, X.

Chunk 24 · 1,998 chars

dvances in Neural Information Processing
Systems, 37:34161–34187, 2024.
Kim, S., Ki, D., Kim, Y., and Lee, J. Cross-lingual qa: A
key to unlocking in-context cross-lingual performance,
2023.
Li, H. et al. Cross-lingual contextualized phrase retrieval.
In EMNLP Findings, 2024.
Lin, B. Y., Lee, S., Qiao, X., and Ren, X. Common sense
beyond english: Evaluating and improving multilingual
language models for commonsense reasoning. arXiv
preprint arXiv:2106.06937, 2021.
Liu, W., Trenous, S., Ribeiro, L. F., Byrne, B., and Hieber,
F. Xrag: Cross-lingual retrieval-augmented generation.
arXiv preprint arXiv:2505.10089, 2025.
Liu, Y. et al. Improving chain-of-thought reasoning in llms
via chain of preference optimization. In NeurIPS, 2024.
Mahmoud, O., Semage, B. L., Karimpanal, T. G., and
Rana, S. Improving multilingual language models by
aligning representations through steering. arXiv preprint
arXiv:2505.12584, 2025.
Man, H., Ngo, N. T., Dac Lai, V., Rossi, R. A., Dernoncourt,
F., and Huu Nguyen, T. Lusifer: Language universal
space integration for enhanced representation in multilin-
gual text embedding models. In Proceedings of the 48th
International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 1360–1370,
2025.
Park, C., Kim, J., Lee, J., Bae, S., Choo, J., and Yoo, K. M.
Cross-lingual collapse: How language-centric foundation
models shape reasoning in large language models. arXiv
preprint arXiv:2506.05850, 2025.
Shi, F., Suzgun, M., Freitag, M., Wang, X., Srivats, S.,
Vosoughi, S., Chung, H. W., Tay, Y., Ruder, S., Zhou, D.,
et al. Language models are multilingual chain-of-thought
reasoners, 2022.
9

-- 9 of 26 --

Submission and Formatting Instructions for ICML 2026
Shioda, K., Komachi, M., Ikeya, R., and Mochihashi, D.
Suggesting sentences for ESL using kernel embeddings.
In Proceedings of the 4th Workshop on Natural Lan-
guage Processing Techniques for Educational Applica-
tions (NLPTEA 2017), pp. 64–68, Taipei, Taiwan, 2017.
Asian

Chunk 25 · 1,986 chars

sion and Formatting Instructions for ICML 2026
Shioda, K., Komachi, M., Ikeya, R., and Mochihashi, D.
Suggesting sentences for ESL using kernel embeddings.
In Proceedings of the 4th Workshop on Natural Lan-
guage Processing Techniques for Educational Applica-
tions (NLPTEA 2017), pp. 64–68, Taipei, Taiwan, 2017.
Asian Federation of Natural Language Processing.
Shubham, M. Breaking language barriers: Advancements in
machine translation for enhanced cross-lingual informa-
tion retrieval. J. Electrical Systems, 20(9s):2860–2875,
2024.
Singh, V., Krishna, A., NJ, K., and Ramakrishnan, G. A
three-pronged approach to cross-lingual adaptation with
multilingual llms. arXiv preprint arXiv:2406.17377,
2024.
Smola, A. J., Gretton, A., Song, L., and Sch ¨olkopf, B. A
hilbert space embedding for distributions. In Hutter, M.,
Servedio, R. A., and Takimoto, E. (eds.), Algorithmic
Learning Theory (ALT 2007), volume 4754 of Lecture
Notes in Computer Science, pp. 13–31. Springer, Berlin,
Heidelberg, 2007. doi: 10.1007/978-3-540-75225-7 5.
Sriperumbudur, B. K., Gretton, A., Fukumizu, K.,
Sch ¨olkopf, B., and Lanckriet, G. R. Hilbert space embed-
dings and metrics on probability measures. JMLR, 11:
1517–1561, 2010.
Thakur, N., Ni, J., ´Abrego, G. H., Wieting, J., Lin, J., and
Cer, D. Leveraging llms for synthesizing training data
across many languages in multilingual dense retrieval.
In Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume 1:
Long Papers), pp. 7699–7724, 2024.
Wu, L., Tan, X., Qin, T., Lai, J., and Liu, T.-Y. Beyond
error propagation: Language branching also affects the
accuracy of sequence generation. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing, 27
(12):1868–1879, 2019.
Xu, Y., Hu, L., Zhao, J., Qiu, Z., Xu, K., Ye, Y., and Gu, H.
A survey on multilingual large language models: Corpora,
alignment, and bias. Frontiers of Computer Science,

Chunk 26 · 1,996 chars

nching also affects the
accuracy of sequence generation. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing, 27
(12):1868–1879, 2019.
Xu, Y., Hu, L., Zhao, J., Qiu, Z., Xu, K., Ye, Y., and Gu, H.
A survey on multilingual large language models: Corpora,
alignment, and bias. Frontiers of Computer Science, 19
(11):1911362, 2025.
Yao, F., Zhuang, Y., Sun, Z., Xu, S., Kumar, A., and Shang,
J. Data contamination can cross language barriers. arXiv
preprint arXiv:2406.13236, 2024.
Yoshikawa, Y., Iwata, T., Sawada, H., and Yamada, T. Cross-
domain matching for bag-of-words data via kernel em-
beddings of latent distributions. Advances in Neural
Information Processing Systems, 28, 2015.
Zhang, M. et al. Chain of code: Reasoning with a language
model-augmented code interpreter. In ICML, 2024.
Zhao, Y., Zhang, W., Chen, G., Kawaguchi, K., and Bing, L.
How do large language models handle multilingualism?
Advances in Neural Information Processing Systems, 37:
15296–15319, 2024.
10

-- 10 of 26 --

Submission and Formatting Instructions for ICML 2026
A. Theoretic Details
A.1. Math Proof
Theorem 1 (Representation deviation bound). Under Assumptions 1–2 and Definitions 1–2, let the optimal English
representation be z⋆ = [ h(y⋆); h(y⋆) ] and the framework output be z = [ g(x); h(y) ]. Then
z − z⋆
2 ≤ ϵ1 + C √2 ϵ2 , (10)
where C > 0 is the kernel boundedness constant from Definition 1, i.e., sups k(s, s) ≤ C2. This constant reflects the
geometry of the semantic RKHS; smaller C indicates more stable embeddings. In practice, C can be estimated empirically
on a corpus (e.g., C ≈ 0.6867 in our experiments).
Proof. Let y⋆ be a ideal translation such that p(s | x) = p(s | y⋆). By block structure and the triangle inequality,
z − z⋆
2 =
g(x) − h(y⋆)
h(y) − h(y⋆)

2
≤ ∥g(x) − h(y⋆)∥2 + ∥h(y) − h(y⋆)∥2. (11)
By Assumption 1,
∥g(x) − h(y⋆)∥2 ≤ ∥g(x) − h(y)∥2 + ∥h(y) − h(y⋆)∥2 ≤ ϵ1 + ∥h(y) − h(y⋆)∥2. (12)
Next, by Assumption 2 and Pinsker’s inequality,
∥p(s | y) − p(s | y⋆)∥1

Chunk 27 · 1,991 chars

| x) = p(s | y⋆). By block structure and the triangle inequality,
z − z⋆
2 =
g(x) − h(y⋆)
h(y) − h(y⋆)

2
≤ ∥g(x) − h(y⋆)∥2 + ∥h(y) − h(y⋆)∥2. (11)
By Assumption 1,
∥g(x) − h(y⋆)∥2 ≤ ∥g(x) − h(y)∥2 + ∥h(y) − h(y⋆)∥2 ≤ ϵ1 + ∥h(y) − h(y⋆)∥2. (12)
Next, by Assumption 2 and Pinsker’s inequality,
∥p(s | y) − p(s | y⋆)∥1 ≤
q
2 DKL
p(s | y) ∥ p(s | y⋆) ≤ √2 ϵ2 . (13)
Using the RKHS mean-embedding view of h (Definition 1) and the bounded-kernel assumption (see, e.g., (Sriperumbudur
et al., 2010)),
∥h(y) − h(y⋆)∥2 = μp(s|y) − μp(s|y⋆) H
≤ C ∥ p(· | y) − p(· | y⋆) ∥TV
= C
2 ∥p(s | y) − p(s | y⋆)∥1
≤ C
q 1
2 DKL
p(s | y) ∥ p(s | y⋆) ≤ C
q ϵ2
2 . (14)
Plugging (14) into (12) and then into (11) yields
z − z⋆
2 ≤

ϵ1 + C
q ϵ2
2

+ C
q ϵ2
2 = ϵ1 + C√2 ϵ2 ,
which proves the claim.
Corollary 1. Downstream stability. Let fLLM denote the downstream scorer. If fLLM is locally Lipschitz around
[g(x); h(y)] with constant Lloc(y; δ) as in Definition 2, then
fLLM(z) − fLLM(z⋆) 2 ≤ Lloc(y; δ)

ϵ1 + C √2 ϵ2

. (15)
Proof. By the (local) Lipschitz property of fLLM and Theorem 3.2,
fLLM(z) − fLLM(z⋆) 2 ≤ Lloc(y; δ) ∥z − z⋆∥2 ≤ Lloc(y; δ)

ϵ1 + C√2 ϵ2

.
Instantiation. In our measurements we obtain L(0.95)(y; δ) ≈ 0.034 and C ≈ 0.6867 (representation dimension n = 4096).
A representative bound (reported in ℓ1 for readability) is
fLLM(z) − fLLM(z⋆) 1 ≤ 0.034 ·

ϵ1 + 1.9423 √ϵ2

. (16)
11

-- 11 of 26 --

Submission and Formatting Instructions for ICML 2026
A.2. About Definition
Definition 1 (RKHS representation) Let h : Y → Rd denote the English sentence encoder. We view h(y) as the kernel
mean embedding (KME) of the conditional semantic distribution p(s | y) in an RKHS (H, k):
h(y) = μp(s|y) = Es∼p(s|y)
φ(s), φ(s) = k(s, ·). (17)
The kernel k is assumed bounded on the (semantics) domain: 0 < k(s, s) = ⟨k(s, ·), k(s, ·)⟩H ≤ C2 for some constant
C > 0.
Remark 1 (On estimating the boundedness constant C). For any probability measure P on the input space with x, x′ i.i.d.
∼ P

Chunk 28 · 1,995 chars

an RKHS (H, k):
h(y) = μp(s|y) = Es∼p(s|y)
φ(s), φ(s) = k(s, ·). (17)
The kernel k is assumed bounded on the (semantics) domain: 0 < k(s, s) = ⟨k(s, ·), k(s, ·)⟩H ≤ C2 for some constant
C > 0.
Remark 1 (On estimating the boundedness constant C). For any probability measure P on the input space with x, x′ i.i.d.
∼ P ,
μP
2
H = Ex,x′
k(x, x′) ≤ Ex
k(x, x), (18)
where the inequality follows from k(x, x′) ≤ pk(x, x) k(x′, x′) for PSD kernels and Jensen. Thus ∥h(y)∥H = ∥μp(s|y)∥H
provides a lower-bound proxy for E[k(x, x)], but it does not identify the pointwise upper bound sups k(s, s) = C2. In
practice one may report empirical surrogates (e.g., corpus-wise maxima of ∥h(y)∥), while the theoretical C remains a
kernel-dependent constant. See Smola et al. (2007); Shioda et al. (2017); Yoshikawa et al. (2015) for background.
Estimator. Let E ∈ RV ×d be the model’s input embedding table and y = (w1, . . . , wT ) the tokenized sentence with
attention mask mt ∈ {0, 1}. We compute the unnormalized mean-pooled sentence vector
bh(y) = 1
PT
t=1 mt
T	X
t=1
mt E[wt, :] ∈ Rd, and its norm ∥bh(y)∥2. (19)
The corpus-level estimators are
b	Cmax = max
y∈Yprobe
∥bh(y)∥2, b	Cq = Quantile y∈Yprobe
∥bh(y)∥2, q, (20)
where q ∈ (0, 1) (e.g., q = 0.90, 0.95, 0.99) provides robust surrogates. By construction b	Cmax ≤ C (a lower bound on the
true C).
Implementation. We follow the released script compute C rkhs.py: (i) tokenize each sentence, (ii) fetch token
embeddings via get input embeddings(), (iii) mean-pool with the attention mask (no ℓ2 normalization), (iv) take
∥ · ∥2 and aggregate statistics (max, mean, std, median, p90, p95, p99, sample count). Unless otherwise noted, we probe on
STS22 (sentence1) with max length=8192 and report per-model results in Table 6.
Table 6. Estimates of the RKHS bound C on STS22 (field: sentence1) using mean-pooled unnormalized input embeddings (no ℓ2
post-normalization). bCmax = maxy ∥bh(y)∥2; bCq denotes the q-quantile.
Model bCmax Mean Std Median

Chunk 29 · 1,996 chars

oted, we probe on
STS22 (sentence1) with max length=8192 and report per-model results in Table 6.
Table 6. Estimates of the RKHS bound C on STS22 (field: sentence1) using mean-pooled unnormalized input embeddings (no ℓ2
post-normalization). bCmax = maxy ∥bh(y)∥2; bCq denotes the q-quantile.
Model bCmax Mean Std Median P90 P95 P99
Qwen3-Embedding-0.6B 0.5333 0.1944 0.0221 0.1878 0.2240 0.2376 0.2605
Qwen3-Embedding-4B 0.5457 0.2073 0.0324 0.1971 0.2465 0.2672 0.3302
Qwen3-Embedding-8B 0.6866 0.3988 0.0434 0.3900 0.4529 0.4752 0.5503
Analysis. Table 6 and Figure 4 shows that the empirical RKHS bound surrogates b	C increase moderately with model size
(0.6B→4B→8B), which tightens separation in representation space but enlarges the worst-case radius r(C) = ϵ1 + 2 C√2ϵ2
when plugging C into our bounds. Because b	Cmax ≤ C (Definition A.2), any bound instantiated with b	Cmax or b	Cq is optimistic
(it may understate the true worst case); therefore we recommend using (i) a high-probability bound using C = b	C0.95 together
with its empirical coverage on validation, and (ii) a worst-case bound using C = b	Cmax as a lower envelope for the true C.
For rigor, one can calibrate a multiplicative slack κ ≥ 1 by back-testing—choose the smallest κ such that the inequality with
C = κ b	C0.95 holds on at least 95% of held-out samples. Finally, b	C is sensitive to tokenization length and domain; we thus
compute b	C on the target corpus (STS22 sentence1 by default) with unnormalized mean pooling, and advise re-estimating it
in-domain when the deployment distribution shifts.
12

-- 12 of 26 --

Submission and Formatting Instructions for ICML 2026
(a) Qwen3-Embedding-0.6B
(b) Qwen3-Embedding-4B
(c) Qwen3-Embedding-8B
Figure 4. Empirical estimates of the RKHS bound C across model scales. Each subfigure shows (left) a violin plot of ∥bh(y)∥2, (middle) a
histogram with the 99th percentile marked, and (right) a scatter of norms by sentence index.
13

-- 13 of 26 --

Submission and Formatting

Chunk 30 · 1,997 chars

) Qwen3-Embedding-4B
(c) Qwen3-Embedding-8B
Figure 4. Empirical estimates of the RKHS bound C across model scales. Each subfigure shows (left) a violin plot of ∥bh(y)∥2, (middle) a
histogram with the 99th percentile marked, and (right) a scatter of norms by sentence index.
13

-- 13 of 26 --

Submission and Formatting Instructions for ICML 2026
Definition 2 (Data-local Lipschitz constant) “How fast can the encoder’s output change under small edit perturbations of a
sentence in real text data?” By standard Lipschitz-continuity arguments on finite discrete domains, any encoder admits a
Lipschitz constant. Hence, on the dataset Ydata the encoder satisfies
Lloc
h (y; δ) = max
y′∈Nδ (y)
fLLM(z) − fLLM(z⋆) 2
z − z⋆ 2
, (21)
where z = [ g(x); h(y) ]. We denote its q-quantile by L(q)
h , measured by the script described earlier (e.g., q = 0.95,
L(0.95)
h ≈ 0.05). Note. Nδ (y) is the neighborhood defined by token-level edit distance ≤ δ; in practice we use δ = 1.
Table 7. Empirical local Lipschitz estimates Lemp(δ) (percent). We report mean, std, median, and high quantiles (P90/P95/P99), plus max.
Model δ Mean Std Median P90 P95 P99 Max
Qwen3-Embedding-0.6B
1 6.8% 6.2% 5% 13.7% 18.1% 31.5% 58.6%
2 6.5% 5.4% 5.1% 12.7% 16.3% 27.2% 55.7%
3 5.7% 4.8% 4.4% 11.3% 14.6% 23.9% 44.9%
5 4.2% 3.8% 3.1% 8.5% 11.1% 18.8% 34%
8 2.7% 3.3% 1.7% 5.8% 8% 16% 77%
10 2.2% 2.9% 1.3% 4.9% 6.9% 14.6% 55.7%
Qwen3-Embedding-4B
1 5.5% 5% 4.1% 11.1% 14.5% 27% 54.9%
2 5.2% 4% 4.2% 10% 12.5% 21% 41.4%
3 4.6% 3.7% 3.7% 8.8% 11.3% 18% 38.2%
5 3.4% 3% 2.6% 7% 8.7% 14.5% 38.7%
8 2.2% 2.6% 1.5% 4.8% 6.7% 12.4% 29.4%
10 1.8% 2.4% 1.1% 4.1% 5.7% 11.9% 39.2%
Qwen3-Embedding-8B
1 3.4% 3% 2.5% 6.4% 8.6% 15.8% 30.3%
2 3.2% 2.6% 2.5% 6.1% 7.9% 14.3% 29.3%
3 2.9% 2.3% 2.2% 5.5% 7.1% 12.2% 22.7%
5 2.1% 1.9% 1.6% 4% 5.3% 9.5% 21.7%
8 1.4% 1.7% 0.9% 2.9% 4.1% 8.3% 20.4%
10 1.1% 1.6% 0.7% 2.4% 3.5% 7.9% 30%
Explanation. The data-local Lipschitz constant is defined w.r.t. the downstream scorer fLLM as
Lloc
h (y; δ) =

Chunk 31 · 1,994 chars

4% 8.6% 15.8% 30.3%
2 3.2% 2.6% 2.5% 6.1% 7.9% 14.3% 29.3%
3 2.9% 2.3% 2.2% 5.5% 7.1% 12.2% 22.7%
5 2.1% 1.9% 1.6% 4% 5.3% 9.5% 21.7%
8 1.4% 1.7% 0.9% 2.9% 4.1% 8.3% 20.4%
10 1.1% 1.6% 0.7% 2.4% 3.5% 7.9% 30%
Explanation. The data-local Lipschitz constant is defined w.r.t. the downstream scorer fLLM as
Lloc
h (y; δ) = max
y′∈Nδ (y)
fLLM(z) − fLLM(z⋆) 2
z − z⋆ 2
, (22)
is defined as follows.
• fLLM : is any fixed downstream model (e.g., Qwen-3, Qwen-3-Embedding, BERT).
• dtok(y, y′) is the token-level edit distance between two sentences (e.g., Levenshtein distance).
1) δ-neighborhood (in the corpus). For a finite corpus Ydata and any sentence y, define the radius-δ neighborhood
Nδ (y) =  y′ ∈ Ydata : 0 < dtok(y, y′) ≤ δ .
That is, Nδ (y) contains all sentences that differ from y by at most δ token edits (e.g., by exactly one token when δ = 1).
2) Pointwise local Lipschitz constant. The quantity Lloc
h (y; δ) measures the encoder’s rate of change within the data
neighborhood. As long as y has at least one neighbor, this value is finite.
3) Empirical quantile over the dataset. For a confidence level q ∈ (0, 1) (e.g., q = 0.95), define
L(q)
h (δ) = Quantile y∈Ydata

Lloc
h (y; δ), q

. (23)
In words, q · 100% of sentences in the corpus have local Lipschitz constants no greater than L(q)
h . Empirically, for δ between
1 and 10, the local Lipschitz ratios of Qwen3-Embedding are below 0.1 for 99% of samples; moreover, as δ increases the
ratios decrease and concentrate, indicating that the local Lipschitz constant both exists and is measurable.
14

-- 14 of 26 --

Submission and Formatting Instructions for ICML 2026
Experimental notes.
• Lipschitz Ratio (Lh). For an original sentence s and its perturbed version s′, the ratio is
Lh = Embed(s) − Embed(s′) 2
EditDistance(s, s′) .
Here ∥ · ∥2 is the Euclidean norm between embedding vectors, and EditDistance is the Levenshtein distance (minimum
number of single-character edits to transform s into s′). A small and stable

Chunk 32 · 1,998 chars

o (Lh). For an original sentence s and its perturbed version s′, the ratio is
Lh = Embed(s) − Embed(s′) 2
EditDistance(s, s′) .
Here ∥ · ∥2 is the Euclidean norm between embedding vectors, and EditDistance is the Levenshtein distance (minimum
number of single-character edits to transform s into s′). A small and stable ratio indicates local stability/robustness:
small input changes do not cause large embedding shifts. If the ratio grows markedly with δ, the model may vary
sharply in some regions.
• δ (Delta). The maximum number of edit operations allowed for the perturbation. The script evaluates multiple δ values
(e.g., 1, 2, 3, 5, etc.).
• Mean. The average of the ratios at a fixed δ, reflecting the typical sensitivity at that perturbation level.
• Standard Deviation. The dispersion of the ratios; larger values indicate greater variability across sen-
tences/perturbations.
• Quantiles. E.g., median (50%), 90%, 95%, and 99% quantiles. The 95% quantile means that 95% of ratios are no
greater than that value, useful for detecting rare but high-sensitivity cases.
• Sample Count. The number of valid ratios computed for a given δ (cases with no change after perturbation are
excluded).
By tracking these statistics as δ varies, we assess the encoder’s local Lipschitz characteristics and, in turn, its stability and
robustness.
Figure 5. Qwen3-Embedding-0.6B: Local Lipschitz analysis across δ.
15

-- 15 of 26 --

Submission and Formatting Instructions for ICML 2026
Figure 6. Qwen3-Embedding-4B: Local Lipschitz analysis across δ.
Figure 7. Qwen3-Embedding-8B: Local Lipschitz analysis across δ.
16

-- 16 of 26 --

Submission and Formatting Instructions for ICML 2026
Analysis. Across radii δ ∈ {1, 2, 3, 5, 8, 10}, the empirical local Lipschitz constant Lemp(δ) decreases as δ grows, indicating
smoother behavior for larger neighborhoods (finite-difference estimates are dominated by local saturation and curvature).
As shown in Table 7 and Figure 5, 6, 7, scaling the encoder from

Chunk 33 · 1,996 chars

2026
Analysis. Across radii δ ∈ {1, 2, 3, 5, 8, 10}, the empirical local Lipschitz constant Lemp(δ) decreases as δ grows, indicating
smoother behavior for larger neighborhoods (finite-difference estimates are dominated by local saturation and curvature).
As shown in Table 7 and Figure 5, 6, 7, scaling the encoder from 0.6B→4B→8B consistently reduces Lemp at all quantiles:
for example at δ=5, P95 drops from 0.1107 (0.6B) to 0.0874 (4B) to 0.0530 (8B). Tail risk also shrinks (P99 and Max),
with occasional outliers on 0.6B (e.g., δ=8) suggesting numeric spikes; hence, for theoretical bounds we recommend using
the high-probability constant Lloc = P95 at δ ∈ [3, 5] as a robust plug-in for the local bound.
A.3. Why Concatenate TWO Representation Paths?
Although feature concatenation introduces an additional error source compared to using a single feature vector, which
appears to increase the overall error term in the model, we provide an an information-theoretic analysis showing that feature
concatenation leads to more stable results. Model the two paths as noisy channels:
g(x) = s + ηg , h(y) = s + ηh,
with zero-mean, finite-covariance noises conditionally independent given s: p(ηg , ηh | s) = p(ηg | s) p(ηh | s), where ηg
and ηh are additive noises in the two channels, assumed zero-mean with finite covariance and conditionally independent
given s. σ2
s,k = Var(sk) is the variance of the k-th coordinate of the latent semantic vector s.
Define the information gain of concatenation:
∆I = Is; [g(x), h(y)] − Is; g(x),
where I(· ; ·) and H(· | ·) denote Shannon mutual information and conditional entropy, respectively. By the chain rule and
conditional independence,
Is; [g, h] = I(s; g) + I(s; h | g)
= I(s; g) + H(s | g) − H(s | g, h)
≥ I(s; g), (24)
with equality iff s ⊥⊥ h | g (i.e., h(y) provides no additional semantic information beyond what is already contained in g(x)).
In cross-lingual settings, translation noise ηh and anchoring noise ηg are complementary, hence

Chunk 34 · 1,998 chars

endence,
Is; [g, h] = I(s; g) + I(s; h | g)
= I(s; g) + H(s | g) − H(s | g, h)
≥ I(s; g), (24)
with equality iff s ⊥⊥ h | g (i.e., h(y) provides no additional semantic information beyond what is already contained in g(x)).
In cross-lingual settings, translation noise ηh and anchoring noise ηg are complementary, hence I(s; h | g) > 0 and ∆I > 0.
Therfore, if Var(ηg,k) → ∞ on some dimension k (e.g., severe LRL ambiguity), then
I(s; g) ≤ 1
2 log

1 + σ2
s,k
Var(ηg,k )

→ 0.
The quantity in parentheses is the per-coordinate signal-to-noise ratio (SNR), defined as SNRk = σ2
s,k/Var(ηg,k). For
the additive channel Gk = sk + ηg,k, the classical Gaussian-channel bound implies I(sk; Gk) ≤ 1
2 log1 + SNRk
.
while a stable English path (Var(ηh,k) < ∞) yields a strictly positive lower bound for ∆I on that dimension. Hence the
concatenation z = [g(x); h(y)] overcomes single-path bottlenecks, and the information gain offsets the apparent worst-case
bound increase.
B. Dataset
Dataset release. We release a de-identified cross-lingual e-commerce retrieval dataset, LazRetrieval, and its pretraining-
scale companion LazRetrieval-mega. The corpus spans seven languages across Southeast and South Asia: Vietnamese
(vi), Thai (th), Indonesian (id), Malay (ms), Urdu (ur), Bengali (bn), and Filipino/Tagalog (ph). LazRetrieval contains
10 k examples per language, while LazRetrieval-mega contains 1,000 k per language. Unless otherwise specified, our
experiments use LazRetrieval; the mega version is intended to support large-scale pretraining. We normalize the frequencies.
Splits and file structure. We split the data into train and test with a fixed 4:1 ratio. Each split consists of three JSON files:
• query.json: de-identified user queries from seven Lazada locales.
• item.json: product titles (landing-page headers) to be retrieved as candidate documents.
• pairs info.json: the set of positive query–item pairs (binary relevance).
17

-- 17 of 26 --

Submission and Formatting Instructions

Chunk 35 · 1,995 chars

ists of three JSON files:
• query.json: de-identified user queries from seven Lazada locales.
• item.json: product titles (landing-page headers) to be retrieved as candidate documents.
• pairs info.json: the set of positive query–item pairs (binary relevance).
17

-- 17 of 26 --

Submission and Formatting Instructions for ICML 2026
query：
{
“ID”:“Q1048”,
“nation”: “BD”,
"text": "সট চুইংগাম"
}
item：
{
"nation": "BD",
“item_id": "C34801",
"text": "ডুেভই 'ািসক পু-ষ /0সেলট গহনা মু7ট কবজ িবলাসব:ল
ম;া<াম জপমালা মিহলােদর জন; /0সেলট পালেসইরা ম;াস7িলনা /ফিমিননা
উপহার"
},
pairs_info:
{
"nation": "BD",
"query_id": "Q1048",
"query": "সট চুইংগাম",
"item_id": "C369",
"item": "Trident cinnamon -.ভার সুগার ি2 গাম x 14 সফট গাম",
"rscore": "1.0"
}
Figure 8. Rendered examples from the Bengali (BD) split of LazRetrieval. We render records as images to avoid Unicode rendering issues.
Example. A minimal example from the Bengali (Bangladesh) portion is shown below.
Dataset analysis. As shown in Table 8 and Figure 9 The corpus exhibits a pronounced length asymmetry between queries
and items: queries are short on average (mean 18.66, median 18), whereas item titles are substantially longer and more
dispersed (mean 97.42, std 42.59). This mismatch reflects real-world e-commerce behavior—concise user intents versus
verbose product titles—and implies (i) robust handling of extreme-length outliers and (ii) sensitivity to multilingual scripts
with different orthographic granularity. Our training objectives (queue-augmented correlation for sentence ranking and
listwise soft-nDCG@10 with safe negatives for retrieval) are designed to be stable under such length skew, while the
anchoring mechanism in LiRA mitigates representation drift caused by noisy or unusually long inputs.
C. Experimental Details
Unless stated otherwise, inputs are tokenized with max length=512, padding=true, and truncation enabled. For
consistency across backbones, document- and query-side representations are pooled before

Chunk 36 · 1,999 chars

horing mechanism in LiRA mitigates representation drift caused by noisy or unusually long inputs.
C. Experimental Details
Unless stated otherwise, inputs are tokenized with max length=512, padding=true, and truncation enabled. For
consistency across backbones, document- and query-side representations are pooled before similarity scoring. Training
updates three learnable components: the multilingual encoder (e.g., mT5-XL encoder), the cross-space Adaptor, and
18

-- 18 of 26 --

Submission and Formatting Instructions for ICML 2026
Table 8. Descriptive statistics of LazRetrieval (train+test). Lengths are measured as raw string lengths; counts denote unique entries.
Field Max Min Mean Median Std Count
Query 248 2 18.65 18.00 9.37 50,000
Item 255 6 97.42 95.00 42.59 50,000
Figure 9. Length’s distribution of LazRetrieval.
the Actor selector. All are optimized with Adam and gradient clipping at 1.0. Both single-process and DDP training are
supported; checkpointing and logging frequencies are configured per task (Tables. 9–10).
C.1. Arca Translation Selections-Equipped LiRA-Max
As shown in Figure 10. On MLQARETRIEVAL, MGSM, and BELEBELERETRIEVAL, Qwen3-32B is the most frequently
selected candidate (29.4%, 34.0%, and 33.8%, respectively); together with DeepSeek-R1-Distill-Qwen-32B
(27.0%, 31.8%, 21.4%), the Qwen-family accounts for the majority of selections (56.4%, 65.8%, and 55.2%). On
LAZRETRIEVAL, the distribution is more balanced with Llama-33-70B-Instruct at 27.2%, gemma-2-27b-it at
26.8%, and the Qwen-family totaling 46.0%. Across datasets, ARCA shows a consistent preference toward Qwen-family
candidates (Qwen3 or the Qwen-distilled DeepSeek variant), especially on MGSM where they comprise 65.8% of selections.
A plausible explanation is architectural alignment and feature-space compatibility with our frozen evaluator, which is
Qwen3-32B. Using the same family for evaluation can introduce a mild inductive bias that favors stylistic and semantic
choices characteristic

Chunk 37 · 1,995 chars

specially on MGSM where they comprise 65.8% of selections.
A plausible explanation is architectural alignment and feature-space compatibility with our frozen evaluator, which is
Qwen3-32B. Using the same family for evaluation can introduce a mild inductive bias that favors stylistic and semantic
choices characteristic of that architecture. We therefore report these breakdowns to make the potential bias explicit and to
encourage future work to cross-check with evaluators from different families.
Table 9. Core setup per experiment. All runs use PyTorch + Transformers; distributed training via DDP when --distributed is
enabled.
Task Encoder LLM Embedding Pool Size Batch Steps GPUs
STS22 mT5-XL Qwen3-Embedding-8B 8 1 10 4
BelebeleRetrieval mT5-XL Qwen3-Embedding-8B 8 1 5 8
MLQARetrieval mT5-XL Qwen3-Embedding-8B - 1 5 8
LazRetrieval mT5-XL Qwen3-Embedding-8B 4 1 120 8
MGSM mT5-XL Qwen3-8B 4 2 - 1
X-CSQA mT5-XL Qwen3-8B 4 2 - 1
19

-- 19 of 26 --

Submission and Formatting Instructions for ICML 2026
Figure 10. Model selection distribution of ARCA across four benchmarks.
C.2. Prompt Engineer
Our translation-aware training relies on a translator module and a critic module. In practice, however, off-the-shelf LLM
translators may produce non-translation artifacts (e.g., meta prefixes such as “Here is the translation:”, unnecessary
politeness, or extra explanations), which introduce avoidable noise into both the translation candidates and the downstream
scoring signals. To reduce such artifacts and to make the translation outputs more consistent across languages, we adopt a
simple but effective prompt-engineering strategy for both translation and evaluation.
Translation prompt. We instruct the model to act as a professional translator and output only the translated text without
any additional commentary:
You are a professional translator, proficient in various languages. Please translate the following phrase or sentence
into English: ’{text}’. Output only the translation

Chunk 38 · 1,991 chars

Translation prompt. We instruct the model to act as a professional translator and output only the translated text without
any additional commentary:
You are a professional translator, proficient in various languages. Please translate the following phrase or sentence
into English: ’{text}’. Output only the translation results without any other explanations.
Critic prompt. To compare multiple translation candidates, the critic evaluates each candidate on three dimensions—
semantic fidelity, emotional consistency, and pragmatic tone—and is required to output strict JSON to avoid verbose
judgments that are hard to parse:
Please strictly evaluate the following translation on three dimensions:
20

-- 20 of 26 --

Submission and Formatting Instructions for ICML 2026
Table 10. Optimization and reward-related hyperparameters. α, β, γ weight the three translation scores; δ scales the similarity term.
Logging and checkpoint intervals are task-specific.
Task Actor LR Encoder LR Adaptor LR α β γ δ
STS22 1×10−4 5×10−5 1×10−4 0.4 0.3 0.3 1.0
BelebeleRetrieval 1×10−4 5×10−5 1×10−4 0.4 0.3 0.3 1.0
MLQARetrieval 1×10−4 5×10−5 1×10−4 0.4 0.3 0.3 1.0
LazRetrieval 1×10−4 5×10−5 1×10−4 0.4 0.3 0.3 1.0
• Semantic Fidelity (1–10): Accuracy in conveying original meaning
• Emotional Consistency (1–10): Alignment with original emotional tone
• Pragmatic Tone (1–10): Appropriateness in contextual usage and style
You MUST only output JSON format with keys ’semantic’, ’emotional’, ’pragmatic’ and values
1–10, without any other explanations.
Origin text in {lang}: ’{sentence}’
Translation in En: ’{translation}’
This design serves two purposes: (i) it suppresses extraneous natural-language responses that otherwise contaminate
translation candidates and destabilize training, and (ii) it yields structured, dimension-wise scores that can be directly
integrated into the actor–critic learning signal without additional heuristics.
C.3. Bad Case Analysis
Despite strong translation ability, large

Chunk 39 · 1,997 chars

tural-language responses that otherwise contaminate
translation candidates and destabilize training, and (ii) it yields structured, dimension-wise scores that can be directly
integrated into the actor–critic learning signal without additional heuristics.
C.3. Bad Case Analysis
Despite strong translation ability, large language models frequently inject small but non-negligible noise into translation
outputs, especially when the input is long, contains news-style formatting, or includes imperative phrases and boilerplate (e.g.,
“Follow us on Telegram”). A common failure mode is meta-text leakage: the model prepends or appends non-translation
content such as “Sure, here is the translation:”, “Here is the translated sentence:”, or polite closing statements like “If you
need any other help, I’m glad to help you.” Although such artifacts are harmless for human readers, they are problematic
in our setting because they (i) change the token distribution of the “translation” string, (ii) inject irrelevant stylistic cues
that the encoder may latch onto, and (iii) distort critic scoring by mixing translation quality with adherence-to-instruction
behavior. We provide a representative example below. Given a long-form input, some LLM translators return a translation
wrapped with assistant-style meta prefixes and extra explanations, instead of outputting the translation alone:
Expected translation candidate:
Baku, Azerbaijan, Jan. 1 ... ‘‘2019 will go down in history as a year of
in-depth reforms ...’’
Noisy translation candidate:
Sure, here is the translated sentence:
‘‘Baku, Azerbaijan, January 1 ... ‘‘2019 will go down in history as a year of
significant reforms ...’’ ...’’
Such meta-text leakage constitutes translation noise under our theoretical framing (semantic drift and formatting artifacts),
and it can propagate into both representation learning and the actor–critic reward signal. Concretely, the added preambles
(e.g., “Sure, here is . . . ”) introduce irrelevant tokens

Chunk 40 · 1,996 chars

...’’ ...’’
Such meta-text leakage constitutes translation noise under our theoretical framing (semantic drift and formatting artifacts),
and it can propagate into both representation learning and the actor–critic reward signal. Concretely, the added preambles
(e.g., “Sure, here is . . . ”) introduce irrelevant tokens and style markers that are absent from genuine translations, making the
translation string less comparable across candidates and languages.
Our prompt-engineering strategy mitigates this issue by enforcing output-only translation for the translator and JSON-only
scoring for the critic. Empirically, this reduces the rate of meta-text leakage and yields more uniform translation candidates
in style and formatting, leading to a cleaner training signal and more stable translation-aware optimization, especially for
long-form inputs with news-style templates.
Reproducibility notes. All models use Adam, gradient clipping (1.0), and cosine similarity on L2-normalized vectors.
LLM-side token representations are extracted via get input embeddings(), and torch.nan to num is applied
defensively during scoring.
21

-- 21 of 26 --

Submission and Formatting Instructions for ICML 2026
D. Model Details
D.1. About Model
Multilingual encoder: mT5-XL (encoder only; decoder frozen). LLM Evaluator (frozen): Qwen/Qwen3-32B is used
only for evaluation/critique; all its parameters are kept frozen. LaSR (trainable): the LaSR module is trained end-to-end
during our experiments (gradients do not propagate into the frozen evaluator). When needed, token representations are read
via get input embeddings() (no gradient flow).
Pooling & shapes. The backbone outputs last hidden state, which is pooled to a sentence vector. On the LLM
side, raw token embeddings are adaptively average-pooled to a fixed temporal length (pool size, e.g., 32 or 4), then
flattened for similarity computation.
Adaptor. A two-layer MLP maps RdML Linear + ReLU + LayerNorm
−−−−−−−−−−−−−−−−−−→ R512 Linear
−−−−→

Chunk 41 · 1,995 chars

st hidden state, which is pooled to a sentence vector. On the LLM
side, raw token embeddings are adaptively average-pooled to a fixed temporal length (pool size, e.g., 32 or 4), then
flattened for similarity computation.
Adaptor. A two-layer MLP maps RdML Linear + ReLU + LayerNorm
−−−−−−−−−−−−−−−−−−→ R512 Linear
−−−−→ RdLLM , aligning multilingual features to
the LLM embedding space.
Actor (candidate selector). For each candidate, the Actor consumes a 4-D feature vector
[semantic, emotional, pragmatic, sim] with topology R4 → 16 → 1 (ReLU+LayerNorm). A softmax
over candidates defines π(a | x), and REINFORCE is used:
Lactor = − log π(a | x) · R(a),
where R(a) = 0.1α semantic + 0.1β emotional + 0.1γ pragmatic + δ · sim (weights in Table 10).
Encoder/adaptor objective. We maximize cosine similarity between the selected candidate and the aligned query vector
by minimizing
Lenc = − cos(pool(Adaptor(EncML(x))), pool(EmbLLM(y))) .
Three optimizers update Actor, the multilingual encoder, and the Adaptor; gradient clipping is applied uniformly (1.0).
D.2. Measuring ϵ1 and ϵ2
Our theory only assumes: (i) there exists a mapping error between representation vectors, and (ii) machine translation
may introduce noise. Both phenomena are widely observed in cross-lingual alignment and MT-based transfer settings.
We introduce an “ideal” representation vector z⋆ that corresponds to the noisy, error-prone representation z purely as a
mathematical construct, rather than as a strict requirement on real-world conditions. Importantly, the purpose of these
assumptions is to derive and justify our training objective: Assumptions 1–2 are directly aligned with our loss, and z⋆ serves
to emphasize that errors arise along two orthogonal dimensions—semantic drift (translation distortion) and vector mapping
mismatch (representation deviation).
To make these assumptions empirically verifiable, we estimate both error terms using observable proxies during training. For
the representation mapping

Chunk 42 · 1,992 chars

⋆ serves
to emphasize that errors arise along two orthogonal dimensions—semantic drift (translation distortion) and vector mapping
mismatch (representation deviation).
To make these assumptions empirically verifiable, we estimate both error terms using observable proxies during training. For
the representation mapping error ϵ1, we track the mismatch between the feature-path representation and the translation-
path representation:
ˆϵ1 := Ex
h
g(x) − h(ˆy(x)) 2
i
, (25)
where g(x) denotes the feature-path embedding and h(ˆy(x)) denotes the translation-path embedding produced by the
selected translation ˆy(x) (e.g., the actor/critic-chosen candidate). In addition, we report the cosine similarity cos(z, z⋆) as a
normalized proxy to visualize the contraction of representation deviation across epochs (higher is better).
For the translation distortion ϵ2, since the ideal translation y⋆ is not observable, we use a semantic-divergence proxy that
captures the degree of drift introduced by translation:
ˆϵ2 := Ex
h
1 − cos Emb(x), Emb(ˆy(x))i
, (26)
In Eq. (26), we instantiate Emb(·) using the frozen encoder states of fllm (i.e., the LLM representations before decoding).
Concretely, given an input text u (either the source sentence x or the selected translation ˆy(x)), we obtain token-level hidden
22

-- 22 of 26 --

Submission and Formatting Instructions for ICML 2026
states Hu = Encfllm (u) ∈ RT ×D, and compute a sentence-level embedding by masked mean pooling:
Efllm (u) = 1
P
t mt
T	X
t=1
mt Hu[t], (27)
where mt ∈ {0, 1} is the attention mask. We keep fllm frozen and use a fixed prompting/formatting template for all inputs in
this appendix to ensure Efllm (·) is a stable diagnostic signal.
Accordingly, our translation-distortion proxy is measured as
ˆϵ2 := Ex
h
1 − cos Efllm (x), Efllm (ˆy(x))i
. (28)
Lower ˆϵ2 indicates better semantic faithfulness and reduced translation noise (as measured in the representation space of
fllm).
Tab. 11 reports the evolution of the

Chunk 43 · 1,995 chars

lm (·) is a stable diagnostic signal.
Accordingly, our translation-distortion proxy is measured as
ˆϵ2 := Ex
h
1 − cos Efllm (x), Efllm (ˆy(x))i
. (28)
Lower ˆϵ2 indicates better semantic faithfulness and reduced translation noise (as measured in the representation space of
fllm).
Tab. 11 reports the evolution of the similarity between z and z⋆ (cosine similarity proxy) and the corresponding training loss
across epochs on three representative Arca training tasks. We observe a consistent increase in similarity and a monotonic
decrease in loss, which is consistent with the theoretical interpretation that the representation deviation contracts as training
proceeds. Tab. 12 summarizes the measured diagnostics ˆϵ1 and ˆϵ2 under different training settings.
Table 11. Similarity between z and z⋆ (cosine similarity proxy) and training loss across epochs (averaged over three Arca training tasks).
Higher similarity and lower loss indicate improved anchoring and reduced deviation.
Epoch 1 2 3 4 5 6 7 8 9 10
Sim. 0.01 0.14 0.34 0.45 0.54 0.60 0.64 0.67 0.69 0.71
Loss 1.94 1.23 0.97 0.74 0.62 0.54 0.42 0.35 0.29 0.21
Table 12. Diagnostics for representation mapping error and translation distortion. Lower is better for ˆϵ1 and ˆϵ2. Fill in with your measured
values (mean ± std if available).
Setting Task ˆϵ1 ↓ ˆϵ2 ↓ Sim. ↑ nDCG@10 ↑
Qwen3-Embedding-8B LazRetrieval 0.81 0.54 0.01 69.4
+Arca LazRetrieval 0.19 0.23 0.78 69.9
+Arca + LaSR LazRetrieval 0.19 0.23 0.78 71.1
D.3. A more complex training pipeline?
While LiRA introduces translation-aware training, our pipeline is designed to be budget-flexible and engineering-friendly.
First, translations can be prepared offline in advance, which substantially reduces online training overhead. Second, the
translator can be replaced by a smaller MT model when compute is constrained. We report resource usage for both training
and inference in Tab. 13 and Tab. 14, measured in single A100-80G GPU hours.1
Here, pass@k indicates that k LLM

Chunk 44 · 1,989 chars

offline in advance, which substantially reduces online training overhead. Second, the
translator can be replaced by a smaller MT model when compute is constrained. We report resource usage for both training
and inference in Tab. 13 and Tab. 14, measured in single A100-80G GPU hours.1
Here, pass@k indicates that k LLM translators are used along the forward path (i.e., k translation candidates are produced
by LLMs); pass@0 uses only MT models. In practical deployment, one may combine a lightweight MT model with an
LLM (or use MT only). Notably, even pass@0 already outperforms the original Qwen3-Embedding-8B baseline in our
experiments, indicating that LiRA provides a principled way to customize computation budgets and select translator capacity
according to available resources.
Table 13. Training resource usage (single A100-80G GPU hours). pass@k denotes using k LLM translators along the forward path.
Setting Pass@4 Arca Pass@4 LaSR Pass@2 Arca Pass@2 LaSR Pass@0 Arca Pass@0 LaSR
Offline translation 0.15h 0.20h 0.15h 0.20h 0.15h 0.20h
Online translation 1.50h 2.00h 0.80h 1.10h 0.30h 0.40h
1The reported “GPU hours” measure the end-to-end wall-clock GPU time under our implementation setup.
23

-- 23 of 26 --

Submission and Formatting Instructions for ICML 2026
Table 14. Inference resource usage (single A100-80G GPU hours) under different pass@k settings.
Setting pass@4 pass@2 pass@0
Offline translation 0.30h 0.30h 0.30h
Online translation 3.80h 2.10h 0.90h
Pass@k. We evaluate with a k-way budget: pass@k is the number of LLM translators used in one LiRA forward
pass (k=0 uses only MT; k>0 follows Table 15). Table 16 aggregates MLQA Retrieval, BelebeleRetrieval, and STS22.
Across k ∈ {0, 1, 2, 3, 4}, LiRA outperforms the baseline on all three datasets. Both models improve as k increases, with
diminishing gains beyond k=2 and a small dip at k=3 on MLQA. Even at k=0, LiRA beats the baseline, showing a tunable
budget–quality trade-off without large translators at

Chunk 45 · 1,996 chars

ebeleRetrieval, and STS22.
Across k ∈ {0, 1, 2, 3, 4}, LiRA outperforms the baseline on all three datasets. Both models improve as k increases, with
diminishing gains beyond k=2 and a small dip at k=3 on MLQA. Even at k=0, LiRA beats the baseline, showing a tunable
budget–quality trade-off without large translators at inference.
Table 15. Translator configurations for pass@k (abbreviations: OPUS = OPUS-MT; M2M = m2m100; N600 = nllb-200-600M; N3B =
nllb-200-3.3B; DS-R1 = Deepseek-R1-Distill-Qwen-32B).
PASS@K T1 T2 T3 T4
PASS@0 OPUS M2M N600 N3B
PASS@1 LLAMA OPUS M2M N3B
PASS@2 LLAMA GEMMA M2M N3B
PASS@3 LLAMA GEMMA QWEN N3B
PASS@4 LLAMA GEMMA QWEN DS-R1
Table 16. Combined pass@k results on three datasets (higher is better).
METHOD PASS@K MLQA-R BELE-R STS22
QWEN3-8B PASS@0 79.96 82.27 69.64
PASS@1 80.41 83.54 70.01
PASS@2 80.45 83.97 70.72
PASS@3 80.79 84.55 71.32
PASS@4 81.13 85.94 71.64
WITH LIRA PASS@0 81.15 86.00 73.01
PASS@1 81.56 86.15 73.54
PASS@2 81.79 86.67 74.11
PASS@3 81.53 86.69 74.39
PASS@4 82.01 87.03 75.00
D.4. Objective
Ranking objective. To improve the statistical stability of correlation targets (Pearson / Spearman) under small batches, we
maintain a FIFO history queue of length at most K. At each optimization step, we concatenate the history (prediction–label
pairs) to the current batch and compute the correlation losses jointly; the history is treated as a constant via stop-gradient
and never contributes gradients. Let the current batch size be B, the number of valid history items be m ≤ K, and the total
after concatenation be N = m+B. Denote the predicted similarities by p = (p1, . . . , pB )⊤ where
pi = cos(qi, di) = q⊤
i di ∈ [−1, 1] (29)
(the two vectors are L2-normalized sentence embeddings), and the gold scores by t = (t1, . . . , tB )⊤ (e.g., STS22
annotations). The detached history buffers are phist ∈ Rm and thist ∈ Rm. We concatenate
˜p =
phist
p

, ˜t =
thist
t

∈ RN . (30)
If N < Nmin (warm-up threshold), we skip the update. (1)

Chunk 46 · 1,995 chars

−1, 1] (29)
(the two vectors are L2-normalized sentence embeddings), and the gold scores by t = (t1, . . . , tB )⊤ (e.g., STS22
annotations). The detached history buffers are phist ∈ Rm and thist ∈ Rm. We concatenate
˜p =
phist
p

, ˜t =
thist
t

∈ RN . (30)
If N < Nmin (warm-up threshold), we skip the update. (1) Pearson correlation:
r(a, b) =
1
N
PN
i=1(ai − ¯a)(bi − ¯b)
q 1
N
PN
i=1(ai − ¯a)2 + ε
q 1
N
PN
i=1(bi − ¯b)2 + ε
, ε = 10−8, (31)
24

-- 24 of 26 --

Submission and Formatting Instructions for ICML 2026
applied to (˜p, ˜t). (2) Soft-Spearman (differentiable rank correlation): first compute soft ranks Ri for ˜p with temperature
τ > 0,
Ri(˜p; τ ) = 1 +
N	X
j=1
σ
 ˜pi − ˜pj
τ

, σ(x) = 1
1 + e−x . (32)
The label ranks ρi(˜t) use average ties (the standard statistical convention). Define Soft-Spearman as Pearson on ranks:
rs(˜p, ˜t) = r

R(˜p; τ ), ρ(˜t)

. (33)
Note that as τ → 0, R(·; τ ) approaches discrete ranks; smaller τ sharpens sorting but increases gradient variance. (3)
Combined loss (as implemented): for α ∈ [0, 1],
LCorrQ = α 1 − r(˜p, ˜t) + (1 − α) 1 − rs(˜p, ˜t). (34)
In practice, phist is passed via detach(), so gradients of LCorrQ flow only through the current p. The queue is up-
dated in FIFO fashion to keep at most K entries (module corr queue). The warm-up threshold Nmin (module
corr min effective) enqueues without backprop when data are insufficient, stabilizing early training.
Retrieval objective. Problem setup. For each query q, build a candidate set C = {d0, . . . , dC−1} where d0 is the positive
and the rest are online hard negatives (in-batch or mined from a same-language index). With qrels, obtain binary relevance
yi ∈ {0, 1}. Use inner-product/cosine scores
si = ⟨ˆq, ˆdi⟩, ˆq = q
∥q∥ , ˆdi = di
∥di∥ . (35)
Differentiable ranks. Use descending soft ranks (SoftRank) with temperature τ :
ri = 1 + X
j̸ =i
σ
 sj − si
τ

, σ(·) is the sigmoid. (36)
Larger si yields ri closer to 1. Soft nDCG@k. Define the discount and soft

Chunk 47 · 1,990 chars

elevance
yi ∈ {0, 1}. Use inner-product/cosine scores
si = ⟨ˆq, ˆdi⟩, ˆq = q
∥q∥ , ˆdi = di
∥di∥ . (35)
Differentiable ranks. Use descending soft ranks (SoftRank) with temperature τ :
ri = 1 + X
j̸ =i
σ
 sj − si
τ

, σ(·) is the sigmoid. (36)
Larger si yields ri closer to 1. Soft nDCG@k. Define the discount and soft top-k mask as
disci = 1
log2(1 + ri) , mi = σ
 k + 1
2 − ri
τk

. (37)
With gains gi = 2yi − 1,
DCG = X
i
gi disci mi, IDCG =
k	X
t=1
(2y↓
t − 1) 1
log2(1 + t) , nDCG@k = DCG
max(IDCG, ε) . (38)
The base loss is
Lndcg = 1 − nDCG@k. (39)
Safe negatives (near-negative safety gate). To avoid treating unlabeled relevant items as negatives, set
θ = s+ − δ, s+ = s0, (40)
and identify near negatives {i : yi = 0, si ≥ θ}. Apply continuous down-weighting
wi =



σ
 θ − si
β

, yi = 0
1, yi = 1
(41)
and update disci ← wi disci (or drop near negatives as a more aggressive variant). Stability terms. To mitigate collapse and
evaluation jitter, add two lightweight regularizers: (i) Top-1 hinge to enforce a margin between the positive and the hardest
negative,
Lhinge = max0, γ + max
yi=0 si − s+
; (42)
25

-- 25 of 26 --

Submission and Formatting Instructions for ICML 2026
(ii) Mean/variance regularization to control score centering and energy,
Lmv = (¯s)2 + Var(s) − ν , ¯s = 1
C
X
i
si. (43)
Final objective. The per-sample objective is
L = Lndcg + λh Lhinge + λr Lmv, (44)
and the training loss is the batch mean. In practice we keep only in-batch negatives and take the M hardest (top-M ) to
control complexity; when using external candidates (e.g., same-language queues/indices), we re-score with the current
model and then take top-M to reduce stale hard-negative artifacts. Implementation notes. We set τ ∈ [0.05, 0.2], τk ≈ 0.5,
δ ∈ [0.1, 0.3], β ∈ [0.01, 0.05], γ ≈ 0.05, ν ≈ 0.15, normalize scores, apply global gradient clipping, and use EMA for
evaluation. This objective directly maximizes differentiable nDCG@k while safe negatives and steady-state

Chunk 48 · 410 chars

reduce stale hard-negative artifacts. Implementation notes. We set τ ∈ [0.05, 0.2], τk ≈ 0.5,
δ ∈ [0.1, 0.3], β ∈ [0.01, 0.05], γ ≈ 0.05, ν ≈ 0.15, normalize scores, apply global gradient clipping, and use EMA for
evaluation. This objective directly maximizes differentiable nDCG@k while safe negatives and steady-state regularization
prevent periodic collapse due to score-field saturation.
26

-- 26 of 26 --