Enhancing Multilingual RAG Systems with Debiased Language

Summary

This paper addresses the perceived preference for English in multilingual Retrieval-Augmented Generation (mRAG) systems. Prior work attributed this to English-centric model capabilities, but the authors argue that structural biases in evaluation benchmarks distort this perception. They identify two key biases: **exposure bias** (English resources dominate retrieval) and **gold-availability bias** (ground-truth evidence is overwhelmingly in English). Cultural priors, where queries tied to specific regions favor local-language evidence, also skew results. To address these biases, the authors propose **DeLP (Debiased Language Preference)**, a calibrated metric that regresses out structural confounds. Using DeLP, they find that the apparent English preference largely disappears, revealing that retrievers fundamentally favor **monolingual alignment**—when the query and document languages match. Building on this insight, they introduce **DELTA (DEbiased Language preference–guided Text Augmentation)**, a lightweight query reformulation strategy. DELTA fuses global and local query segments, using repetition-based weighting to emphasize culturally relevant anchors. Experimental results show DELTA consistently outperforms English pivoting and other mRAG baselines across diverse languages, demonstrating the importance of accounting for true linguistic preference rather than biased environmental cues.

PDF viewer

Chunks(50)

Chunk 0 · 1,990 chars

Enhancing Multilingual RAG Systems with Debiased Language
Preference-Guided Query Fusion
Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee†
Chung-Ang University, Seoul, Korea
{tom0365, michael97k, swiftie1230, hwanheelee}@cau.ac.kr
https://jeonghyunpark2002.github.io/DELTA_project_page
Abstract
Multilingual Retrieval-Augmented Generation
(mRAG) systems often exhibit a perceived pref-
erence for high-resource languages, particu-
larly English, resulting in the widespread adop-
tion of English pivoting. While prior studies
attribute this advantage to the superior English-
centric capabilities of Large Language Mod-
els (LLMs), we find that such measurements
are significantly distorted by structural priors
inherent in evaluation benchmarks. Specifi-
cally, we identify exposure bias and a gold
availability prior—both driven by the dis-
proportionate concentration of resources in
English—as well as cultural priors rooted
in topic locality, as factors that hinder accu-
rate assessment of genuine language prefer-
ence. To address these biases, we propose
DeLP (Debiased Language Preference), a cali-
brated metric designed to explicitly factor out
these structural confounds. Our analysis us-
ing DeLP reveals that the previously reported
English preference is largely a byproduct of
evidence distribution rather than an inherent
model bias. Instead, we find that retrievers fun-
damentally favor monolingual alignment be-
tween the query and the document language.
Building on this insight, we introduce DELTA
(DEbiased Language preference–guided Text
Augmentation), a lightweight and efficient
mRAG framework that strategically lever-
ages monolingual alignment to optimize cross-
lingual retrieval and generation. Experimen-
tal results demonstrate that DELTA con-
sistently outperforms English pivoting and
mRAG baselines across diverse languages. The
Code is available at https://github.com/
jeonghyunpark2002/DELTA.git
1 Introduction
Multilingual Retrieval-Augmented

Chunk 1 · 1,995 chars

alignment to optimize cross-
lingual retrieval and generation. Experimen-
tal results demonstrate that DELTA con-
sistently outperforms English pivoting and
mRAG baselines across diverse languages. The
Code is available at https://github.com/
jeonghyunpark2002/DELTA.git
1 Introduction
Multilingual Retrieval-Augmented Generation
(mRAG) (Chirkova et al., 2024) generalizes
Retrieval-Augmented Generation (RAG) (Lewis
†Corresponding author.
What is Multilingual RAG?
The number of files with gold passages
English queries are advantageous
한국의�전통�가옥은�무엇인가?	
(What is the korean term for
	korean traditional houses?)
Only Korean Documents
	have the evidence
Only Thai Documents
	have the evidence
(What is food served in
	plastic bags for takeout
	called in Thai culture?)
Structural Priors in Multilingual RAG
Gold-availability Prior 	Cultural Prior: Topic Locality
No
File
	20 Files
	Found
	1 File
	Found
	No
File
	No
File
General Topic
	Probability
EN
Local
	Doc
	EN
Local
	Doc
Cultural Topic
	Probability
Query
	Topic
Figure 1: Common causes of language preference of
mRAG: gold-availability prior and cultural prior.
et al., 2020) by retrieving evidence from multilin-
gual knowledge sources. This enables language
models to produce responses that are not only fac-
tually grounded but also sensitive to the user’s
language and linguistic context (Rau et al., 2024).
In this landscape, mRAG systems frequently ex-
hibit a significant language preference for En-
glish (Zhang et al., 2023; Park and Lee, 2025).
Consequently, English pivoting—the practice of
translating a non-English query into English be-
fore retrieval—has emerged as a surprisingly strong
heuristic that yields substantial gains across many
languages (Chirkova et al., 2024; Ranaldi et al.,
2025). Prior works have largely attributed this ad-
vantage to the "English-centric" competence of gen-
erators, such as superior reasoning in English or re-
duced translation noise, leading to research that pri-
marily intervenes at

Chunk 2 · 1,995 chars

ields substantial gains across many
languages (Chirkova et al., 2024; Ranaldi et al.,
2025). Prior works have largely attributed this ad-
vantage to the "English-centric" competence of gen-
erators, such as superior reasoning in English or re-
duced translation noise, leading to research that pri-
marily intervenes at the generation stage (Chirkova
et al., 2024; Li et al., 2025; Moon et al., 2025).
However, we find that the perceived effective-
ness of English pivoting is primarily driven by
retrieval-side structural biases rather than any in-
herent linguistic preference of the model. As il-
lustrated in Figure 1, we identify two major con-
founders: a gold availability prior and cultural
priors. Our analysis shows that ground-truth evi-
dence in standard benchmarks is overwhelmingly
concentrated in English resources, establishing a
dominant gold availability prior. This concentration
1
arXiv:2601.02956v3 [cs.CL] 13 Apr 2026

-- 1 of 21 --

not only makes English the sole or primary loca-
tion where correct evidence exists, but also leads
to exposure bias—the retrieval system’s inherent
tendency to surface documents from more preva-
lent languages regardless of the query’s intended
language—of English documents during retrieval,
further amplifying English’s advantage. Together,
these effects fundamentally distort measured lan-
guage preference and inflate the apparent superi-
ority of English. In addition, we identify cultural
priors as an equally critical factor. There are bench-
mark questions that are tied to specific geographic
or cultural contexts and contain native surface
forms (e.g., local titles, aliases, and scripts) that act
as strong retrieval anchors. When benchmarks over-
represent such locale-specific topics, languages
may appear “preferred” due to query–corpus align-
ment and environmental exposure rather than the
model’s intrinsic preference. Critically, these struc-
tural factors contaminate existing methods (Park
and Lee, 2025) for measuring

Chunk 3 · 1,999 chars

etrieval anchors. When benchmarks over-
represent such locale-specific topics, languages
may appear “preferred” due to query–corpus align-
ment and environmental exposure rather than the
model’s intrinsic preference. Critically, these struc-
tural factors contaminate existing methods (Park
and Lee, 2025) for measuring language preference.
To reveal the intrinsic preference of mRAG sys-
tems, we propose Debiased Language Preference
(DeLP), a calibrated measurement that explicitly
regresses out these structural confounds. DeLP
utilizes a ridge regression framework to predict
observed language preference from structural fac-
tors (e.g., corpus size, gold availability, cultural
prior), treating the residual signal as the true, debi-
ased preference of the model. By applying DeLP,
we reveal a qualitatively different landscape: the
previously inflated preference for English largely
evaporates. Instead, our results show an increased
preference for monolingual alignment, where the
retriever performs most effectively when the query
and the target document languages match.
Building on the discovery of monolingual
alignment, we introduce DEbiased Language
preference–guided Text Augmentation (DELTA),
a lightweight query-level solution for mRAG.
DELTA leverages the debiased preference signals
from DeLP to dynamically identify intrinsic model
preference for a given query, effectively bridging
the gap between the user’s query and the languages
where the model performs most reliably. By re-
formulating the query to include these preference-
aligned multilingual anchors, DELTA preserves the
native script’s context while maximizing the ben-
efits of monolingual alignment. DELTA is highly
cost-effective, requiring no modifications to the
underlying corpus or retriever architecture. Our ex-
periments demonstrate that DELTA outperforms
naive English pivoting, proving that accounting for
the model’s true linguistic preference—rather than
following biased environmental cues—is the key
to

Chunk 4 · 1,988 chars

t. DELTA is highly
cost-effective, requiring no modifications to the
underlying corpus or retriever architecture. Our ex-
periments demonstrate that DELTA outperforms
naive English pivoting, proving that accounting for
the model’s true linguistic preference—rather than
following biased environmental cues—is the key
to unlocking the true potential of mRAG systems.
2 The Myth of English Preference:
Structural Priors in mRAG
A dominant mRAG strategy is English pivoting,
where non-English queries are translated into En-
glish to exploit the perceived superiority of English-
centric models. We hypothesize that these gains are
not necessarily indicative of model preference but
rather reflect a massive exposure bias rooted in the
structural distribution of evidence.
2.1 Experimental Setup
Datasets We conduct our analysis on
MKQA (Longpre et al., 2021), which pro-
vides 10k professionally translated queries. To
enable precise measurement of evidence location,
we use a 2.7K-example subset that overlaps with
KILT NQ *. Since MKQA does not provide stan-
dardized provenance for each translated instance,
using KILT allows us to inherit document-level
provenance (i.e., gold Wikipedia passage IDs),
which is essential for quantifying gold availability
across different linguistic corpora.
Models and Knowledge Sources We employ
BGE-m3 (Chen et al., 2024) as the multilin-
gual retriever and re-ranker. For the generation,
we use three recently released robust multilin-
gual LLMs: Qwen3-235B (Yang et al., 2025),
DeepSeek-v3.1 (Liu et al., 2024), and Gemini-2.5-
Flash (Comanici et al., 2025). We retrieve top-
50 candidate documents per query and apply re-
ranking, using the top-5 documents as contexts for
generation. In line with previous work (Chirkova
et al., 2024; Park and Lee, 2025), we use Wikipedia
editions in English and the user’s local language
to serve as the knowledge sources. Detailed corpus
statistics are provided in Appendix D.
2.2 Linguistic Superiority or Data

Chunk 5 · 1,998 chars

, using the top-5 documents as contexts for
generation. In line with previous work (Chirkova
et al., 2024; Park and Lee, 2025), we use Wikipedia
editions in English and the user’s local language
to serve as the knowledge sources. Detailed corpus
statistics are provided in Appendix D.
2.2 Linguistic Superiority or Data Imbalance?
Following the MKQA protocol anchored to
KILT (Longpre et al., 2021), we identify the lo-
cation of gold passages (WPIDs) within the mul-
tilingual Wikipedia datastore. We report this dis-
tribution as Gold Availability, as the number of
queries whose gold passage WPID is present in that
*https://huggingface.co/datasets/facebook/
kilt_tasks
2

-- 2 of 21 --

Gold Availability Retriever Recall Qwen3-235B-A22B Gemini-2.5-Flash DeepSeek-Chat-v3.1
lang #q ratio Base EN Base EN Base EN Base EN
en 26934 73.29% – – 70.05( EN ) – 58.26( EN ) – 60.77( EN ) –
ar 214 0.58% 13.36 23.57 47.79 55.14 40.79 48.44 43.64 50.97
de 435 1.18% 21.62 26.40 63.81 60.72 53.52 55.17 54.16 56.92
ja 513 1.40% 16.84 25.83 46.60 59.29 44.26 53.68 44.72 56.52
ko 306 0.83% 15.62 24.81 40.14 54.57 35.97 47.67 34.21 50.49
th 187 0.51% 21.90 27.55 40.73 60.46 31.65 54.86 36.80 57.52
zh 287 0.78% 16.47 26.53 37.52 59.53 30.81 53.59 33.14 56.11
Table 1: Gold availability bias and its impact on multilingual RAG. Gold Availability measures gold-passage
coverage per language, and Retriever Recall reports Recall@50. Model columns show end-to-end accuracy. Base
denotes native-language queries, and EN denotes English-translated queries.
language’s Wikipedia corpus for each language of
query. This distribution reflects the extent of corpus-
level coverage of gold evidence in each language
within the benchmark. We then relate this to re-
trieval performance, measured by Recall@50—i.e.,
the fraction of queries whose gold passages ap-
pear within the top-50 retrieved candidates—under
both native-language queries and English queries
(EN). We further evaluate end-to-end mRAG per-
formance

Chunk 6 · 1,985 chars

e in each language
within the benchmark. We then relate this to re-
trieval performance, measured by Recall@50—i.e.,
the fraction of queries whose gold passages ap-
pear within the top-50 retrieved candidates—under
both native-language queries and English queries
(EN). We further evaluate end-to-end mRAG per-
formance using character 3-gram recall between
the generated and reference answers. Details are
provided in Appendix H.
Our analysis in Table 1 reveals an extreme
imbalance in the retrieval environment. English
Wikipedia provides substantially higher document
density and coverage, introducing a strong expo-
sure bias. More critically, for a vast majority of
queries, English Wikipedia also serves as the sole
repository of ground-truth, inducing a dominant
gold availability prior. Consequently, English piv-
oting appears effective not because models prefer
English, but because of this structural skew, sustain-
ing the long-standing "myth" of English preference.
ar de es fr ja ko ru zh
21.43% 16.67% 18.52% 21.74% 14.29% 12.50% 25.00% 6.67%
Table 2: Local-gold coverage by predicted Lloc.
2.3 Impact of Cultural Priors
Queries often carry cultural or regional context,
whose associated language can naturally align with
local-language evidence and be conflated with lan-
guage preference; we therefore examine where
their gold documents reside across Wikipedia lan-
guages. We first isolate queries that involve cul-
tural or regional contexts by instructing GPT-4o-
mini (Hurst et al., 2024) to predict a single pri-
mary locale language Lloc, selecting the language
that corresponds to the query’s main referenced
region or culture (Details on the classifier are in
Appendix J). Table 2 reports the local-gold rate
p(gold WPID exists in local Wikipedia | Lloc) for
each predicted language. Local evidence is not uni-
formly absent—across several predicted locale lan-
guages, about 20% of these queries have gold pages
only in the corresponding local Wikipedia. This

Chunk 7 · 1,990 chars

classifier are in
Appendix J). Table 2 reports the local-gold rate
p(gold WPID exists in local Wikipedia | Lloc) for
each predicted language. Local evidence is not uni-
formly absent—across several predicted locale lan-
guages, about 20% of these queries have gold pages
only in the corresponding local Wikipedia. This dis-
tribution introduces a structural bias for retrieval to
rely on locale-specific surface-form anchors (e.g.,
native titles, aliases, scripts) when they exist. As
a result, observed language preference can be in-
fluenced by locale-tied queries and their local-gold
presence, motivating an explicit cultural prior term
pcult to avoid conflating topic locality.
3 Measuring Language Preference via
Bias Calibration
The structural priors identified in Section 2 sug-
gest that existing metrics, such as MLRS (Park
and Lee, 2025), fail to distinguish between a
model’s intent to use a language and the external
necessity imposed by data distribution. To reveal
the genuine preference, we introduce Debiased
Language Preference (DeLP), a calibrated mea-
surement framework that explicitly regresses out
structural confounds.
3.1 Decomposing Structural Bias in mRAG
To isolate intrinsic model preference, we decom-
pose the confounding factors identified in our pre-
vious analysis into three primary priors.
Exposure prior (pret). As observed in the ex-
posure bias of Section 2.2, high-resource cor-
pora—particularly English—dominate the top re-
trieval results regardless of the encoder’s linguistic
intent. This prior captures the "popularity bias" of
the datastore. A language that appears more fre-
quently in the candidate pool is more likely to be
retrieved, potentially leading to a false inflation
of preference. Let Lq denote the query language
and Ld the language of a retrieved document. We
3

-- 3 of 21 --

STAGE �:
Observed / Biased
Preference
Exposure prior (p ret
)
Gold-availabillity prior (p gold
)
Cultural prior (p cult
)
Observed language
preference

Chunk 8 · 1,995 chars

be
retrieved, potentially leading to a false inflation
of preference. Let Lq denote the query language
and Ld the language of a retrieved document. We
3

-- 3 of 21 --

STAGE �:
Observed / Biased
Preference
Exposure prior (p ret
)
Gold-availabillity prior (p gold
)
Cultural prior (p cult
)
Observed language
preference (raw)
Exposure / corpus skew
+ Auxiliary
Priors
Gold evidence concentration
Topic locality / native anchors
Regression with prior vector
& Residual calculation
Document language
Query language
Strongly correlated
with structural priors
STAGE �:
Structural Priors
STAGE �:
DeLP Calibration
STAGE �:
Debiased / Calibrated
Preference
Debiased language
preference (DeLP)
Document language
Query language
Intrinsic preference after calibration
Monolingual alignment emerges
Raw
Performance
Prior Vector
Figure 2: Overview of DeLP, which measures intrinsic language preference in mRAG by regressing out exposure,
gold-availability, and cultural priors from raw preference signals.
estimate pret(Ld | Lq) by calculating the average
proportion of document language Ld within the
top-50 candidates for queries in Lq.
Gold-availability prior (pgold). Our findings in
Table 1 (Section 2.2) demonstrate that retrieval
is often forced into English because the gold evi-
dence simply does not exist elsewhere. To prevent
such uncontrollable circumstances from being mis-
taken for model preference, we explicitly model the
availability of ground-truth passages. We estimate
pgold(Lq, Ld) on the MKQA–KILT overlap as the
empirical fraction of queries in Lq for which gold
evidence is present in the Ld corpus.
Cultural prior (pcult). As discussed in Sec-
tion 2.3, locale-tied queries contain native sur-
face forms that act as structural anchors, natu-
rally pulling retrieval toward the corresponding
local-language evidence. This prior captures topic
locality; a retriever might show high preference
scores for a language simply because the topic
is regional. We estimate pcult(Ld) by

Chunk 9 · 1,999 chars

ale-tied queries contain native sur-
face forms that act as structural anchors, natu-
rally pulling retrieval toward the corresponding
local-language evidence. This prior captures topic
locality; a retriever might show high preference
scores for a language simply because the topic
is regional. We estimate pcult(Ld) by identifying
the query’s associated locale language Lloc using
GPT-4o-mini and computing the fraction of queries
where Lloc = Ld. The classifier selects the local
language of the primary referenced place or culture
(e.g., "When did Hong Kong go back to China?" →
zh), reserving en for inherently English-speaking or
genuinely global queries. For the details of how we
measure cultural prior, please refer to Appendix J.
3.2 Calibration for Genuine Preference
We calibrate the raw language preference scores
with respect to the above priors to obtain the resid-
ual signal-the component not explained by the pri-
ors, which reflects the model’s genuine language
preference. In addition to the above three main
priors, we incorporate two auxiliary structural con-
trols, namely a corpus-size prior pdb and a passage-
length statistic ℓ, as additional covariates in the
prior feature vector ϕ(Lq, Ld) (defined below) to
account for language-dependent corpus scale and
length effects. Let se(Lq, Ld) denote the observed
language-preference score of the encoder e for the
query language Lq, and for the language of evi-
dence Ld. We instantiate se by MLRS (Park and
Lee, 2025) (Table 10), which measures how of-
ten the retriever surfaces evidence in each Ld for
a fixed Lq. For each language pair (Lq, Ld), we
define a prior feature vector ϕ(Lq, Ld) ∈ R7:
ϕ(Lq , Ld) =









1
logpret(Ld | Lq ) + ϵ
logpdb(Ld) + ϵ
logℓ(Ld) + ϵ
logpgold(Lq , Ld) + ϵ
logpcult(Ld) + ϵ
I[Lq = Ld]









. (1)
where pret is the exposure prior, pdb is the
corpus-size prior, ℓ is a passage-length statistic
(e.g., median length), pgold is the gold-availability
prior, and pcult

Chunk 10 · 1,998 chars

) =









1
logpret(Ld | Lq ) + ϵ
logpdb(Ld) + ϵ
logℓ(Ld) + ϵ
logpgold(Lq , Ld) + ϵ
logpcult(Ld) + ϵ
I[Lq = Ld]









. (1)
where pret is the exposure prior, pdb is the
corpus-size prior, ℓ is a passage-length statistic
(e.g., median length), pgold is the gold-availability
prior, and pcult is the cultural prior, and ϵ > 0
is a small constant added for numerical stability
in the log transformation. The vector ϕ(Lq, Ld)
stacks interpretable covariates that predict se with-
out invoking intrinsic model preference. We use
log-transformed priors to compress heavy-tailed
probabilities and corpus statistics, and make lin-
ear effects more reasonable across languages. The
indicator I[Lq = Ld] allows the model to treat
same-language retrieval as a special case, ensur-
ing that monolingual matching is not forced to be
explained solely by external priors.
Ridge calibration. We fit the regression sepa-
rately for each encoder e to learn how much of its
observed score se can be attributed to structural
priors. We use ridge regularization to stabilize co-
efficients under the various priors, preventing any
4

-- 4 of 21 --

Query
Lang. Lq = Ld
Lq̸ = Ld
en ko zh fr ja it pt es
en 50.10 (56.79) – 36.67 (33.94) 40.34 (33.99) 35.57 (37.57) 39.61 (34.18) 35.49 (36.79) 36.46 (36.54) 36.02 (37.49)
ko 43.38 (42.21) 37.69 (44.36) – 41.75 (35.44) 34.90 (36.84) 43.59 (38.22) 34.70 (36.00) 35.65 (35.71) 34.77 (36.24)
zh 50.60 (45.81) 38.68 (45.35) 37.95 (35.06) – 34.73 (36.73) 41.90 (36.51) 34.87 (36.21) 35.84 (35.91) 35.22 (36.69)
fr 40.05 (43.74) 41.16 (47.84) 36.77 (34.03) 40.50 (34.16) – 39.92 (34.50) 36.00 (37.31) 36.69 (36.76) 36.28 (37.76)
ja 49.19 (45.50) 38.70 (45.37) 38.40 (35.69) 41.56 (35.24) 34.94 (36.94) – 34.99 (36.29) 35.97 (36.04) 35.23 (36.70)
it 39.05 (41.72) 40.63 (47.30) 36.85 (34.12) 40.59 (34.25) 36.63 (38.64) 39.86 (34.44) – 37.01 (37.09) 36.91 (38.39)
pt 46.08 (39.76) 40.55 (47.23) 36.98 (34.24) 40.63 (34.29) 36.50 (38.52) 40.01 (34.59)

Chunk 11 · 1,998 chars

a 49.19 (45.50) 38.70 (45.37) 38.40 (35.69) 41.56 (35.24) 34.94 (36.94) – 34.99 (36.29) 35.97 (36.04) 35.23 (36.70)
it 39.05 (41.72) 40.63 (47.30) 36.85 (34.12) 40.59 (34.25) 36.63 (38.64) 39.86 (34.44) – 37.01 (37.09) 36.91 (38.39)
pt 46.08 (39.76) 40.55 (47.23) 36.98 (34.24) 40.63 (34.29) 36.50 (38.52) 40.01 (34.59) 36.48 (37.80) – 37.73 (39.21)
es 38.19 (41.30) 40.71 (47.39) 36.86 (34.13) 40.39 (34.04) 36.30 (38.31) 39.76 (34.34) 36.45 (37.76) 37.25 (37.32) –
Table 3: DeLP for query-document language pairs, averaged over three encoders (raw MLRS in parentheses).
Background shading is row-wise min-max scaled (darker = stronger preference); dashes denote Lq = Ld. Underline
denotes the second-highest per row. The relatively stronger English and Chinese signals are attributable to encoder
training-data language imbalance (e.g., BGE-m3 trains on 194 languages with English 43.9% and Chinese 20.5%).
single feature from disproportionately absorbing
the preference signal. Let C be the set of all lan-
guage pairs used for calibration. For each encoder
e, we fit a ridge regression that predicts the raw
score from priors:
ˆβe = arg min
β Je(β),
Je(β) = X
(Lq ,Ld)∈C
se(Lq , Ld) − ϕ(Lq , Ld)⊤β2
+ λ∥β∥2
2.
(2)
where λ is a regularization hyperparameter and
ϵ is a small constant for numerical stability.
Debiased preference (DeLP). We define the de-
biased preference as the residual signal after remov-
ing the component explained by structural priors:
re(Lq , Ld) = se(Lq , Ld) − ϕ(Lq , Ld)⊤ ˆβe. (3)
The residual re(Lq, Ld) represents the portion of
the observed score that is independent of structural
priors. To keep the overall scale comparable to the
raw score, we re-center the residuals by the global
mean of raw scores μe:
DeLPe(Lq, Ld) = re(Lq, Ld) + μe,
μe = 1
|C|
X
(Lq ,Ld)∈C
se(Lq, Ld). (4)
By adding back μe, DeLP stays on a numeric
scale comparable to standard MLRS tables while
preserving the relative differences that define the
model’s intrinsic tendencies. To mitigate

Chunk 12 · 1,998 chars

e re-center the residuals by the global
mean of raw scores μe:
DeLPe(Lq, Ld) = re(Lq, Ld) + μe,
μe = 1
|C|
X
(Lq ,Ld)∈C
se(Lq, Ld). (4)
By adding back μe, DeLP stays on a numeric
scale comparable to standard MLRS tables while
preserving the relative differences that define the
model’s intrinsic tendencies. To mitigate potential
encoder-specific bias, we apply our calibration pro-
cedure independently to each retriever and report
all debiased results for three multilingual encoders:
BGE-m3 (Chen et al., 2024) and two Sentence-
BERT variants (Reimers and Gurevych, 2019),
paraphrase-multilingual-MiniLM-L12-v2
and paraphrase-multilingual-mpnet-base-v2.
We denote them as p-mMiniLM and p-mMpNet for
compactness in tables.
pret pgold pcult
Encoder MLRS DeLP MLRS DeLP MLRS DeLP
bge-m3 0.994 0.142 0.914 0.336 0.916 0.335
p-mMiniLM 0.997 0.145 0.915 0.321 0.917 0.320
p-mMpNet 0.996 0.131 0.917 0.311 0.920 0.310
Table 4: Pearson’s r between preference and priors for
before (MLRS) and after (DeLP) calibration.
Emergence of Monolingual Alignment. After
calibration, we find that the preference landscape
shifts qualitatively from the raw preference as in
Table 3. The previously dominant English prefer-
ence largely disappears, and the strongest signal
consistently moves to the diagonal (Lq = Ld). This
reveals that retrievers fundamentally favor mono-
lingual alignment—the matching of query and doc-
ument in the same language. We also observe that
queries favor the linguistically or regionally related
languages, such as Korean with Japanese. Overall,
the DeLP score suggests that much of the apparent
English preference in prior protocols was induced
by structural priors, while the residual preference
signal is dominated by query-language alignment
and interpretable related-language effects. For a
more detailed DeLP score, refer to Appendix G.
Correlation Analysis. To validate DeLP, we
compute the correlation between preference scores
and priors before and after calibration as shown

Chunk 13 · 1,999 chars

s, while the residual preference
signal is dominated by query-language alignment
and interpretable related-language effects. For a
more detailed DeLP score, refer to Appendix G.
Correlation Analysis. To validate DeLP, we
compute the correlation between preference scores
and priors before and after calibration as shown in
Table 4. Raw scores (MLRS) are highly correlated
with all three priors (exposure, gold-availability,
and cultural), suggesting that existing language-
5

-- 5 of 21 --

boost
boost
boost
boost Retrieving
Document
Answer
Generation
Figure 3: Overview of DELTA query fusion. DELTA fuses global and local query segments into a single preference-
aligned query using lightweight repetition-based weighting.
preference measurements largely reflect prior-
driven preference rather than intrinsic model pref-
erence. After calibration, these correlations drop
sharply after applying DeLP. This confirms that
DeLP effectively decouples intrinsic model tenden-
cies from the structural signals.
4 Debiasing mRAG through
Preference-Aligned Augmentation
Monolingual alignment in Section 3 reveals that re-
trievers intrinsically perform best when the query
language matches the document language. This
suggests that while English pivoting provides cov-
erage (due to gold availability), it often sacri-
fices the retrieval anchors present in the user’s na-
tive tongue. Motivated by this point, we propose
DELTA (DEbiased Language preference guided
Text Augmentation), a lightweight query reformu-
lation strategy that injects preference-aligned cues
into a single fused query.
4.1 Query Fusion with Native Anchors
DELTA aims to maximize the benefits of both
global coverage and local discriminative matching.
As illustrated in Figure 3, given a local question
qlocal, DELTA constructs an English pivot qglob and
extracts a set of cultural identifiers—canonical ti-
tles (tglob, tloc), aliases, and a regional hint—using
a frozen LLM (Instruction in Appendix N). These
elements are then

Chunk 14 · 1,982 chars

overage and local discriminative matching.
As illustrated in Figure 3, given a local question
qlocal, DELTA constructs an English pivot qglob and
extracts a set of cultural identifiers—canonical ti-
tles (tglob, tloc), aliases, and a regional hint—using
a frozen LLM (Instruction in Appendix N). These
elements are then concatenated into a single query
Qfused composed of five segments, each optionally
weighted by cultural cues’ confidence score:
• [GLOB]: The English pivot qglob.
• [LOCAL]: The original query qlocal to leverage
monolingual alignment.
• [TITLE_BRIDGE]: Paired titles (tglob, tloc) to fa-
cilitate cross-lingual mapping.
• [ALIASES] and [LOCALE_HINT]: Specific identi-
fiers that serve as stable retrieval anchors.
4.2 Repetition-based Weighting
To implement the debiased preference control,
DELTA utilizes a repetition-based weighting pol-
icy(Wang et al., 2023). We first predict a cul-
tural cue (y, c, Lloc), where y ∈ {0, 1} indicates
whether the query is culture-specific, c ∈ [0, 1] is
the confidence score, and Lloc is the language of
the key native identifiers. This cue determines how
strongly we upweight locale-specific blocks versus
the global English back-off when forming the fused
query Qfused. We map the confidence score c into
three discrete repetition levels using two thresholds
τlow and τhigh. We then set repetition counts for
the local block [LOCAL:Lloc] and the global pivot
block [GLOB] as:
rlocal =
(
1 + I[c ≥ τlow] + I[c ≥ τhigh], y = 1
1, y = 0
rglob =
(
1 + I[c < τlow], y = 1
2, y = 0
(5)
c < τlow triggers no additional upweighting,
τlow ≤ c < τhigh adds one extra repetition, and
c ≥ τhigh adds two, yielding rlocal ∈ {1, 2, 3}
and rglob ∈ {1, 2} . Intuitively, we upweight high-
confidence culture-specific queries toward the local
expression to preserve culturally grounded identi-
fiers, while non-culture-specific queries mildly fa-
vor the global pivot for robust back-off. In addition,
when y=1 and c ≥ τboost, we duplicate

Chunk 15 · 1,994 chars

local ∈ {1, 2, 3}
and rglob ∈ {1, 2} . Intuitively, we upweight high-
confidence culture-specific queries toward the local
expression to preserve culturally grounded identi-
fiers, while non-culture-specific queries mildly fa-
vor the global pivot for robust back-off. In addition,
when y=1 and c ≥ τboost, we duplicate local-side
disambiguation anchors (i.e., [TITLE_BRIDGE]
and [ALIASES]) once more to further emphasize
native surface-form anchors and reduce entity ambi-
guity. Overall, DELTA realizes preference control
via text-only weighting, and all concrete hyperpa-
rameter values are reported in Appendix K.
6

-- 6 of 21 --

Method en ar es zh ja de ko th AVG ↑ Latency ↓
Qwen3-235b-a22b-2507
Document Level
MultiRAG 70.05 47.79 63.76 37.52 46.60 63.81 40.14 40.73 51.30 1.38
CrossRAG 68.21 43.95 61.14 37.81 44.75 60.16 38.13 42.87 49.63 1.29
DKM-RAG 69.13 42.69 62.12 35.13 43.90 61.13 39.49 38.88 49.06 3.80
QTT-RAG 70.11 46.44 63.02 37.68 46.94 62.79 44.13 42.12 51.65 1.80
Query Level
English Translation - 55.14 61.94 59.53 59.29 60.72 54.57 60.46 58.81 1.17
DELTA (ours) 63.85 62.55 63.03 62.59 62.38 62.86 63.26 62.51 62.88 1.13
Gemini-2.5-flash
Document Level
MultiRAG 58.26 40.79 55.11 30.81 44.26 53.52 35.97 31.65 43.80 1.53
CrossRAG 63.40 41.87 57.24 29.74 44.14 56.80 36.09 32.49 45.22 2.60
DKM-RAG 64.21 39.41 59.26 31.34 43.45 57.74 37.26 33.64 45.79 5.63
QTT-RAG 65.32 42.64 57.81 31.56 45.18 56.27 40.65 35.97 46.93 5.55
Query Level
English Translation - 48.44 55.84 53.59 53.68 55.17 47.67 54.86 52.75 1.55
DELTA (ours) 56.97 56.45 55.95 55.83 56.18 55.98 56.44 56.45 56.28 1.48
Deepseek-chat-v3.1
Document Level
MultiRAG 60.77 43.64 56.22 33.14 44.72 54.16 34.21 36.80 45.46 2.56
CrossRAG 67.83 48.34 62.24 39.05 49.27 61.33 39.85 45.70 51.70 2.64
DKM-RAG 67.84 44.07 62.49 37.63 45.66 61.65 40.30 40.38 50.00 2.39
QTT-RAG 68.28 46.13 61.81 37.24 47.36 60.48 41.06 41.29 50.46 1.93
Query Level
English Translation - 50.97 58.32 56.11 56.52 56.92 50.49 57.52 55.26

Chunk 16 · 1,998 chars

44.72 54.16 34.21 36.80 45.46 2.56
CrossRAG 67.83 48.34 62.24 39.05 49.27 61.33 39.85 45.70 51.70 2.64
DKM-RAG 67.84 44.07 62.49 37.63 45.66 61.65 40.30 40.38 50.00 2.39
QTT-RAG 68.28 46.13 61.81 37.24 47.36 60.48 41.06 41.29 50.46 1.93
Query Level
English Translation - 50.97 58.32 56.11 56.52 56.92 50.49 57.52 55.26 2.05
DELTA (ours) 59.85 59.46 58.61 59.67 59.02 59.25 53.51 56.45 58.23 1.13
Table 5: Main results (end-to-end mRAG performance). We use bge-m3 (Chen et al., 2024) for retrieval, and evaluate
with character 3-gram recall (Chirkova et al., 2024). Best, second-best AVG (mean) are computed per generator and
language. We report generation time as latency.
5 Experiments
5.1 Experimental Setup
Baselines. (1) MultiRAG: Retrieve and re-rank
from multilingual datastores using the original
MKQA query, then generate the answer in the
query language. (Chirkova et al., 2024) (2) Cross-
RAG: Run the same multilingual retrieval as Mul-
tiRAG, translate the retrieved passages into a sin-
gle pivot language (English). (Ranaldi et al., 2025)
(3) DKM-RAG: Translate the retrieved passages
into the query language, use an LLM to produce
multiple refined passages. (Park and Lee, 2025) (4)
QTT-RAG: Translate retrieved passages and attach
translation-quality tags so the generator can decide
which contexts to trust. (Moon et al., 2025) (5) En-
glish Translation: Translate the original query into
the global pivot language.
5.2 Results and Analysis
Main Results. Table 5 shows that DELTA
achieves the best average performance for each
generator and is comparable to, or better than,
document-level frameworks that require substan-
tially higher cost due to the document’s long con-
text length. The gains are particularly pronounced
on non-English queries, indicating that preference-
aligned query augmentation is more effective than
relying on document-side transformations in mul-
tilingual settings. DELTA provides little benefit
on English queries (the en column in Table 5) be-
cause

Chunk 17 · 1,998 chars

s long con-
text length. The gains are particularly pronounced
on non-English queries, indicating that preference-
aligned query augmentation is more effective than
relying on document-side transformations in mul-
tilingual settings. DELTA provides little benefit
on English queries (the en column in Table 5) be-
cause the local query and the global pivot become
nearly identical when Lq = en. As a result, DELTA
injects redundant segments with repeated, overlap-
ping content, which unnecessarily lengthens the
query and can dilute the useful signal for retrieval,
resulting in no gain or even a slight degradation.
Statistic Value
Queries (N ) 16,828
Newly recovered queries (Nnew) 1,235
Gold best-rank (mean) 10.39
Gold best-rank (median) 5
Top-10 rate 66.23%
Rank in [10,49] rate 35.14%
Rank in [40,50] rate 4.62%
Table 6: Retrieval rank analysis, which reports the
rank distribution of gold passages newly recovered by
DELTA relative to English-pivot.
Analyzing Gold Passage Recall and Ranking.
To investigate how DELTA recovers missing gold
evidence to improve the overall mRAG system, we
compare its retrieval performance against English-
7

-- 7 of 21 --

pivot retrieval across seven languages (ar, de, es,
ja, ko, th, zh), totaling 16,828 queries. As shown
in Table 6, while English pivoting provides cover-
age due to English-heavy gold availability, it often
degrades native surface-form anchors—such as ti-
tles, aliases, and original scripts—that are critical
for precise entity matching. DELTA restores these
anchors, facilitating better alignment with English
gold documents. Among 1,235 newly recovered
queries, DELTA achieves a mean best rank of 10.39
(median 5) with a 66.2% Top-10 entry rate. This in-
dicates that DELTA does not merely rescue missed
gold pages near the cutoff but significantly elevates
them to high-ranking, actionable positions.
Method ar de es ja ko th zh Avg
Orig 63.68 62.46 63.26 63.26 62.84 63.37 63.04 63.13
+Global 71.42 72.02 71.37 71.51 71.77 71.70

Chunk 18 · 1,998 chars

5) with a 66.2% Top-10 entry rate. This in-
dicates that DELTA does not merely rescue missed
gold pages near the cutoff but significantly elevates
them to high-ranking, actionable positions.
Method ar de es ja ko th zh Avg
Orig 63.68 62.46 63.26 63.26 62.84 63.37 63.04 63.13
+Global 71.42 72.02 71.37 71.51 71.77 71.70 71.57 71.62
+Title 68.17 67.83 67.75 67.78 67.93 68.17 68.34 68.00
+Aliases 68.14 67.46 67.43 67.38 68.61 68.15 68.13 67.90
+Locale 67.57 67.63 67.78 67.89 67.81 67.65 67.44 67.68
All cues 72.99 73.01 72.48 73.26 72.88 72.93 72.70 72.89
Table 7: Cue ablations for DELTA with fixed evidence.
We incrementally add query-side cues and report end-
to-end generation accuracy.
Impact of Cues on Evidence Interpretation. To
isolate generation-stage effects from retrieval-stage
effects, we conduct a cue ablation study under a
fixed-evidence setting, which is reported in Table 7.
Specifically, we first retrieve passages and re-rank,
then hold those retrieved passages constant while
modifying only the query-side cues at generation
time. This design ensures that any observed perfor-
mance differences stem solely from how the gen-
erator interprets the fixed evidence under different
query formulations, not from changes in retrieved
content. Under this setup, the global pivot cue sig-
nificantly outperforms the original query, indicat-
ing that concise global English paraphrasing aids
the generator in aligning evidence. Bridge cues also
provide independent gains, showing that even when
the retrieved evidence context is held fixed, vary-
ing only the query-side cues at generation time im-
proves the model’s ability to select precise evidence
spans. The best performance across all languages
is achieved by combining all cues, suggesting that
bridge cues offer critical disambiguation and entity
grounding.
Latency Analysis. To assess the efficiency of
DELTA, we report average end-to-end latency
(wall-clock time in seconds per query, averaged
over all test queries) in the

Chunk 19 · 1,991 chars

est performance across all languages
is achieved by combining all cues, suggesting that
bridge cues offer critical disambiguation and entity
grounding.
Latency Analysis. To assess the efficiency of
DELTA, we report average end-to-end latency
(wall-clock time in seconds per query, averaged
over all test queries) in the rightmost column of
Table 5. We provide detailed per-language latency
measurements in Appendix I. DELTA maintains
high efficiency by generating a single fused query
and avoiding document translation. It can even be
faster than English Translation; by incorporating
local cues and disambiguation anchors, DELTA
enables direct retrieval, reducing the overhead of
processing overly generic English-only signals.
6 Related Works
6.1 Multilingual RAG
Prior work in mRAG has explored how perfor-
mance varies with the query language (Ranaldi
et al., 2025; Longpre et al., 2021), the language
of relevant or irrelevant evidence (Qi et al., 2025;
Wu et al., 2024), as well as document ordering and
prompting strategies that affect how models con-
sume multilingual contexts (Sharma et al., 2024;
Wu et al., 2024; Shankar et al., 2024; Ki et al.,
2025). A common and effective heuristic is pivot
translation, where non-English queries are trans-
lated into English before retrieval, often produc-
ing large gains (Asai et al., 2021; Ranaldi et al.,
2025). However, much of the existing analysis of
why pivot translation helps centers on the gener-
ation stage (e.g., English-centric generation com-
petence, translation noise, and cross-lingual drift),
which motivates generator-side interventions such
as translation-aware prompting or decoding-time
control (Sharma et al., 2024; Moon et al., 2025). In
contrast, our work focuses on a retrieval-side ex-
planation: we empirically show that gold evidence
is structurally skewed toward English corpora.
6.2 Language Preference
In mRAG, language preference is shown both in
retrieval (over-retrieving high-resource languages)
and in

Chunk 20 · 1,997 chars

harma et al., 2024; Moon et al., 2025). In
contrast, our work focuses on a retrieval-side ex-
planation: we empirically show that gold evidence
is structurally skewed toward English corpora.
6.2 Language Preference
In mRAG, language preference is shown both in
retrieval (over-retrieving high-resource languages)
and in generation (differentially using evidence
by language even under matched relevance—a
setting where the gold passage or core support-
ing evidence is correctly retrieved across all com-
pared conditions), degrading consistency and down-
stream quality (Park and Lee, 2025). Existing mea-
surements of language preference in mRAG com-
monly rely on behavioral proxies, such as compar-
ing outputs across query languages via information
overlap (Sharma et al., 2024) or embedding similar-
ity to references (Park and Lee, 2025), and, in more
controlled settings, analyzing citation or attribution
behavior as evidence that language varies while
8

-- 8 of 21 --

other variables are fixed (Ki et al., 2025; Qi et al.,
2025). While prior approaches offer useful signals,
they miss a key confound in mRAG: structural pri-
ors can dominate preference scores. We therefore
debias preference by regressing out these priors
and using the residual as the preference signal.
7 Conclusion
We demonstrate that gains from English pivoting
in mRAG stem from retrieval-side evidence imbal-
ance, which biases preference measurements. We
address this with DeLP, a debiased metric that cali-
brates structural priors to reveal preference shifts
toward the query language. Leveraging DeLP, we
introduce DELTA, a lightweight query reformula-
tion strategy that fuses global and local cues into a
single query, consistently outperforming baselines.
Limitations
First, our debiasing targets retriever-level prefer-
ence, while generator-level preference can still re-
main. Therefore, extending debiasing to how gen-
erators consume multilingual evidence is an impor-
tant direction for future work.

Chunk 21 · 1,995 chars

local cues into a
single query, consistently outperforming baselines.
Limitations
First, our debiasing targets retriever-level prefer-
ence, while generator-level preference can still re-
main. Therefore, extending debiasing to how gen-
erators consume multilingual evidence is an impor-
tant direction for future work. Second, our conclu-
sions are drawn from a Wikipedia-based mRAG
setup. Evaluating DeLP and DELTA on broader,
domain-specific multilingual corpora is therefore
necessary to assess their generalizability. Third,
DELTA controls the balance between global and lo-
cal signals using simple repetition, which is coarse.
More precise and principled weighting or adaptive
control logic could further improve effectiveness
and stability.
Ethics Statement
We conduct our experiments using publicly avail-
able multilingual datasets, knowledge sources, and
models that are widely used in the research com-
munity and released under established data-sharing
and licensing guidelines. We follow the usage pro-
tocols and license agreements specified by the orig-
inal providers. While these resources are designed
to reduce harmful biases and inappropriate content,
they may still contain artifacts of data imbalance
and may not fully represent the diversity of lan-
guages, dialects, and cultural contexts. Our work
analyzes and mitigates retrieval-side evidence im-
balance and does not involve human subject data,
user interaction logs, or the collection of person-
ally identifiable information. We encourage future
deployments to consider downstream risks such as
uneven coverage across languages and potential
disparities in answer quality for under-resourced
communities.
Acknowledgments
This work was supported by the Institute of Infor-
mation & Communications Technology Planning
& Evaluation (IITP) grant funded by the Korea gov-
ernment (MSIT) [RS-2021-II211341, Artificial In-
telligence Graduate School Program (Chung-Ang
University)] and the National Research Foundation
of

Chunk 22 · 1,988 chars

munities.
Acknowledgments
This work was supported by the Institute of Infor-
mation & Communications Technology Planning
& Evaluation (IITP) grant funded by the Korea gov-
ernment (MSIT) [RS-2021-II211341, Artificial In-
telligence Graduate School Program (Chung-Ang
University)] and the National Research Foundation
of Korea(NRF) grant funded by the Korea govern-
ment(MSIT) (RS-2026-25494299). This research
was supported by the Chung-Ang University Grad-
uate Research Scholarship in 2025.
References
Akari Asai, Jungo Kasai, Jonathan H Clark, Kenton
Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021.
Xor qa: Cross-lingual open-retrieval question answer-
ing. In Proceedings of the 2021 conference of the
North American chapter of the association for com-
putational linguistics: human language technologies,
pages 547–564.
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu
Lian, and Zheng Liu. 2024. Bge m3-embedding:
Multi-lingual, multi-functionality, multi-granularity
text embeddings through self-knowledge distillation.
Preprint, arXiv:2402.03216.
Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault
Formal, Stéphane Clinchant, and Vassilina Nikoulina.
2024. Retrieval-augmented generation in multi-
lingual settings. In Proceedings of the 1st Work-
shop on Towards Knowledgeable Language Models
(KnowLLM 2024), pages 177–188, Bangkok, Thai-
land. Association for Computational Linguistics.
Gheorghe Comanici, Eric Bieber, Mike Schaekermann,
Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar-
cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and
1 others. 2025. Gemini 2.5: Pushing the frontier with
advanced reasoning, multimodality, long context, and
next generation agentic capabilities. arXiv preprint
arXiv:2507.06261.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam
Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow,
Akila Welihinda, Alan Hayes, Alec Radford, and 1
others. 2024. Gpt-4o system card. arXiv preprint
arXiv:2410.21276.
Dayeon Ki, Marine Carpuat, Paul McNamee,

Chunk 23 · 1,990 chars

t generation agentic capabilities. arXiv preprint
arXiv:2507.06261.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam
Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow,
Akila Welihinda, Alan Hayes, Alec Radford, and 1
others. 2024. Gpt-4o system card. arXiv preprint
arXiv:2410.21276.
Dayeon Ki, Marine Carpuat, Paul McNamee, Daniel
Khashabi, Eugene Yang, Dawn Lawrie, and Kevin
Duh. 2025. Linguistic nepotism: Trading-off quality
for language preference in multilingual rag. arXiv
preprint arXiv:2509.13930.
9

-- 9 of 21 --

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
Retrieval-augmented generation for knowledge-
intensive nlp tasks. In Proceedings of the 34th Inter-
national Conference on Neural Information Process-
ing Systems, NIPS ’20, Red Hook, NY, USA. Curran
Associates Inc.
Bo Li, Zhenghua Xu, and Rui Xie. 2025. Language drift
in multilingual retrieval-augmented generation: Char-
acterization and decoding-time mitigation. arXiv
preprint arXiv:2511.09984.
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang,
Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi
Deng, Chenyu Zhang, Chong Ruan, and 1 others.
2024. Deepseek-v3 technical report. arXiv preprint
arXiv:2412.19437.
Shayne Longpre, Yi Lu, and Joachim Daiber. 2021.
MKQA: A linguistically diverse benchmark for mul-
tilingual open domain question answering. Transac-
tions of the Association for Computational Linguis-
tics, 9:1389–1406.
Hoyeon Moon, Byeolhee Kim, and Nikhil Verma. 2025.
Quality-aware translation tagging in multilingual rag
system. In Proceedings of the 5th Workshop on Multi-
lingual Representation Learning (MRL 2025), pages
161–177.
Jeonghyun Park and Hwanhee Lee. 2025. Investigating
language preference of multilingual RAG systems.
In Findings of the Association for Computational
Linguistics: ACL 2025, pages 5647–5675, Vienna,
Austria. Association for

Chunk 24 · 1,989 chars

eedings of the 5th Workshop on Multi-
lingual Representation Learning (MRL 2025), pages
161–177.
Jeonghyun Park and Hwanhee Lee. 2025. Investigating
language preference of multilingual RAG systems.
In Findings of the Association for Computational
Linguistics: ACL 2025, pages 5647–5675, Vienna,
Austria. Association for Computational Linguistics.
Jirui Qi, Raquel Fernández, and Arianna Bisazza. 2025.
On the consistency of multilingual context utiliza-
tion in retrieval-augmented generation. In Proceed-
ings of the 5th Workshop on Multilingual Representa-
tion Learning (MRL 2025), pages 199–225, Suzhuo,
China. Association for Computational Linguistics.
Leonardo Ranaldi, Barry Haddow, and Alexandra Birch.
2025. Multilingual retrieval-augmented genera-
tion for knowledge-intensive task. arXiv preprint
arXiv:2504.03616.
David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault
Formal, Shuai Wang, Stéphane Clinchant, and Vas-
silina Nikoulina. 2024. Bergen: A benchmarking
library for retrieval-augmented generation. In Find-
ings of the Association for Computational Linguistics:
EMNLP 2024, pages 7640–7663.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Bhavani Shankar, Preethi Jyothi, and Pushpak Bhat-
tacharyya. 2024. In-context mixing (ICM): Code-
mixed prompts for multilingual LLMs. In Proceed-
ings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 4162–4176, Bangkok, Thailand. Associ-
ation for Computational Linguistics.
Nikhil Sharma, Kenton Murray, and Ziang Xiao. 2024.
Faux polyglot: A study on information disparity
in multilingual large language models. Preprint,
arXiv:2407.05502.
Liang Wang, Nan Yang, and Furu Wei. 2023.
Query2doc: Query expansion with large language
models. In Proceedings of the 2023 Conference

Chunk 25 · 1,998 chars

onal Linguistics.
Nikhil Sharma, Kenton Murray, and Ziang Xiao. 2024.
Faux polyglot: A study on information disparity
in multilingual large language models. Preprint,
arXiv:2407.05502.
Liang Wang, Nan Yang, and Furu Wei. 2023.
Query2doc: Query expansion with large language
models. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing,
pages 9414–9423.
Suhang Wu, Jialong Tang, Baosong Yang, Ante Wang,
Kaidi Jia, Jiawei Yu, Junfeng Yao, and Jinsong Su.
2024. Not all languages are equal: Insights into mul-
tilingual retrieval-augmented generation. Preprint,
arXiv:2410.21970.
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu, Chang
Gao, Chengen Huang, Chenxu Lv, and 1 others.
2025. Qwen3 technical report. arXiv preprint
arXiv:2505.09388.
Xiang Zhang, Senyu Li, Bradley Hauer, Ning Shi, and
Grzegorz Kondrak. 2023. Don’t trust chatgpt when
your question is not in english: a study of multi-
lingual abilities and types of llms. arXiv preprint
arXiv:2305.16339.
10

-- 10 of 21 --

Appendix
A The Use of Large Language Models
We write the manuscript ourselves, and an LLM
(ChatGPT-5.2) is used solely for refinement—style,
clarity, and grammar. It is not used for ideation or
content generation.
B Implementation Details
We adopt the multilingual retrieval baseline of
Bergen (Chirkova et al., 2024), which retrieves evi-
dence from a datastore spanning all languages. For
generations, we follow Bergen’s prompting setup
and use the basic_translated_langspec template
(Figure 4) to produce the final mRAG response.
Building on this standardized pipeline, we conduct
a series of experiments to quantify language prefer-
ence in mRAG under the Bergen framework, which
systematically examines the key components and
practical adjustments necessary for a robust mul-
tilingual RAG baseline. We use a robust multilin-
gual LLM, qwen3-235b-a22b-2507, to translate the
user’s local question into the global pivot language,
English. We

Chunk 26 · 1,992 chars

ge prefer-
ence in mRAG under the Bergen framework, which
systematically examines the key components and
practical adjustments necessary for a robust mul-
tilingual RAG baseline. We use a robust multilin-
gual LLM, qwen3-235b-a22b-2507, to translate the
user’s local question into the global pivot language,
English. We instruct GPT-4o-mini (Hurst et al.,
2024) to generate a compact lexical bundle that
supplies candidate titles, aliases, and a short dis-
ambiguation hint for the global and local segments.
All LLM calls are made using the OpenRouter API.
We conduct our experiments using an AMD EPYC
7313 CPU (3.0 GHz) paired with four NVIDIA
RTX 6000 Ada GPUs. We use Python 3.11.5 and
PyTorch 2.3.1 for the software environment.
C Language Notation
We use standard ISO 639-1 language codes to de-
note the languages in our experiments. Specifically,
en denotes English, ar represents Arabic, es cor-
responds to Spanish, zh refers to Chinese (Simpli-
fied), ja indicates Japanese, de stands for German,
ko denotes Korean, and th denotes Thai. These con-
cise codes facilitate consistent identification and
processing of language-specific data across datasets
and models in multilingual NLP research.
D Dataset Details & Statistics
Wikipedia is a widely adopted knowledge source
in both monolingual RAG and mRAG systems,
as it provides broad topical coverage and is com-
monly used to benchmark RAG pipelines. In most
experiments, we retrieve from various linguistic
data sources from (i) the KILT snapshot of English
Wikipedia† and (ii) the Wikipedia edition in the
user’s local language‡. This two-source design re-
flects a standard and practical mRAG setting where
English serves as a high-coverage reference corpus
while local-language Wikipedia captures language-
specific evidence and terminology.
We report summary statistics for the data re-
sources used in our experiments in Table 14.
MKQA is our primary evaluation dataset, and we
provide the number of examples along with the

Chunk 27 · 1,980 chars

English serves as a high-coverage reference corpus
while local-language Wikipedia captures language-
specific evidence and terminology.
We report summary statistics for the data re-
sources used in our experiments in Table 14.
MKQA is our primary evaluation dataset, and we
provide the number of examples along with the me-
dian lengths of questions and answers. We also use
Wikipedia as the external corpus for the retriever
datastore; its statistics, including the number of
passages and their median lengths, are likewise
presented in Table 14. These statistics provide an
overview of the datasets and corpora underlying
our experimental setup.
E Raw Language Preference Score
MLRS. Following the standard MultiLingual-
RankShift (MLRS) protocol, we quantify retriever-
level language preference by measuring how much
the ranks of non-query-language documents im-
prove after being translated into the query lan-
guage (Park and Lee, 2025). For each query q with
language Lq, we first retrieve a ranked list Dq from
a multilingual datastore, assigning each document
d ∈ Dq an initial rank rinit
d . We then translate docu-
ments with Ld̸ = Lq into Lq and re-rank them
using the same retriever, obtaining rre-rank
d . The
(non-negative) rank gain is computed as ∆rd =
maxrinit
d − rre-rank
d , 0, and aggregated per query
as ∆rq = P
d ∆rd. Normalizing by the maximum
possible gain ∆rmax
q = P
d(rinit
d − 1) yields the
query-level score MLRSq = ∆rq
∆rmax	
q × 100 (or 0 if
∆rmax
q = 0), and the final MLRS is the average
over queries.
Results. Table 10 reports the retriever’s language
preference scores before calibration with DeLP.
Overall, we observe three consistent patterns. First,
cross-lingual retrieval (Lq̸ = Ld) generally yields
lower MLRS than monolingual retrieval, indicat-
ing that cross-lingual matching is less preferred in
most cases. Second, English emerges as a dom-
inant target language: when the retrieved docu-
ment language Ld is English, the retriever

Chunk 28 · 1,995 chars

e consistent patterns. First,
cross-lingual retrieval (Lq̸ = Ld) generally yields
lower MLRS than monolingual retrieval, indicat-
ing that cross-lingual matching is less preferred in
most cases. Second, English emerges as a dom-
inant target language: when the retrieved docu-
ment language Ld is English, the retriever attains
near-maximum preference scores and often even
†https://huggingface.co/datasets/facebook/
kilt_wikipedia
‡https://huggingface.co/datasets/wikimedia/
wikipedia
11

-- 11 of 21 --

Generator Level Method en ar es zh ja de ko th
Deepseek-chat-v3.1
Document
MultiRAG 1.925 3.093 2.456 2.683 2.816 2.456 2.362 2.716
CrossRAG 2.005 2.897 2.706 2.822 2.968 2.328 2.778 2.653
DKM-RAG 1.955 2.710 2.262 2.572 2.841 2.227 1.733 2.843
QTT-RAG 1.671 2.347 2.054 1.348 2.475 1.663 1.383 2.525
Query English Translation – 2.170 1.992 1.800 1.857 2.224 2.179 2.110
Ours 0.851 0.988 1.496 1.288 1.004 0.853 1.637 0.955
Gemini
Document
MultiRAG 1.524 1.518 1.502 1.546 1.595 1.494 1.519 1.581
CrossRAG 1.688 4.821 0.780 1.511 2.632 3.234 3.883 2.277
DKM-RAG 4.308 7.293 5.301 0.765 6.261 5.559 6.187 9.346
QTT-RAG 2.511 7.355 5.306 1.177 6.581 5.639 6.274 9.553
Query English Translation – 1.484 1.510 1.528 1.802 1.481 1.478 1.539
Ours 1.483 1.485 1.469 1.484 1.499 1.473 1.490 1.481
Qwen
Document
MultiRAG 1.067 1.905 1.367 1.657 1.147 1.343 1.141 1.410
CrossRAG 1.082 1.533 1.357 2.475 1.033 0.630 0.693 1.543
DKM-RAG 4.573 1.368 3.794 5.171 5.721 0.764 0.627 8.364
QTT-RAG 1.401 1.813 1.698 3.263 1.494 1.350 1.037 2.352
Query English Translation – 1.086 1.347 1.141 1.095 1.108 1.214 1.226
Ours 1.069 1.036 1.272 1.128 1.045 1.111 1.123 1.225
Table 8: Average generation time (sec/query; lower is better). Bold = lowest, underline = second-lowest across both
Document-Level and Query-Level rows for each (generator, language).
surpasses monolingual settings, consistent with
English-heavy pretraining and stronger English
representations. Third, cross-lingual preference is
partly

Chunk 29 · 1,992 chars

eneration time (sec/query; lower is better). Bold = lowest, underline = second-lowest across both
Document-Level and Query-Level rows for each (generator, language).
surpasses monolingual settings, consistent with
English-heavy pretraining and stronger English
representations. Third, cross-lingual preference is
partly modulated by linguistic relatedness: closely
related Romance languages (fr/it/pt/es) preserve rel-
atively high cross-lingual scores, while East Asian
pairs (ko/ja/zh) show moderate but noticeable drops
compared to monolingual baselines.
F Language Distribution of Retrieved
Documents
Table 13 reports the language composition of the
top-50 documents retrieved for each MKQA query-
language split. Across nearly all query languages,
the retrieved evidence is heavily concentrated in
English, and English often remains the most fre-
quently retrieved language even for non-English
queries. This trend is also reflected in the aggre-
gated distribution (mkqa_avg), indicating that the
observed language preference in standard mRAG
pipelines largely mirrors structural priors of the
retrieval setup rather than purely intrinsic model
preference.
G Calibration Details
Table 9 reports the detailed DeLP scores for each
of the three encoders. After removing the variance
explained by the structural priors, the calibrated ma-
trices exhibit a consistent pattern across encoders:
the strongest preference concentrates on the di-
agonal (Lq=Ld), indicating a robust shift toward
query–document language alignment rather than an
English-dominant bias. Residual cross-lingual pref-
erences remain comparatively mild and structured,
reflecting interpretable related-language effects in-
stead of exposure- or coverage-driven artifacts.
H Gold Passage Counting Protocol
We compute Table 1 on the 2,827-question sub-
set that overlaps with KILT NQ provenance. Be-
cause prior work (Park and Lee, 2025) provides
the same underlying question translated into 13
query languages, the unit

Chunk 30 · 1,995 chars

ed-language effects in-
stead of exposure- or coverage-driven artifacts.
H Gold Passage Counting Protocol
We compute Table 1 on the 2,827-question sub-
set that overlaps with KILT NQ provenance. Be-
cause prior work (Park and Lee, 2025) provides
the same underlying question translated into 13
query languages, the unit counted in Table 1 is
not the number of unique questions but the num-
ber of question×query-language instances. Hence
the total number of instances is 2,827 × 13 =
36,751, and values such as “#q = 26,934 (73.29%)”
can legitimately exceed 2,827; the ratio is com-
puted as 26,934/36,751. Gold labels originate
from KILT’s provenance, which is anchored to En-
glish Wikipedia page IDs (WPIDs). All 13 trans-
lations of the same question share the same gold
WPID set. We then assess gold availability in each
Wikipedia language edition by mapping each En-
glish WPID to a corresponding page in language
ℓ using Wikipedia/Wikidata interlanguage links
(sitelinks), and checking whether the mapped page
exists in the Wikipedia dump used to build our
12

-- 12 of 21 --

Query Lang. Encoder Lq = Ld Lq̸ = Ld
en ko zh fr ja it pt es
en
bge-m3 49.25 – 35.87 (-13.37) 39.48 (-9.77) 34.58 (-14.67) 38.79 (-10.46) 34.59 (-14.66) 35.81 (-13.43) 35.13 (-14.12)
p-mMiniLM 50.26 – 37.00 (-13.27) 40.94 (-9.32) 36.18 (-14.08) 39.96 (-10.31) 35.83 (-14.43) 36.62 (-13.64) 36.49 (-13.78)
p-mMpNet 50.80 – 37.15 (-13.65) 40.61 (-10.19) 35.94 (-14.86) 40.09 (-10.71) 36.04 (-14.76) 36.96 (-13.84) 36.43 (-14.37)
ko
bge-m3 42.34 36.76 (-5.58) – 40.69 (-1.65) 34.41 (-7.93) 42.45 (+0.11) 34.44 (-7.90) 35.29 (-7.05) 34.46 (-7.87)
p-mMiniLM 44.09 38.03 (-6.06) – 42.37 (-1.72) 35.09 (-9.00) 43.90 (-0.19) 34.75 (-9.34) 36.07 (-8.02) 34.98 (-9.11)
p-mMpNet 43.72 38.29 (-5.43) – 42.19 (-1.53) 35.20 (-8.52) 44.43 (+0.72) 34.91 (-8.80) 35.59 (-8.13) 34.87 (-8.84)
zh
bge-m3 49.67 38.52 (-11.15) 37.32 (-12.36) – 34.33 (-15.34) 41.36 (-8.32) 34.58 (-15.09) 35.71 (-13.96) 34.98 (-14.69)
p-mMiniLM 51.01

Chunk 31 · 1,996 chars

09 (-9.00) 43.90 (-0.19) 34.75 (-9.34) 36.07 (-8.02) 34.98 (-9.11)
p-mMpNet 43.72 38.29 (-5.43) – 42.19 (-1.53) 35.20 (-8.52) 44.43 (+0.72) 34.91 (-8.80) 35.59 (-8.13) 34.87 (-8.84)
zh
bge-m3 49.67 38.52 (-11.15) 37.32 (-12.36) – 34.33 (-15.34) 41.36 (-8.32) 34.58 (-15.09) 35.71 (-13.96) 34.98 (-14.69)
p-mMiniLM 51.01 38.80 (-12.21) 38.11 (-12.90) – 34.99 (-16.03) 42.20 (-8.81) 35.06 (-15.95) 35.94 (-15.07) 35.38 (-15.64)
p-mMpNet 51.11 38.72 (-12.39) 37.91 (-13.20) – 34.87 (-16.24) 42.13 (-8.98) 34.98 (-16.13) 35.88 (-15.23) 35.31 (-15.80)
fr
bge-m3 39.45 40.48 (+1.02) 36.15 (-3.30) 39.95 (+0.49) – 39.47 (+0.01) 35.39 (-4.06) 36.25 (-3.20) 35.75 (-3.70)
p-mMiniLM 40.42 41.56 (+1.14) 37.20 (-3.22) 40.85 (+0.43) – 40.27 (-0.15) 36.33 (-4.09) 36.94 (-3.48) 36.56 (-3.86)
p-mMpNet 40.28 41.45 (+1.17) 36.95 (-3.33) 40.71 (+0.43) – 40.03 (-0.25) 36.29 (-3.98) 36.87 (-3.41) 36.54 (-3.74)
ja
bge-m3 48.61 38.44 (-10.16) 38.21 (-10.39) 41.15 (-7.46) 34.70 (-13.91) – 34.83 (-13.78) 35.86 (-12.75) 35.09 (-13.52)
p-mMiniLM 49.56 38.95 (-10.61) 38.55 (-11.01) 41.90 (-7.66) 35.19 (-14.37) – 35.21 (-14.35) 36.14 (-13.42) 35.44 (-14.12)
p-mMpNet 49.41 38.70 (-10.72) 38.43 (-10.98) 41.64 (-7.78) 34.94 (-14.47) – 34.94 (-14.47) 35.92 (-13.49) 35.15 (-14.26)
it
bge-m3 38.38 39.88 (+1.50) 36.15 (-2.23) 39.83 (+1.45) 35.87 (-2.51) 39.26 (+0.88) – 36.39 (-2.00) 36.17 (-2.21)
p-mMiniLM 39.44 41.10 (+1.67) 37.23 (-2.21) 40.92 (+1.48) 37.08 (-2.36) 40.24 (+0.80) – 37.44 (-2.00) 37.36 (-2.08)
p-mMpNet 39.33 40.90 (+1.57) 37.18 (-2.15) 41.02 (+1.69) 36.94 (-2.39) 40.09 (+0.77) – 37.21 (-2.12) 37.20 (-2.12)
pt
bge-m3 45.38 39.89 (-5.49) 36.22 (-9.17) 39.82 (-5.56) 35.78 (-9.60) 39.41 (-5.97) 35.81 (-9.57) – 37.09 (-8.29)
p-mMiniLM 46.52 41.16 (-5.36) 37.33 (-9.19) 41.24 (-5.28) 37.03 (-9.49) 40.47 (-6.05) 36.93 (-9.59) – 38.21 (-8.31)
p-mMpNet 46.33 40.61 (-5.72) 37.38 (-8.95) 40.84 (-5.49) 36.70 (-9.63) 40.14 (-6.19) 36.71 (-9.61) – 37.88 (-8.45)
es
bge-m3 37.64 40.18 (+2.54) 36.21 (-1.43)

Chunk 32 · 1,997 chars

39.41 (-5.97) 35.81 (-9.57) – 37.09 (-8.29)
p-mMiniLM 46.52 41.16 (-5.36) 37.33 (-9.19) 41.24 (-5.28) 37.03 (-9.49) 40.47 (-6.05) 36.93 (-9.59) – 38.21 (-8.31)
p-mMpNet 46.33 40.61 (-5.72) 37.38 (-8.95) 40.84 (-5.49) 36.70 (-9.63) 40.14 (-6.19) 36.71 (-9.61) – 37.88 (-8.45)
es
bge-m3 37.64 40.18 (+2.54) 36.21 (-1.43) 39.78 (+2.15) 35.68 (-1.95) 39.28 (+1.64) 35.90 (-1.74) 36.82 (-0.82) –
p-mMiniLM 38.71 41.31 (+2.60) 37.29 (-1.42) 40.85 (+2.14) 36.87 (-1.84) 40.20 (+1.49) 37.01 (-1.70) 37.73 (-0.98) –
p-mMpNet 38.23 40.65 (+2.42) 37.09 (-1.14) 40.53 (+2.30) 36.34 (-1.89) 39.81 (+1.58) 36.43 (-1.80) 37.19 (-1.04) –
Table 9: Language preference measured by DeLP. Each cell reports the debiased preference score and its delta
from the matching-language baseline (Lq = Ld). Background shading is row-wise min–max scaled (including the
diagonal cell); Darker cells indicate a stronger preference for the document language.
corpus.
A key source of confusion is that our corpus is
passage-based: each Wikipedia page is split into
multiple chunks, so a single WPID may correspond
to multiple passages. Moreover, for a given ques-
tion, KILT may provide multiple gold provenances
that map to different passages within the same
WPID. In Table 1, we use a WPID-level conven-
tion: when multiple gold passages correspond to
the same WPID for a query, we treat them as a
single gold item rather than counting them multi-
ple times. Out of the 2,827 questions in our KILT-
overlap subset, 2,404 questions have at least one
available gold WPID (i.e., a mapped gold page
exists in our Wikipedia dumps). Accordingly, the
number of questions with gold evidence satisfies
only_en + both = 2,404 in Table 1.
I Detailed Latency
Table 8 reports detailed latency measured as av-
erage generation time (sec/query; lower is better)
for each generator across languages. DELTA re-
mains consistently efficient because it produces a
single fused query and avoids document translation
overhead; in several settings, it

Chunk 33 · 1,998 chars

2,404 in Table 1.
I Detailed Latency
Table 8 reports detailed latency measured as av-
erage generation time (sec/query; lower is better)
for each generator across languages. DELTA re-
mains consistently efficient because it produces a
single fused query and avoids document translation
overhead; in several settings, it is even faster than
English Translation, since the fused query retains
local cues and disambiguation anchors that help re-
trieval focus earlier and reduce wasted computation
on overly generic English-only signals.
J Cultural Prior Measurement
To model whether a query is intrinsically tied to
a particular cultural or regional context (indepen-
dent of corpus size or retrieval exposure), we con-
struct a cultural prior using an LLM-based clas-
sifier. Starting from the English version of each
query (MKQA-en), we assign exactly one cultural
database language from a fixed set of 13 languages
(en, ar, es, de, ja, ko, th, zh, fr, it, pt, ru, fi). We use
GPT-4o mini (via OpenRouter) with constrained
JSON output to enforce a single-label decision. We
instruct the classifier to choose the local language
of the primary place/culture the query is about (e.g.,
France → fr, Hong Kong/China → zh), and to se-
lect en only when the cultural context is inherently
English-speaking (e.g., US/UK-specific) or when
the query is genuinely global/multi-country and not
tied to a single locale.
In addition to the cultural-language label, we
record lightweight cultural metadata for analy-
sis and filtering: (i) country_or_region (a single
primary place/region), (ii) is_culture_specific
(whether the question is judged to be culture/locale-
specific), (iii) confidence (0–1), and (iv) a short
13

-- 13 of 21 --

rationale. These fields are used only to character-
ize the dataset and to support qualitative inspection;
our core metric relies on the language label.
Finally, we define the cultural prior pcult(ℓ) as
the empirical probability that a query’s predicted
cultural language

Chunk 34 · 1,999 chars

nce (0–1), and (iv) a short
13

-- 13 of 21 --

rationale. These fields are used only to character-
ize the dataset and to support qualitative inspection;
our core metric relies on the language label.
Finally, we define the cultural prior pcult(ℓ) as
the empirical probability that a query’s predicted
cultural language equals ℓ, i.e., the normalized fre-
quency of the single-label assignments over the
evaluation set. This prior captures where evidence
should exist in a fair localized setting, and is incor-
porated as a structural factor alongside other priors
(e.g., exposure and gold availability) in our cali-
bration analysis. We use those cultural prior and
metadata for calibration and DELTA.
K Repetition-based weighting for DELTA
Query construction with repetition. To control
cue influence without changing the retriever or
learning parameters, we apply a deterministic rep-
etition policy while constructing the fused query
string Qfused. We use the same notation as in Eq. 5:
y ∈ {0, 1} indicates whether the query is culture-
specific and c ∈ [0, 1] is the confidence score. We
repeat the local block [LOCAL:Lq] rlocal times and
the global pivot block [GLOB] rglob times, and con-
catenate all segments with a delimiter (“ | ”) to
form a single retrieval query.
Length control. To keep retrieval budgets com-
parable across methods, we truncate the final Qfused
to a fixed maximum length (e.g., 900 characters)
after concatenation.
Deduplication. We apply conservative dedupli-
cation to avoid redundant anchors: (i) if the global
and local titles are identical, we keep only a sin-
gle [TITLE_BRIDGE]; (ii) if alias sets match across
languages, we keep only the global alias block; and
(iii) when the query language is not English, we
always include [LOCAL:Lq] at least once.
L Thresholds and fixed hyperparameters.
We do not exhaustively tune (τlow, τhigh, τboost) be-
cause a full sweep is combinatorial and would cou-
ple these knobs to expensive end-to-end RAG runs.
Instead, we

Chunk 35 · 1,995 chars

lobal alias block; and
(iii) when the query language is not English, we
always include [LOCAL:Lq] at least once.
L Thresholds and fixed hyperparameters.
We do not exhaustively tune (τlow, τhigh, τboost) be-
cause a full sweep is combinatorial and would cou-
ple these knobs to expensive end-to-end RAG runs.
Instead, we instantiate the confidence thresholds
with three goals: (i) discretize the continuous con-
fidence c into a small number of stable intervals,
(ii) keep the query-length increase bounded, and
(iii) reserve upweighting for only the most reliable
culture-specific cases. Concretely, we set two cut-
offs τlow < τhigh to map c into three repetition lev-
els for the local block, rlocal ∈ {1, 2, 3}, where τlow
marks the onset of reliable culture-specificity and
τhigh indicates high-confidence cases that warrant
the strongest local emphasis. In our implementa-
tion, we use τlow = 0.6 and τhigh = 0.85, which
empirically balance coverage (triggering local up-
weighting for sufficiently confident cases) and con-
servativeness (avoiding frequent over-repetition un-
der noisy cue predictions).
For auxiliary local boosting, we use a separate
threshold τboost that applies only to the disambigua-
tion anchors ([TITLE_BRIDGE] and [ALIASES]),
not the full [LOCAL] query text. Specifically,
for culture-specific queries (y = 1), we set
b = I[c ≥ τboost] and, when b = 1, dupli-
cate [TITLE_BRIDGE] and [ALIASES] once to
strengthen culturally grounded anchoring and re-
duce entity ambiguity
We set τboost = 0.7 so that anchor duplication is
enabled for moderately-to-high confidence culture-
specific queries, providing extra entity anchor-
ing/disambiguation without incurring the larger
length increase of repeating the entire local block.
For ridge calibration, we likewise keep a single
regularization strength λ across all encoders; this
choice is motivated by the small calibration design
(|C| language pairs with a low-dimensional prior
vector) where ridge mainly stabilizes

Chunk 36 · 1,999 chars

thout incurring the larger
length increase of repeating the entire local block.
For ridge calibration, we likewise keep a single
regularization strength λ across all encoders; this
choice is motivated by the small calibration design
(|C| language pairs with a low-dimensional prior
vector) where ridge mainly stabilizes coefficients
against correlated priors rather than serving as a
performance-tuned knob.
M Case Study
DELTA. Table 15 illustrates DELTA on a Ko-
rean query asking “when was the last time
South Korea had the Olympics.” DELTA forms
the global pivot qglob and emits it as [GLOB],
and places the original Korean surface form as
[LOCAL:ko]. It then injects multilingual anchors:
[TITLE_BRIDGE] contains paired Wikipedia-style
titles, while [ALIASES:GLOB] and [ALIASES:ko]
provide short alias cues in the global and local lan-
guages, respectively. Finally, [LOCALE_HINT] adds
a brief region hint with minimal disambiguation to
bias retrieval toward region-appropriate evidence.
Crucially, DELTA controls the balance be-
tween global and local signals purely through
repetition. Because this query is labeled culture-
specific (y=1) with high confidence c=0.93,
the policy sets rlocal=3 (since c ≥ 0.85)
while keeping rglob=1 (since c ≥ 0.6), yield-
ing three copies of [LOCAL:ko] but only one
copy of [GLOB]. Moreover, the auxiliary local-
14

-- 14 of 21 --

Lq = Ld 	Lq̸ = Ld
Query Lang. Encoder 	en 	ko 	zh 	fr 	ja 	it 	pt 	es
en
bge-m3 	56.03 	– 	33.02 (-23.01) 33.10 (-22.93) 36.61 (-19.42) 33.36 (-22.67) 35.89 (-20.14) 35.86 (-20.17) 36.62 (-19.41)
p-mMiniLM 56.85 	– 	34.34 (-22.51) 34.61 (-22.24) 38.17 (-18.68) 34.52 (-22.33) 37.15 (-19.70) 36.73 (-20.12) 37.96 (-18.89)
p-mMpNet 57.49 	– 	34.45 (-23.04) 34.27 (-23.22) 37.94 (-19.55) 34.67 (-22.82) 37.34 (-20.15) 37.02 (-20.47) 37.90 (-19.59)
ko
bge-m3 	41.15 43.49 (+2.34) 	– 	34.42 (-6.73) 36.42 (-4.73) 37.18 (-3.97) 35.72 (-5.43) 35.30 (-5.85) 35.93 (-5.22)
p-mMiniLM 42.95 44.62 (+1.67) 	– 	36.04 (-6.91) 37.08 (-5.87)

Chunk 37 · 1,995 chars

37.96 (-18.89)
p-mMpNet 57.49 	– 	34.45 (-23.04) 34.27 (-23.22) 37.94 (-19.55) 34.67 (-22.82) 37.34 (-20.15) 37.02 (-20.47) 37.90 (-19.59)
ko
bge-m3 	41.15 43.49 (+2.34) 	– 	34.42 (-6.73) 36.42 (-4.73) 37.18 (-3.97) 35.72 (-5.43) 35.30 (-5.85) 35.93 (-5.22)
p-mMiniLM 42.95 44.62 (+1.67) 	– 	36.04 (-6.91) 37.08 (-5.87) 38.47 (-4.48) 36.07 (-6.88) 36.18 (-6.77) 36.45 (-6.50)
p-mMpNet 42.53 44.98 (+2.45) 	– 	35.85 (-6.68) 37.20 (-5.33) 39.01 (-3.52) 36.21 (-6.32) 35.65 (-6.88) 36.34 (-6.19)
zh
bge-m3 	44.98 45.26 (+0.28) 34.52 (-10.46) 	– 	36.34 (-8.64) 36.05 (-8.93) 35.86 (-9.12) 35.73 (-9.25) 36.45 (-8.53)
p-mMiniLM 46.18 45.39 (-0.79) 35.46 (-10.72) 	– 	36.98 (-9.20) 36.77 (-9.41) 36.38 (-9.80) 36.05 (-10.13) 36.85 (-9.33)
p-mMpNet 46.27 45.41 (-0.86) 35.21 (-11.06) 	– 	36.87 (-9.40) 36.71 (-9.56) 36.28 (-9.99) 35.94 (-10.33) 36.78 (-9.49)
fr
bge-m3 	43.18 47.23 (+4.05) 33.29 (-9.89) 33.58 (-9.60) 	– 	34.07 (-9.11) 36.70 (-6.48) 36.30 (-6.88) 37.25 (-5.93)
p-mMiniLM 44.09 48.15 (+4.06) 34.54 (-9.55) 34.52 (-9.57) 	– 	34.83 (-9.26) 37.65 (-6.44) 37.05 (-7.04) 38.03 (-6.06)
p-mMpNet 43.96 48.14 (+4.18) 34.25 (-9.71) 34.37 (-9.59) 	– 	34.61 (-9.35) 37.59 (-6.37) 36.93 (-7.03) 38.01 (-5.95)
ja
bge-m3 	45.03 45.18 (+0.15) 35.45 (-9.58) 34.86 (-10.17) 36.71 (-8.32) 	– 	36.11 (-8.92) 35.88 (-9.15) 36.56 (-8.47)
p-mMiniLM 45.80 45.54 (-0.26) 35.90 (-9.90) 35.57 (-10.23) 37.18 (-8.62) 	– 	36.53 (-9.27) 36.25 (-9.55) 36.91 (-8.89)
p-mMpNet 45.67 45.39 (-0.28) 35.73 (-9.94) 35.30 (-10.37) 36.94 (-8.73) 	– 	36.24 (-9.43) 35.98 (-9.69) 36.62 (-9.05)
it
bge-m3 	41.06 46.63 (+5.57) 33.30 (-7.76) 33.47 (-7.59) 37.92 (-3.14) 33.86 (-7.20) 	– 	36.44 (-4.62) 37.68 (-3.38)
p-mMiniLM 42.11 47.69 (+5.58) 34.57 (-7.54) 34.59 (-7.52) 39.07 (-3.04) 34.80 (-7.31) 	– 	37.55 (-4.56) 38.83 (-3.28)
p-mMpNet 41.98 47.59 (+5.61) 34.48 (-7.50) 34.68 (-7.30) 38.94 (-3.04) 34.67 (-7.31) 	– 	37.27 (-4.71) 38.67 (-3.31)
pt
bge-m3 	39.19 46.64 (+7.45) 33.37 (-5.82) 33.46 (-5.73) 37.83 (-1.36) 34.02

Chunk 38 · 1,999 chars

3.38)
p-mMiniLM 42.11 47.69 (+5.58) 34.57 (-7.54) 34.59 (-7.52) 39.07 (-3.04) 34.80 (-7.31) 	– 	37.55 (-4.56) 38.83 (-3.28)
p-mMpNet 41.98 47.59 (+5.61) 34.48 (-7.50) 34.68 (-7.30) 38.94 (-3.04) 34.67 (-7.31) 	– 	37.27 (-4.71) 38.67 (-3.31)
pt
bge-m3 	39.19 46.64 (+7.45) 33.37 (-5.82) 33.46 (-5.73) 37.83 (-1.36) 34.02 (-5.17) 37.13 (-2.06) 	– 	38.61 (-0.58)
p-mMiniLM 40.17 47.75 (+7.58) 34.67 (-5.50) 34.91 (-5.26) 39.02 (-1.15) 35.03 (-5.14) 38.25 (-1.92) 	– 	39.68 (-0.49)
p-mMpNet 39.91 47.30 (+7.39) 34.68 (-5.23) 34.50 (-5.41) 38.70 (-1.21) 34.72 (-5.19) 38.01 (-1.90) 	– 	39.35 (-0.56)
es
bge-m3 	40.76 46.93 (+6.17) 33.36 (-7.40) 33.42 (-7.34) 37.73 (-3.03) 33.87 (-6.89) 37.22 (-3.54) 36.88 (-3.88) 	–
p-mMiniLM 41.81 47.90 (+6.09) 34.63 (-7.18) 34.52 (-7.29) 38.86 (-2.95) 34.76 (-7.05) 38.33 (-3.48) 37.84 (-3.97) 	–
p-mMpNet 41.33 47.34 (+6.01) 34.39 (-6.94) 34.19 (-7.14) 38.34 (-2.99) 34.39 (-6.94) 37.73 (-3.60) 37.25 (-4.08) 	–
Table 10: Raw language preference measured by MLRS with different re-ranking encoders for various
query–document language pairs. The Lq = Ld column shows scores for matching query and document lan-
guages, while the remaining columns represent cross-lingual scenarios. Parentheses indicate the change from the
Lq = Ld column (positive for improvement, negative for decline). The highest score per row is in bold, and the
second highest is underlined.
boost flag triggers at c ≥ 0.7, duplicating the
local-side anchors once more, which explains
why [TITLE_BRIDGE] and [ALIASES:ko] appear
twice, whereas [ALIASES:GLOB] remains single-
copy. Overall, this design realizes a global back-
off ([GLOB]) with preference-aligned local empha-
sis ([LOCAL], [TITLE_BRIDGE], [ALIASES:ko])
within a single Qfused, without modifying the re-
triever or adding model parameters.
Success Case. Table 16 presents a representative
top-1 retrieval example comparing DELTA with
a simple English-translation query for the ques-
tion “언제 마지막으로 대한민국이 올림픽을 했
었나요. (When was the

Chunk 39 · 1,998 chars

sis ([LOCAL], [TITLE_BRIDGE], [ALIASES:ko])
within a single Qfused, without modifying the re-
triever or adding model parameters.
Success Case. Table 16 presents a representative
top-1 retrieval example comparing DELTA with
a simple English-translation query for the ques-
tion “언제 마지막으로 대한민국이 올림픽을 했
었나요. (When was the last time South Korea had
the Olympics?).” Although both methods use the
same retriever and multilingual datastore, the re-
trieved evidence differs markedly: DELTA’s fused
query contains explicit host-oriented cues (local
surface form, title/alias anchors, and a locale hint),
which increases lexical alignment with passages
that describe Olympics held in Korea (e.g., 개최,
서울 1988, 평창 2018). In contrast, the English-
translation query is more underspecified and can
drift to participation-centric passages that match
broad entities (“South Korea”, “Olympics”) but do
not emphasize hosting-related facts. As a result,
the DELTA top-1 passage provides the necessary
host evidence for inferring the most recent domes-
tically held Olympics, enabling the generator to
produce the correct answer, while the translation-
based pipeline is more likely to miss the hosting
signal and return an incorrect year/event.
Failure Case. Table 17 shows a representative
failure where the question “who is the president dur-
ing the Korean War” is underspecified: “president”
can plausibly refer to the U.S. president oversee-
ing U.S. involvement (gold: Harry S. Truman and
Dwight D. Eisenhower) or to the South Korean
president during the same period (Syngman Rhee).
In this example, DELTA’s cultural/locale cues and
title bridge steer the query toward South Korean
leadership, effectively resolving the ambiguity in
the wrong direction. Consequently, the top-1 re-
trieved passage focuses on Syngman Rhee and con-
tains strong lexical overlap with the localized cues
(e.g., “대한민국 대통령”, “이승만”, “한국 전쟁”),
making the generator likely to output Syngman
Rhee despite the dataset’s gold reference

Chunk 40 · 1,991 chars

ship, effectively resolving the ambiguity in
the wrong direction. Consequently, the top-1 re-
trieved passage focuses on Syngman Rhee and con-
tains strong lexical overlap with the localized cues
(e.g., “대한민국 대통령”, “이승만”, “한국 전쟁”),
making the generator likely to output Syngman
Rhee despite the dataset’s gold reference targeting
15

-- 15 of 21 --

U.S. presidents. This failure highlights a limitation
of repetition- and cue-based weighting: when the
underlying intent is ambiguous, aggressively in-
jecting locale-specific anchors can over-localize re-
trieval and suppress globally relevant evidence, sug-
gesting the need for ambiguity-aware safeguards
(e.g., intent disambiguation or controlled locale
injection) for such queries.
N Prompts
As shown in Figure 4, we provide the exact prompt
templates used throughout our pipeline. Prompt
(A) specifies the RAG answer-generation instruc-
tion, with two variants depending on whether re-
trieved documents are provided, enforcing concise
English outputs and (when available) conditioning
answers on the supplied evidence. Prompt (B) de-
fines our cultural-context annotation step, where an
LLM assigns a single cultural database language
from a fixed set under strict locality-oriented rules
and returns lightweight metadata (region, culture-
specificity, confidence, and a brief rationale) in a
structured JSON format. Prompt (C) is used by
DELTA to produce retrieval anchors—English and
local Wikipedia-style titles, alias lists, and a short
disambiguation hint—which are then assembled
into a fused query; this prompt enforces a fixed
JSON schema and language constraints to keep the
generated anchors consistent and directly usable
for retrieval. Finally, in Prompt (D), we provide the
prompts used for English translation.
O Validation of the DeLP
We validate DeLP by constructing a controlled
experiment that directly tests whether the met-
ric remains stable under artificial shifts in gold-
answer distribution—a scenario where a

Chunk 41 · 1,992 chars

d directly usable
for retrieval. Finally, in Prompt (D), we provide the
prompts used for English translation.
O Validation of the DeLP
We validate DeLP by constructing a controlled
experiment that directly tests whether the met-
ric remains stable under artificial shifts in gold-
answer distribution—a scenario where a robust
metric should reflect consistent model preference
regardless of structural changes in the corpus.
Experimental Setup. We fix the underlying
model preference and artificially vary the gold-
answer language ratio between Korean and English
from 0 (all gold answers in Korean) to 1 (all gold
answers in English) in incremental steps. Under
this setup, a metric that disentangles structural pri-
ors from intrinsic preference should remain stable
across the spectrum, whereas a metric that conflates
the two should exhibit high sensitivity.
Results. Table 11 reports the range and standard
deviation of each metric across all ratio configura-
tions. We observe that MLRS consistently exhibits
larger variability (range 10.35, std 3.27) as the gold-
language ratio shifts, confirming that raw MLRS
scores are heavily confounded by gold-distribution
bias. In contrast, DeLP remains more stable across
the same configurations (range 9.56, std 3.02). We
further confirm this difference via paired permuta-
tion tests, obtaining p = 0.00016 for range and
p = 0.00010 for standard deviation, both indi-
cating statistical significance. In terms of effect
size, DeLP reduces variability by 7.6% relative to
MLRS, supporting its robustness against structural
biases caused by gold-language fluctuations.
Furthermore, as reported in Table 4, we validate
DeLP by measuring the correlation between pref-
erence scores and structural priors before and after
calibration. Raw MLRS scores exhibit near-perfect
correlation with the exposure prior (r > 0.99),
gold-availability prior (r > 0.91), and cultural
prior (r > 0.91) across all encoders. After applying
DeLP calibration, these

Chunk 42 · 1,991 chars

by measuring the correlation between pref-
erence scores and structural priors before and after
calibration. Raw MLRS scores exhibit near-perfect
correlation with the exposure prior (r > 0.99),
gold-availability prior (r > 0.91), and cultural
prior (r > 0.91) across all encoders. After applying
DeLP calibration, these correlations drop sharply,
confirming that DeLP effectively decouples intrin-
sic model preference from the structural signals
that contaminate standard benchmarks.
Metric Range Std
MLRS (raw) 10.35 3.27
DeLP (ours) 9.56 3.02
Table 11: Sensitivity of MLRS and DeLP to gold-
language ratio shifts (Korean → English). Lower
range and std indicate greater robustness against gold-
language distribution shifts.
P RAG Utility and Sanity Check
We clarify that Gold Availability in Table 1 mea-
sures whether the KILT gold provenance page ex-
ists in the target language’s Wikipedia via inter-
language mapping—it does not directly measure
answerability. Consequently, a Gold Availability
of ∼1% does not imply that 99% of questions are
unanswerable or that the Base column primarily
reflects parametric memory.
To empirically verify the actual utility of re-
trieval under these conditions, we conduct a sanity
check by comparing mRAG performance with and
without retrieval augmentation across multiple non-
English query languages and models. As shown
in Table 12, we observe consistent performance
gains from RAG across all tested languages and
models. Notably, the improvements are substantial
16

-- 16 of 21 --

in several cases—for example, Qwen3-235B gains
14.97 points on Arabic and 14.75 points on Korean
when retrieval is enabled. These results demon-
strate that even when the exact gold provenance
page is absent in the local Wikipedia, retrieval still
surfaces relevant supporting evidence distributed
across multilingual Wikipedia corpora, yielding
consistent and meaningful performance improve-
ments over relying solely on parametric memory.
Model Lang. RAG No

Chunk 43 · 1,994 chars

mon-
strate that even when the exact gold provenance
page is absent in the local Wikipedia, retrieval still
surfaces relevant supporting evidence distributed
across multilingual Wikipedia corpora, yielding
consistent and meaningful performance improve-
ments over relying solely on parametric memory.
Model Lang. RAG No RAG
Gemini-2.5-Flash
ar 43.60 38.42
es 58.76 57.13
ko 38.83 34.66
th 31.55 30.20
zh 30.87 30.65
DeepSeek-Chat-v3.1
ar 48.37 38.38
es 61.82 59.35
ko 41.56 32.81
th 35.42 33.28
zh 38.99 37.89
Qwen3-235B
ar 45.58 30.61
es 64.21 61.92
ko 42.29 27.54
th 43.86 32.37
zh 38.67 33.85
Table 12: mRAG performance comparison with and
without retrieval augmentation across non-English query
languages. RAG consistently outperforms No RAG,
demonstrating that multilingual Wikipedia provides rel-
evant evidence beyond the exact gold provenance page.
Q Explanation of Low Gold Availability
We clarify that the >99% figure does not indi-
cate that non-English questions are unanswerable.
Gold Availability measures whether the KILT gold
provenance page exists in the target language’s
Wikipedia via interlanguage mapping—it does not
directly measure whether a question can be an-
swered from that language’s corpus.
Why does Gold Availability appear so low? As
detailed in Appendix H, KILT provenance is con-
structed on English Wikipedia and mapped to other
languages via interlanguage links. If the corre-
sponding page is absent or the interlanguage map-
ping is incomplete, Gold Availability becomes zero
even when partially relevant content exists in that
language. Furthermore, we aggregate at the WPID
level, treating multiple gold passages from the same
page as a single item, which further reduces the
reported ratio. For instance, 17,667 Korean gold
passages correspond to a smaller number of distinct
page IDs, making the percentage appear lower than
the raw passage count would suggest.
The absolute scale is not negligible. We com-
pute statistics over 36,751 expanded samples

Chunk 44 · 1,993 chars

a single item, which further reduces the
reported ratio. For instance, 17,667 Korean gold
passages correspond to a smaller number of distinct
page IDs, making the percentage appear lower than
the raw passage count would suggest.
The absolute scale is not negligible. We com-
pute statistics over 36,751 expanded samples (2,827
questions × 13 query languages). Even at 1%,
this corresponds to approximately 367 gold page
IDs—an absolute scale that is by no means triv-
ial. We also note that this analysis is conducted on
the widely adopted MKQA/KILT benchmark, so
the observed distribution reflects a standard setting
rather than an artifact of an unusual corpus.
What does this result actually tell us? The key
implication of this finding is not about QA diffi-
culty. Rather, it demonstrates that standard bench-
marks carry a systematic English-centric prove-
nance prior, which causes gold evidence to con-
centrate overwhelmingly in English corpora. We
argue that this structural skew—not any intrinsic
linguistic superiority of English—drives the appar-
ent advantage of English pivoting in mRAG sys-
tems, which is the central motivation for our DeLP
calibration framework.
17

-- 17 of 21 --

en ko ar zh fi fr de ja it pt ru es th
mkqa_en 44.12 1.60 1.19 1.30 2.54 10.03 6.90 1.44 8.32 7.67 4.85 9.90 0.13
mkqa_ko 23.07 17.35 1.99 4.81 2.04 7.90 5.96 10.36 6.16 5.06 6.85 6.85 1.58
mkqa_ar 24.93 3.30 15.29 4.07 2.10 8.30 6.53 6.64 6.80 5.71 7.78 7.65 0.89
mkqa_zh 24.70 3.17 1.76 23.22 2.01 7.47 6.17 6.27 6.08 5.24 6.37 7.27 0.27
mkqa_fi 30.32 2.27 1.63 2.33 7.92 11.11 8.20 3.78 8.77 7.18 6.51 9.42 0.58
mkqa_fr 29.90 1.48 1.25 1.55 2.50 21.44 6.96 2.06 9.40 7.96 4.77 10.55 0.19
mkqa_de 32.54 1.46 1.17 1.44 2.96 11.40 15.12 1.89 9.09 7.69 4.83 10.17 0.24
mkqa_ja 24.56 4.80 1.69 3.99 2.19 7.97 5.99 22.55 6.38 5.66 6.49 7.45 0.28
mkqa_it 28.72 1.59 1.30 1.58 2.52 12.30 6.97 1.95 17.46 8.47 5.26 11.70 0.17
mkqa_pt 28.82 1.71 1.40 1.63 2.60 11.92 6.74 2.23 10.24 13.78 5.38 13.33

Chunk 45 · 1,994 chars

6 4.77 10.55 0.19
mkqa_de 32.54 1.46 1.17 1.44 2.96 11.40 15.12 1.89 9.09 7.69 4.83 10.17 0.24
mkqa_ja 24.56 4.80 1.69 3.99 2.19 7.97 5.99 22.55 6.38 5.66 6.49 7.45 0.28
mkqa_it 28.72 1.59 1.30 1.58 2.52 12.30 6.97 1.95 17.46 8.47 5.26 11.70 0.17
mkqa_pt 28.82 1.71 1.40 1.63 2.60 11.92 6.74 2.23 10.24 13.78 5.38 13.33 0.24
mkqa_ru 27.02 2.53 1.92 1.98 2.45 8.83 6.44 2.71 7.36 6.24 23.83 8.43 0.26
mkqa_es 29.45 1.73 1.27 1.60 2.66 11.85 6.93 1.83 10.55 9.33 5.27 17.36 0.16
mkqa_th 32.39 3.10 2.10 2.96 2.53 10.00 7.40 4.43 8.06 7.43 6.80 9.70 3.10
mkqa_avg 29.27 3.55 2.61 4.04 2.85 10.81 7.41 5.24 8.82 7.49 7.31 9.98 0.62
Table 13: Language distribution of retrieved documents for each MKQA query-language split. Each row corresponds
to the query language (dataset), and each column indicates the language of the retrieved passages; values are shown
as percentages (without the %). The final row (mkqa_avg) reports the average retrieved-language distribution across
all query languages.
Dataset en ar es fi fr de ja it ko pt ru zh th
MKQA
# examples 2827 2827 2827 2827 2827 2827 2827 2827 2827 2827 2827 2827 2827
len question. 43 38 48 46 49 47 26 48 22 45 42 16 41
len answer. 11 10 11 11 11 11 8 11 6 11 12 6 12
Wikipedia
# ex. (M) 25 3.3 10 1.5 13 14 27 8.2 1.6 4.7 8.6 11 3.7
len passage. 624 585 619 833 627 720 208 650 431 619 721 206 217
Table 14: Statistics of the datasets used in our experiments. MKQA Number of examples and median lengths of
questions and answers (in Unicode characters). Wikipedia: Number of passages (in millions) and their median
lengths.
DELTA segment Instantiated content (case study) Rep.
[GLOB] when was the last time south korea had the olympics 1
[LOCAL:ko] 언제 마지막으로 대한민국이 올림픽을 했었나요 3
[TITLE_BRIDGE] South Korea at the Olympics / 대한민국의 올림픽 2
[ALIASES:ko] 대한민국 올림픽, 한국 올림픽, 한국의 올림픽 역사 2
[ALIASES:GLOB] Olympics in South Korea, South Korean Olympic Games, History of South Korea Olympics 1
[LOCALE_HINT] South Korea + Last Olympic Games in South Korea

Chunk 46 · 1,988 chars

d the olympics 1
[LOCAL:ko] 언제 마지막으로 대한민국이 올림픽을 했었나요 3
[TITLE_BRIDGE] South Korea at the Olympics / 대한민국의 올림픽 2
[ALIASES:ko] 대한민국 올림픽, 한국 올림픽, 한국의 올림픽 역사 2
[ALIASES:GLOB] Olympics in South Korea, South Korean Olympic Games, History of South Korea Olympics 1
[LOCALE_HINT] South Korea + Last Olympic Games in South Korea 1
Table 15: DELTA case study. A Korean culture-specific query (c=0.93) is converted into a single fused query Qfused
by concatenating labeled segments. Repetition counts follow Eq. 5, which upweights local cues while maintaining a
global back-off.
18

-- 18 of 21 --

Item Content
DELTA [GLOB] when was the last time south korea had the olympics
|[LOCAL:ko] 언제 마지막으로 대한민국이 올림픽을 했었나요
|[TITLE_BRIDGE] South Korea at the Olympics / 대한민국의 올림픽
|[ALIASES:ko] 대한민국 올림픽, 한국 올림픽, 한국의 올림픽 역사
|[ALIASES:GLOB] Olympics in South Korea, South Korean Olympic Games,
History of South Korea Olympics
|[LOCALE_HINT] South Korea Last Olympic Games in South Korea
English Translation When was the last time south korea had the olympics
Top-1 passage (DELTA) 대한민국에서 열린 올림픽으로는 1988년 서울 하계 올림픽과 2018년 평창 동계
올림픽이 널리 알려져 있다. 서울 대회는 20세기 후반 대한민국의 국제 스포츠
행사 유치와 관련해 자주 언급되며, 주요 경기장은 서울 및 인근 지역에 분산되어
운영되었다. 평창 대회는 강원 지역을 중심으로 동계 종목이 진행되었고, 개/폐회식
과 일부 경기장이 평창 및 주변 권역에 배치되었다. 두 대회 모두 대한민국 내에서
개최된 사례로 정리되며, 대회의 성격(하계/동계)과 개최 지역(서울/평창)이 함께
기술되는 경우가 많다.
Top-1 passage (English Translation) 대한민국(South Korea)은 근대 올림픽(Olympics)에 지속적으로 참가해 왔으며,
여러 종목에서 의미 있는 성과를 거두었다. 이 문서는 연도별 참가 개요, 선수단
규모, 주요 종목에서의 메달 기록과 같은 정보를 중심으로 구성된다. 예를 들어
양궁, 태권도, 쇼트트랙 등에서의 성과가 요약되고, 대회별 대표 선수나 주목할
만한 기록이 덧붙여지기도 한다. 또한 특정 대회에서의 종합 순위 변화나 메달 수
추이처럼 참가 및 성과를 설명하는 통계적 서술이 포함될 수 있다.
Gold answer 2018년 평창 동계 올림픽. (The 2018 PyeongChang Winter Olympics.)
English translation answer 1988년 서울 하계 올림픽. (The 1988 Seoul Summer Olympics.)
DELTA answer 대한민국에서 개최된 올림픽으로 1988년 서울 하계 올림픽과 2018년 평창 동계
올림픽이 언급되므로, 질문에서 묻는 “마지막으로” 개최된 올림픽은 2018년 평창
동계 올림픽이다. (The 2018 PyeongChang Winter Olympics.)
Table 16: Case study: DELTA vs. English

Chunk 47 · 1,996 chars

계 올림픽. (The 2018 PyeongChang Winter Olympics.)
English translation answer 1988년 서울 하계 올림픽. (The 1988 Seoul Summer Olympics.)
DELTA answer 대한민국에서 개최된 올림픽으로 1988년 서울 하계 올림픽과 2018년 평창 동계
올림픽이 언급되므로, 질문에서 묻는 “마지막으로” 개최된 올림픽은 2018년 평창
동계 올림픽이다. (The 2018 PyeongChang Winter Olympics.)
Table 16: Case study: DELTA vs. English translation (top-1 retrieval).
Item Content
DELTA (misled) [GLOB] who is the president during the korean war
|[TITLE_BRIDGE] President of South Korea during the Korean War / 한국 전쟁 중
대한민국 대통령
|[ALIASES:ko] 이승만, 이승만 대통령, 대통령 이승만
|[ALIASES:GLOB] Syngman Rhee, Rhee Syngman, President Rhee, Rhee
|[LOCALE_HINT] Korea (Korean Peninsula) President during Korean War era
Top-1 passage (DELTA) 이승만(Syngman Rhee)은 1948년부터 1960년까지 대한민국의 대통령으로 재임한 정치인이다.
대한민국 정부 수립 이후 초대 대통령으로 선출되었으며, 냉전 초기 한반도의 분단 체제 속에
서 정부 운영을 주도했다. 재임 기간에는 한국 전쟁(1950–1953) 시기가 포함되며, 전쟁 전후의
정치적 갈등과 대외 관계가 함께 언급된다. 관련 문서들은 대체로 이승만의 생애, 대통령 재임
기간, 당대의 국내 정치 상황과 외교적 맥락을 중심으로 개괄한다.
Gold answer 해리 S. 트루먼; 드와이트 D. 아이젠하워 (Harry S. Truman; Dwight D. Eisenhower)
DELTA answer 이승만 (Syngman Rhee)
Table 17: Failure case (top-1 retrieval).
19

-- 19 of 21 --

(A) RAG Answer Generation
Goal: Answer as concisely as possible in {lang}.
With Documents:
System: Extract relevant information from provided documents and answer briefly. Reply in {lang}.
User: Background: {docs} \n\nQuestion: {question}
Without Documents:
System: Answer briefly. Reply in {lang}.
User: Question: {question}
(B) Cultural Language Classifier
System (instruction):
You are annotating a FAIR multilingual retrieval setup.
Given an English query, decide the SINGLE most appropriate "cultural database language"
where the relevant evidence SHOULD exist in a fair, localized setting.
CRITICAL RULES:
- You MUST choose exactly ONE language from this fixed set:
{en, ar, es, de, ja, ko, th, zh, fr, it, pt, ru, fi}
- Prefer the LOCAL language of the primary place/culture the query is about.
- Do NOT choose ’en’ just because the query text is English.
-

Chunk 48 · 1,966 chars

vidence SHOULD exist in a fair, localized setting.
CRITICAL RULES:
- You MUST choose exactly ONE language from this fixed set:
{en, ar, es, de, ja, ko, th, zh, fr, it, pt, ru, fi}
- Prefer the LOCAL language of the primary place/culture the query is about.
- Do NOT choose ’en’ just because the query text is English.
- Choose ’en’ only if the query’s primary cultural context is inherently English-speaking
(e.g., US/UK-specific) OR the query is truly global / multi-country / not place-specific.
- If the query mentions a place that maps to one of the non-English languages,
pick that non-English language.
Examples:
- "when did hong kong go back to china" -> cultural_language="zh"
- "what is the capital of france" -> cultural_language="fr"
- "who was the first president of the united states" -> cultural_language="en"
- "compare gdp of france and germany" -> cultural_language="en" (multi-country/global)
Output (JSON only; no extra text):
{
"country_or_region": string (SINGLE primary place/region),
"cultural_language": string (exactly one from the set),
"is_culture_specific": boolean,
"confidence": number in [0,1],
"rationale": short string
}
Input:
User: Query: {query_en}
Figure 4: Prompt templates used in our pipelines: RAG generation and cultural-language classification.
20

-- 20 of 21 --

(C) DELTA Bundle Generator
Goal: Produce title/alias anchors and a short disambiguation hint for fused query construction.
Return Format:
- SINGLE-LINE JSON object only (no markdown, no explanation).
- Keys must be EXACTLY:
en_title, local_title, aliases_en, aliases_local, extra_disambig
Constraints:
- aliases_en / aliases_local: 0..K items each
- Titles: plausible Wikipedia page titles; use null if unsure
- extra_disambig: <= 8 words
- local_title & aliases_local MUST be in {query_lang}; English fields MUST be English
- Do not add new keys
Input (User JSON):
{
"q_en": "{q_en}",
"q_orig": "{q_orig}",
"query_lang": "{query_lang}",
"country_or_region":

Chunk 49 · 1,384 chars

ems each
- Titles: plausible Wikipedia page titles; use null if unsure
- extra_disambig: <= 8 words
- local_title & aliases_local MUST be in {query_lang}; English fields MUST be English
- Do not add new keys
Input (User JSON):
{
"q_en": "{q_en}",
"q_orig": "{q_orig}",
"query_lang": "{query_lang}",
"country_or_region": "{country_or_region}",
"cultural_language": "{cultural_language}",
"is_culture_specific": {is_culture_specific},
"confidence": {confidence}
}
(D) English Translation
Goal: Translate the question from {lang_name} to fluent, natural English while preserving
the original meaning as much as possible.
Rules:
- Keep named entities as appropriate English forms.
- Do not add explanations or extra information.
- Return STRICT JSON with a single key "translation".
System:
You are a professional translator from {lang_name} to English.
You receive a question in the source language and must translate it into fluent,
natural English while preserving the original meaning as much as possible.
- Keep named entities as appropriate English forms.
- Do not add explanations or extra information.
Return STRICT JSON with a single key "translation".
User:
Question in {lang_name}:
{query}
Return only:
{"translation": "<the question translated into English>"}
Figure 5: Prompt templates used in our pipeline: DELTA bundle generation and English translation.
21

-- 21 of 21 --