All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG
Summary
This paper investigates language bias in multilingual Retrieval-Augmented Generation (mRAG) systems, where rerankers systematically favor English and the query's native language, suppressing critical multilingual evidence. Using an estimated oracle evidence analysis, the authors quantify a significant performance gap between current rerankers and the theoretical upper bound. They find that existing systems fail to reliably identify relevant evidence across languages, with over 70% of top-5 documents often coming from English or the query language. To address this, they propose LAURA (Language-Agnostic Utility-driven Reranker Alignment), a framework that aligns reranking with downstream generation quality by using answer utility as supervision. Experiments show LAURA effectively mitigates language bias, improves reranker performance metrics like Precision@5 and NDCG@5, and enhances generation quality across diverse languages and models. The method outperforms baselines like self-training and mMARCO fine-tuning, demonstrating its effectiveness in promoting language fairness and improving mRAG system performance.
PDF viewer
Chunks(35)
Chunk 0 · 1,997 chars
All Languages Matter: Understanding and Mitigating
Language Bias in Multilingual RAG
Dan Wang1,2,∗, Guozhao Mo1,2,∗, Yafei Shi3, Cheng Zhang3, Bo Zheng3, Boxi Cao1,†,
Xuanang Chen1,†, Yaojie Lu1, Hongyu Lin1, Ben He1,2, Xianpei Han1,2, Le Sun1,2
1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
3MYbank, AntGroup
{wangdan2023,moguozhao2024,caoboxi,chenxuanang}@iscas.ac.cn
{shiyafei.syf,zc481262,guangyuan}@mybank.cn benhe@ucas.ac.cn
{luyaojie,hongyu,sunle,xianpei}@iscas.ac.cn
Abstract
Multilingual Retrieval-Augmented Generation
(mRAG) leverages cross-lingual evidence to
ground Large Language Models (LLMs) in
global knowledge. However, we show that
current mRAG systems suffer from a lan-
guage bias during reranking, systematically
favoring English and the query’s native lan-
guage. By introducing an estimated oracle ev-
idence analysis, we quantify a substantial per-
formance gap between existing rerankers and
the achievable upper bound. Further analy-
sis reveals a critical distributional mismatch:
while optimal predictions require evidence
scattered across multiple languages, current
systems systematically suppress such “answer-
critical” documents, thereby limiting down-
stream generation performance. To bridge this
gap, we propose Language-Agnostic Utility-
driven Reranker Alignment (LAURA), which
aligns multilingual evidence ranking with
downstream generative utility. Experiments
across diverse languages and generation mod-
els show that LAURA effectively mitigates lan-
guage bias and consistently improves mRAG
performance.
1 Introduction
Retrieval-Augmented Generation (RAG), which
incorporates external documentary evidence into
the generation process, has emerged as a core
technique for improving the factual consistency,
knowledge coverage, and controllability of large
language models (LLMs) (Lewis et al., 2020;
Ram et al., 2023). In these cases, multilingual
RAG (mRAG) hasChunk 1 · 1,990 chars
eration (RAG), which incorporates external documentary evidence into the generation process, has emerged as a core technique for improving the factual consistency, knowledge coverage, and controllability of large language models (LLMs) (Lewis et al., 2020; Ram et al., 2023). In these cases, multilingual RAG (mRAG) has become a critical technology to address the needs of a global user base for LLMs (Asai et al., 2021b; Li et al., 2024). In ∗These authors contributed equally. †Corresponding authors. Q: 《第五元素》中饰演蓝色女士的是谁? Q: Who plays the blue lady in The Fifth Element? Retrieve Rerank Evidence Doc Top-50 Retrieved Documents A: 米拉·乔沃维奇 A: Milla Jovovich Top-5 Reranked Documents Corpus Wasted Evidence Figure 1: Illustration of failures induced by reranker lan- guage bias. real-world settings, knowledge is not uniformly dis- tributed across languages. Instead, it exhibits in- herently cross-lingual and complementary struc- tures. Many region-specific facts, cultural con- texts, policy details, and technical knowledge are systematically documented only in particular lan- guages. Therefore, an effective multilingual RAG system should go beyond merely supporting mul- tilingual input and output. Its objective should be to select and integrate documents across various languages, thereby providing the generation model with an evidence set that maximizes informational value. Despite this ideal objective, prior studies have reported the existence of bias in current mRAG sys- tems (Park and Lee, 2025; Amiraz et al., 2025; Qi et al., 2025). Motivated by these observations, we present a systematic analysis of language bias in mRAG. Crucially, departing from previous stud- ies that primarily focus on characterizing the pres- ence of bias, we move beyond mere description to investigate the underlying causes of such biases and their significant impact on downstream predic- arXiv:2604.20199v1 [cs.CL] 22 Apr 2026 -- 1 of 15 -- tions. Specifically, based on MKQA dataset, we
Chunk 2 · 1,996 chars
previous stud- ies that primarily focus on characterizing the pres- ence of bias, we move beyond mere description to investigate the underlying causes of such biases and their significant impact on downstream predic- arXiv:2604.20199v1 [cs.CL] 22 Apr 2026 -- 1 of 15 -- tions. Specifically, based on MKQA dataset, we per- form a comprehensive evaluation across multiple rerankers and 13 languages. We first construct multilingual candidate document pools and apply standard multilingual retrieval and reranking pro- cedures, after which we analyze the language com- position of top-ranked documents. Our analysis reveals a consistent pattern: current mRAG sys- tems exhibit a pronounced language preference bias during the reranking stage, systematically fa- voring English and the original query language. For instance, when using the widely adopted BGE reranker, more than 70% of the top-5 retrieved doc- uments, averaged across 13 languages, originate from English and the query language alone. Such a pronounced bias motivated us to dive into its root causes and practical consequences. Conceptually, such language preference bias may stem from two distinct factors. First, it is pos- sible that more accurate or richer information is in- herently concentrated in certain languages for spe- cific queries. Second, the bias may arise from the limited multilingual capability of reranking mod- els, which struggle to accurately identify relevant evidence expressed in other languages. Disentan- gling these two factors is essential for diagnosing the core limitations of current mRAG systems. To this end, we propose a novel multilingual evidence estimation method that approximates the oracle distribution of evidence required to achieve opti- mal downstream predictions, independent of the reranker’s language preferences. By comparing estimated oracle evidence dis- tributions, we find that existing multilingual rerankers exhibit limited cross-lingual capability and often fail to provide
Chunk 3 · 1,999 chars
roximates the oracle distribution of evidence required to achieve opti- mal downstream predictions, independent of the reranker’s language preferences. By comparing estimated oracle evidence dis- tributions, we find that existing multilingual rerankers exhibit limited cross-lingual capability and often fail to provide sufficiently reliable evi- dence for LLM generation. On the MKQA bench- mark, standard rerankers underperform the ora- cle by nearly 20%, revealing a large performance gap. Further analysis indicates that this gap is not caused by language concentration: oracle ev- idence is distributed across multiple languages rather than dominated by any single one. Al- though high-quality evidence already exists in di- verse languages within the candidate set, it is systematically downweighted by language-biased rerankers, which substantially limits downstream performance. To address this misalignment, we propose Language-Agnostic Utility-driven Reranker Alignment (LAURA), a training framework that mitigates language bias in multilingual reranking by aligning evidence selection with downstream generation quality. Rather than relying solely on semantic relevance signals, which often favor the query language or high-resource languages, LAURA derives supervision from multilingual documents that lead to better generation outcomes in practice. It then trains the reranker to prioritize answer-critical evidence regardless of language. This utility-driven alignment reduces systematic language preferences in evidence selection and yields consistent improvements in generation performance. Our major contributions are summarized as fol- lows: • We systematically investigate and quantify language bias in mRAG. We further intro- duce an estimated oracle evidence analysis framework, revealing that such bias substan- tially constrains the generation performance of mRAG systems. • We propose LAURA, an answer-utility- driven reranking framework that leverages generation outcomes as
Chunk 4 · 1,993 chars
investigate and quantify language bias in mRAG. We further intro- duce an estimated oracle evidence analysis framework, revealing that such bias substan- tially constrains the generation performance of mRAG systems. • We propose LAURA, an answer-utility- driven reranking framework that leverages generation outcomes as supervision signals. LAURA effectively mitigates language bias while consistently improving downstream task performance. 2 Reranking Bias in mRAG Systems While prior work has identified performance degra- dation in mRAG, it largely focuses on pipeline- level optimizations, such as translation-based strategies, without rigorously quantifying the sys- tem’s theoretical upper bound or identifying the underlying causes. A key unresolved question is whether current bottlenecks arise from insufficient relevant information in the retrieval pool or from the selection mechanism’s inability to identify ac- curate multilingual evidence. To bridge this gap, we present a systematic analysis comparing stan- dard retrieval pipelines against an oracle evidence estimating setting, aiming to reveal the misalign- ment between relevance-based selection and actual answer utility. 2.1 Language Distribution Analysis To quantify the limitations of current mRAG pipelines, we define two contrasting settings and a method for analyzing language distribution. -- 2 of 15 -- Who plays the blue lady in The Fifth Element? Milla Jovovich (×) Final Answer: Maïwenn EN docs ZH docs DE docs Rerank Multilingual Corpus (13 languages) Retrieve Top-50 Retrieved Documents Group by language Milla Jovovich (×) Rerank Maïwenn (√) Rerank (wrong info) (wrong info) (contain evidence doc) (Evidence Doc) Figure 2: Illustration of the oracle evidence estimation strategy, where candidate documnents are grouped by lan- guage and reranked independently to select the top-5 documents within each language group, and multilingual evidence documents are selected based on correctness of the generated
Chunk 5 · 1,994 chars
vidence doc) (Evidence Doc) Figure 2: Illustration of the oracle evidence estimation strategy, where candidate documnents are grouped by lan- guage and reranked independently to select the top-5 documents within each language group, and multilingual evidence documents are selected based on correctness of the generated answer. Vanilla Document Reranking. Following the standard multilingual RAG setup adopted in previ- ous work (Chirkova et al., 2024), for each query q ∈ Q, we retrieve documents from a uni- fied multilingual corpus that contains documents from all evaluation languages (13 languages in to- tal). The pipeline consists of two stages: first, a multilingual retriever BGE-M3 (Chen et al.) fetches the top-50 candidate passages across all lan- guages; second, a multilingual reranker, such as BGE-Reranker-V2-M3 (Chen et al.) and Qwen3- Reranker-0.6B (Zhang et al., 2025), selects the top- 5 most relevant passages. These passages are con- catenated to form the context for the generator. The quality of the generated answers is evaluated using the metrics defined below. Oracle Evidence Estimating. As show in Fig- ure 2, to estimate the performance upper bound given the retrieved candidates, we adopt a language-wise reranking strategy. For a query q ∈ Q, the pool of 50 retrieved candidates is partitioned by document language. Within each language group, we select the top-5 documents (or fewer if insufficient candidates exist) to gen- erate a language-specific answer. The final per- formance for query q is defined as the maximum score achieved across all language groups, serving as an estimated upper limit for language selection. We use BGE-M3 embeddings for retrieval and the BGE-Reranker-V2-M3 for reranking. Language Distribution Computation. To un- derstand the linguistic composition of selected ev- idence, we calculate distribution metrics for both settings: • Vanilla Distribution. For each query, we calculate the proportion of each language within the final
Chunk 6 · 1,998 chars
ddings for retrieval and the BGE-Reranker-V2-M3 for reranking. Language Distribution Computation. To un- derstand the linguistic composition of selected ev- idence, we calculate distribution metrics for both settings: • Vanilla Distribution. For each query, we calculate the proportion of each language within the final top-5 documents chosen by the reranker (e.g., three English and two Chi- nese documents yield a distribution of 0.6 and 0.4, respectively). These per-query distribu- tions are then averaged over all queries in a specific query language to obtain the overall context language distribution. • Oracle Distribution. For each query, we identify the document language(s) that pro- duce the best-performing answer. We as- sign an importance weight to languages based on answer performance: if a single language achieves the best score, it receives a weight of 1; if multiple languages tie for the best, the weight is uniformly distributed among them (e.g., a tie between English and Chinese re- sults in 0.5 for each). Similar to the vanilla setting, these per-query weights are averaged across all queries for each query language. 2.2 Experimental Setups Datasets. For the multilingual document corpus, we use English Wikipedia1 and Wikipedia in the corresponding user languages2. Following the pre- processing strategy of (Chirkova et al., 2024), we split each Wikipedia article into chunks of 100 words. For languages without explicit whites- pace segmentation, namely Chinese, Japanese, and Thai, we instead split articles into chunks of 100 Unicode characters. The article title is prepended to each chunk. For multilingual question answering, we use the MKQA (Longpre et al., 2021) dataset, following the setup of Chirkova et al. (2024). MKQA is a multilingual open-domain QA benchmark consist- ing of 10,000 questions from the Natural Questions 1https://huggingface.co/datasets/facebook/ kilt_wikipedia 2https://huggingface.co/datasets/wikimedia/ wikipedia -- 3 of 15 -- ar de
Chunk 7 · 1,993 chars
Longpre et al., 2021) dataset, following the setup of Chirkova et al. (2024). MKQA is a multilingual open-domain QA benchmark consist- ing of 10,000 questions from the Natural Questions 1https://huggingface.co/datasets/facebook/ kilt_wikipedia 2https://huggingface.co/datasets/wikimedia/ wikipedia -- 3 of 15 -- ar de en es fi fr it ja ko pt ru th zh Query Language ar de en es fi fr it ja ko pt ru th zh Doc Language (a) BGE-Reranker-V2-M3 ar de en es fi fr it ja ko pt ru th zh Query Language (b) Qwen3-Reanker-0.6B ar de en es fi fr it ja ko pt ru th zh Query Language (c) Oracle 0.0 0.1 0.2 0.3 0.4 0.5 Proportion Figure 3: Heatmaps showing the proportion of selected document languages (y-axis) for each query language (x- axis). (a): Distribution from the BGE-Reranker-V2-M3 reranker. (b): Distribution from the Qwen3-Reranker-0.6B reranker. (c): The oracle evidence distribution derived from our estimation strategy. Results for other reranking models are detailed in Appendix E. (NQ) dataset (Kwiatkowski et al., 2019), translated into 25 languages. In our experiments, we focus on a subset of languages used for evaluation. Specifi- cally, we select 2.7K samples that overlap between MKQA and the KILT NQ dataset3, enabling access to corresponding document-level relevance infor- mation for the selected test languages. Models. For retrieval, we use BGE-M3 (Chen et al.), a strong and publicly available multilin- gual embedding model capable of encoding all lan- guages considered in our experiments. For reranking, we adopt BGE-Reranker-V2-M3 and Qwen3-Reranker-0.6B (Zhang et al., 2025) as representatives of mainstream encoder-only rerankers and LLM-based rerankers, respectively. For answer generation, we evaluate two multi- lingual large language models, including Qwen2.5- 7B-Instruct (Qwen et al., 2025), and Llama-3.1- 8B-Instruct (Grattafiori et al., 2024). Evaluation Metric. Following Chirkova et al. (2024), we evaluate model outputs using the character-level
Chunk 8 · 1,999 chars
based rerankers, respectively. For answer generation, we evaluate two multi- lingual large language models, including Qwen2.5- 7B-Instruct (Qwen et al., 2025), and Llama-3.1- 8B-Instruct (Grattafiori et al., 2024). Evaluation Metric. Following Chirkova et al. (2024), we evaluate model outputs using the character-level 3-gram recall metric. The details are shown in Appendix A. 2.3 Analysis Results 2.3.1 Multilingual Rerankers Exhibit Systemic Language Bias Conclusion 1. Current multilingual RAG sys- tems exhibit a pronounced language preference bias during the reranking stage, systematically fa- voring English and the original query language. 3https://huggingface.co/datasets/facebook/ kilt_tasks To understand the linguistic preferences of cur- rent mRAG systems, we analyze the language dis- tribution of the documents selected for genera- tion. As illustrated in Figure 3, the heatmaps display two dominant patterns: a strong diagonal alignment reflecting a bias toward the query lan- guage, and a pronounced horizontal alignment in- dicating a systemic preference for English. Tak- ing BGE-Reranker as a example, around 60% of candidate documents are concentrated in English and the query language. This distribution con- firms that current rerankers heavily prioritize doc- uments based on surface-level language matching or dominant language priors (predominantly En- glish), rather than assessing semantic relevance eq- uitably across all candidate languages. 2.3.2 Reranking Bias as a Primary Performance Bottleneck Conclusion 2. These reranking biases consti- tute a primary performance bottleneck in multilin- gual RAG by causing the model to overlook gen- uinely relevant evidence within the candidate pool, thereby hindering the retrieval of optimal informa- tion. To determine whether this pronounced language bias stems from an intrinsic concentration of high- quality information in dominant languages or a fun- damental lack of multilingual capability in current rerankers, we
Chunk 9 · 1,993 chars
relevant evidence within the candidate pool, thereby hindering the retrieval of optimal informa- tion. To determine whether this pronounced language bias stems from an intrinsic concentration of high- quality information in dominant languages or a fun- damental lack of multilingual capability in current rerankers, we conduct a decoupled analysis by con- trasting the standard pipeline with an Oracle Evi- dence Estimating setting. This comparison allows us to isolate the model’s selection bias from the quality of the candidate pool, thereby identifying the core defect in the current evidence selection -- 4 of 15 -- mechanism. First, to quantify the extent to which reranking limits system performance, we evaluated the gen- eration quality under both settings and computed the correlation between reranking scores and an- swer utility. As shown in Table 1, simply select- ing the correct documents from the existing re- trieval pool yields substantial improvements rang- ing from +12.9 to +20 points. This result con- firms that the retrieval stage successfully recalls the necessary information, but the reranker fails to surface it. Furthermore, quantitative analysis reveals a weak correlation between reranker rele- vance scores and downstream answer quality, with Pearson coefficients consistently below 0.2 across all models (Table 2). These indicate that cur- rent multilingual rerankers fail to provide suffi- ciently accurate and effective evidence, creating a bottleneck that strictly limits the generation potential of LLMs. Next, to understand why valid evidence is over- looked, we analyzed the language distribution of estimated oracle evidence and conducted a case study to observe model behavior. By analyzing the language distribution under the Oracle Evi- dence Estimating setting (Figure 3, c), we find that true answer-critical evidence is broadly distributed across diverse, non-query languages, rather than being concentrated in the query language. How- ever, the
Chunk 10 · 1,988 chars
ducted a case study to observe model behavior. By analyzing the language distribution under the Oracle Evi- dence Estimating setting (Figure 3, c), we find that true answer-critical evidence is broadly distributed across diverse, non-query languages, rather than being concentrated in the query language. How- ever, the systemic bias identified in the previous section filters these optimal documents out. This phenomenon is exemplified in the Case Study (Ta- ble 11): for the query ”Who plays the blue lady in The Fifth Element?”, the reranker prioritizes non- informative query-language documents (ranks 1- 5) leading to hallucination, while suppressing the decisive multilingual evidence to rank 10. Thus, while genuinely relevant evidence is already present within candidate documents across di- verse languages, it is consistently marginalized by the systemic language preferences of cur- rent rerankers, thereby significantly constrain- ing downstream performance. 3 Language-Agnostic Utility-driven Reranker Alignment In this section, we aim to mitigate language bias in multilingual rerankers. Such bias leads models to disproportionately favor documents in English or the query language, even when higher quality evidence exists in other languages. We hypothe- Lang Llama-8B-Instruct Qwen2.5-7B-Instruct BGE Qwen3 Oracle BGE Qwen3 Oracle ar 32.7 28.9 53.6 33.8 31.4 51.4 de 62.8 60.9 76.5 59.6 58.0 73.7 en 70.1 67.8 79.3 65.4 63.3 76.2 es 63.0 62.7 76.8 62.3 61.3 75.6 fi 58.1 54.2 73.4 55.5 52.8 71.5 fr 64.4 63.7 76.7 56.6 54.4 71.6 it 63.9 62.2 77.0 60.2 57.9 73.8 ja 29.2 28.2 47.9 28.0 27.2 44.9 ko 25.5 23.3 41.0 26.5 24.7 38.9 pt 66.4 66.3 78.4 60.5 59.3 73.8 ru 51.9 47.4 68.0 45.7 42.3 63.1 th 26.4 24.8 44.1 23.5 22.3 39.0 zh 21.7 21.9 33.8 29.0 28.7 42.8 AVG 48.9 47.1 63.6 46.7 44.9 61.3 Table 1: Performance comparison (Recall@3-gram) of vanilla reranking and oracle evidence estimating. ‘BGE’ and ‘Qwen3’ refer to BGE-Reranker-V2-M3 and Qwen3-Reranker-0.6B models,
Chunk 11 · 1,978 chars
.3 73.8 ru 51.9 47.4 68.0 45.7 42.3 63.1 th 26.4 24.8 44.1 23.5 22.3 39.0 zh 21.7 21.9 33.8 29.0 28.7 42.8 AVG 48.9 47.1 63.6 46.7 44.9 61.3 Table 1: Performance comparison (Recall@3-gram) of vanilla reranking and oracle evidence estimating. ‘BGE’ and ‘Qwen3’ refer to BGE-Reranker-V2-M3 and Qwen3-Reranker-0.6B models, respectively. ‘Oracle’ denotes the performance achieved under the estimated oracle evidence. Reranker Model Pearson p-value BGE-Reranker Llama3-8B-Instruct 0.188 3.8 × 10−290 Qwen2.5-7B-Instruct 0.198 1.0 × 10−320 Qwen-Reranker Llama3-8B-Instruct 0.129 1.2 × 10−135 Qwen2.5-7B-Instruct 0.127 2.3 × 10−135 Table 2: Correlation between relevance scores (mean top-5) and downstream answer performance (Recall@3- gram) under different rerankers and generators. size that skewed training data, in which high qual- ity query and document annotations are scarce for low resource languages, is a key factor driving this disparity. To address this issue, we propose a language ag- nostic utility driven reranker alignment framework (LAURA). This framework reduces language bias by grounding reranker supervision in answer utility instead of relying on language dependent relevance signals. Instead of defining positives based on lexi- cal overlap or language matching, LAURA selects documents according to their contribution to down- stream answer quality, thereby reducing reliance on language specific surface features. Specifi- cally, LAURA uses a two stage data construction pipeline (Figure 4) to generate language agnostic supervision signals, followed by listwise reranker fine tuning. This design promotes balanced cross lingual supervision and aligns reranker preferences with answer correctness, thereby mitigating the over preference for high resource languages. 3.1 Answer Utility-driven Data Construction Although many RAG QA datasets are publicly available, most only provide annotations for the -- 5 of 15 -- correctness of the final answer, without
Chunk 12 · 1,994 chars
n and aligns reranker preferences with answer correctness, thereby mitigating the over preference for high resource languages. 3.1 Answer Utility-driven Data Construction Although many RAG QA datasets are publicly available, most only provide annotations for the -- 5 of 15 -- correctness of the final answer, without explicit query-document relevance labels. This absence leads to language bias in reranker training. We aim to automatically generate such annotations while maintaining balanced multilingual coverage. Given a query q, we retrieve a candidate docu- ment set D from a multilingual corpus. Our objec- tive is to select a positive subset Dpos ⊂ D consist- ing of documents that genuinely support answering the query, free from language-specific bias. We use the average answer quality produced by mul- tiple generators conditioned on a document as a proxy for its answer utility. Stage 1: Language-Debiased Subset Selection. Directly estimating answer utility on top-ranked retrieved documents can amplify the inherent lan- guage bias of multilingual rerankers, which of- ten favor documents in high-resource or query- matched languages. To mitigate this effect, we pro- pose a candidate debiasing stage that filters the can- didate set before utility estimation while preserv- ing overall utility. Given a retrieved document set D, we partition documents into disjoint subsets according to their language and apply the same reranker to rank doc- uments within each subset independently. From each subset, we retain up to five top-ranked doc- uments as utility candidates. This procedure does not assume language-specific relevance. Instead, it enforces equal exposure across linguistic subsets, preventing the candidate pool from being domi- nated by documents favored due to language priors rather than informational content. The retained documents are then evaluated by multiple generators to estimate their average an- swer quality. Documents achieving the highest generation
Chunk 13 · 1,991 chars
exposure across linguistic subsets, preventing the candidate pool from being domi- nated by documents favored due to language priors rather than informational content. The retained documents are then evaluated by multiple generators to estimate their average an- swer quality. Documents achieving the highest generation utility are selected for subsequent super- vision construction. In cases where multiple sub- sets yield identical maximal utility (e.g., all gener- ators produce correct answers), all corresponding documents are preserved. The resulting candidate set is denoted as Dbalanced. By decoupling candidate selection from global reranker scores and restricting comparisons to within each subset, this stage reduces language- induced ranking bias while retaining documents that are useful for downstream answer generation. Stage 2: Document-Level Utility Estimation. While Stage 1 ensures cross-lingual coverage, doc- uments in Dbalanced may still vary in their actual usefulness. We therefore perform fine-grained document-level utility estimation by evaluating each document independently via generation. To avoid introducing an implicit language bias through relative ranking alone, we apply an ab- solute utility threshold θ and retain only docu- ments whose average generation performance ex- ceeds this threshold. The final positive set Dpos thus consists of documents that demonstrably con- tribute to answer correctness, independent of lan- guage. Overall, this two-stage procedure yields high- quality, language-debiased training data that grounds reranker supervision in answer utility rather than language preference. 3.2 Listwise Reranker Fine-Tuning Using the constructed training data, we fine-tune the reranker with a listwise learning objective. Given a query q and a candidate set D, docu- ments not selected into Dpos are treated as nega- tives, forming Dneg. During training, we construct training instances consisting of one positive document and k neg- ative
Chunk 14 · 1,996 chars
ng
Using the constructed training data, we fine-tune
the reranker with a listwise learning objective.
Given a query q and a candidate set D, docu-
ments not selected into Dpos are treated as nega-
tives, forming Dneg.
During training, we construct training instances
consisting of one positive document and k neg-
ative documents, i.e., (q, dpos, {d(i)
neg}k
i=1), where
dpos ∈ Dpos and d(i)
neg ∈ Dneg. The reranker pro-
duces a relevance score s(q, d) for each document
d ∈ Dq. Training encourages the model to assign
the highest score to the positive document within
the list. We adopt a softmax cross-entropy loss:
L = −s(q, dpos) + log ∑
d∈Dq
exp(s(q, d)) (1)
For encoder-only rerankers, s(q, d) is produced
directly as a scalar logit. For LLM-based rerankers,
the score is derived from the relative logits of pre-
defined positive_token and negative_token,
which represent the model’s preference over rele-
vance labels.
3.3 Experimental Setups
LAURA Dataset. We use data from the MKQA
benchmark, selecting only samples that are disjoint
from the evaluation test set to avoid data leakage.
In stage 1, for each question, we retrieve the
top-100 candidate documents from a multilingual
Wikipedia corpus using the BGE-M3 retriever.
Within each language group, we apply the multilin-
gual BGE reranker to select the top-5 documents,
yielding a language-debiased candidate set.
During evaluation, we prompt multiple genera-
tion models to answer the question conditioned on
-- 6 of 15 --
Rerank
Candidate
Documents
Generate Eval Group
1.00
1.00
0.85
Stage 1: Language-Balanced Subset Selection
Selected
Subset
1.00
EN.1
0.75
EN.2
0.80
ZH.1
Stage 2: Document-Level Utility Estimation
MAX
Postive
Documents
Split Generate Eval Threshold
Figure 4: Two-stage data construction pipeline in the LAURA framework.
each document independently and measure answer
quality using character-level 3-gram recall. To re-
duce model-specific bias, we compute each docu-
ment’s utility score as the averageChunk 15 · 1,997 chars
Estimation MAX Postive Documents Split Generate Eval Threshold Figure 4: Two-stage data construction pipeline in the LAURA framework. each document independently and measure answer quality using character-level 3-gram recall. To re- duce model-specific bias, we compute each docu- ment’s utility score as the average generation per- formance across a diverse set of four generation models, including Qwen2.5-7B-Instruct, Qwen2.5- 14B-Instruct, Llama3-8B-Instruct, and DeepSeek- R1-Distill-Qwen-7B (DeepSeek-AI et al., 2025). The threshold θ is set to 0.8, ensuring that the doc- uments retain high utility. Finally, we construct a total of 18,360 query– positive documents pairs. Among them, 1,000 are randomly sampled as the dev set. Detailed statis- tics of the constructed fine-tuning dataset are re- ported in Appendix B. Evaluation Metric. To evaluate the effective- ness of LAURA, we adopt Precision@k and NDCG@k to assess the rerank performance on positive documents in the dev set. In addition, we use the PEER (Yang et al., 2024) metric to measure whether the reranker exhibits language- specific bias. PEER is based on the assump- tion that documents with equal relevance should have similar average rankings across different lan- guages. Higher PEER scores indicate weaker lan- guage preference. The detailed definitions of the evaluation metrics are provided in Appendix A. Training Details. We fine-tune BGE-Reranker- V2-M3 using the implementation provided by FlagEmbedding4, and Qwen3-Reranker-0.6B us- ing SWIFT (Zhao et al., 2025). For each query, BGE is trained with 1 negative document, whereas Qwen uses 7 negative documents, reflecting the stronger capacity of the LLM-based reranker to handle larger candidate lists. Both models are optimized with AdamW (Loshchilov and Hutter, 2019), using a learning rate of 6 × 10−6, and are trained for five epochs. 4https://github.com/FlagOpen/FlagEmbedding Setting Precision NDCG PEER @5 @10 @5 @10 BGE-Reranker 0.3400 0.2712 0.4666
Chunk 16 · 1,993 chars
ity of the LLM-based reranker to handle larger candidate lists. Both models are optimized with AdamW (Loshchilov and Hutter, 2019), using a learning rate of 6 × 10−6, and are trained for five epochs. 4https://github.com/FlagOpen/FlagEmbedding Setting Precision NDCG PEER @5 @10 @5 @10 BGE-Reranker 0.3400 0.2712 0.4666 0.4904 0.5941 + LAURA 0.3830 0.3149 0.5531 0.5925 0.6627 Qwen-Reranker 0.2702 0.2206 0.3695 0.3921 0.6606 + LAURA 0.3546 0.2847 0.5214 0.5496 0.6720 Table 3: Reranking results of BGE-Reranker-V2-M3 (BGE-Reranker) and Qwen3-Reranker-0.6B (Qwen- Reranker) on the dev set before and after LAURA train- ing. PEER measures language bias in the reranker, with higher values indicating weaker language preference. 3.4 Results of LAURA LAURA improves multilingual rerankers’ abil- ity to identify relevant documents. To evaluate whether rerankers can better identify positive can- didates under the LAURA, we assess Precision, NDCG on the dev set before and after training. These metrics directly reflect the rerankers’ ability to rank relevant candidates higher. As shown in Ta- ble 3, both BGE and Qwen rerankers exhibit con- sistent improvements after being trained within the LAURA. In particular, Precision@5 increases by approximately 6 points, while NDCG@5 improves by around 13 points across both model families, in- dicating a stronger capability to place positive can- didates at higher ranks. LAURA improves multilingual rerankers’ lan- guage fairness. We observe that LAURA leads to consistent improvements in language fairness. Beyond the quantitative gains on the dev set mea- sured by the PEER metric, we analyze the language distribution of reranker outputs on the MKQA test set after LAURA training. As shown in Table 4, the JS divergence and KL divergence between the post-training distribution and the estimated ora- cle evidence distribution are substantially reduced, demonstrating that the learned distribution moves closer to the desired target distribution.
Chunk 17 · 1,997 chars
ker outputs on the MKQA test set after LAURA training. As shown in Table 4, the JS divergence and KL divergence between the post-training distribution and the estimated ora- cle evidence distribution are substantially reduced, demonstrating that the learned distribution moves closer to the desired target distribution. More- over, we observe a consistent decrease in the pro- portion of documents written in English and the -- 7 of 15 -- Setting JS KL Entropy BGE-Reranker 0.203 0.186 2.03 + LAURA 0.090 0.041 2.27 Qwen-Reranker 0.141 0.122 2.13 + LAURA 0.129 0.094 2.14 Table 4: Language distributional metrics before and af- ter LAURA training on the MKQA test set. JS and KL denote the average distances between vanilla distribu- tion and the estimated oracle distribution. Entropy indi- cates the average entropy of the vanilla distribution of each query language. query language, suggesting that LAURA mitigates the over-preference for dominant languages and en- courages a more balanced multilingual ranking be- havior. This indicates that LAURA effectively re- duces the original language skew of rerankers. In terms of PEER, LAURA yields an about +7 points for the BGE reranker and about +0.5 points for the Qwen reranker, suggesting that the method system- atically mitigates language biases and promotes more equitable performance across languages. LAURA improves downstream generation per- formance and ranking utility. LAURA is de- signed to enhance reranking quality and improve the alignment between reranking scores and down- stream generation. To investigate to what ex- tent the improved reranking capability learned un- der the LAURA transfers to downstream genera- tion performance, we conduct experiments on the MKQA test set using the setup in Section 2.1. The results are reported in Table 5. On the 3- gram recall metric, incorporating LAURA leads to an average improvement of 1.95 points for the Qwen reranker and 1.0 points for the BGE reranker. These results indicate
Chunk 18 · 1,997 chars
m genera- tion performance, we conduct experiments on the MKQA test set using the setup in Section 2.1. The results are reported in Table 5. On the 3- gram recall metric, incorporating LAURA leads to an average improvement of 1.95 points for the Qwen reranker and 1.0 points for the BGE reranker. These results indicate that improving the rerankers’ ability to select higher-quality candidates can trans- late into better downstream generate quality. To quantitatively assess the change in the rela- tionship between ranking quality and generation performance, we compute the Pearson correlation between the average reranking score of the top-5 documents and the corresponding 3-gram recall scores. After training with LAURA, the Pearson correlation increases by approximately 25% for the BGE reranker and by about 108% for the Qwen reranker. This demonstrates that LAURA substan- tially strengthens the correlation between rerank- ing scores and generation performance, thereby im- proving the practical utility of the reranking scores for downstream generation. Setting Llama Qwen 3-gram Pearson 3-gram Pearson BGE-Reranker 48.9 0.198 46.7 0.188 + LAURA 49.9 0.236 47.7 0.247 Qwen-Reranker 47.1 0.129 44.9 0.127 + LAURA 49.2 0.269 46.7 0.264 Table 5: Generation performance and Pearson corre- lation of rerankers before and after LAURA training on the MKQA test set. Pearson correlations are com- puted between the average reranker scores of the top-5 reranked documents and character 3-gram recall perfor- mance. All Pearson correlations are statistically signif- icant with p-values < 0.001. Setting Llama Qwen 3-gram Pearson 3-gram Pearson BGE-Reranker 48.9 0.198 46.7 0.188 Self-Training 48.9 0.188 46.7 0.202 mMARCO 48.7 0.132 46.3 0.137 LAURA 49.9 0.236 47.7 0.247 Table 6: Performance comparison of LAURA against al- ternative fine-tuning strategies, including Self-Training (naive supervision using top-5 retrieved candidates) and mMARCO fine-tuning (general-purpose multilingual ranking
Chunk 19 · 1,998 chars
8 Self-Training 48.9 0.188 46.7 0.202 mMARCO 48.7 0.132 46.3 0.137 LAURA 49.9 0.236 47.7 0.247 Table 6: Performance comparison of LAURA against al- ternative fine-tuning strategies, including Self-Training (naive supervision using top-5 retrieved candidates) and mMARCO fine-tuning (general-purpose multilingual ranking data). 3.5 Comparison Against Fine-tuning Baselines We provide additional analysis on two alternative fine-tuning strategies to further validate the effec- tiveness of LAURA’s data construction pipeline. Self-Training Baseline. The first baseline fine- tunes the reranker solely on its own top-ranked out- puts as pseudo-positive supervision, directly treat- ing the top-5 re-ranked documents as relevant and all remaining candidates as non-relevant, without any additional filtering or refinement. This setting corresponds to the starting point of LAURA’s data construction pipeline and serves as an empty con- trol to verify whether LAURA’s additional filter- ing and refinement steps contribute beyond naive supervision. Under this paradigm, the model’s ex- isting ranking preferences may be progressively re- inforced, as no mechanism is introduced to correct noisy or biased pseudo labels. Fine-tuning on mMARCO. The second base- line fine-tunes the reranker using mMARCO (Boni- facio et al., 2022), a widely-used multilingual dataset, to examine whether general-purpose train- ing data can address the specific distribution imbal- ance in mRAG. We randomly sample 20k queries -- 8 of 15 -- from mMARCO for training, comparable to the 17,360 queries used by LAURA, ensuring a fair comparison in terms of training scale. For both baselines, we use the same hyperparam- eters as in our main experiments. In addition, we ensure that LAURA and the Self-Training baseline are trained on the exact same set of queries, iso- lating the effect of the data construction strategy rather than the training queries themselves. As shown in Table 6, LAURA consistently outperforms both
Chunk 20 · 1,999 chars
ame hyperparam- eters as in our main experiments. In addition, we ensure that LAURA and the Self-Training baseline are trained on the exact same set of queries, iso- lating the effect of the data construction strategy rather than the training queries themselves. As shown in Table 6, LAURA consistently outperforms both baselines across all settings. The Self-Training baseline fails to surpass BGE- Reranker on certain metrics, indicating that naive pseudo-label supervision can reinforce existing bi- ases rather than correct them. The mMARCO base- line also leads to a slight performance drop com- pared to BGE-Reranker, suggesting that general relevance signals cannot resolve the specific distri- bution imbalance in mRAG. These results collec- tively demonstrate that LAURA’s filtering and re- finement steps are essential for effective reranker adaptation in mRAG settings. 4 Related Work mRAG is pivotal for bridging global informa- tion gaps and ensuring equitable knowledge ac- cess across linguistic barriers. To advance this ca- pability, the community has established a robust foundation spanning diverse benchmarks (Asai et al., 2021a,c; Liu et al., 2025) and retrieval ar- chitectures (Gao et al., 2022; Zhang et al., 2023; Chirkova et al., 2024). Previous studies have conducted preliminary analyses of language preference phenomena in mRAG systems. For instance, Amiraz et al. (2025) investigates multilingual retrieval biases over Ara- bic–English corpora. Park and Lee (2025) eval- uate language bias in multilingual RAG by mea- suring retrieval ranking shifts. In comparison, our work moves beyond merely characterizing these preferences to systematically quantify the substan- tial performance gap resulting from this linguistic misalignment. To mitigate these biases, prior research has largely relied on translation-centric strategies, such as mapping queries or documents to a shared pivot language (Moon et al., 2025; Amiraz et al., 2025; Park and Lee, 2025). However, these
Chunk 21 · 1,998 chars
tify the substan- tial performance gap resulting from this linguistic misalignment. To mitigate these biases, prior research has largely relied on translation-centric strategies, such as mapping queries or documents to a shared pivot language (Moon et al., 2025; Amiraz et al., 2025; Park and Lee, 2025). However, these pipeline- level heuristics depend heavily on the capability of translation models and do not fundamentally cor- rect the ranking objective. In comparison, we pro- pose to align rerankers directly with generation util- ity, training the model to prioritize answer-critical evidence regardless of the source language. 5 Conclusion This work analyzes language bias in multilingual retrieval-augmented generation (mRAG) systems, showing that conventional rerankers favor English and the query’s original language, suppressing crit- ical multilingual evidence. Using estimated oracle evidence, we reveal the resulting performance gap and cross-lingual distribution of answer-relevant documents. To address this, we propose LAURA, a language-agnostic utility-driven reranker that aligns evidence ranking with downstream genera- tion, mitigating bias and improving performance across languages and models. Limitations This work focuses on analyzing the alignment be- tween reranker relevance and downstream answer quality in multilingual RAG systems. Accordingly, our study is limited to the reranking stage and does not consider modifications to the retriever or the generator, whose interactions with reranking re- main an important direction for future work. In addition, our evaluation relies on automatic, task-specific metrics that may not fully capture all aspects of generation utility, such as factual completeness or cross-lingual reasoning. Finally, while our experiments cover diverse multilingual settings, the generalizability of our findings to other architectures, domains, and low-resource lan- guages warrants further investigation. Acknowledgments We sincerely thank
Chunk 22 · 1,999 chars
pects of generation utility, such as factual completeness or cross-lingual reasoning. Finally, while our experiments cover diverse multilingual settings, the generalizability of our findings to other architectures, domains, and low-resource lan- guages warrants further investigation. Acknowledgments We sincerely thank the reviewers for their insight- ful comments and valuable suggestions. This work was supported by Beijing Natural Science Foun- dation (L243006), the Natural Science Foundation of China (No. 62536008, 62506354), the Post- doctoral Fellowship Program of CPSF under Grant Number GZC20251041, and MYbank, AntGroup. References Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zo- har Karnin, and Liane Lewin-Eytan. 2025. The cross-lingual cost: Retrieval biases in RAG over Arabic-English corpora. In Proceedings of The -- 9 of 15 -- Third Arabic Natural Language Processing Confer- ence, pages 69–83, Suzhou, China. Association for Computational Linguistics. Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021a. XOR QA: Cross-lingual open-retrieval question answer- ing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 547–564, Online. Association for Com- putational Linguistics. Akari Asai, Xinyan Yu, Jungo Kasai, and Hanna Ha- jishirzi. 2021b. One question answering model for many languages with cross-lingual dense passage re- trieval. Advances in Neural Information Processing Systems, 34:7547–7560. Akari Asai, Xinyan Yu, Jungo Kasai, and Hannaneh Ha- jishirzi. 2021c. One question answering model for many languages with cross-lingual dense passage re- trieval. In Proceedings of the 35th International Con- ference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA. Curran Associates Inc. Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and
Chunk 23 · 1,997 chars
r many languages with cross-lingual dense passage re- trieval. In Proceedings of the 35th International Con- ference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA. Curran Associates Inc. Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and Rodrigo Nogueira. 2022. mmarco: A multilingual version of the ms marco pas- sage ranking dataset. Preprint, arXiv:2108.13897. Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault Formal, Stéphane Clinchant, and Vassilina Nikoulina. 2024. Retrieval-augmented generation in multilingual settings. In Proceedings of the 1st Workshop on Towards Knowledgeable Lan- guage Models (KnowLLM 2024), pages 177–188, Bangkok, Thailand. Association for Computational Linguistics. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, and others. 2025. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning. Preprint, arXiv:2501.12948. Yifan Gao, Qingyu Yin, Zheng Li, Rui Meng, Tong Zhao, Bing Yin, Irwin King, and Michael Lyu. 2022. Retrieval-augmented multilingual keyphrase gener- ation with retriever-generator iterative training. In Findings of the Association for Computational Lin- guistics: NAACL 2022, pages 1233–1246, Seattle, United States. Association for Computational Lin- guistics. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur
Chunk 24 · 1,983 chars
bhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob De- vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answer- ing research. Transactions of the Association for Computational Linguistics, page 453–466. Patrick Lewis, Ethan Perez, Aleksandara Piktus, Filippo Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv: Computation and Language,arXiv: Computation and Language. Bryan Li, Samar Haider, Fiona Luo, Adwait Agashe, and Chris Callison-Burch. 2024. BordIRlines: A dataset for evaluating cross-lingual retrieval aug- mented generation. In Proceedings of the First Work- shop on Advancing Natural Language Processing for Wikipedia, pages 1–13, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Wei Liu, Sony Trenous, Leonardo F. R. Ribeiro, Bill Byrne, and Felix Hieber. 2025. XRAG: Cross- lingual retrieval-augmented generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15669–15690, Suzhou, China. Association for Computational Linguistics. Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. Mkqa: A linguistically diverse benchmark for mul- tilingual open domain question answering. Transac- tions of the Association for Computational Linguis- tics, page 1389–1406. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Con- ference on Learning
Chunk 25 · 1,993 chars
d Joachim Daiber. 2021. Mkqa: A linguistically diverse benchmark for mul- tilingual open domain question answering. Transac- tions of the Association for Computational Linguis- tics, page 1389–1406. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Con- ference on Learning Representations. Hoyeon Moon, Byeolhee Kim, and Nikhil Verma. 2025. Quality-aware translation tagging in multilin- gual RAG system. In Proceedings of the 5th Work- shop on Multilingual Representation Learning (MRL 2025), pages 161–177, Suzhuo, China. Association for Computational Linguistics. Jeonghyun Park and Hwanhee Lee. 2025. Investigat- ing language preference of multilingual RAG sys- tems. In Findings of the Association for Compu- tational Linguistics: ACL 2025, pages 5647–5675, Vienna, Austria. Association for Computational Lin- guistics. -- 10 of 15 -- Jirui Qi, Raquel Fernández, and Arianna Bisazza. 2025. On the consistency of multilingual context utiliza- tion in retrieval-augmented generation. In Proceed- ings of the 5th Workshop on Multilingual Representa- tion Learning (MRL 2025), pages 199–225, Suzhuo, China. Association for Computational Linguistics. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 oth- ers. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115. Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented lan- guage models. Transactions of the Association for Computational Linguistics, 11:1316–1331. Eugene Yang, Thomas Jänich, James Mayfield, and Dawn Lawrie. 2024. Language fairness in multilin- gual information retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR ’24, page
Chunk 26 · 1,991 chars
he Association for Computational Linguistics, 11:1316–1331. Eugene Yang, Thomas Jänich, James Mayfield, and Dawn Lawrie. 2024. Language fairness in multilin- gual information retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR ’24, page 2487–2491, New York, NY, USA. Association for Computing Machinery. Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, and Jimmy Lin. 2023. Toward best practices for training multilingual dense retrieval models. ACM Trans. Inf. Syst., 42(2). Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 embedding: Ad- vancing text embedding and reranking through foun- dation models. arXiv preprint arXiv:2506.05176. Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2025. Swift: A scalable lightweight infrastructure for fine-tuning. Proceed- ings of the AAAI Conference on Artificial Intelli- gence, 39(28):29733–29735. -- 11 of 15 -- A Metric Implementation Details We report some evaluation metrics in our exper- iments: character 3-gram Recall, Precision@k, NDCG@k, and PEER. Below, we describe their implementations in detail. Character 3-gram Recall. Character 3-gram Recall measures the lexical coverage between the generated content and the reference text at the char- acter level. We extract all contiguous character 3- grams from both the reference text and the gener- ated text. Let Cref denote the multiset of character 3-grams from the reference, and Cgen denote those from the generated text. The character 3-gram Re- call score is defined as: Recallchar-3 = |Cgen ∩ Cref| |Cref| (2) This metric is robust to tokenization differences and is particularly suitable for multilingual evalu- ation. Precision@k. Precision@k measures the pro- portion of relevant
Chunk 27 · 1,989 chars
ference, and Cgen denote those
from the generated text. The character 3-gram Re-
call score is defined as:
Recallchar-3 = |Cgen ∩ Cref|
|Cref| (2)
This metric is robust to tokenization differences
and is particularly suitable for multilingual evalu-
ation.
Precision@k. Precision@k measures the pro-
portion of relevant documents among the top-k
reranked results. Formally, given a ranked list of
documents Rk of length k and a binary relevance
function rel(·), Precision@k is defined as:
Precision@k = 1
k
k ∑
i=1
rel(Ri) (3)
where rel(Ri) = 1 if the document at rank i is rel-
evant, and 0 otherwise.
NDCG@k. Normalized Discounted Cumulative
Gain (NDCG@k) takes into both the relevance and
the ranking position of documents. We first com-
pute DCG@k as:
DCG@k =
k ∑
i=1
2rel(Ri) − 1
log2(i + 1) (4)
where rel(Ri) denotes the relevance score of
the document at rank i. NDCG@k is obtained
by normalizing DCG@k with the ideal DCG@k
(IDCG@k), which corresponds to the optimal
ranking:
NDCG@k = DCG@k
IDCG@k (5)
This normalization ensures that NDCG@k ranges
between 0 and 1.
17.1% en
13.2% es
12.7% it
12.4% de
12.0% fr
11.6% pt
8.8% fi
3.5% ru
2.4% zh
1.8% th
1.7% ja
1.4% ko
1.3% ar
Figure 5: Language distribution of queries (inner ring)
and positive documents (outer ring).
PEER. We compute PEER (Probability of Equal
Expected Rank) following Yang et al. (2024), with
a task-specific adaptation: we only use positive
documents in the fairness test. Intuitively, PEER
evaluates whether relevant documents written in
different languages receive systematically different
ranks.
For each query q, we collect all retrieved doc-
uments labeled as positive and record their rank
positions in the final ranked list. We then parti-
tion these ranks by the document language ℓ ∈ L,
yielding groups {Rq,ℓ}ℓ∈L, where Rq,ℓ is the mul-
tiset of rank positions of positive documents in lan-
guage ℓ.
We apply the Kruskal–Wallis H test (KW) on
these rank groups, with the null hypothesis that the
rankChunk 28 · 1,989 chars
rank
positions in the final ranked list. We then parti-
tion these ranks by the document language ℓ ∈ L,
yielding groups {Rq,ℓ}ℓ∈L, where Rq,ℓ is the mul-
tiset of rank positions of positive documents in lan-
guage ℓ.
We apply the Kruskal–Wallis H test (KW) on
these rank groups, with the null hypothesis that the
rank distributions of positive documents are iden-
tical across languages (i.e., equal expected ranks).
We define PEER for query q as the resulting p-
value:
PEER(q) = p(KW({Rq,ℓ}ℓ∈L)) (6)
where higher values (closer to 1) indicate that we
cannot reject the hypothesis of equal expected rank,
suggesting better language fairness. We report the
final PEER score as the mean of PEER(q) over all
queries.
B Statistics of the LAURA Dataset
As shown in Table 7, we report the number of
queries and positive documents in the constructed
-- 12 of 15 --
Train Set Dev Set
Queries 17,360 1,000
Positive Documents 114,867 6,762
Avg. Languages per Query 2.90 2.90
Table 7: Statistics of the LAURA dataset. Avg. Lan-
guages per Query indicates the average number of
distinct languages among the positive candidate docu-
ments associated with each query.
Language Baseline LAURA Δ% p
Portuguese 63.1 65.9 +4.4 1.24e-24***
Finnish 55.1 57.8 +4.7 1.15e-17***
French 59.8 62.3 +4.3 1.26e-19***
Spanish 62.3 64.8 +4.0 2.94e-20***
Italian 61.1 63.5 +4.1 1.70e-18***
German 60.3 62.1 +2.9 1.46e-10***
Arabic 31.7 32.9 +3.8 5.22e-04***
English 66.6 67.8 +1.7 8.68e-06***
Russian 46.8 47.5 +1.5 4.62e-02*
Thai 24.2 24.9 +2.9 1.20e-02*
Japanese 28.1 28.6 +1.8 1.02e-01
Chinese 25.3 25.8 +1.8 8.04e-02
Korean 25.0 24.8 -0.8 4.98e-01
Overall 46.9 48.4 +3.1 8.62e-74***
Table 8: Per-language paired t-test results on 3-gram
scores.
LAURA dataset, as well as the average number of
languages per query among the candidate positive
documents. Figure 5 illustrates the language distri-
butions of queries and positive documents.
C Detailed MKQA test Results
To facilitate comparison, Table 10 reportsChunk 29 · 1,996 chars
: Per-language paired t-test results on 3-gram scores. LAURA dataset, as well as the average number of languages per query among the candidate positive documents. Figure 5 illustrates the language distri- butions of queries and positive documents. C Detailed MKQA test Results To facilitate comparison, Table 10 reports detailed per-language results on the MKQA test set used in the main experiments. D Case Study Table 11 presents a case study illustrating the limi- tations of relevance-based reranking in the vanilla multilingual RAG setting. E Language Distribution and Model Performance Figure 6 and Figure 7 show the document lan- guage distribution and generation performance un- der the vanilla and upper-limit settings, respec- tively, across different rerankers. Setting Llama Qwen 3-gram Pearson 3-gram Pearson BGE-Reranker 48.9 0.198 46.7 0.188 Stage 1 48.7 0.272 46.5 0.269 LAURA 49.9 0.236 47.7 0.247 Table 9: Ablation study comparing the full pipeline against using only Stage 1 training data. F Statistical Significance of LAURA Improvements To assess whether the RAG performance gains from LAURA are statistically reliable, we compute per-query 3-gram scores and construct paired sam- ples between LAURA and the corresponding base- line under identical configurations (2 rerankers × 2 generators), yielding approximately 4,000 paired observations per language. We apply two-tailed paired t-tests on the per-query score differences. Results are reported in Table 8. LAURA achieves statistically significant gains (p < 0.05) in 10 out of 13 languages, and the overall improvement is highly significant (p = 8.62 × 10−74). All conclu- sions hold under Bonferroni correction, confirm- ing that the improvements reflect a systematic ef- fect rather than sampling variation. G Ablation of LAURA To further validate the necessity of both Stage 1 and Stage 2 in our pipeline, we conduct an ablation study using the training data obtained from Stage 1 alone. As shown in Table 9, training
Chunk 30 · 1,996 chars
correction, confirm- ing that the improvements reflect a systematic ef- fect rather than sampling variation. G Ablation of LAURA To further validate the necessity of both Stage 1 and Stage 2 in our pipeline, we conduct an ablation study using the training data obtained from Stage 1 alone. As shown in Table 9, training solely with the data from Stage 1 improves the correlation coef- ficient. However, due to the lack of filtering, a substantial number of false-positive documents are included, making it difficult to achieve meaning- ful improvements in downstream performance. In contrast, Stage 2 performs document-level evalu- ation and filtering, which substantially improves data quality and consequently leads to further gains in the final generation performance. -- 13 of 15 -- Setting ar de en es fi fr it ja ko pt ru th zh Avg. Llama3-8B-Instruct Upper-limit 53.6 76.5 79.3 76.8 73.4 76.7 77.0 47.9 41.0 78.4 68.0 44.1 33.8 63.6 BGE-Reranker 32.7 62.8 70.1 63.0 58.1 64.4 63.9 29.2 25.5 66.4 51.9 26.4 21.7 48.9 + LAURA 33.2 64.8 69.8 65.7 59.9 66.7 65.5 28.9 24.9 69.1 50.8 26.9 22.2 49.9 Qwen-Reranker 28.9 60.9 67.8 62.7 54.2 63.7 62.2 28.2 23.3 66.3 47.4 24.8 21.9 47.1 + LAURA 31.6 63.3 70.1 65.5 57.3 66.1 65.0 28.2 23.8 68.4 50.5 26.7 22.7 49.2 Qwen2.5-7B-Instruct Upper-limit 51.4 73.7 76.2 75.6 71.5 71.6 73.8 44.9 38.9 73.8 63.1 39.0 42.8 61.3 BGE-Reranker 33.8 59.6 65.4 62.3 55.5 56.6 60.2 28.0 26.5 60.5 45.7 23.5 29.0 46.7 + LAURA 33.9 61.0 66.0 64.2 58.4 58.8 61.8 29.4 25.7 63.4 44.8 23.4 29.5 47.7 Qwen-Reranker 31.4 58.0 63.3 61.3 52.8 54.4 57.9 27.2 24.7 59.3 42.3 22.3 28.7 44.9 + LAURA 33.1 59.2 65.2 63.9 55.3 57.8 61.9 28.1 25.0 62.6 44.1 22.7 28.7 46.7 Table 10: Performance comparison between the vanilla document reranking and the oracle evidence estimation set- tings on MKQA. All results are reported using character 3-gram recall. Bolded results denote the best performance among all non–upper-limit settings. ar de en es fi fr it ja ko pt ru th
Chunk 31 · 1,991 chars
2.7 28.7 46.7 Table 10: Performance comparison between the vanilla document reranking and the oracle evidence estimation set- tings on MKQA. All results are reported using character 3-gram recall. Bolded results denote the best performance among all non–upper-limit settings. ar de en es fi fr it ja ko pt ru th zh Query Language ar de en es fi fr it ja ko pt ru th zh Doc Language 0.00 0.25 0.50 0.75 1.00 Performance 0.33 0.60 0.66 0.64 0.57 0.58 0.61 0.29 0.25 0.63 0.46 0.24 0.29 (a) BGE-Gemma ar de en es fi fr it ja ko pt ru th zh Query Language 0.31 0.60 0.67 0.64 0.55 0.58 0.60 0.28 0.25 0.62 0.45 0.21 0.30 (b) BGE-Minicpm ar de en es fi fr it ja ko pt ru th zh Query Language 0.31 0.58 0.63 0.61 0.53 0.54 0.58 0.27 0.25 0.59 0.42 0.22 0.29 (c) Qwen3-Reranker-0.6B 0.0 0.1 0.2 0.3 0.4 0.5 Proportion Figure 6: Vanilla document reranking with BGE-emma, BGE-Minicpm and Qwen3-Reranker-0.6B rerankers. The heatmap shows the language distribution, while the bar chart reports Recall@3-gram of Qwen2.5-7B-Instruct. ar de en es fi fr it ja ko pt ru th zh Query Language ar de en es fi fr it ja ko pt ru th zh Doc Language 0.00 0.25 0.50 0.75 1.00 Performance 0.52 0.74 0.76 0.76 0.71 0.72 0.74 0.45 0.39 0.74 0.63 0.39 0.43 (a) BGE-Gemma ar de en es fi fr it ja ko pt ru th zh Query Language 0.52 0.74 0.77 0.75 0.72 0.72 0.74 0.45 0.38 0.74 0.63 0.38 0.44 (b) BGE-Minicpm ar de en es fi fr it ja ko pt ru th zh Query Language 0.52 0.73 0.76 0.76 0.71 0.71 0.74 0.45 0.39 0.74 0.62 0.38 0.43 (c) Qwen3-Reranker-0.6B 0.0 0.1 0.2 0.3 0.4 0.5 Proportion Figure 7: Oracle evidence estimation with BGE-Gemma, BGE-Minicpm and Qwen3-Reranker-0.6B rerankers. The heatmap shows the language distribution, while the bar chart reports Recall@3-gram of Qwen2.5-7B-Instruct. -- 14 of 15 -- Query Who plays the blue lady in The Fifth Element? Label Maïwenn Vanilla Top-5 [1] (es) El quinto elemento. El quinto elemento (en francés: Le Cinquième
Chunk 32 · 1,996 chars
pm and Qwen3-Reranker-0.6B rerankers. The heatmap shows the language distribution, while the bar chart reports Recall@3-gram of Qwen2.5-7B-Instruct. -- 14 of 15 -- Query Who plays the blue lady in The Fifth Element? Label Maïwenn Vanilla Top-5 [1] (es) El quinto elemento. El quinto elemento (en francés: Le Cinquième Élément) es una película francesa (con coproducción de EE.UU.) de ciencia ficción y acción de 1997 dirigida por Luc Besson, con Bruce Willis, Milla Jovovich y Gary Oldman en los papeles principales. Principalmente ambientada en el , la trama central de la película involucra la supervivencia del planeta Tierra, que se convierte en responsabilidad de Korben Dallas (Willis)... [2] (en) The Fifth Element. The Fifth Element The Fifth Element () is a 1997 French science fiction action film directed and co-written by Luc Besson. It stars Bruce Willis, Gary Oldman and Milla Jovovich. Primarily set in the 23rd century, the film’s central plot involves the survival of planet Earth, which becomes the responsibility of Korben Dallas (Willis), a taxicab driver and former special forces major, after a young woman (Jovovich) falls into his cab... [3] (it) Il quinto elemento. Il quinto elemento (Le Cinquième Élément) è un film del 1997 diretto da Luc Besson. Di produzione francese (benché girato in lingua inglese), fu la pellicola più costosa mai prodotta in Europa allépoca della sua uscita. Il film, che ha per protagonisti Bruce Willis, Milla Jovovich e Gary Oldman, venne presentato fuori concorso al 50º Festival di Cannes... [4] (ru) Пятый элемент (фильм). того, для создания костюмов был привлечён известный модельер Жан-Поль Готье. Он разработал все 900 костюмов, использованных в сценах на корабле «Флостон Парадайз». Костюм Лилу из белых полос ткани Готье создал, вдохновившись картиной Фриды Кало «Сломанная колонна». В течение года команда создала более 8000 рисунков. В это время Бессон предложил на главную роль Брюса Уиллиса и Мела Гибсона, а также рассматривал
Chunk 33 · 1,984 chars
900 костюмов, использованных в сценах на корабле «Флостон Парадайз». Костюм Лилу из белых полос ткани Готье создал, вдохновившись картиной Фриды Кало «Сломанная колонна». В течение года команда создала более 8000 рисунков. В это время Бессон предложил на главную роль Брюса Уиллиса и Мела Гибсона, а также рассматривал Джулию Робертс на роль Лилу... [5] (zh) 第五元素 (電影). 米拉·乔沃维奇饰)的人形女性。莉露对周围的一切深感恐惧,逃出实验 室后,她从楼层的外沿跳了下去,正好掉进前特种部队少校科本·达拉斯(布鲁斯·威利斯饰)所 开的出租车裡... Model Answer Milla Jovovich plays the blue lady, Leeloo, in The Fifth Element. Wrong Oracle Top-5 [1] (de) Das fünfte Element. Das fünfte Element (Originaltitel: Le Cinquième Élément) ist ein Science- Fiction-Film von Luc Besson mit Bruce Willis und Milla Jovovich aus dem Jahr 1997. Das fünfte El- ement ist aufgrund seiner hohen Einspielergebnisse von über 260 Millionen US-Dollar einer der bisher kommerziell erfolgreichsten europäischen Filme. Handlung Der Film beginnt im Jahr 1914 in Ägypten, in einem verfallenen Tempel, wo der Archäologe Professor Pacoli, begleitet vom Reporter Billy und einem Priester, Inschriften über das unfassbar Böse findet... [2*] (de) (rank 10 in baseline) Das fünfte Element. der Antagonist Zorg begegnen sich im Film kein einziges Mal. Die Kostüme und Accessoires wurden von dem französischen Modeschöpfer Jean Paul Gaultier entworfen. Als sich der Archäologe zu Beginn des Films plötzlich riesigen Mondoshawan-Aliens gegenübersieht, fragt er in der deutschen Fassung: „Sind Sie hier von der Erde?“, während es im Original heißt: „Are you German?“(dt. „Sind Sie Deutsche(r)?“). Der erste Teil der Arie der Diva ist aus der Oper Lucia di Lammermoor von Gaetano Donizetti und wird hier von Inva Mula gesungen. Als Darstellerin der Diva agierte jedoch Maïwenn, mit der Regisseur Besson zum Zeitpunkt der Dreharbeiten zusam- menlebte und... [3] (de) Milla Jovovich. Milica „Milla“Jovovich (* 17. Dezember 1975 in Kiew, Ukrainische SSR, Sowjetunion, ukrainisch Милиця Богданівна Йовович) ist eine
Chunk 34 · 1,816 chars
d wird hier von Inva Mula gesungen. Als Darstellerin der Diva agierte jedoch Maïwenn, mit der Regisseur Besson zum Zeitpunkt der Dreharbeiten zusam- menlebte und... [3] (de) Milla Jovovich. Milica „Milla“Jovovich (* 17. Dezember 1975 in Kiew, Ukrainische SSR, Sowjetunion, ukrainisch Милиця Богданівна Йовович) ist eine US-amerikanische Schauspielerin und Model serbisch-russischer Herkunft. Bekannt wurde sie nach Erfolgen in den Filmen Das fünfte Element und Johanna von Orleans, aber besonders für ihre Hauptrolle in der Filmreihe Resident Evil... [4] (de) Das fünfte Element. ins Weltall geschossen wird. In einer abschließenden Szene will sich der Präsident bei den beiden „Helden“bedanken, die sich aber leidenschaftlich lieben und deshalb unabkömm- lich sind. Auszeichnungen Der Film wurde im Jahr 1998 für den Oscar in der Kategorie Bester Tonschnitt nominiert. Er wurde 1998 in den Kategorien Bester Science-Fiction-Film, Beste Spezialeffekte, Beste Kostüme und Beste Nebendarstellerin (Milla Jovovich) für den Saturn Award nominiert... [5] (de) Das fünfte Element. den Kategorien Bester Film, Beste Kostüme, Bester Schnitt, Beste Filmmusik und Bester Ton für den gleichen Preis nominiert. Milla Jovovich wurde 1998 für den Blockbuster Enter- tainment Award und (für die Beste Kampfszene) den MTV Movie Award nominiert. Der Film gewann 1997 die Goldene Leinwand, den Bogey Award in Silber und wurde für den Europäischen Filmpreis no- miniert... Model Answer The role of the blue alien diva Plavalaguna was played by Maïwenn. True Table 11: A case study revealing a limitation of relevance-based reranking in multilingual RAG. The answer-critical document (marked as [2*]) is retrieved but ranked only 10th under the baseline, causing relevance-based reranking to produce an incorrect answer. -- 15 of 15 --