Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation

arXiv:2410.21970

Summary

This study introduces Futurepedia, a new benchmark for evaluating multilingual Retrieval-Augmented Language Models (RALMs). Unlike existing datasets, Futurepedia provides time-insensitive, parallel documents and ground-truth answers across eight languages, enabling fair comparison. The authors evaluate six models on three tasks: monolingual knowledge extraction, cross-lingual knowledge transfer, and multilingual knowledge selection. Results reveal significant linguistic inequalities. In monolingual extraction, high-resource languages outperform low-resource ones, though scaling model size reduces this gap. Translating low-resource documents to high-resource languages often fails due to cascading translation errors. In cross-lingual transfer, Indo-European languages facilitate better performance by allowing models to quote documents directly rather than translating answers. Finally, in knowledge selection, models exhibit strong bias toward English, often ignoring non-English documents even when they contain correct information. To mitigate these issues, the paper suggests specific strategies. For monolingual tasks, caution is advised when translating low-resource content. For cross-lingual tasks, prompting models to extract answers from source documents before translating improves accuracy. To reduce selection bias, the authors recommend increasing the volume of non-English documents and placing English documents in non-initial positions within the context. These findings highlight the complexities of multilingual RAG and offer actionable insights for future development.

PDF viewer

Chunks(40)

Chunk 0 · 1,995 chars

Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented
Generation
Suhang Wu1* Jialong Tang2†, Baosong Yang2, Ante Wang1,
Kaidi Jia1, Jiawei Yu1, Junfeng Yao1, Jinsong Su1
1Xiamen University
2Tongyi Lab, Alibaba Group
wusuhang@xmu.stu.edu.cn
Abstract
RALMs (Retrieval-Augmented Language Models) broaden
their knowledge scope by incorporating external textual re-
sources. However, the multilingual nature of global knowl-
edge necessitates RALMs to handle diverse languages, a
topic that has received limited research focus. In this work,
we propose Futurepedia, a carefully crafted benchmark con-
taining parallel texts across eight representative languages.
We evaluate six multilingual RALMs using our benchmark
to explore the challenges of multilingual RALMs. Experi-
mental results reveal linguistic inequalities: 1) high-resource
languages stand out in Monolingual Knowledge Extraction;
2) Indo-European languages lead RALMs to provide answers
directly from documents, alleviating the challenge of ex-
pressing answers across languages; 3) English benefits from
RALMs’ selection bias and speaks louder in multilingual
knowledge selection. Based on these findings, we offer ad-
vice for improving multilingual Retrieval Augmented Gen-
eration. For monolingual knowledge extraction, careful at-
tention must be paid to cascading errors from translating
low-resource languages into high-resource ones. In cross-
lingual knowledge transfer, encouraging RALMs to provide
answers within documents in different languages can improve
transfer performance. For multilingual knowledge selection,
incorporating more non-English documents and reposition-
ing English documents can help mitigate RALMs’ selection
bias. Through comprehensive experiments, we underscore the
complexities inherent in multilingual RALMs and offer valu-
able insights for future research.
1 Introduction
Retrieval-Augmented Generation (RAG) aims to allevi-
ate the knowledge limitations of Large Language

Chunk 1 · 1,989 chars

lish documents can help mitigate RALMs’ selection
bias. Through comprehensive experiments, we underscore the
complexities inherent in multilingual RALMs and offer valu-
able insights for future research.
1 Introduction
Retrieval-Augmented Generation (RAG) aims to allevi-
ate the knowledge limitations of Large Language Mod-
els (LLMs), leading to the development of Retrieval-
Augmented Language Models (RALMs) (Chen et al. 2023a;
Yu et al. 2023; Gao et al. 2024; Asai et al. 2024; Lin et al.
2024). Particularly, since the knowledge encapsulated in dif-
ferent languages may vary significantly, multilingual RAG
has emerged as a critical research direction for effectively
utilizing multilingual texts.
However, multilingual RAG is in its infancy and faces
challenges due to the lack of effective benchmarks. The
commonly-used RAG benchmarks such as CRUD (Lyu et al.
*Work done during internship at Tongyi Lab.
†Contributed equally.
#Lang 	Time-
Insensitive
Ground truth
Answer
Multilingual
Parallel
Non-multilingual RAG Benchmark
RECALL (Liu et al. 2023) 	1 	✓ 	✓ 	✗
RGB (Chen et al. 2023a) 	2 	✗ 	✓ 	✗
CRUD (Lyu et al. 2024) 	1 	✗ 	✓ 	✗
Multilingual RAG Benchmark
MKQA (Longpre, Lu, and Daiber 2021) 26 	✗ 	✓ 	✓
XOR-TyDi QA(Asai et al. 2021) 	8 	✗ 	✓ 	✗
NoMIRACL (Thakur et al. 2024) 	18 	✗ 	✗ 	✗
Ours 	8 	✓ 	✓ 	✓
Table 1: Comparison of ours and other RAG benchmarks.
2024), RECALL (Liu et al. 2023), and RGB (Chen et al.
2023a) are only limited to English or Chinese. Although
some multilingual benchmarks exist, such as MKQA (Long-
pre, Lu, and Daiber 2021), XOR-TyDi QA (Asai et al.
2021), and NoMIRACL (Thakur et al. 2024), they also have
their shortcomings. MKQA and XOR-TyDi QA are time-
sensitive and face the risk of information leakage, while
NoMIRACL lacks ground truth answers, hindering the as-
sessment of RALMs in comprehension and response gen-
eration. Besides, most benchmarks fail to provide multilin-
gual parallel data, preventing fair comparison across differ-
ent

Chunk 2 · 1,999 chars

A and XOR-TyDi QA are time-
sensitive and face the risk of information leakage, while
NoMIRACL lacks ground truth answers, hindering the as-
sessment of RALMs in comprehension and response gen-
eration. Besides, most benchmarks fail to provide multilin-
gual parallel data, preventing fair comparison across differ-
ent languages.
To address these issues, we first propose Futurepedia1,
a carefully crafted multilingual RAG benchmark based on
Wikipedia2. The differences between our benchmark and
previous ones are shown in Table 1. Our benchmark includes
197 parallel documents and the corresponding QA pairs in
eight languages. Particularly, it also introduces three tasks
to investigate multilingual RAG: 1) Monolingual Knowl-
edge Extraction, which requires RALMs to extract knowl-
edge from documents and resolve questions in the same lan-
guage; 2) Cross-lingual Knowledge Transfer, which chal-
lenges RALMs to resolve questions using documents in dif-
ferent languages; and 3) Multilingual Knowledge Selection,
which examines RALMs’ bias toward languages when se-
lecting answers from documents in different languages.
We then conduct experiments to evaluate several com-
1We will release our data at https://github.com/H-shw/
futurepedia/
2https://www.wikipedia.org/
arXiv:2410.21970v1 [cs.CL] 29 Oct 2024

-- 1 of 15 --

monly used RALMs, revealing significant linguistic inequal-
ity in multilingual RAG. Specifically, in the task of mono-
lingual knowledge extraction, high-resource languages stand
out, with RALMs exhibiting superior performance in these
languages. Meanwhile, as the model size grows, RALMs not
only show enhanced performance but also alleviate linguis-
tic inequality. In the task of cross-lingual knowledge trans-
fer, Indo-European languages3 lead RALMs to provide an-
swers directly from documents in different languages, allevi-
ating the challenge of expressing answers across languages.
In the multilingual knowledge selection task, English bene-
fits from RALMs’

Chunk 3 · 1,999 chars

s-
tic inequality. In the task of cross-lingual knowledge trans-
fer, Indo-European languages3 lead RALMs to provide an-
swers directly from documents in different languages, allevi-
ating the challenge of expressing answers across languages.
In the multilingual knowledge selection task, English bene-
fits from RALMs’ selection bias and speaks louder in multi-
lingual contexts. Even a small number of English documents
can exert a dominant influence, overshadowing a larger num-
ber of documents in other languages.
Based on the findings above, we further explore several
strategies to improve multilingual RAG: 1) In the task of
monolingual knowledge extraction, translating documents
from low-resource languages into high-resource ones is a di-
rect way to improve performance in low-resource language
tasks. However, careful attention must be paid to cascad-
ing errors during the translation; 2) in the task of cross-
lingual knowledge transfer, encouraging RALMs to provide
answers from documents in different languages can enhance
their cross-lingual entity understanding and response gen-
eration; 3) in the task of multilingual knowledge selection,
incorporating more non-English documents and reposition-
ing English documents can help mitigate RALMs’ selection
bias.
2 Our Benchmark
In this section, we describe the details of our benchmark, in-
cluding its data construction process (§2.1), three evaluation
tasks (§2.2), and evaluation metrics (§2.3).
2.1 Data Construction
As mentioned above, existing benchmarks often lack par-
allel data, fail to provide ground truth answers, and are hin-
dered by time sensitivity. To provide parallel data, we collect
Wikipedia documents created from January 2018 to April
2024 in eight languages: English (en), French (fr), Span-
ish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Korean
(ko), and Arabic (ar). Meanwhile, we also gather the parallel
factual triples from WikiData4 and utilize GPT-4o to con-
vert them into natural language

Chunk 4 · 1,993 chars

a documents created from January 2018 to April
2024 in eight languages: English (en), French (fr), Span-
ish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Korean
(ko), and Arabic (ar). Meanwhile, we also gather the parallel
factual triples from WikiData4 and utilize GPT-4o to con-
vert them into natural language QA pairs, so as to provide
ground truth answers alongside the corresponding questions.
Then, as shown in Figure 1, we construct time-insensitive in-
stances via three-stage operations in sequence. In the step of
Timestamp Modification, we adjust the timestamps of doc-
uments and QA pairs to random years between 2124 and
2200. During the subsequent Entities Modification step, we
prompt GPT-4o to generate similar and reasonable entities
for the subjects and objects in the original factual triples. In
the final Translation & Update step, we utilize GPT-4o to
3For the languages discussed in this work, English, French,
Spanish, and Portuguese are all Indo-European languages.
4https://www.wikidata.org/
Original Document (fr)
Original Document (zh)
Timestamp Modification 	2019 	2046
Entities Modification 	Chernobyl 	Pripyat
HBO 	Neo Cinema Network
Translation & Update 	en: Pripyat 	zh: 普里皮亚季 fr: ...
Original Document (en)
Original QA (en)
Q: Which company produced the 2019 miniseries Chernobyl ?
A: HBO
Chernobyl is a 2019 historical drama television miniseries that revolves around the …
The series was produced by HBO, known for its high-quality …
Final Document (fr)
Final Document (zh)
Final Document (en)
Final QA (en)
Q: Which company produced the 2146 miniseries Pripyat?
A: Neo Cinema Network
Pripyat is a 2146 historical drama television miniseries that revolves around the …
The series was produced by Neo Cinema Network, known for its high-quality …
Figure 1: The refinement process on the collected data.
translate the fictional entities into eight languages and then
update documents and QA pairs in all languages.
Besides, we ensure the quality of the

Chunk 5 · 1,990 chars

miniseries that revolves around the …
The series was produced by Neo Cinema Network, known for its high-quality …
Figure 1: The refinement process on the collected data.
translate the fictional entities into eight languages and then
update documents and QA pairs in all languages.
Besides, we ensure the quality of the benchmark through
rigorous manual review. For each language, we hire three
experts proficient in both English and the target language
to evaluate the accuracy and coherence of the generated
documents and QA pairs at a pace of 20 entries per day.
Only documents that can resolve their corresponding QA
pairs without logical errors are directly retained. Those that
do not satisfy these criteria are revised until consensus is
reached among experts. Ultimately, we obtain 197 parallel
documents and corresponding QA pairs in eight languages.
2.2 Three Evaluation Tasks
As shown in Figure 2, our benchmark provides three tasks to
comprehensively evaluate RALMs from different perspec-
tives.
Monolingual Knowledge Extraction This task requires
the RALMs to extract answers from documents and resolve
questions within the same language, as shown in Figure 2(a),
intending to assess the fundamental RAG capabilities in dif-
ferent languages.
Cross-lingual Knowledge Transfer As depicted in Fig-
ure 2(b), we challenge RALMs with documents and QA
pairs written in different languages within this evaluation
task. This task imitates the real-world scenario where there
are no high-quality documents in the query language, allow-
ing for the evaluation of the cross-lingual transfer ability of
RALMs.
Multilingual Knowledge Selection For each question, we
provide RALMs with documents in eight languages that

-- 2 of 15 --

QA (fr)
QA (zh)
Document (en)
QA (en)
Q: Which company produced the 2146 miniseries Pripyat?
A: Neo Cinema Network
Pripyat was produced by Neo Cinema Network
Document (zh)
QA (zh)
Q: 2146年迷你剧《普里皮亚特》是由哪家公司制作的?
A: 新电影院网络
(Q: Which company produced the 2146

Chunk 6 · 1,994 chars

ide RALMs with documents in eight languages that

-- 2 of 15 --

QA (fr)
QA (zh)
Document (en)
QA (en)
Q: Which company produced the 2146 miniseries Pripyat?
A: Neo Cinema Network
Pripyat was produced by Neo Cinema Network
Document (zh)
QA (zh)
Q: 2146年迷你剧《普里皮亚特》是由哪家公司制作的?
A: 新电影院网络
(Q: Which company produced the 2146 miniseries Pripyat?
A: New Cinema Network.)
新电影院网络制作了迷你剧《普里皮亚特》
(Neo Cinema Network has produced miniseries Pripyat)
Document (fr)
QA (fr)
Q: Quelle société a produit la mini-série Pripiat de 2146?
A: Réseau Cinéma Néo
(Q: Which company produced the 2146 mini-series Pripyat?
A: New Cinema Network.)
Pripiat Sociétés de production : Réseau Cinéma Néo
(The production company of Pripiat: New Cinema Network)
Document (zh)
QA (en)
Q: Which company produced the 2146 miniseries Pripyat?
A: Neo Cinema Network
新电影院网络制作了迷你剧《普里皮亚特》
(Neo Cinema Network has produced miniseries Pripyat)
Document (en)
QA (fr)
Q: Quelle société a produit la mini-série Pripiat de 2146?
A: Réseau Cinéma Néo
(Q: Which company produced the 2146 mini-series Pripyat?
A: New Cinema Network.)
Pripyat was produced by Neo Cinema Network
Document (fr)
QA (zh)
Q: 2146年迷你剧《普里皮亚特》是由哪家公司制作的?
A: 新电影院网络
(Q: Which company produced the 2146 miniseries Pripyat?
A: New Cinema Network.)
Pripiat Sociétés de production : Réseau Cinéma Néo
(The production company of Pripiat: New Cinema Network)
Document (en)
Document (zh)
史诗故事频道制作了迷你剧《普里皮亚特》
(Epic Story Channel has produced the miniseries Pripyat)
Pripyat was produced by Metro Film Alliance
Pripiat Sociétés de production : Studios de cinéma Starlight
(The production company of Pripiat: Starlight Film Studios)
Document (fr)
QA (en)
Q: Which company produced the 2146 miniseries
Pripyat?
A: Metro Film Alliance? Epic Story Channel?
Starlight Film Studios?
(a) Monolingual Knowledge Extration 	(b) Cross-lingual Knowledge Transfer 	(c) Multilingual Knowledge Selection
Figure 2: Three evaluation tasks in our benchmark: (a) Monolingual knowledge extraction, which

Chunk 7 · 1,996 chars

ompany produced the 2146 miniseries
Pripyat?
A: Metro Film Alliance? Epic Story Channel?
Starlight Film Studios?
(a) Monolingual Knowledge Extration (b) Cross-lingual Knowledge Transfer (c) Multilingual Knowledge Selection
Figure 2: Three evaluation tasks in our benchmark: (a) Monolingual knowledge extraction, which requires RALMs to extract
knowledge from documents and resolve questions within the same language; (b) Cross-lingual knowledge transfer, which
challenges RALMs to handle documents and QA pairs in different languages; (c) Multilingual knowledge selection, which
presents documents in various languages that containing different answers, allowing for the evaluation of RALMs’ selection
bias. Note that we use three of the eight languages: English (en), Chinese (zh), and French (fr) to illustrate these tasks, and we
provide the English translations in parentheses.
contain different answers, creating a scenario of knowl-
edge conflict. Back to Figure 2(c), for the English ques-
tion “Which production company was behind the miniseries
Pripyat”, the English, Chinese, and French documents pro-
vide three different answers: “Metro Film Alliance”, “Epic
Story Channel”, “Starlight Film Studios”, respectively. In
this context, we regard all these answers as potentially cor-
rect and assess the RALMs’ bias toward specific languages
by examining which answers are selected.
2.3 Evaluation Metrics
The common practices of RAG often use Accuracy to eval-
uate whether the ground truth answer is fully contained in
the prediction (Lewis et al. 2020; Chen et al. 2023a; Saad-
Falcon et al. 2024). However, as analyzed in (Chirkova et al.
2024), one answer may have diverse expressions in multi-
lingual RAG, and thus Accuracy fails to capture similar-
ity in such cases. To deal with this issue, Chirkova et al.
(2024) propose Character 3-gram Recall, which measures
the proportion of 3-grams of ground truth answers that ap-
pear in the predictions. In this work, we use Character

Chunk 8 · 1,988 chars

ve diverse expressions in multi-
lingual RAG, and thus Accuracy fails to capture similar-
ity in such cases. To deal with this issue, Chirkova et al.
(2024) propose Character 3-gram Recall, which measures
the proportion of 3-grams of ground truth answers that ap-
pear in the predictions. In this work, we use Character 3-
gram Recall as our primary evaluation metric. Additionally,
we also use LLM for evaluation and report the results in Ap-
pendix A due to page limitation, which shows a similar trend
as Character 3-gram Recall.
Furthermore, based on Character 3-gram Recall, we set
additional metrics for three evaluation tasks. For the tasks of
monolingual knowledge extraction and cross-lingual knowl-
edge transfer, we report the average Character 3-gram Recall
values across languages (AVG) and their variance (VAR) to
assess performance differences among languages. In the task
of multilingual knowledge selection, we introduce Selection
Entropy (SE) to evaluate the selection bias of RALMs across
languages. To accomplish this, we first normalize the recall
scores to derive a distribution, followed by the calculation of
its entropy. This process can be formally expressed as
SE = −
n	X
i=1
p(i) log(p(i)),
p(i) = f (i)
Pn
j=1 f (j) ,
where f (i) represents the Character 3-gram Recall for the
answer from the i-th language in a total of n languages.
3 Experiment
In this section, we first outline our experimental settings
(§3.1) and then present the RALMs’ overall performance
on our benchmark (§3.2). Next, we provide detailed anal-
yses of RALMs on the task of monolingual knowledge ex-
traction (§3.3), cross-lingual knowledge transfer (§3.4), and
multilingual knowledge selection (§3.5), respectively. Based
on these analyses, we also try several strategies to improve
RALMs’ performance on these tasks.
3.1 Settings
We choose six representative multilingual LLMs as RALMs,
including: 1) open-source LLMs: Aya-23-8B, Aya-23-35B
(Aryabumi et al. 2024), Qwen2-7B-Instruct,

Chunk 9 · 1,996 chars

ilingual knowledge selection (§3.5), respectively. Based
on these analyses, we also try several strategies to improve
RALMs’ performance on these tasks.
3.1 Settings
We choose six representative multilingual LLMs as RALMs,
including: 1) open-source LLMs: Aya-23-8B, Aya-23-35B
(Aryabumi et al. 2024), Qwen2-7B-Instruct, Qwen2-72B-
Instruct (Yang et al. 2024); and 2) closed-source LLMs:

-- 3 of 15 --

Mono. 	Cross. Multi.
AVG ↑ VAR ↓ AVG ↑ VAR ↓ SE ↑
Aya-23-8B 	67.50 12.53 56.00 15.14 0.84
Aya-23-35B 	77.69 4.84 58.91 16.84 0.86
Qwen2-7B-Instruct 78.45 3.55 53.62 16.90 0.81
Qwen2-72B-Instruct 84.51 4.31 66.82 15.00 0.89
GPT-3.5-Turbo 63.78 13.41 55.52 18.58 0.83
GPT-4o 	82.93 6.05 68.46 8.48 0.90
Table 2: Performance of RALMs on our benchmark. We use
Mono., Cross., and Multi. to represent the tasks of monolin-
gual knowledge extraction, cross-lingual knowledge trans-
fer, and multilingual knowledge selection, respectively.
GPT-3.5-Turbo, and GPT-4o5. During evaluation, RALMs
are instructed to respond based on the provided documents.
Other implementation details regarding the RAG process
can be found in Appendix B.
3.2 Overall Performance
Experimental results are reported in Figure 2. We can clearly
find that Qwen2-72B-Instruct demonstrates superior perfor-
mance in monolingual knowledge extraction and exhibits
a well-balanced performance across various languages. In
cross-lingual knowledge transfer, GPT-4o achieves the best
knowledge transfer performance and most approximate re-
sults across different languages. Besides, we note that the
AVG values are lower and the VAR values are higher for
all RALMS in the cross-lingual task compared to the mono-
lingual task, showing that cross-lingual knowledge transfer
is more challenging. For the task of multilingual knowledge
selection task, GPT-4o obtains the highest selection entropy,
indicating less selection bias among different languages.
3.3 Experiments on Monolingual Knowledge
Extraction
This evaluation task presents

Chunk 10 · 1,999 chars

ngual task, showing that cross-lingual knowledge transfer
is more challenging. For the task of multilingual knowledge
selection task, GPT-4o obtains the highest selection entropy,
indicating less selection bias among different languages.
3.3 Experiments on Monolingual Knowledge
Extraction
This evaluation task presents RALMs with questions and
documents in the same language. Experimental results are
illustrated in Figure 3, where we can obtain the following
conclusions:
RALMs exhibit better knowledge extraction capa-
bilities in high-resource languages, while demonstrating
less satisfactory performance in relatively low-resource lan-
guages. For instance, GPT-3.5-Turbo achieves a Character
3-gram Recall of 72.80 in English and 72.87 in Chinese, but
only 32.65 in relatively low-resource Arabic, highlighting
significant disparities among languages.
Furthermore, we also observe that scaling RALMs within
the same series not only improves overall performance but
also narrows the performance gaps between languages. For
example, when the model size is increased from 8B to 35B,
the performance of Aya-23 is significantly boosted in the
previously underperforming languages such as Arabic, with
scores rising from 72.33 to 80.29. Meanwhile, the VAR
5We use gpt3.5-turbo-0125 and gpt-4o-2024-05-13 in this
work.
en
fr
es
pt
zh
ko
ja
ar
30
45
60
75 	85
Aya-23-8B
Aya-23-35B
Qwen2-7B-Instruct
Qwen2-72B-Instruct
GPT-3.5-Turbo
GPT-4o
Figure 3: Performance of RALMs in monolingual knowl-
edge extraction. Note that Chinese and English are relatively
high-resource languages, while Arabic is a relatively low-
resource language.
Qwen2-7B-Instruct GPT-3.5-Turbo
Arabic 79.21 32.65
Arabic → English 42.46 27.91
Arabic → Chinese 32.46 19.91
Table 3: Performance comparison of RALMs on original
Arabic data and its English and Chinese translations.
score of Aya-23 reduces from 12.53 to 4.84 as shown in Ta-
ble 2.
Based on the above experimental results, we naturally
pose one question: Can we enhance

Chunk 11 · 1,991 chars

Arabic → English 42.46 27.91
Arabic → Chinese 32.46 19.91
Table 3: Performance comparison of RALMs on original
Arabic data and its English and Chinese translations.
score of Aya-23 reduces from 12.53 to 4.84 as shown in Ta-
ble 2.
Based on the above experimental results, we naturally
pose one question: Can we enhance RALMs’ performance
by translating low-resource languages into high-resource
ones? To answer this question, we employ GPT-4o to trans-
late Arabic documents and QA pairs into English and Chi-
nese, and then assess the performance of Qwen2-7B-Instruct
and GPT-3.5-Turbo on the translations. The results shown in
Table 3 indicate that translation does not benefit RALMs.
For these results, we speculate that misinterpretations of key
entities during translation may lead to cascading errors, ulti-
mately resulting in inaccurate predictions of RALMs. There-
fore, we recommend that although RALM performs well in
high-resource languages, attention should be paid to the
potential cascading errors when translating from low-
resource to high-resource ones.
3.4 Experiments on Cross-lingual Knowledge
Transfer
This task requires RALMs to understand documents in dif-
ferent languages and then generate responses. In this group
of experiments, we consider two settings: 1) Strict Language
Setting, which requires correct answers and responses in the
query language, and 2) Flexible Language Setting, which
focuses solely on answer accuracy. From the experimental
results shown in Figure 4, we can reach the following con-

-- 4 of 15 --

en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
60.9 	77.7/23.4 80.7/25.4 77.7/23.9 59.4/8.1 66.5/12.7 65.0/16.2 68.5/9.1
78.2/24.4 	82.2 	67.5/24.9 68.5/24.9 36.0/8.6 34.0/8.6 39.1/7.1 46.2/7.6
79.7/18.8 70.6/22.8 	40.6 	75.6/31.5 32.0/7.1 38.1/9.6 25.9/9.6 34.5/9.6
82.7/24.9 75.6/29.9 76.1/35.5 	70.6 	29.9/9.1 24.4/9.1 23.9/9.1 40.1/11.2
54.8/3.0 55.3/7.1 45.7/6.1 44.7/11.2 	94.4 	53.8/8.1 43.6/11.7 44.2/8.6
58.4/12.7 56.9/10.2

Chunk 12 · 1,999 chars

8.5/9.1
78.2/24.4 	82.2 	67.5/24.9 68.5/24.9 36.0/8.6 34.0/8.6 39.1/7.1 46.2/7.6
79.7/18.8 70.6/22.8 	40.6 	75.6/31.5 32.0/7.1 38.1/9.6 25.9/9.6 34.5/9.6
82.7/24.9 75.6/29.9 76.1/35.5 	70.6 	29.9/9.1 24.4/9.1 23.9/9.1 40.1/11.2
54.8/3.0 55.3/7.1 45.7/6.1 44.7/11.2 	94.4 	53.8/8.1 43.6/11.7 44.2/8.6
58.4/12.7 56.9/10.2 62.9/11.2 54.8/10.2 31.0/11.7 	85.8 	33.0/14.2 28.9/7.6
57.9/8.6 53.3/12.2 52.8/11.2 49.8/10.7 33.5/13.2 29.9/14.7 	81.7 	18.3/12.2
67.5/1.5 59.9/3.0 59.9/7.1 55.8/5.1 14.7/5.6 28.9/3.0 26.4/7.6 	68.5
Aya-23-8B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
81.9 	78.4/48.2 80.5/42.5 80.8/44.8 50.3/30.6 53.0/37.4 62.1/41.0 47.5/30.4
80.4/45.4 	81.1 	77.8/41.2 71.6/44.0 44.6/25.4 42.7/30.2 49.1/30.7 36.6/29.6
77.7/39.7 78.7/41.9 	79.2 	77.1/61.1 43.8/24.2 39.7/23.5 51.0/25.1 38.0/25.3
80.9/42.3 79.1/42.1 78.3/57.8 	80.0 	52.8/22.4 44.5/22.7 49.4/25.4 41.4/27.5
62.6/16.0 57.2/11.7 61.7/10.0 57.2/12.6 	77.1 	29.4/19.1 41.4/17.8 21.0/11.8
61.9/15.4 57.7/12.0 59.5/9.0 49.8/11.1 45.2/9.9 	69.8 	37.3/21.2 25.7/10.1
59.8/18.8 58.9/13.2 57.7/11.2 49.0/12.9 48.4/15.9 27.6/24.0 	79.2 	15.0/9.4
62.1/18.2 61.4/16.4 57.9/14.9 53.6/16.7 31.3/18.5 28.9/18.4 35.7/18.4 	79.2
Qwen2-7B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
77.0 	80.2/46.7 81.6/39.1 81.7/43.0 74.5/3.6 69.1/5.7 80.3/10.8 63.9/8.8
77.8/41.9 	81.3 	78.0/38.7 77.2/39.2 70.9/12.3 55.8/24.0 70.6/17.7 59.4/20.7
76.7/36.6 76.7/37.7 	78.0 	77.6/58.4 73.1/7.0 58.0/14.9 71.7/10.6 64.6/11.1
75.7/37.5 76.1/37.7 79.4/57.4 	80.5 	72.8/6.9 59.5/12.2 71.0/14.4 61.1/17.8
73.0/9.7 74.8/6.7 70.8/4.3 76.9/6.0 	78.7 	67.2/4.8 77.2/5.8 63.5/4.0
74.2/14.7 71.1/10.6 69.0/11.7 72.4/13.7 61.8/21.9 	73.7 	66.2/25.9 51.2/21.5
69.8/10.0 69.2/6.4 69.5/5.3 66.0/7.3 63.7/10.4 59.6/17.8 	83.5 	48.4/8.6
59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 	68.0
GPT-4o
Figure 4: The performance of RALMs on cross-lingual knowledge transfer. The x-axis represents the document language,
and the

Chunk 13 · 1,989 chars

7 61.8/21.9 73.7 66.2/25.9 51.2/21.5
69.8/10.0 69.2/6.4 69.5/5.3 66.0/7.3 63.7/10.4 59.6/17.8 83.5 48.4/8.6
59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 68.0
GPT-4o
Figure 4: The performance of RALMs on cross-lingual knowledge transfer. The x-axis represents the document language,
and the y-axis represents the query language. The first/second values represent the RALM performance in the Flexible/Strict
Language Setting. The colors indicate the performance of the strict language setting, with deeper blues representing stronger
performance. Note that results for other RALMs can be found in Appendix A.
clusions:
Indo-European languages lead RALMs to provide an-
swers directly from documents, alleviating the challenge
of expressing answers across languages. Although we
instruct RALMs to respond in the query language, we find
that they perform poorly in the strict language setting (see
the first values in Figure 4). Conversely, under the flexible
language setting, their performance significantly improves
showed by the second values. This phenomenon suggests
that RALMs face difficulties in expressing answers across
languages, while directly providing documents in different
languages is comparatively easier. Additionally, the values
in the left half of the heatmap are significantly higher than
those on the right under the flexible language setting. For
example, when Qwen2-7B-Instruct uses Arabic as the query
language, the average Character 3-gram Recall score for
Indo-European languages is 58.75, markedly surpassing the
31.94 average for other languages. This indicates that Indo-
European languages lead RALMs to provide answers di-
rectly from documents in different languages, thus enhanc-
ing performance by avoiding the challenge of expressing an-
swers across languages.
Inspired by these findings, we pose the following ques-
tion: Can we enhance the cross-lingual performance by en-
couraging RALMs to provide answers from documents

Chunk 14 · 1,996 chars

answers di-
rectly from documents in different languages, thus enhanc-
ing performance by avoiding the challenge of expressing an-
swers across languages.
Inspired by these findings, we pose the following ques-
tion: Can we enhance the cross-lingual performance by en-
couraging RALMs to provide answers from documents in
different languages? To answer this, we refine the prompts
to require RALMs to provide answers within documents and
then respond in the query language. In contrast, the vanilla
prompt only instructs RALMs to respond in the query
language. Table 4 displays the performance of Qwen2-
7B-Instruct and GPT-3.5-Turbo using Vanilla and Refined
prompts across different document languages. Under the
flexible language setting, our refined prompts significantly
improve RALMs’ performance, particularly for GPT-3.5-
Turbo, which achieves an average Character 3-gram Recall
score of 59.05. This demonstrates the potential of RALMs
in cross-lingual document understanding. Additionally, both
Qwen2-7B-Instruct and GPT-3.5-Turbo also show improve-
Prompt en pt zh ar avg
Flexible Language Setting
Qwen2-7B-Instruct Vanilla 69.35 62.74 45.21 32.15 52.36
Refined 68.89 64.87 51.55 32.66 54.49
GPT-3.5-Turbo Vanilla 65.14 61.43 34.86 32.38 48.46
Refined 73.78 72.56 50.85 38.99 59.05
Strict Language Setting
Qwen2-7B-Instruct Vanilla 27.98 29.03 20.99 20.60 24.65
Refined 29.04 29.76 23.30 19.97 25.52
GPT-3.5-Turbo Vanilla 21.45 22.40 8.43 7.16 14.86
Refined 26.83 26.09 12.86 10.45 19.06
Table 4: Comparison of performance between using the
Vanilla prompt and the Refined prompt.
ments in the strict language setting, with GPT-3.5-Turbo’s
score increasing from 14.86 to 19.06. We believe this im-
provement stems from the refined prompt, which may split
the knowledge transfer process into two steps: first extract-
ing answers from the documents, and then translating them
into the query language. This process can be viewed as a
chain-of-thought (Wei et al. 2022), thereby alleviating

Chunk 15 · 1,998 chars

to 19.06. We believe this im-
provement stems from the refined prompt, which may split
the knowledge transfer process into two steps: first extract-
ing answers from the documents, and then translating them
into the query language. This process can be viewed as a
chain-of-thought (Wei et al. 2022), thereby alleviating diffi-
culties in cross-lingual transfer. Therefore, we advise that
encouraging RALMs to provide answers within docu-
ments in different languages can enhance their cross-
lingual performance.
3.5 Experiments on Multilingual Knowledge
Selection
This task assesses RALMs’ selection bias toward languages
within a knowledge conflict scenario. In this group of ex-
periments, we present RALMs with eight randomly ordered
documents in different languages, each providing a distinct
answer6. By comparing the Character 3-gram Recall for an-
swers from different languages, we can assess RALMs’ lan-
guage preference. The results shown in Figure 5 lead us to
the following conclusions:
6Since our data is fictional, we consider all these answers as
potentially correct.

-- 5 of 15 --

en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
17.89 	10.90 	11.89 	12.02 	10.61 	7.73 	11.31 	9.32
27.59 	12.26 	11.94 	18.34 	13.72 	10.49 	12.42 	11.22
25.70 	15.63 	15.63 	20.89 	15.83 	11.32 	14.08 	11.98
29.32 	16.31 	15.42 	22.39 	14.93 	11.14 	14.38 	13.27
21.95 	12.78 	9.86 	12.32 	26.59 	11.20 	10.96 	9.74
21.52 	12.27 	11.86 	17.00 	14.67 	15.39 	15.85 	9.62
18.15 	9.65 	8.62 	12.51 	17.54 	10.84 	18.10 	6.94
23.68 	12.81 	12.37 	13.52 	13.31 	10.22 	16.31 	20.56
Aya-23-8B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
35.18 	17.73 	18.36 	17.89 	28.63 	15.51 	13.95 	12.67
31.24 	19.86 	17.79 	21.96 	27.86 	12.75 	12.92 	12.88
28.96 	19.95 	22.14 	17.85 	24.64 	14.60 	12.52 	12.69
34.49 	22.82 	24.44 	26.95 	23.01 	14.80 	15.93 	16.53
25.73 	10.17 	11.31 	14.19 	38.05 	9.21 	9.30 	7.76
37.11 	18.86 	18.90 	18.15 	26.16 	17.30 	17.26 	15.59
35.44 	21.47 	18.61

Chunk 16 · 1,995 chars

28.63 	15.51 	13.95 	12.67
31.24 	19.86 	17.79 	21.96 	27.86 	12.75 	12.92 	12.88
28.96 	19.95 	22.14 	17.85 	24.64 	14.60 	12.52 	12.69
34.49 	22.82 	24.44 	26.95 	23.01 	14.80 	15.93 	16.53
25.73 	10.17 	11.31 	14.19 	38.05 	9.21 	9.30 	7.76
37.11 	18.86 	18.90 	18.15 	26.16 	17.30 	17.26 	15.59
35.44 	21.47 	18.61 	21.90 	32.29 	17.27 	20.70 	12.99
28.37 	18.40 	18.30 	16.29 	15.47 	14.59 	16.29 	23.71
Qwen2-7B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
43.98 	38.16 	39.45 	40.28 	44.78 	37.48 	41.07 	39.06
35.20 	42.53 	27.66 	28.25 	29.37 	23.51 	27.12 	25.34
24.54 	20.22 	24.74 	21.71 	23.37 	16.83 	20.39 	21.67
35.77 	29.52 	29.79 	48.09 	26.80 	27.34 	25.64 	28.78
36.21 	28.49 	29.59 	28.51 	50.95 	26.67 	28.92 	31.21
41.60 	30.40 	31.72 	32.68 	31.74 	40.51 	29.01 	30.15
51.56 	40.54 	43.85 	43.37 	42.83 	37.44 	45.50 	40.26
26.38 	19.19 	22.84 	22.53 	22.43 	18.37 	19.61 	49.87
GPT-4o
Figure 5: The performance of RALMs on the task of multilingual knowledge selection. The x-axis represents the and y-axis
represent the answer and query languages, respectively. Note that results of other RALMs can be found in Appendix A.
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
languages
0
50
100
150
200
Character 3-gram Recall
Qwen2-7B-Instruct
Qwen2-72B-Instruct
GPT-3.5-Turbo
GPT-4o
Aya-23-8B
Aya-23-35B
Figure 6: The stacked bar chart of Character 3-gram Recall
scores for answers from different answers in the multilin-
gual knowledge selection, with higher scores indicating a
stronger preference for information from that language.
English benefits from selection bias and speaks louder
in multilingual contexts. To better analyze the RALMs’
preference for different languages, we aggregate the Char-
acter 3-gram Recall for answers from various languages and
present the cumulative results in Figure 6. Results imply
that English consistently ranks as the most preferred lan-
guage for providing answers, and other Indo-European lan-
guages (such as French, Spanish,

Chunk 17 · 1,983 chars

ence for different languages, we aggregate the Char-
acter 3-gram Recall for answers from various languages and
present the cumulative results in Figure 6. Results imply
that English consistently ranks as the most preferred lan-
guage for providing answers, and other Indo-European lan-
guages (such as French, Spanish, and Portuguese) follow
for most RALMs. An exception is Qwen2, which exhibits a
greater preference for Chinese even though English remains
the primary choice in most cases. We believe this is because
Qwen2 has been specifically enhanced for the Chinese.
Meanwhile, we observe that RALMs favor query lan-
guages. As shown in Figure 5, the values on the diago-
nal lines that assess RALM’s preference for the query lan-
guages are typically higher. This trend is particularly pro-
nounced for GPT-4o, whose preference for the query lan-
guage can even exceed that for English. For example, when
the query language is Portuguese, GPT-4o shows a signifi-
cantly higher recall score for Portuguese (48.09) compared
to English (35.77).
Based on the experimental results, we find that RALMs
1.0 	1.5 	2.0 	2.5 	3.0 	3.5 	4.0 	4.5 	5.0
k
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
Character 3-gram Recall
(a) Qwen2-7B-Instruct
en query: English answer
en query: k-voting answer
zh query: English answer
zh query: k-voting answer
1.0 	1.5 	2.0 	2.5 	3.0 	3.5 	4.0 	4.5 	5.0
k
10.0
20.0
30.0
40.0
Character 3-gram Recall
(b) GPT-3.5-Turbo
en query: English answer
en query: k-voting answer
zh query: English answer
zh query: k-voting answer
Figure 7: The impact of increasing the number of Non-
English documents (denoted as k) on RALMs.
show a significant selection bias and a marked preference
for English. This tendency may result in the neglect of non-
English documents, which is harmful to multilingual RAG.
To address this issue, we aim to explore the following ques-
tion: How can we mitigate the selection bias of RALMs
toward English? Inspired by previous studies, we

Chunk 18 · 1,991 chars

election bias and a marked preference
for English. This tendency may result in the neglect of non-
English documents, which is harmful to multilingual RAG.
To address this issue, we aim to explore the following ques-
tion: How can we mitigate the selection bias of RALMs
toward English? Inspired by previous studies, we explore
leveraging two well-known characteristics of RAG: the ma-
jority rule and its sensitivity to document positions to tackle
this issue. As in the prior experiment, we still use Qwen2-
7B-Instruct and GPT-3.5-Turbo as our RALMs.

-- 6 of 15 --

1 	2 	3 	4 	5 	6 	7 	8
Position of English Document
20
40
60
80
100
120
140
160
180
Character 3-gram Recall
Qwen2-7B-Instruct: en
Qwen2-7B-Instruct: sum_w/o_en
GPT-3.5-Turbo: en
GPT-3.5-Turbo: sum_w/o_en
Figure 8: The impact of altering the position of English doc-
uments on RALMs.
The majority rule suggests that RALMs favor answers
that appear more frequently in the documents (Jin et al.
2024). Based on this characteristic, we construct an exper-
imental setup that includes k non-English documents con-
taining the same answer, referred to k-voting answer, while
the remaining documents still present different answers. By
comparing the Character 3-gram Recall for English answers
and the k-voting answer, we can assess how many non-
English documents are needed to alleviate the RALMs’ pref-
erence for English.
Figure 7 shows the experimental results when using Chi-
nese and English queries. The results indicate that when the
k is low, the Character 3-gram Recall for English answers
is significantly higher than that for k-voting answers. How-
ever, as k increases, the scores for k-voting answer rise while
those for the English answers decline. When k reaches 3, the
scores for English answers and voting answers are compara-
ble for Qwen2-7B-Instruct. We also see the same trend with
GPT-3.5-Turbo when k is 4. Based on these observations,
we recommend introducing more non-English documents to
mitigate the

Chunk 19 · 1,989 chars

nswer rise while
those for the English answers decline. When k reaches 3, the
scores for English answers and voting answers are compara-
ble for Qwen2-7B-Instruct. We also see the same trend with
GPT-3.5-Turbo when k is 4. Based on these observations,
we recommend introducing more non-English documents to
mitigate the selection bias of RALMs toward English.
Moreover, RALMs are known for their sensitivity to doc-
ument positions. Liu et al. (2024) reveals that the position
of the gold documents within the context can significantly
impact RALMs’ performance. To investigate this, we alter
the positions of English documents in the context and con-
duct experiments with English queries. From results pre-
sented in Figure 8, we can find that when English docu-
ments are placed in the first position, RALMs achieve the
highest Character 3-gram Recall score for English answers
(indicated by en), while the score for non-English answers
(indicated by sum w/o en) is typically low. In contrast,
when English documents occupy non-initial positions, the
scores exhibit an opposite trend, indicating a decrease in
RALMs’ selection bias towards English and a heightened
emphasis on non-English documents. Thus, we advise po-
sitioning English documents in non-initial positions to help
mitigate RALMs’ selection bias toward English.
4 Related Work
Retrieval-Augmented Generation RAG enhances LLMs
by integrating relevant texts from external knowledge re-
sources. Most of the current RALMs focus on query refin-
ing or better document utilization. For example, Rewrite-
Retrieve-Read (Ma et al. 2023) trains a small language
model as the rewriter to better align the query to the retriever
and the LLM reader. Chain of Note (Yu et al. 2023) gener-
ates reading notes for retrieved documents to improve the
robustness of RALMs in facing noisy, irrelevant documents
and in handling unknown scenarios. However, these studies
primarily focus on English, lacking exploration in multilin-
gual RAG

Chunk 20 · 1,997 chars

to the retriever
and the LLM reader. Chain of Note (Yu et al. 2023) gener-
ates reading notes for retrieved documents to improve the
robustness of RALMs in facing noisy, irrelevant documents
and in handling unknown scenarios. However, these studies
primarily focus on English, lacking exploration in multilin-
gual RAG scenarios.
Although, many benchmarks (Chen et al. 2023a; Lyu et al.
2024; Liu et al. 2023; Thakur et al. 2024) are proposed
to evaluate RALMs’ performance. Multilingual benchmarks
like MKQA (Longpre, Lu, and Daiber 2021) and XOR-TyDi
QA (Asai et al. 2021) are time-sensitive and have potential
leakage risk, while NoMIRACL(Thakur et al. 2024) lacks
ground truth answers, making it unable to assess RALMs in
comprehension and response generation.
Multilingualism in LLM Multilingual Large Language
Models (MLLMs) achieve remarkable success thanks to
multilingual datasets such as mC4 (Xue et al. 2021), Cul-
turaX (Nguyen et al. 2023), Aya Dataset (Singh et al. 2024),
and MultilingualSIFT (Chen et al. 2023b). Along with the
developments of MLLMs, related analyses have also been
carried out (Shi et al. 2022; Yuan et al. 2024; Yang et al.
2023; Qin et al. 2024; Xu et al. 2024). For instance, Yuan
et al. (2024) analyze the multilingual capability of LLMs
from a vocabulary-sharing perspective, while Qin et al.
(2024) study alignment methods from parameter-tuning and
parameter-frozen aspects.
Most recently, Chirkova et al. (2024) conduct multi-
lingual RAG experiments on MKQA (Longpre, Lu, and
Daiber 2021) and XOR-TyDi QA (Asai et al. 2021), dis-
covering code-switching phenomena. Sharma, Murray, and
Xiao (2024) create a machine-translated dataset finding
that RALMs tend to prefer documents in queries’ lan-
guage. Compared to these studies, we conduct a comprehen-
sive analysis based on the proposed benchmark with three
evaluation tasks: monolingual knowledge extraction, cross-
lingual knowledge transfer, and multilingual knowledge se-
lection. We also offer three

Chunk 21 · 1,999 chars

ding
that RALMs tend to prefer documents in queries’ lan-
guage. Compared to these studies, we conduct a comprehen-
sive analysis based on the proposed benchmark with three
evaluation tasks: monolingual knowledge extraction, cross-
lingual knowledge transfer, and multilingual knowledge se-
lection. We also offer three advice based on the observations
for better multilingual RAG.
5 Conclusion
In this paper, we propose a new time-insensitive multilingual
RAG benchmark Futurepedia for multilingual RAG. Our
benchmark includes parallel documents and corresponding
QA in eight languages, where three evaluation tasks are in-
troduced: monolingual knowledge extraction, cross-lingual
knowledge transfer, and multilingual knowledge selection.
Then, we conduct experiments to evaluate the commonly
used multilingual RALMs. Our findings reveal significant
linguistic inequality: 1) high-resource languages stand out
in the task of monolingual knowledge extraction; 2) Indo-
European languages lead RALMs to provide answers di-
rectly from documents in cross-lingual knowledge transfer,
alleviating the challenge of expressing answers across lan-
guages; 3) English benefits from RALMs’ selection bias and
speaks louder in multilingual knowledge selection. Based on

-- 7 of 15 --

these findings, we try some strategies to improve multilin-
gual RALMs. In the future, we will explore ways to mitigate
or leverage the linguistic inequality of multilingual RAG.
References
Aryabumi, V.; Dang, J.; Talupuru, D.; Dash, S.; Cairuz,
D.; Lin, H.; Venkitesh, B.; Smith, M.; Campos, J. A.;
Tan, Y. C.; Marchisio, K.; Bartolo, M.; Ruder, S.; Lo-
catelli, A.; Kreutzer, J.; Frosst, N.; Gomez, A.; Blunsom,
P.; Fadaee, M.; ¨Ust¨un, A.; and Hooker, S. 2024. Aya 23:
Open Weight Releases to Further Multilingual Progress.
arXiv:2405.15032.
Asai, A.; Kasai, J.; Clark, J.; Lee, K.; Choi, E.; and Ha-
jishirzi, H. 2021. XOR QA: Cross-lingual Open-Retrieval
Question Answering. In NAACL 2021.
Asai, A.; Wu, Z.; Wang, Y.;

Chunk 22 · 1,999 chars

; Blunsom,
P.; Fadaee, M.; ¨Ust¨un, A.; and Hooker, S. 2024. Aya 23:
Open Weight Releases to Further Multilingual Progress.
arXiv:2405.15032.
Asai, A.; Kasai, J.; Clark, J.; Lee, K.; Choi, E.; and Ha-
jishirzi, H. 2021. XOR QA: Cross-lingual Open-Retrieval
Question Answering. In NAACL 2021.
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H.
2024. Self-RAG: Learning to Retrieve, Generate, and Cri-
tique through Self-Reflection. In ICLR 2024.
Chen, J.; Lin, H.; Han, X.; and Sun, L. 2023a. Benchmark-
ing Large Language Models in Retrieval-Augmented Gen-
eration. In AAAI 2024.
Chen, Z.; Yan, S.; Liang, J.; Jiang, F.; Wu, X.; Yu, F.; Chen,
G. H.; Chen, J.; Zhang, H.; Jianquan, L.; Xiang, W.; and
Wang, B. 2023b. MultilingualSIFT: Multilingual Super-
vised Instruction Fine-tuning.
Chirkova, N.; Rau, D.; D´ejean, H.; Formal, T.; Clinchant, S.;
and Nikoulina, V. 2024. Retrieval-augmented generation in
multilingual settings. arXiv:2407.01463.
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai,
Y.; Sun, J.; Wang, M.; and Wang, H. 2024. Retrieval-
Augmented Generation for Large Language Models: A Sur-
vey. arXiv:2312.10997.
Jin, Z.; Cao, P.; Chen, Y.; Liu, K.; Jiang, X.; Xu, J.; Qiuxia,
L.; and Zhao, J. 2024. Tug-of-War between Knowledge:
Exploring and Resolving Knowledge Conflicts in Retrieval-
Augmented Language Models. In LREC-COLING 2024.
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.;
Goyal, N.; K¨uttler, H.; Lewis, M.; tau Yih, W.; Rockt¨aschel,
T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented
Generation for Knowledge-Intensive NLP Tasks. In
NeurIPS 2020.
Lin, X. V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James,
R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; Zettle-
moyer, L.; and tau Yih, W. 2024. RA-DIT: Retrieval-
Augmented Dual Instruction Tuning. In ICLR 2024.
Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua,
M.; Petroni, F.; and Liang, P. 2024. Lost in the Middle: How
Language Models Use Long Contexts. TACL.
Liu,

Chunk 23 · 1,994 chars

uez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; Zettle-
moyer, L.; and tau Yih, W. 2024. RA-DIT: Retrieval-
Augmented Dual Instruction Tuning. In ICLR 2024.
Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua,
M.; Petroni, F.; and Liang, P. 2024. Lost in the Middle: How
Language Models Use Long Contexts. TACL.
Liu, Y.; Huang, L.; Li, S.; Chen, S.; Zhou, H.; Meng, F.;
Zhou, J.; and Sun, X. 2023. RECALL: A Benchmark for
LLMs Robustness against External Counterfactual Knowl-
edge. arXiv:2311.08147.
Longpre, S.; Lu, Y.; and Daiber, J. 2021. MKQA: A Linguis-
tically Diverse Benchmark for Multilingual Open Domain
Question Answering. arXiv:2007.15207.
Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang,
W.; Wu, H.; Liu, H.; Xu, T.; and Chen, E. 2024.
CRUD-RAG: A Comprehensive Chinese Benchmark for
Retrieval-Augmented Generation of Large Language Mod-
els. arXiv:2401.17043.
Ma, X.; Gong, Y.; He, P.; Zhao, H.; and Duan, N. 2023.
Query Rewriting for Retrieval-Augmented Large Language
Models. In EMNLP 2023.
Nguyen, T.; Nguyen, C. V.; Lai, V. D.; Man, H.; Ngo,
N. T.; Dernoncourt, F.; Rossi, R. A.; and Nguyen, T. H.
2023. CulturaX: A Cleaned, Enormous, and Multilingual
Dataset for Large Language Models in 167 Languages.
arXiv:2309.09400.
Qin, L.; Chen, Q.; Zhou, Y.; Chen, Z.; Li, Y.; Liao, L.; Li,
M.; Che, W.; and Yu, P. S. 2024. Multilingual Large Lan-
guage Model: A Survey of Resources, Taxonomy and Fron-
tiers. arXiv:2404.04925.
Saad-Falcon, J.; Khattab, O.; Potts, C.; and Zaharia,
M. 2024. ARES: An Automated Evaluation Frame-
work for Retrieval-Augmented Generation Systems.
arXiv:2311.09476.
Sharma, N.; Murray, K.; and Xiao, Z. 2024. Faux Polyglot:
A Study on Information Disparity in Multilingual Large
Language Models. arXiv:2407.05502.
Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.;
Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.; Zhou, D.;
Das, D.; and Wei, J. 2022. Language Models are Multilin-
gual Chain-of-Thought Reasoners. arXiv:2210.03057.
Singh, S.;

Chunk 24 · 1,994 chars

udy on Information Disparity in Multilingual Large
Language Models. arXiv:2407.05502.
Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.;
Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.; Zhou, D.;
Das, D.; and Wei, J. 2022. Language Models are Multilin-
gual Chain-of-Thought Reasoners. arXiv:2210.03057.
Singh, S.; Vargus, F.; Dsouza, D.; Karlsson, B. F.; Ma-
hendiran, A.; Ko, W.-Y.; Shandilya, H.; Patel, J.; Mataciu-
nas, D.; OMahony, L.; Zhang, M.; Hettiarachchi, R.; Wil-
son, J.; Machado, M.; Moura, L. S.; Krzemi´nski, D.; Fadaei,
H.; Erg¨un, I.; Okoh, I.; Alaagib, A.; Mudannayake, O.;
Alyafeai, Z.; Chien, V. M.; Ruder, S.; Guthikonda, S.; Al-
ghamdi, E. A.; Gehrmann, S.; Muennighoff, N.; Bartolo, M.;
Kreutzer, J.; ¨Ust¨un, A.; Fadaee, M.; and Hooker, S. 2024.
Aya Dataset: An Open-Access Collection for Multilingual
Instruction Tuning. arXiv:2402.06619.
Thakur, N.; Bonifacio, L.; Zhang, X.; Ogundepo, O.; Ka-
malloo, E.; Alfonso-Hermelo, D.; Li, X.; Liu, Q.; Chen,
B.; Rezagholizadeh, M.; and Lin, J. 2024. NoMIRACL:
Knowing When You Don’t Know for Robust Multilingual
Retrieval-Augmented Generation. arXiv:2312.11361.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi,
E. H.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-Thought
Prompting Elicits Reasoning in Large Language Models. In
NeurIPS 2022.
Xu, Y.; Hu, L.; Zhao, J.; Qiu, Z.; Ye, Y.; and Gu, H. 2024. A
Survey on Multilingual Large Language Models: Corpora,
Alignment, and Bias. arXiv:2404.00929.
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.;
Siddhant, A.; Barua, A.; and Raffel, C. 2021. mT5: A mas-
sively multilingual pre-trained text-to-text transformer. In
NAACL 2021.
Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.;
Li, C.; Li, C.; Liu, D.; Huang, F.; Dong, G.; Wei, H.; Lin,

-- 8 of 15 --

H.; Tang, J.; Wang, J.; Yang, J.; Tu, J.; Zhang, J.; Ma, J.;
Yang, J.; Xu, J.; Zhou, J.; Bai, J.; He, J.; Lin, J.; Dang, K.;
Lu, K.; Chen, K.; Yang, K.; Li, M.; Xue, M.; Ni, N.; Zhang,
P.;

Chunk 25 · 1,983 chars

g, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.;
Li, C.; Li, C.; Liu, D.; Huang, F.; Dong, G.; Wei, H.; Lin,

-- 8 of 15 --

H.; Tang, J.; Wang, J.; Yang, J.; Tu, J.; Zhang, J.; Ma, J.;
Yang, J.; Xu, J.; Zhou, J.; Bai, J.; He, J.; Lin, J.; Dang, K.;
Lu, K.; Chen, K.; Yang, K.; Li, M.; Xue, M.; Ni, N.; Zhang,
P.; Wang, P.; Peng, R.; Men, R.; Gao, R.; Lin, R.; Wang, S.;
Bai, S.; Tan, S.; Zhu, T.; Li, T.; Liu, T.; Ge, W.; Deng, X.;
Zhou, X.; Ren, X.; Zhang, X.; Wei, X.; Ren, X.; Liu, X.;
Fan, Y.; Yao, Y.; Zhang, Y.; Wan, Y.; Chu, Y.; Liu, Y.; Cui,
Z.; Zhang, Z.; Guo, Z.; and Fan, Z. 2024. Qwen2 Technical
Report. arXiv:2407.10671.
Yang, W.; Li, C.; Zhang, J.; and Zong, C. 2023. Big-
Translate: Augmenting Large Language Models with
Multilingual Translation Capability over 100 Languages.
arXiv:2305.18098.
Yu, W.; Zhang, H.; Pan, X.; Ma, K.; Wang, H.; and Yu, D.
2023. Chain-of-Note: Enhancing Robustness in Retrieval-
Augmented Language Models. arXiv:2311.09210.
Yuan, F.; Yuan, S.; Wu, Z.; and Li, L. 2024. How Vocabulary
Sharing Facilitates Multilingualism in LLaMA? In Findings
of ACL 2024.

-- 9 of 15 --

A Supplementary Experimental Results
A.1 Cross-Lingual Knowledge Transfer
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
60.9 	77.7/23.4 80.7/25.4 77.7/23.9 59.4/8.1 66.5/12.7 65.0/16.2 68.5/9.1
78.2/24.4 	82.2 	67.5/24.9 68.5/24.9 36.0/8.6 	34.0/8.6 	39.1/7.1 	46.2/7.6
79.7/18.8 70.6/22.8 	40.6 	75.6/31.5 32.0/7.1 	38.1/9.6 	25.9/9.6 	34.5/9.6
82.7/24.9 75.6/29.9 76.1/35.5 	70.6 	29.9/9.1 	24.4/9.1 	23.9/9.1 40.1/11.2
54.8/3.0 	55.3/7.1 	45.7/6.1 44.7/11.2 	94.4 	53.8/8.1 43.6/11.7 44.2/8.6
58.4/12.7 56.9/10.2 62.9/11.2 54.8/10.2 31.0/11.7 	85.8 	33.0/14.2 28.9/7.6
57.9/8.6 53.3/12.2 52.8/11.2 49.8/10.7 33.5/13.2 29.9/14.7 	81.7 	18.3/12.2
67.5/1.5 	59.9/3.0 	59.9/7.1 	55.8/5.1 	14.7/5.6 	28.9/3.0 	26.4/7.6 	68.5
Aya-23-8B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
74.6 	75.2/43.4 75.5/37.2 77.5/42.6 71.5/9.7 64.9/16.1 78.6/11.2

Chunk 26 · 1,984 chars

.2 31.0/11.7 	85.8 	33.0/14.2 28.9/7.6
57.9/8.6 53.3/12.2 52.8/11.2 49.8/10.7 33.5/13.2 29.9/14.7 	81.7 	18.3/12.2
67.5/1.5 	59.9/3.0 	59.9/7.1 	55.8/5.1 	14.7/5.6 	28.9/3.0 	26.4/7.6 	68.5
Aya-23-8B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
74.6 	75.2/43.4 75.5/37.2 77.5/42.6 71.5/9.7 64.9/16.1 78.6/11.2 56.6/12.6
79.9/43.3 	79.6 	79.2/40.2 74.2/39.9 48.6/20.6 46.5/27.4 52.8/27.9 42.1/25.9
80.6/37.1 75.9/40.7 	69.0 	75.5/58.5 39.6/20.3 37.3/24.4 42.0/28.8 32.2/25.2
79.5/37.4 73.8/41.0 77.3/59.0 	79.5 	40.0/24.4 38.9/31.4 45.3/30.0 34.0/32.1
77.9/4.9 	66.4/6.6 	66.8/5.4 	69.6/7.3 	81.6 	35.5/22.5 55.7/14.5 28.1/11.8
72.6/9.6 66.5/11.7 69.5/10.0 61.1/12.5 48.3/16.9 	72.5 	56.6/25.8 24.3/18.2
78.9/7.6 66.8/10.8 65.4/12.1 61.6/11.2 51.2/20.2 44.8/36.8 	84.5 	24.4/19.2
76.0/16.2 69.0/15.2 72.9/17.0 68.7/24.5 39.7/24.9 42.5/30.9 42.9/28.4 	80.3
Aya-23-35B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
81.9 	78.4/48.2 80.5/42.5 80.8/44.8 50.3/30.6 53.0/37.4 62.1/41.0 47.5/30.4
80.4/45.4 	81.1 	77.8/41.2 71.6/44.0 44.6/25.4 42.7/30.2 49.1/30.7 36.6/29.6
77.7/39.7 78.7/41.9 	79.2 	77.1/61.1 43.8/24.2 39.7/23.5 51.0/25.1 38.0/25.3
80.9/42.3 79.1/42.1 78.3/57.8 	80.0 	52.8/22.4 44.5/22.7 49.4/25.4 41.4/27.5
62.6/16.0 57.2/11.7 61.7/10.0 57.2/12.6 	77.1 	29.4/19.1 41.4/17.8 21.0/11.8
61.9/15.4 57.7/12.0 59.5/9.0 49.8/11.1 45.2/9.9 	69.8 	37.3/21.2 25.7/10.1
59.8/18.8 58.9/13.2 57.7/11.2 49.0/12.9 48.4/15.9 27.6/24.0 	79.2 	15.0/9.4
62.1/18.2 61.4/16.4 57.9/14.9 53.6/16.7 31.3/18.5 28.9/18.4 35.7/18.4 	79.2
Qwen2-7B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
87.8 	83.1/51.1 82.8/42.2 84.8/46.6 71.5/28.9 67.4/32.9 73.4/38.7 64.7/23.9
82.0/46.2 	86.6 	81.4/41.5 80.7/45.4 57.3/34.2 52.9/36.5 56.7/37.1 48.4/35.9
83.5/42.6 82.2/46.2 	86.1 	83.8/64.9 55.4/33.7 50.4/37.6 58.4/39.7 48.4/37.0
83.9/42.6 81.8/42.0 80.2/59.2 	86.3 	63.5/31.6 54.2/29.7 60.0/35.6 52.4/34.1
78.6/3 	81.1/2.6 	79.3/3.5 	80.6/4.2 	84.8 	58.4/10.7 74.1/6.2

Chunk 27 · 1,993 chars

/32.9 73.4/38.7 64.7/23.9
82.0/46.2 	86.6 	81.4/41.5 80.7/45.4 57.3/34.2 52.9/36.5 56.7/37.1 48.4/35.9
83.5/42.6 82.2/46.2 	86.1 	83.8/64.9 55.4/33.7 50.4/37.6 58.4/39.7 48.4/37.0
83.9/42.6 81.8/42.0 80.2/59.2 	86.3 	63.5/31.6 54.2/29.7 60.0/35.6 52.4/34.1
78.6/3 	81.1/2.6 	79.3/3.5 	80.6/4.2 	84.8 	58.4/10.7 74.1/6.2 	55.0/8.9
77.7/16.5 77.4/15.3 76.3/12.1 76.0/12.5 42.1/23.4 	73.7 	53.8/24.9 33.4/18.9
77.4/14.2 78.5/13.9 76.8/12.3 74.5/14.3 49.6/21.8 51.5/30.0 	87.6 	42.8/25.5
77.8/23.9 78.6/35.9 76.0/37.0 73.4/34.1 37.3/8.9 35.3/18.9 43.1/21.9 	83.1
Qwen2-72B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
72.8 	65.5/39.6 69.9/35.2 67.5/38.6 40.7/5.9 	45.6/7.9 58.2/15.3 48.9/3.2
73.1/41.7 	56.0 	74.0/35.3 72.0/36.0 50.2/13.1 52.1/12.7 63.6/17.8 46.7/10.1
63.9/32.8 64.7/33.3 	61.6 	66.6/50.7 30.8/10.9 25.8/10.4 39.7/15.9 	31/10.6
63.1/30.3 57.1/30.9 65.3/45.9 	72.9 	36.9/8.1 	35.7/5.8 	35.4/9.6 	36/4.1
62.9/7.0 	55.1/3.4 	61.1/5.0 	63.3/6.8 	71.3 	40.3/4.4 	52.1/8.9 	15.8/3.0
72.5/13.8 61.1/6.8 	64.8/6.9 	67.7/8.9 39.4/11.5 	66.7 	42.0/25.2 30.5/7.3
74.6/16.0 62.6/9.7 64.4/10.0 	65/9.7 	41.8/6.1 35.2/13.9 	77.0 	17.7/7.0
45.9/8.6 	26.8/4.3 	31.6/4.5 	28.1/6.1 	4.3/3.4 	6.2/3.7 	7.5/3.7 	32.6
GPT-3.5-Turbo
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
77.0 	80.2/46.7 81.6/39.1 81.7/43.0 74.5/3.6 	69.1/5.7 80.3/10.8 63.9/8.8
77.8/41.9 	81.3 	78.0/38.7 77.2/39.2 70.9/12.3 55.8/24.0 70.6/17.7 59.4/20.7
76.7/36.6 76.7/37.7 	78.0 	77.6/58.4 73.1/7.0 58.0/14.9 71.7/10.6 64.6/11.1
75.7/37.5 76.1/37.7 79.4/57.4 	80.5 	72.8/6.9 59.5/12.2 71.0/14.4 61.1/17.8
73.0/9.7 	74.8/6.7 	70.8/4.3 	76.9/6.0 	78.7 	67.2/4.8 	77.2/5.8 	63.5/4.0
74.2/14.7 71.1/10.6 69.0/11.7 72.4/13.7 61.8/21.9 	73.7 	66.2/25.9 51.2/21.5
69.8/10.0 69.2/6.4 	69.5/5.3 	66.0/7.3 63.7/10.4 59.6/17.8 	83.5 	48.4/8.6
59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 	68.0
GPT-4o
Figure 9: Performance of RALMs on cross-lingual knowledge transfer using Character

Chunk 28 · 1,990 chars

0
74.2/14.7 71.1/10.6 69.0/11.7 72.4/13.7 61.8/21.9 	73.7 	66.2/25.9 51.2/21.5
69.8/10.0 69.2/6.4 	69.5/5.3 	66.0/7.3 63.7/10.4 59.6/17.8 	83.5 	48.4/8.6
59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 	68.0
GPT-4o
Figure 9: Performance of RALMs on cross-lingual knowledge transfer using Character 3-gram Recall as metric. Note the
first/second values represent the RALM performance in the Flexible/Strict Language Setting. The colors indicate the perfor-
mance of the strict language setting, with deeper blues representing a stronger performance.

-- 10 of 15 --

en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
63.9 	83.2/11.7 82.2/17.3 82.7/6.1 68.0/12.7 72.1/16.2 74.1/24.9 73.6/17.3
80.7/16.2 	82.6 	61.4/15.2 64.0/9.6 39.6/14.2 46.7/10.7 44.2/11.7 57.9/19.3
83.2/12.2 77.2/8.6 	47.6 	82.7/9.1 	43.1/9.6 55.8/11.7 49.2/11.7 49.8/17.8
59.9/17.3 72.1/11.7 60.9/29.4 	76.5 	28.4/12.7 38.6/12.2 36.0/14.7 41.6/24.4
49.8/4.1 	41.6/6.6 36.5/11.7 45.2/4.6 	91.1 	34.5/9.1 38.1/14.2 14.2/15.2
53.3/8.1 	55.8/8.1 59.4/14.7 50.2/7.6 53.8/14.7 	83.8 	51.3/23.4 29.4/19.8
63.5/8.1 55.3/10.2 56.9/16.2 55.3/6.6 51.3/11.2 45.2/25.9 	81.4 	20.3/19.8
79.7/1.5 	69.0/3.0 75.6/11.2 72.6/4.6 	55.3/6.1 	50.8/6.6 	43.1/7.6 	70.5
Aya-23-8B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
78.7 	80.2/22.3 81.7/29.9 84.8/9.6 80.2/17.8 77.2/26.4 84.3/27.9 58.9/33.0
85.8/31.5 	85.8 	71.1/28.4 65.5/19.3 40.1/22.3 49.2/20.3 57.4/21.8 48.2/35.5
86.8/27.4 80.7/16.8 	74.1 	83.8/15.7 45.7/14.7 53.8/19.8 58.4/22.3 39.6/33.5
69.0/34.5 70.0/20.8 65.0/52.8 	83.8 	28.4/20.8 44.7/21.8 41.1/23.4 30.5/45.2
74.6/9.1 59.4/13.2 54.8/22.3 64.0/10.7 	92.9 	43.1/22.3 53.8/28.4 28.9/35.0
65.5/16.2 64.0/18.8 66.0/22.8 57.9/15.7 44.7/34.5 	87.8 	58.9/54.8 23.4/49.2
71.1/17.8 62.4/22.3 63.5/33.5 58.4/14.7 51.8/25.4 50.2/52.3 	85.3 	24.9/44.2
83.8/8.6 79.2/15.2 80.2/24.9 74.6/11.7 42.6/18.3 54.8/19.3 53.8/22.3 	81.2
Aya-23-35B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
89.3

Chunk 29 · 1,994 chars

53.8/28.4 28.9/35.0
65.5/16.2 64.0/18.8 66.0/22.8 57.9/15.7 44.7/34.5 	87.8 	58.9/54.8 23.4/49.2
71.1/17.8 62.4/22.3 63.5/33.5 58.4/14.7 51.8/25.4 50.2/52.3 	85.3 	24.9/44.2
83.8/8.6 79.2/15.2 80.2/24.9 74.6/11.7 42.6/18.3 54.8/19.3 53.8/22.3 	81.2
Aya-23-35B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
89.3 	83.2/79.7 85.8/93.9 86.8/39.6 48.7/58.4 53.8/111.2 64.0/98.0 53.3/112.2
87.3/78.2 	85.3 	61.9/88.8 53.3/51.8 27.4/52.3 33.5/74.1 38.6/60.9 35.0/90.4
84.3/70.6 82.7/46.2 	82.2 	83.2/56.4 42.6/42.1 38.6/74.6 55.8/61.4 34.0/91.4
59.4/88.8 72.1/64.5 60.9/146.2 	85.8 	32.0/56.9 38.6/71.6 43.6/64.0 27.9/108.1
59.4/22.8 52.3/33.5 55.8/44.7 53.3/24.4 	86.3 	35.5/72.6 40.1/76.7 23.4/68.0
48.2/41.1 48.2/44.7 49.2/43.1 39.1/28.9 38.1/61.4 	79.2 	39.6/118.3 16.2/86.3
51.3/52.8 51.8/52.8 49.8/60.9 40.1/34.0 53.8/60.4 29.9/142.1 	75.6 	11.7/78.7
72.6/33.5 67.5/49.2 69.0/48.7 64.0/25.9 33.5/37.6 36.5/53.3 46.2/46.7 	75.6
Qwen2-7B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
90.9 	87.8/107.188.3/133.5 92.4/54.8 72.6/62.9 79.7/145.778.7/138.671.6/165.5
89.8/97.5 	90.3 	86.3/126.9 75.6/66.0 41.1/55.8 42.1/94.9 40.1/79.7 49.8/125.4
92.4/86.8 88.8/77.2 	91.9 	90.9/75.1 52.3/45.2 49.8/92.9 59.4/82.7 47.2/136.0
74.6/108.6 56.9/92.4 63.5/202.0 	91.4 	49.2/59.9 47.2/87.8 55.8/86.8 43.6/151.8
70.6/38.1 68.0/50.2 70.6/68.5 65.0/42.1 	95.4 	53.3/91.4 72.1/98.0 49.8/102.5
68.0/57.9 61.4/63.5 57.4/67.0 63.5/39.1 39.6/71.6 	87.3 	55.3/155.325.4/122.8
76.1/76.7 60.4/71.1 68.5/88.3 69.0/52.8 51.3/67.0 61.4/175.1 	89.8 	36.0/117.8
88.3/49.2 81.7/75.6 85.8/70.0 81.7/40.1 47.7/43.1 52.3/67.0 60.9/64.5 	80.2
Qwen2-72B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
81.7 	69.0/38.6 75.6/52.8 72.1/19.8 42.1/31.0 53.8/62.4 58.9/54.8 47.2/48.2
79.2/45.2 	58.4 	77.7/49.2 77.7/28.4 52.3/33.0 56.9/35.5 66.0/35.5 46.2/42.6
69.5/41.1 68.0/25.4 	68.0 	70.6/31.5 31.0/27.4 27.4/39.1 38.1/36.0 27.9/39.6
69.5/54.8 60.4/37.1 71.1/84.8 	75.1 	41.1/35.0 41.1/40.1

Chunk 30 · 1,995 chars

ja 	ar
en
fr
es
pt
zh
ko
ja
ar
81.7 	69.0/38.6 75.6/52.8 72.1/19.8 42.1/31.0 53.8/62.4 58.9/54.8 47.2/48.2
79.2/45.2 	58.4 	77.7/49.2 77.7/28.4 52.3/33.0 56.9/35.5 66.0/35.5 46.2/42.6
69.5/41.1 68.0/25.4 	68.0 	70.6/31.5 31.0/27.4 27.4/39.1 38.1/36.0 27.9/39.6
69.5/54.8 60.4/37.1 71.1/84.8 	75.1 	41.1/35.0 41.1/40.1 38.1/37.6 33.5/56.9
67.0/10.2 60.9/19.8 66.5/28.4 68.5/16.2 	78.2 	50.2/39.1 56.9/38.1 17.8/35.5
78.2/19.8 65.5/26.9 72.6/28.4 74.6/17.3 47.2/40.6 	76.1 	50.8/71.1 32.5/50.2
80.2/25.9 67.5/33.5 70.6/43.1 73.1/19.3 51.8/37.1 40.6/83.8 	74.6 	21.3/45.2
49.8/9.6 27.4/23.9 33.5/32.0 29.9/12.2 2.0/22.8 	4.6/27.9 	5.1/31.0 	29.4
GPT-3.5-Turbo
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
81.2 	86.8/45.2 86.8/58.9 89.8/25.9 87.3/42.6 83.8/92.9 88.3/68.0 66.5/73.6
84.3/58.9 	86.8 	84.8/57.9 84.8/34.0 82.2/43.6 69.5/58.4 82.7/44.2 60.9/57.9
84.8/50.8 82.7/31.0 	83.2 	85.8/40.6 83.2/34.5 76.7/60.9 79.2/43.6 69.5/56.9
81.7/69.5 81.2/45.7 85.3/100.5 	83.2 	84.3/43.1 77.7/60.4 83.8/48.7 68.5/76.7
81.7/11.7 85.3/26.4 79.7/32.5 85.3/18.8 	88.3 	83.2/63.5 85.8/58.4 67.5/49.8
80.2/22.3 76.7/33.0 76.1/31.0 80.2/20.8 71.1/47.7 	85.8 	78.2/95.4 53.3/66.5
76.1/34.0 73.6/39.6 76.7/47.7 75.1/24.4 79.7/44.7 77.2/121.8 	86.3 	54.3/55.8
65.0/16.2 62.4/32.0 68.5/37.1 65.0/19.8 60.4/26.9 59.9/46.2 64.5/41.1 	70.0
GPT-4o
Figure 10: Performance of RALMs on cross-lingual knowledge transfer when using LLM for evaluation. Specifically, we
employ GPT-4-Turbo to assess the predictions of RALMs.

-- 11 of 15 --

A.2 Multilingual Knowledge Selection
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
17.89 10.90 11.89 12.02 10.61 	7.73 	11.31 	9.32
27.59 12.26 11.94 18.34 13.72 10.49 12.42 11.22
25.70 15.63 15.63 20.89 15.83 11.32 14.08 11.98
29.32 16.31 15.42 22.39 14.93 11.14 14.38 13.27
21.95 12.78 	9.86 	12.32 26.59 11.20 10.96 	9.74
21.52 12.27 11.86 17.00 14.67 15.39 15.85 	9.62
18.15 	9.65 	8.62 	12.51 17.54 10.84 18.10 	6.94
23.68 12.81 12.37 13.52 13.31 10.22 16.31

Chunk 31 · 1,991 chars

11.94 18.34 13.72 10.49 12.42 11.22
25.70 15.63 15.63 20.89 15.83 11.32 14.08 11.98
29.32 16.31 15.42 22.39 14.93 11.14 14.38 13.27
21.95 12.78 	9.86 	12.32 26.59 11.20 10.96 	9.74
21.52 12.27 11.86 17.00 14.67 15.39 15.85 	9.62
18.15 	9.65 	8.62 	12.51 17.54 10.84 18.10 	6.94
23.68 12.81 12.37 13.52 13.31 10.22 16.31 20.56
Aya-23-8B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
53.00 38.77 39.04 36.40 23.87 18.83 20.72 17.94
48.96 36.98 37.00 34.20 22.07 16.73 18.45 15.30
49.14 32.08 39.03 29.72 23.47 20.08 20.47 14.20
45.65 32.33 33.66 36.19 25.07 17.19 18.47 16.08
46.60 30.40 26.72 26.98 30.35 15.94 15.15 13.94
39.34 25.86 23.63 22.81 18.87 26.77 15.41 12.66
52.07 38.77 38.19 34.89 22.88 21.89 30.09 16.26
50.23 32.66 32.86 28.91 22.81 16.36 17.37 26.28
Aya-23-35B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
35.18 17.73 18.36 17.89 28.63 15.51 13.95 12.67
31.24 19.86 17.79 21.96 27.86 12.75 12.92 12.88
28.96 19.95 22.14 17.85 24.64 14.60 12.52 12.69
34.49 22.82 24.44 26.95 23.01 14.80 15.93 16.53
25.73 10.17 11.31 14.19 38.05 	9.21 	9.30 	7.76
37.11 18.86 18.90 18.15 26.16 17.30 17.26 15.59
35.44 21.47 18.61 21.90 32.29 17.27 20.70 12.99
28.37 18.40 18.30 16.29 15.47 14.59 16.29 23.71
Qwen2-7B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
58.36 49.89 49.63 50.50 48.34 43.64 53.06 45.17
54.28 42.55 44.70 40.21 39.52 34.61 42.45 36.51
53.68 41.98 46.93 42.87 37.86 37.84 36.77 36.35
52.74 40.94 42.75 42.55 32.05 36.65 41.02 34.50
58.13 49.59 50.04 50.56 49.82 43.13 49.44 41.67
54.18 39.47 39.32 39.45 37.43 38.89 38.71 33.55
58.49 48.81 48.93 51.08 42.37 40.92 47.77 40.27
54.60 37.24 38.99 39.88 35.58 32.91 35.54 42.75
Qwen2-72B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
45.89 28.51 31.48 29.81 18.59 17.27 22.28 17.18
45.98 25.85 30.00 29.10 22.06 20.35 22.89 20.17
15.94 	8.93 	12.13 	9.63 	6.10 	6.21 	6.85 	7.21
20.45 13.14 15.09 17.16 	9.77 	8.22 	9.25 	10.28
9.20 	4.45 	5.49 	7.63 	11.78 	4.78 	5.38

Chunk 32 · 1,994 chars

.54 42.75
Qwen2-72B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
45.89 28.51 31.48 29.81 18.59 17.27 22.28 17.18
45.98 25.85 30.00 29.10 22.06 20.35 22.89 20.17
15.94 	8.93 	12.13 	9.63 	6.10 	6.21 	6.85 	7.21
20.45 13.14 15.09 17.16 	9.77 	8.22 	9.25 	10.28
9.20 	4.45 	5.49 	7.63 	11.78 	4.78 	5.38 	3.92
25.83 12.77 14.98 13.60 13.19 17.13 11.26 11.23
26.02 16.41 16.51 16.22 13.80 12.61 10.72 10.77
11.48 	6.28 	6.74 	5.55 	3.77 	3.43 	6.16 	6.10
GPT3.5-Turbo
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
43.98 38.16 39.45 40.28 44.78 37.48 41.07 39.06
35.20 42.53 27.66 28.25 29.37 23.51 27.12 25.34
24.54 20.22 24.74 21.71 23.37 16.83 20.39 21.67
35.77 29.52 29.79 48.09 26.80 27.34 25.64 28.78
36.21 28.49 29.59 28.51 50.95 26.67 28.92 31.21
41.60 30.40 31.72 32.68 31.74 40.51 29.01 30.15
51.56 40.54 43.85 43.37 42.83 37.44 45.50 40.26
26.38 19.19 22.84 22.53 22.43 18.37 19.61 49.87
GPT-4o
Figure 11: Performance of RALMs on multilingual knowledge selection using Character 3-gram Recall as metric. The colors
indicate the performance of the strict language setting, with deeper blues representing a stronger performance.

-- 12 of 15 --

en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
18.78 	9.14 	8.63 	7.11 	6.09 	3.05 	6.09 	3.05
31.98 	9.64 	8.12 	13.71 	9.64 	2.54 	5.58 	4.57
26.40 10.15 11.17 15.74 	9.14 	3.55 	6.09 	2.54
29.44 10.66 11.68 18.27 	7.11 	1.52 	4.06 	4.57
23.35 10.66 	4.06 	6.09 	24.87 	7.11 	9.64 	4.06
28.43 11.68 	7.61 	13.20 11.68 12.69 	9.64 	6.60
21.32 	5.58 	5.58 	8.63 	15.74 	6.09 	14.72 	5.58
23.86 	7.61 	6.60 	5.08 	5.08 	1.52 	6.60 	13.20
Aya-23-8B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
63.45 40.10 39.59 34.01 12.69 	9.14 	7.61 	7.11
57.87 44.10 40.10 35.03 10.66 	7.11 	8.12 	4.57
57.36 33.50 45.53 28.43 14.72 10.15 11.17 	5.08
54.31 36.04 33.50 41.98 15.23 	4.06 	9.14 	10.15
55.33 31.98 25.89 26.40 34.69 10.58 	9.60 	8.61
44.67 24.87 21.83 17.26 13.71 22.60 	7.61 	7.11
64.47 43.65 39.59

Chunk 33 · 1,995 chars

zh
ko
ja
ar
63.45 40.10 39.59 34.01 12.69 	9.14 	7.61 	7.11
57.87 44.10 40.10 35.03 10.66 	7.11 	8.12 	4.57
57.36 33.50 45.53 28.43 14.72 10.15 11.17 	5.08
54.31 36.04 33.50 41.98 15.23 	4.06 	9.14 	10.15
55.33 31.98 25.89 26.40 34.69 10.58 	9.60 	8.61
44.67 24.87 21.83 17.26 13.71 22.60 	7.61 	7.11
64.47 43.65 39.59 35.53 10.66 	9.64 	35.63 	4.57
61.42 36.55 35.03 28.93 13.20 	5.58 	7.61 	27.61
Aya-23-35B
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
44.16 16.24 11.68 13.20 26.40 	6.09 	7.61 	4.57
35.03 16.24 13.20 16.24 20.81 	4.57 	7.11 	3.05
31.98 14.72 18.27 14.21 18.27 	6.09 	4.57 	4.06
34.01 16.24 18.27 23.35 13.71 	4.57 	6.09 	4.57
30.96 	9.14 	8.63 	10.66 39.09 	4.06 	5.58 	2.54
40.10 13.71 14.21 10.66 17.26 10.66 10.66 	2.03
38.58 17.26 14.21 16.24 26.90 	7.61 	18.27 	2.03
30.96 15.74 15.74 11.17 	6.60 	4.06 	6.09 	18.27
Qwen2-7B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
72.08 55.33 50.76 51.27 39.09 29.95 34.01 24.87
62.44 43.15 43.65 38.07 29.44 22.34 27.41 28.93
59.90 40.10 46.70 36.55 25.89 23.35 18.27 20.30
63.96 40.61 42.64 46.19 24.37 25.89 23.86 21.83
73.10 56.85 57.36 56.35 39.59 34.01 34.52 24.37
59.90 32.99 34.52 32.49 28.43 27.41 22.34 18.27
67.01 51.78 50.25 51.27 30.96 26.90 34.52 21.83
58.88 33.50 36.55 32.49 24.87 17.77 25.38 36.04
Qwen2-72B-Instruct
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
57.36 26.90 33.50 28.43 	8.63 	6.60 	12.69 	5.58
56.85 25.38 30.96 26.90 13.20 	7.61 	13.20 	8.63
19.80 	7.11 	11.68 	9.14 	2.03 	2.03 	3.05 	4.06
25.89 13.71 15.23 20.30 	3.55 	4.06 	5.58 	6.60
12.69 	5.58 	5.08 	6.60 	9.64 	3.05 	5.08 	1.02
31.47 	6.09 	12.18 10.66 	7.11 	11.68 	5.08 	4.57
29.95 15.74 16.24 16.75 10.66 	8.12 	5.08 	5.58
13.20 	6.60 	6.60 	4.57 	2.54 	1.02 	4.06 	4.06
GPT3.5-Turbo
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
54.82 40.61 43.15 42.64 39.59 23.35 29.95 29.44
44.67 52.79 30.96 28.43 19.80 15.23 18.78 16.75
29.44 21.32 29.44 24.37 18.27 10.15 15.74 14.72
46.70 30.96

Chunk 34 · 1,993 chars

29.95 15.74 16.24 16.75 10.66 	8.12 	5.08 	5.58
13.20 	6.60 	6.60 	4.57 	2.54 	1.02 	4.06 	4.06
GPT3.5-Turbo
en 	fr 	es 	pt 	zh 	ko 	ja 	ar
en
fr
es
pt
zh
ko
ja
ar
54.82 40.61 43.15 42.64 39.59 23.35 29.95 29.44
44.67 52.79 30.96 28.43 19.80 15.23 18.78 16.75
29.44 21.32 29.44 24.37 18.27 10.15 15.74 14.72
46.70 30.96 32.49 53.30 18.27 13.71 12.69 15.74
47.72 34.01 34.01 28.93 49.75 22.34 27.41 22.34
53.30 31.47 32.49 31.98 26.40 36.04 17.26 19.80
64.97 48.22 50.76 48.22 35.53 29.95 34.52 27.41
26.90 17.77 22.34 17.77 13.71 	8.63 	11.17 44.67
GPT-4o
Figure 12: Performance of RALMs on multilingual knowledge selection when using LLM for evaluation. Specifically, we
employ GPT-4-Turbo to assess the predictions of RALMs.

-- 13 of 15 --

A.3 Experiments on the Refined Prompt
Prompt en fr es pt zh ko ja ar avg
Flexible Language Setting
Qwen2-7B-Instruct Vanilla 69.35 67.34 67.62 62.74 45.21 37.98 46.57 32.15 52.36
Refined 68.89 69.08 68.63 64.87 51.55 42.22 50.53 32.66 56.05
GPT-3.5-Turbo Vanilla 65.14 56.13 61.58 61.43 34.86 34.43 42.67 32.38 48.58
Refined 73.78 64.54 71.68 72.56 50.85 44.77 55.67 38.99 59.11
Strict Language Setting
Qwen2-7B-Instruct Vanilla 27.98 26.51 26.64 29.03 20.99 25.05 25.65 20.60 25.31
Refined 29.04 26.76 26.47 29.76 23.30 25.91 25.52 19.97 25.84
GPT-3.5-Turbo Vanilla 21.45 18.30 20.40 22.40 8.43 8.40 13.78 7.16 15.04
Refined 26.83 20.10 26.17 26.09 12.86 13.85 20.31 10.45 19.58
Table 5: Comparison of performance between using the Vanilla prompt and the Refined prompt across all languages.
As mentioned in §3.4, we refine the original prompt, re-
quiring RALMs to first provide answers from documents
in various languages, and then translate them into the
query language. This method leverages the chain-of-
thought to enhance RALMs’ performance in cross-lingual
knowledge transfer. Table 5 presents the experimental
results of Qwen2-7B-Instruct and GPT-3.5-Turbo across all
languages, confirming the effectiveness of our method.
B Implementation

Chunk 35 · 1,993 chars

then translate them into the
query language. This method leverages the chain-of-
thought to enhance RALMs’ performance in cross-lingual
knowledge transfer. Table 5 presents the experimental
results of Qwen2-7B-Instruct and GPT-3.5-Turbo across all
languages, confirming the effectiveness of our method.
B Implementation Details
When collecting data, we utilized MediaWiki Action API7
to gather Wikipedia data. During evaluation, we simplify the
retrieval process by directly providing the gold documents.
Each document is divided into 200-token chunks, and we
employ mcontriever-msmarco8 as the retriever to select the
5 most relevant chunks for the QA as context for length con-
trol. We limit the RALMs’ context input to 8,192 tokens and
instruct RALMs that the current year is 2200 to ensure that
the timestamps in the QA pairs are reasonable. RALMs are
also instructed to respond based on the provided context or
reply that the information is insufficient.
C Instructions in the Experiments
In this section, we provide the English instructions used for
the data construction and experiments. Instructions in other
languages were derived through translation from the English
version.
7https://www.mediawiki.org/wiki/API:Main page
8https://huggingface.co/facebook/mcontriever-msmarco
Instruction for Entities Modification
Given the original entity {ENTITY}, you should provide 8
unique and reasonable entities of the same type as the orig-
inal entity. If the given entity is a person’s name, a movie
title, a game title, etc., return a fictional but reasonable en-
tity. If the given entity is a country or a city, provide a real
and comparatively similar entity. Return the result in the
form of Python lists, with no additional context.
Instruction for Document Update
Your task is to modify a given document by replacing cer-
tain entities within it to ensure logical coherence and con-
sistency. Please adhere to the following instructions:
1. The modified document must be relevant and

Chunk 36 · 1,994 chars

the result in the
form of Python lists, with no additional context.
Instruction for Document Update
Your task is to modify a given document by replacing cer-
tain entities within it to ensure logical coherence and con-
sistency. Please adhere to the following instructions:
1. The modified document must be relevant and capable
of answering the question: {QUERY} with the answer
{ANSWER}.
2. Ensure all mentioned entities are thoroughly replaced.
For example, if replacing a person’s name, ensure both the
first and the last name are replaced everywhere. 3. For en-
tities with aliases (e.g., “United States” and “America” in
English), replace all variations. 4. After replacing an entity
(e.g., changing “USA” to “UK”), adjust related content ac-
cordingly (e.g., changing “American” to “British”).
Instruction for Monolingual Knowledge Extraction
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents, and you
need to use the same language as the query to give your an-
swer. If the information in the documents does not contain
the answer, you will generate “The document contains in-
sufficient information, so I cannot answer the query based
on the document.” If the information in the documents con-
tains the correct answer, you will succinctly and directly
give all the answers without including any context, and if
there are multiple answers, separate them by “, ”. Note that
the current time is the year 2200, and the temporal infor-
mation, names, and other entity information mentioned in
the text are all correct. Now the Document is :{DOCS} ...
the query is:{QUERY}”

-- 14 of 15 --

Instruction for Cross-lingual Knowledge Transfer
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents in dif-
ferent languages, and you need to use the same language
as the query to give your answer. If the information in the
documents does not contain the answer, you will generate
“The

Chunk 37 · 1,998 chars

ingual Knowledge Transfer
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents in dif-
ferent languages, and you need to use the same language
as the query to give your answer. If the information in the
documents does not contain the answer, you will generate
“The document contains insufficient information, so I can-
not answer the query based on the document.” If the infor-
mation in the documents contains the correct answer, you
will succinctly and directly give all the answers without
including any context, and if there are multiple answers,
separate them by “, ”. Note that the current time is the year
2200, and the temporal information, names, and other en-
tity information mentioned in the text are all correct. Now
the Document is :{DOCS} ... the query is:{QUERY}”
Refined Instruction for Cross-lingual Knowledge
Transfer
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents in dif-
ferent languages, and you need to use the same language
as the query to give your answer. If the information in the
documents does not contain the answer, you will generate
“The document contains insufficient information, so I can-
not answer the query based on the document.” If the infor-
mation in the documents contains the correct answer, you
will succinctly and directly give all the answers without
including any context, and if there are multiple answers,
separate them by “, ”. Note you need to provide the an-
swer from the original text and then transfer it into the
query language. The format is in the format [answer in
query language (answer in the original text), ...]. Note that
the current time is the year 2200, and the temporal infor-
mation, names, and other entity information mentioned in
the text are all correct. Now the Document is :{DOCS} ...
the query is:{QUERY}”
Instruction for Multilingual Knowledge Selection
You are an accurate and reliable AI assistant that

Chunk 38 · 1,998 chars

text), ...]. Note that
the current time is the year 2200, and the temporal infor-
mation, names, and other entity information mentioned in
the text are all correct. Now the Document is :{DOCS} ...
the query is:{QUERY}”
Instruction for Multilingual Knowledge Selection
You are an accurate and reliable AI assistant that can an-
swer queries with the help of multiple external documents
in various languages with different answers, and you need
to use the same language as the query to give your answer.
If the information in the documents does not contain the
answer, you will generate “The document contains insuf-
ficient information, so I cannot answer the query based on
the document.” Otherwise, you will succinctly and directly
give all the answers you think are correct without includ-
ing any context, and if there are multiple answers, separate
them by “, ”. that the current time is the year 2200, and the
temporal information, names, and other entity information
mentioned in the text are all correct. Now the Document is
:{DOCS} ... the query is:{QUERY}”
D Question Construction Guidelines
Below are the annotation guidelines for manually reviewing
and modifying data to ensure quality.
This task requires reviewing the provided documents and
QA to ensure they meet the following requirements; if they
do not, modifications must be made:
• The document must contain the knowledge mentioned
in the QA and be able to answer the QA.
• The document must not contain conflicting informa-
tion.
• If the QA in the target language differs from the English
QA, modify it according to the English version.
• Ensure that the timestamps in the document and QA
are consistent, with all dates falling between the years
2124 and 2200.
How to modify the QA or the document:
• If there is a significant difference between the QA and
the English QA, a re-translation is required.
• If the document lacks information to answer the QA,
add or revise sentences to include the necessary details.
• Ensure all

Chunk 39 · 1,105 chars

h all dates falling between the years
2124 and 2200.
How to modify the QA or the document:
• If there is a significant difference between the QA and
the English QA, a re-translation is required.
• If the document lacks information to answer the QA,
add or revise sentences to include the necessary details.
• Ensure all instances of names, last names, and aliases
are thoroughly consistent. If the main character is
named Michael Johnson, it is essential to maintain con-
sistency throughout. Use pronouns such as “he” (e.g.,
“He went to the store”), refer to him by his last name
“Johnson” (e.g., “Johnson attended the meeting”), or
use his full name.
• Resolve any conflicts present in the document.
– Adjust the document to reflect a future time without
logical inconsistencies. For instance, if the document
states that a film was shot in 2021 but is set to be
released in 2038, the filming date should be changed
to 2038 to maintain consistency.
– Geographical Conflicts: These typically involve dis-
crepancies between city and country names. For ex-
ample, “New York City in Italy.”

-- 15 of 15 --