Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation
Summary
This study introduces Futurepedia, a new benchmark for evaluating multilingual Retrieval-Augmented Language Models (RALMs). Unlike existing datasets, Futurepedia provides time-insensitive, parallel documents and ground-truth answers across eight languages, enabling fair comparison. The authors evaluate six models on three tasks: monolingual knowledge extraction, cross-lingual knowledge transfer, and multilingual knowledge selection. Results reveal significant linguistic inequalities. In monolingual extraction, high-resource languages outperform low-resource ones, though scaling model size reduces this gap. Translating low-resource documents to high-resource languages often fails due to cascading translation errors. In cross-lingual transfer, Indo-European languages facilitate better performance by allowing models to quote documents directly rather than translating answers. Finally, in knowledge selection, models exhibit strong bias toward English, often ignoring non-English documents even when they contain correct information. To mitigate these issues, the paper suggests specific strategies. For monolingual tasks, caution is advised when translating low-resource content. For cross-lingual tasks, prompting models to extract answers from source documents before translating improves accuracy. To reduce selection bias, the authors recommend increasing the volume of non-English documents and placing English documents in non-initial positions within the context. These findings highlight the complexities of multilingual RAG and offer actionable insights for future development.
PDF viewer
Chunks(40)
Chunk 0 Ā· 1,995 chars
Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation Suhang Wu1* Jialong Tang2ā , Baosong Yang2, Ante Wang1, Kaidi Jia1, Jiawei Yu1, Junfeng Yao1, Jinsong Su1 1Xiamen University 2Tongyi Lab, Alibaba Group wusuhang@xmu.stu.edu.cn Abstract RALMs (Retrieval-Augmented Language Models) broaden their knowledge scope by incorporating external textual re- sources. However, the multilingual nature of global knowl- edge necessitates RALMs to handle diverse languages, a topic that has received limited research focus. In this work, we propose Futurepedia, a carefully crafted benchmark con- taining parallel texts across eight representative languages. We evaluate six multilingual RALMs using our benchmark to explore the challenges of multilingual RALMs. Experi- mental results reveal linguistic inequalities: 1) high-resource languages stand out in Monolingual Knowledge Extraction; 2) Indo-European languages lead RALMs to provide answers directly from documents, alleviating the challenge of ex- pressing answers across languages; 3) English benefits from RALMsā selection bias and speaks louder in multilingual knowledge selection. Based on these findings, we offer ad- vice for improving multilingual Retrieval Augmented Gen- eration. For monolingual knowledge extraction, careful at- tention must be paid to cascading errors from translating low-resource languages into high-resource ones. In cross- lingual knowledge transfer, encouraging RALMs to provide answers within documents in different languages can improve transfer performance. For multilingual knowledge selection, incorporating more non-English documents and reposition- ing English documents can help mitigate RALMsā selection bias. Through comprehensive experiments, we underscore the complexities inherent in multilingual RALMs and offer valu- able insights for future research. 1 Introduction Retrieval-Augmented Generation (RAG) aims to allevi- ate the knowledge limitations of Large Language
Chunk 1 Ā· 1,989 chars
lish documents can help mitigate RALMsā selection bias. Through comprehensive experiments, we underscore the complexities inherent in multilingual RALMs and offer valu- able insights for future research. 1 Introduction Retrieval-Augmented Generation (RAG) aims to allevi- ate the knowledge limitations of Large Language Mod- els (LLMs), leading to the development of Retrieval- Augmented Language Models (RALMs) (Chen et al. 2023a; Yu et al. 2023; Gao et al. 2024; Asai et al. 2024; Lin et al. 2024). Particularly, since the knowledge encapsulated in dif- ferent languages may vary significantly, multilingual RAG has emerged as a critical research direction for effectively utilizing multilingual texts. However, multilingual RAG is in its infancy and faces challenges due to the lack of effective benchmarks. The commonly-used RAG benchmarks such as CRUD (Lyu et al. *Work done during internship at Tongyi Lab. ā Contributed equally. #Lang Time- Insensitive Ground truth Answer Multilingual Parallel Non-multilingual RAG Benchmark RECALL (Liu et al. 2023) 1 ā ā ā RGB (Chen et al. 2023a) 2 ā ā ā CRUD (Lyu et al. 2024) 1 ā ā ā Multilingual RAG Benchmark MKQA (Longpre, Lu, and Daiber 2021) 26 ā ā ā XOR-TyDi QA(Asai et al. 2021) 8 ā ā ā NoMIRACL (Thakur et al. 2024) 18 ā ā ā Ours 8 ā ā ā Table 1: Comparison of ours and other RAG benchmarks. 2024), RECALL (Liu et al. 2023), and RGB (Chen et al. 2023a) are only limited to English or Chinese. Although some multilingual benchmarks exist, such as MKQA (Long- pre, Lu, and Daiber 2021), XOR-TyDi QA (Asai et al. 2021), and NoMIRACL (Thakur et al. 2024), they also have their shortcomings. MKQA and XOR-TyDi QA are time- sensitive and face the risk of information leakage, while NoMIRACL lacks ground truth answers, hindering the as- sessment of RALMs in comprehension and response gen- eration. Besides, most benchmarks fail to provide multilin- gual parallel data, preventing fair comparison across differ- ent
Chunk 2 Ā· 1,999 chars
A and XOR-TyDi QA are time- sensitive and face the risk of information leakage, while NoMIRACL lacks ground truth answers, hindering the as- sessment of RALMs in comprehension and response gen- eration. Besides, most benchmarks fail to provide multilin- gual parallel data, preventing fair comparison across differ- ent languages. To address these issues, we first propose Futurepedia1, a carefully crafted multilingual RAG benchmark based on Wikipedia2. The differences between our benchmark and previous ones are shown in Table 1. Our benchmark includes 197 parallel documents and the corresponding QA pairs in eight languages. Particularly, it also introduces three tasks to investigate multilingual RAG: 1) Monolingual Knowl- edge Extraction, which requires RALMs to extract knowl- edge from documents and resolve questions in the same lan- guage; 2) Cross-lingual Knowledge Transfer, which chal- lenges RALMs to resolve questions using documents in dif- ferent languages; and 3) Multilingual Knowledge Selection, which examines RALMsā bias toward languages when se- lecting answers from documents in different languages. We then conduct experiments to evaluate several com- 1We will release our data at https://github.com/H-shw/ futurepedia/ 2https://www.wikipedia.org/ arXiv:2410.21970v1 [cs.CL] 29 Oct 2024 -- 1 of 15 -- monly used RALMs, revealing significant linguistic inequal- ity in multilingual RAG. Specifically, in the task of mono- lingual knowledge extraction, high-resource languages stand out, with RALMs exhibiting superior performance in these languages. Meanwhile, as the model size grows, RALMs not only show enhanced performance but also alleviate linguis- tic inequality. In the task of cross-lingual knowledge trans- fer, Indo-European languages3 lead RALMs to provide an- swers directly from documents in different languages, allevi- ating the challenge of expressing answers across languages. In the multilingual knowledge selection task, English bene- fits from RALMsā
Chunk 3 Ā· 1,999 chars
s- tic inequality. In the task of cross-lingual knowledge trans- fer, Indo-European languages3 lead RALMs to provide an- swers directly from documents in different languages, allevi- ating the challenge of expressing answers across languages. In the multilingual knowledge selection task, English bene- fits from RALMsā selection bias and speaks louder in multi- lingual contexts. Even a small number of English documents can exert a dominant influence, overshadowing a larger num- ber of documents in other languages. Based on the findings above, we further explore several strategies to improve multilingual RAG: 1) In the task of monolingual knowledge extraction, translating documents from low-resource languages into high-resource ones is a di- rect way to improve performance in low-resource language tasks. However, careful attention must be paid to cascad- ing errors during the translation; 2) in the task of cross- lingual knowledge transfer, encouraging RALMs to provide answers from documents in different languages can enhance their cross-lingual entity understanding and response gen- eration; 3) in the task of multilingual knowledge selection, incorporating more non-English documents and reposition- ing English documents can help mitigate RALMsā selection bias. 2 Our Benchmark In this section, we describe the details of our benchmark, in- cluding its data construction process (§2.1), three evaluation tasks (§2.2), and evaluation metrics (§2.3). 2.1 Data Construction As mentioned above, existing benchmarks often lack par- allel data, fail to provide ground truth answers, and are hin- dered by time sensitivity. To provide parallel data, we collect Wikipedia documents created from January 2018 to April 2024 in eight languages: English (en), French (fr), Span- ish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Korean (ko), and Arabic (ar). Meanwhile, we also gather the parallel factual triples from WikiData4 and utilize GPT-4o to con- vert them into natural language
Chunk 4 Ā· 1,993 chars
a documents created from January 2018 to April 2024 in eight languages: English (en), French (fr), Span- ish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Korean (ko), and Arabic (ar). Meanwhile, we also gather the parallel factual triples from WikiData4 and utilize GPT-4o to con- vert them into natural language QA pairs, so as to provide ground truth answers alongside the corresponding questions. Then, as shown in Figure 1, we construct time-insensitive in- stances via three-stage operations in sequence. In the step of Timestamp Modification, we adjust the timestamps of doc- uments and QA pairs to random years between 2124 and 2200. During the subsequent Entities Modification step, we prompt GPT-4o to generate similar and reasonable entities for the subjects and objects in the original factual triples. In the final Translation & Update step, we utilize GPT-4o to 3For the languages discussed in this work, English, French, Spanish, and Portuguese are all Indo-European languages. 4https://www.wikidata.org/ Original Document (fr) Original Document (zh) Timestamp Modification 2019 2046 Entities Modification Chernobyl Pripyat HBO Neo Cinema Network Translation & Update en: Pripyat zh: ę®éē®äŗå£ fr: ... Original Document (en) Original QA (en) Q: Which company produced the 2019 miniseries Chernobyl ? A: HBO Chernobyl is a 2019 historical drama television miniseries that revolves around the ⦠The series was produced by HBO, known for its high-quality ⦠Final Document (fr) Final Document (zh) Final Document (en) Final QA (en) Q: Which company produced the 2146 miniseries Pripyat? A: Neo Cinema Network Pripyat is a 2146 historical drama television miniseries that revolves around the ⦠The series was produced by Neo Cinema Network, known for its high-quality ⦠Figure 1: The refinement process on the collected data. translate the fictional entities into eight languages and then update documents and QA pairs in all languages. Besides, we ensure the quality of the
Chunk 5 Ā· 1,990 chars
miniseries that revolves around the ⦠The series was produced by Neo Cinema Network, known for its high-quality ⦠Figure 1: The refinement process on the collected data. translate the fictional entities into eight languages and then update documents and QA pairs in all languages. Besides, we ensure the quality of the benchmark through rigorous manual review. For each language, we hire three experts proficient in both English and the target language to evaluate the accuracy and coherence of the generated documents and QA pairs at a pace of 20 entries per day. Only documents that can resolve their corresponding QA pairs without logical errors are directly retained. Those that do not satisfy these criteria are revised until consensus is reached among experts. Ultimately, we obtain 197 parallel documents and corresponding QA pairs in eight languages. 2.2 Three Evaluation Tasks As shown in Figure 2, our benchmark provides three tasks to comprehensively evaluate RALMs from different perspec- tives. Monolingual Knowledge Extraction This task requires the RALMs to extract answers from documents and resolve questions within the same language, as shown in Figure 2(a), intending to assess the fundamental RAG capabilities in dif- ferent languages. Cross-lingual Knowledge Transfer As depicted in Fig- ure 2(b), we challenge RALMs with documents and QA pairs written in different languages within this evaluation task. This task imitates the real-world scenario where there are no high-quality documents in the query language, allow- ing for the evaluation of the cross-lingual transfer ability of RALMs. Multilingual Knowledge Selection For each question, we provide RALMs with documents in eight languages that -- 2 of 15 -- QA (fr) QA (zh) Document (en) QA (en) Q: Which company produced the 2146 miniseries Pripyat? A: Neo Cinema Network Pripyat was produced by Neo Cinema Network Document (zh) QA (zh) Q: 2146å¹“čæ·ä½ å§ćę®éē®äŗē¹ćęÆē±åŖå®¶å ¬åøå¶ä½ē? A: ę°ēµå½±é¢ē½ē» (Q: Which company produced the 2146
Chunk 6 Ā· 1,994 chars
ide RALMs with documents in eight languages that -- 2 of 15 -- QA (fr) QA (zh) Document (en) QA (en) Q: Which company produced the 2146 miniseries Pripyat? A: Neo Cinema Network Pripyat was produced by Neo Cinema Network Document (zh) QA (zh) Q: 2146å¹“čæ·ä½ å§ćę®éē®äŗē¹ćęÆē±åŖå®¶å ¬åøå¶ä½ē? A: ę°ēµå½±é¢ē½ē» (Q: Which company produced the 2146 miniseries Pripyat? A: New Cinema Network.) ę°ēµå½±é¢ē½ē»å¶ä½äŗčæ·ä½ å§ćę®éē®äŗē¹ć (Neo Cinema Network has produced miniseries Pripyat) Document (fr) QA (fr) Q: Quelle sociĆ©tĆ© a produit la mini-sĆ©rie Pripiat de 2146? A: RĆ©seau CinĆ©ma NĆ©o (Q: Which company produced the 2146 mini-series Pripyat? A: New Cinema Network.) Pripiat SociĆ©tĆ©s de production : RĆ©seau CinĆ©ma NĆ©o (The production company of Pripiat: New Cinema Network) Document (zh) QA (en) Q: Which company produced the 2146 miniseries Pripyat? A: Neo Cinema Network ę°ēµå½±é¢ē½ē»å¶ä½äŗčæ·ä½ å§ćę®éē®äŗē¹ć (Neo Cinema Network has produced miniseries Pripyat) Document (en) QA (fr) Q: Quelle sociĆ©tĆ© a produit la mini-sĆ©rie Pripiat de 2146? A: RĆ©seau CinĆ©ma NĆ©o (Q: Which company produced the 2146 mini-series Pripyat? A: New Cinema Network.) Pripyat was produced by Neo Cinema Network Document (fr) QA (zh) Q: 2146å¹“čæ·ä½ å§ćę®éē®äŗē¹ćęÆē±åŖå®¶å ¬åøå¶ä½ē? A: ę°ēµå½±é¢ē½ē» (Q: Which company produced the 2146 miniseries Pripyat? A: New Cinema Network.) Pripiat SociĆ©tĆ©s de production : RĆ©seau CinĆ©ma NĆ©o (The production company of Pripiat: New Cinema Network) Document (en) Document (zh) å²čÆę äŗé¢éå¶ä½äŗčæ·ä½ å§ćę®éē®äŗē¹ć (Epic Story Channel has produced the miniseries Pripyat) Pripyat was produced by Metro Film Alliance Pripiat SociĆ©tĆ©s de production : Studios de cinĆ©ma Starlight (The production company of Pripiat: Starlight Film Studios) Document (fr) QA (en) Q: Which company produced the 2146 miniseries Pripyat? A: Metro Film Alliance? Epic Story Channel? Starlight Film Studios? (a) Monolingual Knowledge Extration (b) Cross-lingual Knowledge Transfer (c) Multilingual Knowledge Selection Figure 2: Three evaluation tasks in our benchmark: (a) Monolingual knowledge extraction, which
Chunk 7 Ā· 1,996 chars
ompany produced the 2146 miniseries Pripyat? A: Metro Film Alliance? Epic Story Channel? Starlight Film Studios? (a) Monolingual Knowledge Extration (b) Cross-lingual Knowledge Transfer (c) Multilingual Knowledge Selection Figure 2: Three evaluation tasks in our benchmark: (a) Monolingual knowledge extraction, which requires RALMs to extract knowledge from documents and resolve questions within the same language; (b) Cross-lingual knowledge transfer, which challenges RALMs to handle documents and QA pairs in different languages; (c) Multilingual knowledge selection, which presents documents in various languages that containing different answers, allowing for the evaluation of RALMsā selection bias. Note that we use three of the eight languages: English (en), Chinese (zh), and French (fr) to illustrate these tasks, and we provide the English translations in parentheses. contain different answers, creating a scenario of knowl- edge conflict. Back to Figure 2(c), for the English ques- tion āWhich production company was behind the miniseries Pripyatā, the English, Chinese, and French documents pro- vide three different answers: āMetro Film Allianceā, āEpic Story Channelā, āStarlight Film Studiosā, respectively. In this context, we regard all these answers as potentially cor- rect and assess the RALMsā bias toward specific languages by examining which answers are selected. 2.3 Evaluation Metrics The common practices of RAG often use Accuracy to eval- uate whether the ground truth answer is fully contained in the prediction (Lewis et al. 2020; Chen et al. 2023a; Saad- Falcon et al. 2024). However, as analyzed in (Chirkova et al. 2024), one answer may have diverse expressions in multi- lingual RAG, and thus Accuracy fails to capture similar- ity in such cases. To deal with this issue, Chirkova et al. (2024) propose Character 3-gram Recall, which measures the proportion of 3-grams of ground truth answers that ap- pear in the predictions. In this work, we use Character
Chunk 8 Ā· 1,988 chars
ve diverse expressions in multi- lingual RAG, and thus Accuracy fails to capture similar- ity in such cases. To deal with this issue, Chirkova et al. (2024) propose Character 3-gram Recall, which measures the proportion of 3-grams of ground truth answers that ap- pear in the predictions. In this work, we use Character 3- gram Recall as our primary evaluation metric. Additionally, we also use LLM for evaluation and report the results in Ap- pendix A due to page limitation, which shows a similar trend as Character 3-gram Recall. Furthermore, based on Character 3-gram Recall, we set additional metrics for three evaluation tasks. For the tasks of monolingual knowledge extraction and cross-lingual knowl- edge transfer, we report the average Character 3-gram Recall values across languages (AVG) and their variance (VAR) to assess performance differences among languages. In the task of multilingual knowledge selection, we introduce Selection Entropy (SE) to evaluate the selection bias of RALMs across languages. To accomplish this, we first normalize the recall scores to derive a distribution, followed by the calculation of its entropy. This process can be formally expressed as SE = ā n X i=1 p(i) log(p(i)), p(i) = f (i) Pn j=1 f (j) , where f (i) represents the Character 3-gram Recall for the answer from the i-th language in a total of n languages. 3 Experiment In this section, we first outline our experimental settings (§3.1) and then present the RALMsā overall performance on our benchmark (§3.2). Next, we provide detailed anal- yses of RALMs on the task of monolingual knowledge ex- traction (§3.3), cross-lingual knowledge transfer (§3.4), and multilingual knowledge selection (§3.5), respectively. Based on these analyses, we also try several strategies to improve RALMsā performance on these tasks. 3.1 Settings We choose six representative multilingual LLMs as RALMs, including: 1) open-source LLMs: Aya-23-8B, Aya-23-35B (Aryabumi et al. 2024), Qwen2-7B-Instruct,
Chunk 9 Ā· 1,996 chars
ilingual knowledge selection (§3.5), respectively. Based on these analyses, we also try several strategies to improve RALMsā performance on these tasks. 3.1 Settings We choose six representative multilingual LLMs as RALMs, including: 1) open-source LLMs: Aya-23-8B, Aya-23-35B (Aryabumi et al. 2024), Qwen2-7B-Instruct, Qwen2-72B- Instruct (Yang et al. 2024); and 2) closed-source LLMs: -- 3 of 15 -- Mono. Cross. Multi. AVG ā VAR ā AVG ā VAR ā SE ā Aya-23-8B 67.50 12.53 56.00 15.14 0.84 Aya-23-35B 77.69 4.84 58.91 16.84 0.86 Qwen2-7B-Instruct 78.45 3.55 53.62 16.90 0.81 Qwen2-72B-Instruct 84.51 4.31 66.82 15.00 0.89 GPT-3.5-Turbo 63.78 13.41 55.52 18.58 0.83 GPT-4o 82.93 6.05 68.46 8.48 0.90 Table 2: Performance of RALMs on our benchmark. We use Mono., Cross., and Multi. to represent the tasks of monolin- gual knowledge extraction, cross-lingual knowledge trans- fer, and multilingual knowledge selection, respectively. GPT-3.5-Turbo, and GPT-4o5. During evaluation, RALMs are instructed to respond based on the provided documents. Other implementation details regarding the RAG process can be found in Appendix B. 3.2 Overall Performance Experimental results are reported in Figure 2. We can clearly find that Qwen2-72B-Instruct demonstrates superior perfor- mance in monolingual knowledge extraction and exhibits a well-balanced performance across various languages. In cross-lingual knowledge transfer, GPT-4o achieves the best knowledge transfer performance and most approximate re- sults across different languages. Besides, we note that the AVG values are lower and the VAR values are higher for all RALMS in the cross-lingual task compared to the mono- lingual task, showing that cross-lingual knowledge transfer is more challenging. For the task of multilingual knowledge selection task, GPT-4o obtains the highest selection entropy, indicating less selection bias among different languages. 3.3 Experiments on Monolingual Knowledge Extraction This evaluation task presents
Chunk 10 Ā· 1,999 chars
ngual task, showing that cross-lingual knowledge transfer is more challenging. For the task of multilingual knowledge selection task, GPT-4o obtains the highest selection entropy, indicating less selection bias among different languages. 3.3 Experiments on Monolingual Knowledge Extraction This evaluation task presents RALMs with questions and documents in the same language. Experimental results are illustrated in Figure 3, where we can obtain the following conclusions: RALMs exhibit better knowledge extraction capa- bilities in high-resource languages, while demonstrating less satisfactory performance in relatively low-resource lan- guages. For instance, GPT-3.5-Turbo achieves a Character 3-gram Recall of 72.80 in English and 72.87 in Chinese, but only 32.65 in relatively low-resource Arabic, highlighting significant disparities among languages. Furthermore, we also observe that scaling RALMs within the same series not only improves overall performance but also narrows the performance gaps between languages. For example, when the model size is increased from 8B to 35B, the performance of Aya-23 is significantly boosted in the previously underperforming languages such as Arabic, with scores rising from 72.33 to 80.29. Meanwhile, the VAR 5We use gpt3.5-turbo-0125 and gpt-4o-2024-05-13 in this work. en fr es pt zh ko ja ar 30 45 60 75 85 Aya-23-8B Aya-23-35B Qwen2-7B-Instruct Qwen2-72B-Instruct GPT-3.5-Turbo GPT-4o Figure 3: Performance of RALMs in monolingual knowl- edge extraction. Note that Chinese and English are relatively high-resource languages, while Arabic is a relatively low- resource language. Qwen2-7B-Instruct GPT-3.5-Turbo Arabic 79.21 32.65 Arabic ā English 42.46 27.91 Arabic ā Chinese 32.46 19.91 Table 3: Performance comparison of RALMs on original Arabic data and its English and Chinese translations. score of Aya-23 reduces from 12.53 to 4.84 as shown in Ta- ble 2. Based on the above experimental results, we naturally pose one question: Can we enhance
Chunk 11 Ā· 1,991 chars
Arabic ā English 42.46 27.91 Arabic ā Chinese 32.46 19.91 Table 3: Performance comparison of RALMs on original Arabic data and its English and Chinese translations. score of Aya-23 reduces from 12.53 to 4.84 as shown in Ta- ble 2. Based on the above experimental results, we naturally pose one question: Can we enhance RALMsā performance by translating low-resource languages into high-resource ones? To answer this question, we employ GPT-4o to trans- late Arabic documents and QA pairs into English and Chi- nese, and then assess the performance of Qwen2-7B-Instruct and GPT-3.5-Turbo on the translations. The results shown in Table 3 indicate that translation does not benefit RALMs. For these results, we speculate that misinterpretations of key entities during translation may lead to cascading errors, ulti- mately resulting in inaccurate predictions of RALMs. There- fore, we recommend that although RALM performs well in high-resource languages, attention should be paid to the potential cascading errors when translating from low- resource to high-resource ones. 3.4 Experiments on Cross-lingual Knowledge Transfer This task requires RALMs to understand documents in dif- ferent languages and then generate responses. In this group of experiments, we consider two settings: 1) Strict Language Setting, which requires correct answers and responses in the query language, and 2) Flexible Language Setting, which focuses solely on answer accuracy. From the experimental results shown in Figure 4, we can reach the following con- -- 4 of 15 -- en fr es pt zh ko ja ar en fr es pt zh ko ja ar 60.9 77.7/23.4 80.7/25.4 77.7/23.9 59.4/8.1 66.5/12.7 65.0/16.2 68.5/9.1 78.2/24.4 82.2 67.5/24.9 68.5/24.9 36.0/8.6 34.0/8.6 39.1/7.1 46.2/7.6 79.7/18.8 70.6/22.8 40.6 75.6/31.5 32.0/7.1 38.1/9.6 25.9/9.6 34.5/9.6 82.7/24.9 75.6/29.9 76.1/35.5 70.6 29.9/9.1 24.4/9.1 23.9/9.1 40.1/11.2 54.8/3.0 55.3/7.1 45.7/6.1 44.7/11.2 94.4 53.8/8.1 43.6/11.7 44.2/8.6 58.4/12.7 56.9/10.2
Chunk 12 Ā· 1,999 chars
8.5/9.1 78.2/24.4 82.2 67.5/24.9 68.5/24.9 36.0/8.6 34.0/8.6 39.1/7.1 46.2/7.6 79.7/18.8 70.6/22.8 40.6 75.6/31.5 32.0/7.1 38.1/9.6 25.9/9.6 34.5/9.6 82.7/24.9 75.6/29.9 76.1/35.5 70.6 29.9/9.1 24.4/9.1 23.9/9.1 40.1/11.2 54.8/3.0 55.3/7.1 45.7/6.1 44.7/11.2 94.4 53.8/8.1 43.6/11.7 44.2/8.6 58.4/12.7 56.9/10.2 62.9/11.2 54.8/10.2 31.0/11.7 85.8 33.0/14.2 28.9/7.6 57.9/8.6 53.3/12.2 52.8/11.2 49.8/10.7 33.5/13.2 29.9/14.7 81.7 18.3/12.2 67.5/1.5 59.9/3.0 59.9/7.1 55.8/5.1 14.7/5.6 28.9/3.0 26.4/7.6 68.5 Aya-23-8B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 81.9 78.4/48.2 80.5/42.5 80.8/44.8 50.3/30.6 53.0/37.4 62.1/41.0 47.5/30.4 80.4/45.4 81.1 77.8/41.2 71.6/44.0 44.6/25.4 42.7/30.2 49.1/30.7 36.6/29.6 77.7/39.7 78.7/41.9 79.2 77.1/61.1 43.8/24.2 39.7/23.5 51.0/25.1 38.0/25.3 80.9/42.3 79.1/42.1 78.3/57.8 80.0 52.8/22.4 44.5/22.7 49.4/25.4 41.4/27.5 62.6/16.0 57.2/11.7 61.7/10.0 57.2/12.6 77.1 29.4/19.1 41.4/17.8 21.0/11.8 61.9/15.4 57.7/12.0 59.5/9.0 49.8/11.1 45.2/9.9 69.8 37.3/21.2 25.7/10.1 59.8/18.8 58.9/13.2 57.7/11.2 49.0/12.9 48.4/15.9 27.6/24.0 79.2 15.0/9.4 62.1/18.2 61.4/16.4 57.9/14.9 53.6/16.7 31.3/18.5 28.9/18.4 35.7/18.4 79.2 Qwen2-7B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 77.0 80.2/46.7 81.6/39.1 81.7/43.0 74.5/3.6 69.1/5.7 80.3/10.8 63.9/8.8 77.8/41.9 81.3 78.0/38.7 77.2/39.2 70.9/12.3 55.8/24.0 70.6/17.7 59.4/20.7 76.7/36.6 76.7/37.7 78.0 77.6/58.4 73.1/7.0 58.0/14.9 71.7/10.6 64.6/11.1 75.7/37.5 76.1/37.7 79.4/57.4 80.5 72.8/6.9 59.5/12.2 71.0/14.4 61.1/17.8 73.0/9.7 74.8/6.7 70.8/4.3 76.9/6.0 78.7 67.2/4.8 77.2/5.8 63.5/4.0 74.2/14.7 71.1/10.6 69.0/11.7 72.4/13.7 61.8/21.9 73.7 66.2/25.9 51.2/21.5 69.8/10.0 69.2/6.4 69.5/5.3 66.0/7.3 63.7/10.4 59.6/17.8 83.5 48.4/8.6 59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 68.0 GPT-4o Figure 4: The performance of RALMs on cross-lingual knowledge transfer. The x-axis represents the document language, and the
Chunk 13 Ā· 1,989 chars
7 61.8/21.9 73.7 66.2/25.9 51.2/21.5 69.8/10.0 69.2/6.4 69.5/5.3 66.0/7.3 63.7/10.4 59.6/17.8 83.5 48.4/8.6 59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 68.0 GPT-4o Figure 4: The performance of RALMs on cross-lingual knowledge transfer. The x-axis represents the document language, and the y-axis represents the query language. The first/second values represent the RALM performance in the Flexible/Strict Language Setting. The colors indicate the performance of the strict language setting, with deeper blues representing stronger performance. Note that results for other RALMs can be found in Appendix A. clusions: Indo-European languages lead RALMs to provide an- swers directly from documents, alleviating the challenge of expressing answers across languages. Although we instruct RALMs to respond in the query language, we find that they perform poorly in the strict language setting (see the first values in Figure 4). Conversely, under the flexible language setting, their performance significantly improves showed by the second values. This phenomenon suggests that RALMs face difficulties in expressing answers across languages, while directly providing documents in different languages is comparatively easier. Additionally, the values in the left half of the heatmap are significantly higher than those on the right under the flexible language setting. For example, when Qwen2-7B-Instruct uses Arabic as the query language, the average Character 3-gram Recall score for Indo-European languages is 58.75, markedly surpassing the 31.94 average for other languages. This indicates that Indo- European languages lead RALMs to provide answers di- rectly from documents in different languages, thus enhanc- ing performance by avoiding the challenge of expressing an- swers across languages. Inspired by these findings, we pose the following ques- tion: Can we enhance the cross-lingual performance by en- couraging RALMs to provide answers from documents
Chunk 14 Ā· 1,996 chars
answers di- rectly from documents in different languages, thus enhanc- ing performance by avoiding the challenge of expressing an- swers across languages. Inspired by these findings, we pose the following ques- tion: Can we enhance the cross-lingual performance by en- couraging RALMs to provide answers from documents in different languages? To answer this, we refine the prompts to require RALMs to provide answers within documents and then respond in the query language. In contrast, the vanilla prompt only instructs RALMs to respond in the query language. Table 4 displays the performance of Qwen2- 7B-Instruct and GPT-3.5-Turbo using Vanilla and Refined prompts across different document languages. Under the flexible language setting, our refined prompts significantly improve RALMsā performance, particularly for GPT-3.5- Turbo, which achieves an average Character 3-gram Recall score of 59.05. This demonstrates the potential of RALMs in cross-lingual document understanding. Additionally, both Qwen2-7B-Instruct and GPT-3.5-Turbo also show improve- Prompt en pt zh ar avg Flexible Language Setting Qwen2-7B-Instruct Vanilla 69.35 62.74 45.21 32.15 52.36 Refined 68.89 64.87 51.55 32.66 54.49 GPT-3.5-Turbo Vanilla 65.14 61.43 34.86 32.38 48.46 Refined 73.78 72.56 50.85 38.99 59.05 Strict Language Setting Qwen2-7B-Instruct Vanilla 27.98 29.03 20.99 20.60 24.65 Refined 29.04 29.76 23.30 19.97 25.52 GPT-3.5-Turbo Vanilla 21.45 22.40 8.43 7.16 14.86 Refined 26.83 26.09 12.86 10.45 19.06 Table 4: Comparison of performance between using the Vanilla prompt and the Refined prompt. ments in the strict language setting, with GPT-3.5-Turboās score increasing from 14.86 to 19.06. We believe this im- provement stems from the refined prompt, which may split the knowledge transfer process into two steps: first extract- ing answers from the documents, and then translating them into the query language. This process can be viewed as a chain-of-thought (Wei et al. 2022), thereby alleviating
Chunk 15 Ā· 1,998 chars
to 19.06. We believe this im- provement stems from the refined prompt, which may split the knowledge transfer process into two steps: first extract- ing answers from the documents, and then translating them into the query language. This process can be viewed as a chain-of-thought (Wei et al. 2022), thereby alleviating diffi- culties in cross-lingual transfer. Therefore, we advise that encouraging RALMs to provide answers within docu- ments in different languages can enhance their cross- lingual performance. 3.5 Experiments on Multilingual Knowledge Selection This task assesses RALMsā selection bias toward languages within a knowledge conflict scenario. In this group of ex- periments, we present RALMs with eight randomly ordered documents in different languages, each providing a distinct answer6. By comparing the Character 3-gram Recall for an- swers from different languages, we can assess RALMsā lan- guage preference. The results shown in Figure 5 lead us to the following conclusions: 6Since our data is fictional, we consider all these answers as potentially correct. -- 5 of 15 -- en fr es pt zh ko ja ar en fr es pt zh ko ja ar 17.89 10.90 11.89 12.02 10.61 7.73 11.31 9.32 27.59 12.26 11.94 18.34 13.72 10.49 12.42 11.22 25.70 15.63 15.63 20.89 15.83 11.32 14.08 11.98 29.32 16.31 15.42 22.39 14.93 11.14 14.38 13.27 21.95 12.78 9.86 12.32 26.59 11.20 10.96 9.74 21.52 12.27 11.86 17.00 14.67 15.39 15.85 9.62 18.15 9.65 8.62 12.51 17.54 10.84 18.10 6.94 23.68 12.81 12.37 13.52 13.31 10.22 16.31 20.56 Aya-23-8B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 35.18 17.73 18.36 17.89 28.63 15.51 13.95 12.67 31.24 19.86 17.79 21.96 27.86 12.75 12.92 12.88 28.96 19.95 22.14 17.85 24.64 14.60 12.52 12.69 34.49 22.82 24.44 26.95 23.01 14.80 15.93 16.53 25.73 10.17 11.31 14.19 38.05 9.21 9.30 7.76 37.11 18.86 18.90 18.15 26.16 17.30 17.26 15.59 35.44 21.47 18.61
Chunk 16 Ā· 1,995 chars
28.63 15.51 13.95 12.67 31.24 19.86 17.79 21.96 27.86 12.75 12.92 12.88 28.96 19.95 22.14 17.85 24.64 14.60 12.52 12.69 34.49 22.82 24.44 26.95 23.01 14.80 15.93 16.53 25.73 10.17 11.31 14.19 38.05 9.21 9.30 7.76 37.11 18.86 18.90 18.15 26.16 17.30 17.26 15.59 35.44 21.47 18.61 21.90 32.29 17.27 20.70 12.99 28.37 18.40 18.30 16.29 15.47 14.59 16.29 23.71 Qwen2-7B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 43.98 38.16 39.45 40.28 44.78 37.48 41.07 39.06 35.20 42.53 27.66 28.25 29.37 23.51 27.12 25.34 24.54 20.22 24.74 21.71 23.37 16.83 20.39 21.67 35.77 29.52 29.79 48.09 26.80 27.34 25.64 28.78 36.21 28.49 29.59 28.51 50.95 26.67 28.92 31.21 41.60 30.40 31.72 32.68 31.74 40.51 29.01 30.15 51.56 40.54 43.85 43.37 42.83 37.44 45.50 40.26 26.38 19.19 22.84 22.53 22.43 18.37 19.61 49.87 GPT-4o Figure 5: The performance of RALMs on the task of multilingual knowledge selection. The x-axis represents the and y-axis represent the answer and query languages, respectively. Note that results of other RALMs can be found in Appendix A. en fr es pt zh ko ja ar languages 0 50 100 150 200 Character 3-gram Recall Qwen2-7B-Instruct Qwen2-72B-Instruct GPT-3.5-Turbo GPT-4o Aya-23-8B Aya-23-35B Figure 6: The stacked bar chart of Character 3-gram Recall scores for answers from different answers in the multilin- gual knowledge selection, with higher scores indicating a stronger preference for information from that language. English benefits from selection bias and speaks louder in multilingual contexts. To better analyze the RALMsā preference for different languages, we aggregate the Char- acter 3-gram Recall for answers from various languages and present the cumulative results in Figure 6. Results imply that English consistently ranks as the most preferred lan- guage for providing answers, and other Indo-European lan- guages (such as French, Spanish,
Chunk 17 Ā· 1,983 chars
ence for different languages, we aggregate the Char- acter 3-gram Recall for answers from various languages and present the cumulative results in Figure 6. Results imply that English consistently ranks as the most preferred lan- guage for providing answers, and other Indo-European lan- guages (such as French, Spanish, and Portuguese) follow for most RALMs. An exception is Qwen2, which exhibits a greater preference for Chinese even though English remains the primary choice in most cases. We believe this is because Qwen2 has been specifically enhanced for the Chinese. Meanwhile, we observe that RALMs favor query lan- guages. As shown in Figure 5, the values on the diago- nal lines that assess RALMās preference for the query lan- guages are typically higher. This trend is particularly pro- nounced for GPT-4o, whose preference for the query lan- guage can even exceed that for English. For example, when the query language is Portuguese, GPT-4o shows a signifi- cantly higher recall score for Portuguese (48.09) compared to English (35.77). Based on the experimental results, we find that RALMs 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 k 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 Character 3-gram Recall (a) Qwen2-7B-Instruct en query: English answer en query: k-voting answer zh query: English answer zh query: k-voting answer 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 k 10.0 20.0 30.0 40.0 Character 3-gram Recall (b) GPT-3.5-Turbo en query: English answer en query: k-voting answer zh query: English answer zh query: k-voting answer Figure 7: The impact of increasing the number of Non- English documents (denoted as k) on RALMs. show a significant selection bias and a marked preference for English. This tendency may result in the neglect of non- English documents, which is harmful to multilingual RAG. To address this issue, we aim to explore the following ques- tion: How can we mitigate the selection bias of RALMs toward English? Inspired by previous studies, we
Chunk 18 Ā· 1,991 chars
election bias and a marked preference for English. This tendency may result in the neglect of non- English documents, which is harmful to multilingual RAG. To address this issue, we aim to explore the following ques- tion: How can we mitigate the selection bias of RALMs toward English? Inspired by previous studies, we explore leveraging two well-known characteristics of RAG: the ma- jority rule and its sensitivity to document positions to tackle this issue. As in the prior experiment, we still use Qwen2- 7B-Instruct and GPT-3.5-Turbo as our RALMs. -- 6 of 15 -- 1 2 3 4 5 6 7 8 Position of English Document 20 40 60 80 100 120 140 160 180 Character 3-gram Recall Qwen2-7B-Instruct: en Qwen2-7B-Instruct: sum_w/o_en GPT-3.5-Turbo: en GPT-3.5-Turbo: sum_w/o_en Figure 8: The impact of altering the position of English doc- uments on RALMs. The majority rule suggests that RALMs favor answers that appear more frequently in the documents (Jin et al. 2024). Based on this characteristic, we construct an exper- imental setup that includes k non-English documents con- taining the same answer, referred to k-voting answer, while the remaining documents still present different answers. By comparing the Character 3-gram Recall for English answers and the k-voting answer, we can assess how many non- English documents are needed to alleviate the RALMsā pref- erence for English. Figure 7 shows the experimental results when using Chi- nese and English queries. The results indicate that when the k is low, the Character 3-gram Recall for English answers is significantly higher than that for k-voting answers. How- ever, as k increases, the scores for k-voting answer rise while those for the English answers decline. When k reaches 3, the scores for English answers and voting answers are compara- ble for Qwen2-7B-Instruct. We also see the same trend with GPT-3.5-Turbo when k is 4. Based on these observations, we recommend introducing more non-English documents to mitigate the
Chunk 19 Ā· 1,989 chars
nswer rise while those for the English answers decline. When k reaches 3, the scores for English answers and voting answers are compara- ble for Qwen2-7B-Instruct. We also see the same trend with GPT-3.5-Turbo when k is 4. Based on these observations, we recommend introducing more non-English documents to mitigate the selection bias of RALMs toward English. Moreover, RALMs are known for their sensitivity to doc- ument positions. Liu et al. (2024) reveals that the position of the gold documents within the context can significantly impact RALMsā performance. To investigate this, we alter the positions of English documents in the context and con- duct experiments with English queries. From results pre- sented in Figure 8, we can find that when English docu- ments are placed in the first position, RALMs achieve the highest Character 3-gram Recall score for English answers (indicated by en), while the score for non-English answers (indicated by sum w/o en) is typically low. In contrast, when English documents occupy non-initial positions, the scores exhibit an opposite trend, indicating a decrease in RALMsā selection bias towards English and a heightened emphasis on non-English documents. Thus, we advise po- sitioning English documents in non-initial positions to help mitigate RALMsā selection bias toward English. 4 Related Work Retrieval-Augmented Generation RAG enhances LLMs by integrating relevant texts from external knowledge re- sources. Most of the current RALMs focus on query refin- ing or better document utilization. For example, Rewrite- Retrieve-Read (Ma et al. 2023) trains a small language model as the rewriter to better align the query to the retriever and the LLM reader. Chain of Note (Yu et al. 2023) gener- ates reading notes for retrieved documents to improve the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. However, these studies primarily focus on English, lacking exploration in multilin- gual RAG
Chunk 20 Ā· 1,997 chars
to the retriever and the LLM reader. Chain of Note (Yu et al. 2023) gener- ates reading notes for retrieved documents to improve the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. However, these studies primarily focus on English, lacking exploration in multilin- gual RAG scenarios. Although, many benchmarks (Chen et al. 2023a; Lyu et al. 2024; Liu et al. 2023; Thakur et al. 2024) are proposed to evaluate RALMsā performance. Multilingual benchmarks like MKQA (Longpre, Lu, and Daiber 2021) and XOR-TyDi QA (Asai et al. 2021) are time-sensitive and have potential leakage risk, while NoMIRACL(Thakur et al. 2024) lacks ground truth answers, making it unable to assess RALMs in comprehension and response generation. Multilingualism in LLM Multilingual Large Language Models (MLLMs) achieve remarkable success thanks to multilingual datasets such as mC4 (Xue et al. 2021), Cul- turaX (Nguyen et al. 2023), Aya Dataset (Singh et al. 2024), and MultilingualSIFT (Chen et al. 2023b). Along with the developments of MLLMs, related analyses have also been carried out (Shi et al. 2022; Yuan et al. 2024; Yang et al. 2023; Qin et al. 2024; Xu et al. 2024). For instance, Yuan et al. (2024) analyze the multilingual capability of LLMs from a vocabulary-sharing perspective, while Qin et al. (2024) study alignment methods from parameter-tuning and parameter-frozen aspects. Most recently, Chirkova et al. (2024) conduct multi- lingual RAG experiments on MKQA (Longpre, Lu, and Daiber 2021) and XOR-TyDi QA (Asai et al. 2021), dis- covering code-switching phenomena. Sharma, Murray, and Xiao (2024) create a machine-translated dataset finding that RALMs tend to prefer documents in queriesā lan- guage. Compared to these studies, we conduct a comprehen- sive analysis based on the proposed benchmark with three evaluation tasks: monolingual knowledge extraction, cross- lingual knowledge transfer, and multilingual knowledge se- lection. We also offer three
Chunk 21 Ā· 1,999 chars
ding that RALMs tend to prefer documents in queriesā lan- guage. Compared to these studies, we conduct a comprehen- sive analysis based on the proposed benchmark with three evaluation tasks: monolingual knowledge extraction, cross- lingual knowledge transfer, and multilingual knowledge se- lection. We also offer three advice based on the observations for better multilingual RAG. 5 Conclusion In this paper, we propose a new time-insensitive multilingual RAG benchmark Futurepedia for multilingual RAG. Our benchmark includes parallel documents and corresponding QA in eight languages, where three evaluation tasks are in- troduced: monolingual knowledge extraction, cross-lingual knowledge transfer, and multilingual knowledge selection. Then, we conduct experiments to evaluate the commonly used multilingual RALMs. Our findings reveal significant linguistic inequality: 1) high-resource languages stand out in the task of monolingual knowledge extraction; 2) Indo- European languages lead RALMs to provide answers di- rectly from documents in cross-lingual knowledge transfer, alleviating the challenge of expressing answers across lan- guages; 3) English benefits from RALMsā selection bias and speaks louder in multilingual knowledge selection. Based on -- 7 of 15 -- these findings, we try some strategies to improve multilin- gual RALMs. In the future, we will explore ways to mitigate or leverage the linguistic inequality of multilingual RAG. References Aryabumi, V.; Dang, J.; Talupuru, D.; Dash, S.; Cairuz, D.; Lin, H.; Venkitesh, B.; Smith, M.; Campos, J. A.; Tan, Y. C.; Marchisio, K.; Bartolo, M.; Ruder, S.; Lo- catelli, A.; Kreutzer, J.; Frosst, N.; Gomez, A.; Blunsom, P.; Fadaee, M.; ĀØUstĀØun, A.; and Hooker, S. 2024. Aya 23: Open Weight Releases to Further Multilingual Progress. arXiv:2405.15032. Asai, A.; Kasai, J.; Clark, J.; Lee, K.; Choi, E.; and Ha- jishirzi, H. 2021. XOR QA: Cross-lingual Open-Retrieval Question Answering. In NAACL 2021. Asai, A.; Wu, Z.; Wang, Y.;
Chunk 22 Ā· 1,999 chars
; Blunsom, P.; Fadaee, M.; ¨Ust¨un, A.; and Hooker, S. 2024. Aya 23: Open Weight Releases to Further Multilingual Progress. arXiv:2405.15032. Asai, A.; Kasai, J.; Clark, J.; Lee, K.; Choi, E.; and Ha- jishirzi, H. 2021. XOR QA: Cross-lingual Open-Retrieval Question Answering. In NAACL 2021. Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2024. Self-RAG: Learning to Retrieve, Generate, and Cri- tique through Self-Reflection. In ICLR 2024. Chen, J.; Lin, H.; Han, X.; and Sun, L. 2023a. Benchmark- ing Large Language Models in Retrieval-Augmented Gen- eration. In AAAI 2024. Chen, Z.; Yan, S.; Liang, J.; Jiang, F.; Wu, X.; Yu, F.; Chen, G. H.; Chen, J.; Zhang, H.; Jianquan, L.; Xiang, W.; and Wang, B. 2023b. MultilingualSIFT: Multilingual Super- vised Instruction Fine-tuning. Chirkova, N.; Rau, D.; D“ejean, H.; Formal, T.; Clinchant, S.; and Nikoulina, V. 2024. Retrieval-augmented generation in multilingual settings. arXiv:2407.01463. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; and Wang, H. 2024. Retrieval- Augmented Generation for Large Language Models: A Sur- vey. arXiv:2312.10997. Jin, Z.; Cao, P.; Chen, Y.; Liu, K.; Jiang, X.; Xu, J.; Qiuxia, L.; and Zhao, J. 2024. Tug-of-War between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval- Augmented Language Models. In LREC-COLING 2024. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; K¨uttler, H.; Lewis, M.; tau Yih, W.; Rockt¨aschel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS 2020. Lin, X. V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James, R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; Zettle- moyer, L.; and tau Yih, W. 2024. RA-DIT: Retrieval- Augmented Dual Instruction Tuning. In ICLR 2024. Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P. 2024. Lost in the Middle: How Language Models Use Long Contexts. TACL. Liu,
Chunk 23 Ā· 1,994 chars
uez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; Zettle- moyer, L.; and tau Yih, W. 2024. RA-DIT: Retrieval- Augmented Dual Instruction Tuning. In ICLR 2024. Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P. 2024. Lost in the Middle: How Language Models Use Long Contexts. TACL. Liu, Y.; Huang, L.; Li, S.; Chen, S.; Zhou, H.; Meng, F.; Zhou, J.; and Sun, X. 2023. RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowl- edge. arXiv:2311.08147. Longpre, S.; Lu, Y.; and Daiber, J. 2021. MKQA: A Linguis- tically Diverse Benchmark for Multilingual Open Domain Question Answering. arXiv:2007.15207. Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; and Chen, E. 2024. CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Mod- els. arXiv:2401.17043. Ma, X.; Gong, Y.; He, P.; Zhao, H.; and Duan, N. 2023. Query Rewriting for Retrieval-Augmented Large Language Models. In EMNLP 2023. Nguyen, T.; Nguyen, C. V.; Lai, V. D.; Man, H.; Ngo, N. T.; Dernoncourt, F.; Rossi, R. A.; and Nguyen, T. H. 2023. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv:2309.09400. Qin, L.; Chen, Q.; Zhou, Y.; Chen, Z.; Li, Y.; Liao, L.; Li, M.; Che, W.; and Yu, P. S. 2024. Multilingual Large Lan- guage Model: A Survey of Resources, Taxonomy and Fron- tiers. arXiv:2404.04925. Saad-Falcon, J.; Khattab, O.; Potts, C.; and Zaharia, M. 2024. ARES: An Automated Evaluation Frame- work for Retrieval-Augmented Generation Systems. arXiv:2311.09476. Sharma, N.; Murray, K.; and Xiao, Z. 2024. Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models. arXiv:2407.05502. Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.; Zhou, D.; Das, D.; and Wei, J. 2022. Language Models are Multilin- gual Chain-of-Thought Reasoners. arXiv:2210.03057. Singh, S.;
Chunk 24 Ā· 1,994 chars
udy on Information Disparity in Multilingual Large Language Models. arXiv:2407.05502. Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H. W.; Tay, Y.; Ruder, S.; Zhou, D.; Das, D.; and Wei, J. 2022. Language Models are Multilin- gual Chain-of-Thought Reasoners. arXiv:2210.03057. Singh, S.; Vargus, F.; Dsouza, D.; Karlsson, B. F.; Ma- hendiran, A.; Ko, W.-Y.; Shandilya, H.; Patel, J.; Mataciu- nas, D.; OMahony, L.; Zhang, M.; Hettiarachchi, R.; Wil- son, J.; Machado, M.; Moura, L. S.; KrzemiĀ“nski, D.; Fadaei, H.; ErgĀØun, I.; Okoh, I.; Alaagib, A.; Mudannayake, O.; Alyafeai, Z.; Chien, V. M.; Ruder, S.; Guthikonda, S.; Al- ghamdi, E. A.; Gehrmann, S.; Muennighoff, N.; Bartolo, M.; Kreutzer, J.; ĀØUstĀØun, A.; Fadaee, M.; and Hooker, S. 2024. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning. arXiv:2402.06619. Thakur, N.; Bonifacio, L.; Zhang, X.; Ogundepo, O.; Ka- malloo, E.; Alfonso-Hermelo, D.; Li, X.; Liu, Q.; Chen, B.; Rezagholizadeh, M.; and Lin, J. 2024. NoMIRACL: Knowing When You Donāt Know for Robust Multilingual Retrieval-Augmented Generation. arXiv:2312.11361. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E. H.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS 2022. Xu, Y.; Hu, L.; Zhao, J.; Qiu, Z.; Ye, Y.; and Gu, H. 2024. A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias. arXiv:2404.00929. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; and Raffel, C. 2021. mT5: A mas- sively multilingual pre-trained text-to-text transformer. In NAACL 2021. Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; Dong, G.; Wei, H.; Lin, -- 8 of 15 -- H.; Tang, J.; Wang, J.; Yang, J.; Tu, J.; Zhang, J.; Ma, J.; Yang, J.; Xu, J.; Zhou, J.; Bai, J.; He, J.; Lin, J.; Dang, K.; Lu, K.; Chen, K.; Yang, K.; Li, M.; Xue, M.; Ni, N.; Zhang, P.;
Chunk 25 Ā· 1,983 chars
g, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; Dong, G.; Wei, H.; Lin, -- 8 of 15 -- H.; Tang, J.; Wang, J.; Yang, J.; Tu, J.; Zhang, J.; Ma, J.; Yang, J.; Xu, J.; Zhou, J.; Bai, J.; He, J.; Lin, J.; Dang, K.; Lu, K.; Chen, K.; Yang, K.; Li, M.; Xue, M.; Ni, N.; Zhang, P.; Wang, P.; Peng, R.; Men, R.; Gao, R.; Lin, R.; Wang, S.; Bai, S.; Tan, S.; Zhu, T.; Li, T.; Liu, T.; Ge, W.; Deng, X.; Zhou, X.; Ren, X.; Zhang, X.; Wei, X.; Ren, X.; Liu, X.; Fan, Y.; Yao, Y.; Zhang, Y.; Wan, Y.; Chu, Y.; Liu, Y.; Cui, Z.; Zhang, Z.; Guo, Z.; and Fan, Z. 2024. Qwen2 Technical Report. arXiv:2407.10671. Yang, W.; Li, C.; Zhang, J.; and Zong, C. 2023. Big- Translate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages. arXiv:2305.18098. Yu, W.; Zhang, H.; Pan, X.; Ma, K.; Wang, H.; and Yu, D. 2023. Chain-of-Note: Enhancing Robustness in Retrieval- Augmented Language Models. arXiv:2311.09210. Yuan, F.; Yuan, S.; Wu, Z.; and Li, L. 2024. How Vocabulary Sharing Facilitates Multilingualism in LLaMA? In Findings of ACL 2024. -- 9 of 15 -- A Supplementary Experimental Results A.1 Cross-Lingual Knowledge Transfer en fr es pt zh ko ja ar en fr es pt zh ko ja ar 60.9 77.7/23.4 80.7/25.4 77.7/23.9 59.4/8.1 66.5/12.7 65.0/16.2 68.5/9.1 78.2/24.4 82.2 67.5/24.9 68.5/24.9 36.0/8.6 34.0/8.6 39.1/7.1 46.2/7.6 79.7/18.8 70.6/22.8 40.6 75.6/31.5 32.0/7.1 38.1/9.6 25.9/9.6 34.5/9.6 82.7/24.9 75.6/29.9 76.1/35.5 70.6 29.9/9.1 24.4/9.1 23.9/9.1 40.1/11.2 54.8/3.0 55.3/7.1 45.7/6.1 44.7/11.2 94.4 53.8/8.1 43.6/11.7 44.2/8.6 58.4/12.7 56.9/10.2 62.9/11.2 54.8/10.2 31.0/11.7 85.8 33.0/14.2 28.9/7.6 57.9/8.6 53.3/12.2 52.8/11.2 49.8/10.7 33.5/13.2 29.9/14.7 81.7 18.3/12.2 67.5/1.5 59.9/3.0 59.9/7.1 55.8/5.1 14.7/5.6 28.9/3.0 26.4/7.6 68.5 Aya-23-8B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 74.6 75.2/43.4 75.5/37.2 77.5/42.6 71.5/9.7 64.9/16.1 78.6/11.2
Chunk 26 Ā· 1,984 chars
.2 31.0/11.7 85.8 33.0/14.2 28.9/7.6 57.9/8.6 53.3/12.2 52.8/11.2 49.8/10.7 33.5/13.2 29.9/14.7 81.7 18.3/12.2 67.5/1.5 59.9/3.0 59.9/7.1 55.8/5.1 14.7/5.6 28.9/3.0 26.4/7.6 68.5 Aya-23-8B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 74.6 75.2/43.4 75.5/37.2 77.5/42.6 71.5/9.7 64.9/16.1 78.6/11.2 56.6/12.6 79.9/43.3 79.6 79.2/40.2 74.2/39.9 48.6/20.6 46.5/27.4 52.8/27.9 42.1/25.9 80.6/37.1 75.9/40.7 69.0 75.5/58.5 39.6/20.3 37.3/24.4 42.0/28.8 32.2/25.2 79.5/37.4 73.8/41.0 77.3/59.0 79.5 40.0/24.4 38.9/31.4 45.3/30.0 34.0/32.1 77.9/4.9 66.4/6.6 66.8/5.4 69.6/7.3 81.6 35.5/22.5 55.7/14.5 28.1/11.8 72.6/9.6 66.5/11.7 69.5/10.0 61.1/12.5 48.3/16.9 72.5 56.6/25.8 24.3/18.2 78.9/7.6 66.8/10.8 65.4/12.1 61.6/11.2 51.2/20.2 44.8/36.8 84.5 24.4/19.2 76.0/16.2 69.0/15.2 72.9/17.0 68.7/24.5 39.7/24.9 42.5/30.9 42.9/28.4 80.3 Aya-23-35B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 81.9 78.4/48.2 80.5/42.5 80.8/44.8 50.3/30.6 53.0/37.4 62.1/41.0 47.5/30.4 80.4/45.4 81.1 77.8/41.2 71.6/44.0 44.6/25.4 42.7/30.2 49.1/30.7 36.6/29.6 77.7/39.7 78.7/41.9 79.2 77.1/61.1 43.8/24.2 39.7/23.5 51.0/25.1 38.0/25.3 80.9/42.3 79.1/42.1 78.3/57.8 80.0 52.8/22.4 44.5/22.7 49.4/25.4 41.4/27.5 62.6/16.0 57.2/11.7 61.7/10.0 57.2/12.6 77.1 29.4/19.1 41.4/17.8 21.0/11.8 61.9/15.4 57.7/12.0 59.5/9.0 49.8/11.1 45.2/9.9 69.8 37.3/21.2 25.7/10.1 59.8/18.8 58.9/13.2 57.7/11.2 49.0/12.9 48.4/15.9 27.6/24.0 79.2 15.0/9.4 62.1/18.2 61.4/16.4 57.9/14.9 53.6/16.7 31.3/18.5 28.9/18.4 35.7/18.4 79.2 Qwen2-7B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 87.8 83.1/51.1 82.8/42.2 84.8/46.6 71.5/28.9 67.4/32.9 73.4/38.7 64.7/23.9 82.0/46.2 86.6 81.4/41.5 80.7/45.4 57.3/34.2 52.9/36.5 56.7/37.1 48.4/35.9 83.5/42.6 82.2/46.2 86.1 83.8/64.9 55.4/33.7 50.4/37.6 58.4/39.7 48.4/37.0 83.9/42.6 81.8/42.0 80.2/59.2 86.3 63.5/31.6 54.2/29.7 60.0/35.6 52.4/34.1 78.6/3 81.1/2.6 79.3/3.5 80.6/4.2 84.8 58.4/10.7 74.1/6.2
Chunk 27 Ā· 1,993 chars
/32.9 73.4/38.7 64.7/23.9 82.0/46.2 86.6 81.4/41.5 80.7/45.4 57.3/34.2 52.9/36.5 56.7/37.1 48.4/35.9 83.5/42.6 82.2/46.2 86.1 83.8/64.9 55.4/33.7 50.4/37.6 58.4/39.7 48.4/37.0 83.9/42.6 81.8/42.0 80.2/59.2 86.3 63.5/31.6 54.2/29.7 60.0/35.6 52.4/34.1 78.6/3 81.1/2.6 79.3/3.5 80.6/4.2 84.8 58.4/10.7 74.1/6.2 55.0/8.9 77.7/16.5 77.4/15.3 76.3/12.1 76.0/12.5 42.1/23.4 73.7 53.8/24.9 33.4/18.9 77.4/14.2 78.5/13.9 76.8/12.3 74.5/14.3 49.6/21.8 51.5/30.0 87.6 42.8/25.5 77.8/23.9 78.6/35.9 76.0/37.0 73.4/34.1 37.3/8.9 35.3/18.9 43.1/21.9 83.1 Qwen2-72B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 72.8 65.5/39.6 69.9/35.2 67.5/38.6 40.7/5.9 45.6/7.9 58.2/15.3 48.9/3.2 73.1/41.7 56.0 74.0/35.3 72.0/36.0 50.2/13.1 52.1/12.7 63.6/17.8 46.7/10.1 63.9/32.8 64.7/33.3 61.6 66.6/50.7 30.8/10.9 25.8/10.4 39.7/15.9 31/10.6 63.1/30.3 57.1/30.9 65.3/45.9 72.9 36.9/8.1 35.7/5.8 35.4/9.6 36/4.1 62.9/7.0 55.1/3.4 61.1/5.0 63.3/6.8 71.3 40.3/4.4 52.1/8.9 15.8/3.0 72.5/13.8 61.1/6.8 64.8/6.9 67.7/8.9 39.4/11.5 66.7 42.0/25.2 30.5/7.3 74.6/16.0 62.6/9.7 64.4/10.0 65/9.7 41.8/6.1 35.2/13.9 77.0 17.7/7.0 45.9/8.6 26.8/4.3 31.6/4.5 28.1/6.1 4.3/3.4 6.2/3.7 7.5/3.7 32.6 GPT-3.5-Turbo en fr es pt zh ko ja ar en fr es pt zh ko ja ar 77.0 80.2/46.7 81.6/39.1 81.7/43.0 74.5/3.6 69.1/5.7 80.3/10.8 63.9/8.8 77.8/41.9 81.3 78.0/38.7 77.2/39.2 70.9/12.3 55.8/24.0 70.6/17.7 59.4/20.7 76.7/36.6 76.7/37.7 78.0 77.6/58.4 73.1/7.0 58.0/14.9 71.7/10.6 64.6/11.1 75.7/37.5 76.1/37.7 79.4/57.4 80.5 72.8/6.9 59.5/12.2 71.0/14.4 61.1/17.8 73.0/9.7 74.8/6.7 70.8/4.3 76.9/6.0 78.7 67.2/4.8 77.2/5.8 63.5/4.0 74.2/14.7 71.1/10.6 69.0/11.7 72.4/13.7 61.8/21.9 73.7 66.2/25.9 51.2/21.5 69.8/10.0 69.2/6.4 69.5/5.3 66.0/7.3 63.7/10.4 59.6/17.8 83.5 48.4/8.6 59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 68.0 GPT-4o Figure 9: Performance of RALMs on cross-lingual knowledge transfer using Character
Chunk 28 Ā· 1,990 chars
0 74.2/14.7 71.1/10.6 69.0/11.7 72.4/13.7 61.8/21.9 73.7 66.2/25.9 51.2/21.5 69.8/10.0 69.2/6.4 69.5/5.3 66.0/7.3 63.7/10.4 59.6/17.8 83.5 48.4/8.6 59.2/14.5 57.8/13.2 61.6/13.5 58.5/19.1 53.4/14.0 51.6/17.2 60.8/11.2 68.0 GPT-4o Figure 9: Performance of RALMs on cross-lingual knowledge transfer using Character 3-gram Recall as metric. Note the first/second values represent the RALM performance in the Flexible/Strict Language Setting. The colors indicate the perfor- mance of the strict language setting, with deeper blues representing a stronger performance. -- 10 of 15 -- en fr es pt zh ko ja ar en fr es pt zh ko ja ar 63.9 83.2/11.7 82.2/17.3 82.7/6.1 68.0/12.7 72.1/16.2 74.1/24.9 73.6/17.3 80.7/16.2 82.6 61.4/15.2 64.0/9.6 39.6/14.2 46.7/10.7 44.2/11.7 57.9/19.3 83.2/12.2 77.2/8.6 47.6 82.7/9.1 43.1/9.6 55.8/11.7 49.2/11.7 49.8/17.8 59.9/17.3 72.1/11.7 60.9/29.4 76.5 28.4/12.7 38.6/12.2 36.0/14.7 41.6/24.4 49.8/4.1 41.6/6.6 36.5/11.7 45.2/4.6 91.1 34.5/9.1 38.1/14.2 14.2/15.2 53.3/8.1 55.8/8.1 59.4/14.7 50.2/7.6 53.8/14.7 83.8 51.3/23.4 29.4/19.8 63.5/8.1 55.3/10.2 56.9/16.2 55.3/6.6 51.3/11.2 45.2/25.9 81.4 20.3/19.8 79.7/1.5 69.0/3.0 75.6/11.2 72.6/4.6 55.3/6.1 50.8/6.6 43.1/7.6 70.5 Aya-23-8B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 78.7 80.2/22.3 81.7/29.9 84.8/9.6 80.2/17.8 77.2/26.4 84.3/27.9 58.9/33.0 85.8/31.5 85.8 71.1/28.4 65.5/19.3 40.1/22.3 49.2/20.3 57.4/21.8 48.2/35.5 86.8/27.4 80.7/16.8 74.1 83.8/15.7 45.7/14.7 53.8/19.8 58.4/22.3 39.6/33.5 69.0/34.5 70.0/20.8 65.0/52.8 83.8 28.4/20.8 44.7/21.8 41.1/23.4 30.5/45.2 74.6/9.1 59.4/13.2 54.8/22.3 64.0/10.7 92.9 43.1/22.3 53.8/28.4 28.9/35.0 65.5/16.2 64.0/18.8 66.0/22.8 57.9/15.7 44.7/34.5 87.8 58.9/54.8 23.4/49.2 71.1/17.8 62.4/22.3 63.5/33.5 58.4/14.7 51.8/25.4 50.2/52.3 85.3 24.9/44.2 83.8/8.6 79.2/15.2 80.2/24.9 74.6/11.7 42.6/18.3 54.8/19.3 53.8/22.3 81.2 Aya-23-35B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 89.3
Chunk 29 Ā· 1,994 chars
53.8/28.4 28.9/35.0 65.5/16.2 64.0/18.8 66.0/22.8 57.9/15.7 44.7/34.5 87.8 58.9/54.8 23.4/49.2 71.1/17.8 62.4/22.3 63.5/33.5 58.4/14.7 51.8/25.4 50.2/52.3 85.3 24.9/44.2 83.8/8.6 79.2/15.2 80.2/24.9 74.6/11.7 42.6/18.3 54.8/19.3 53.8/22.3 81.2 Aya-23-35B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 89.3 83.2/79.7 85.8/93.9 86.8/39.6 48.7/58.4 53.8/111.2 64.0/98.0 53.3/112.2 87.3/78.2 85.3 61.9/88.8 53.3/51.8 27.4/52.3 33.5/74.1 38.6/60.9 35.0/90.4 84.3/70.6 82.7/46.2 82.2 83.2/56.4 42.6/42.1 38.6/74.6 55.8/61.4 34.0/91.4 59.4/88.8 72.1/64.5 60.9/146.2 85.8 32.0/56.9 38.6/71.6 43.6/64.0 27.9/108.1 59.4/22.8 52.3/33.5 55.8/44.7 53.3/24.4 86.3 35.5/72.6 40.1/76.7 23.4/68.0 48.2/41.1 48.2/44.7 49.2/43.1 39.1/28.9 38.1/61.4 79.2 39.6/118.3 16.2/86.3 51.3/52.8 51.8/52.8 49.8/60.9 40.1/34.0 53.8/60.4 29.9/142.1 75.6 11.7/78.7 72.6/33.5 67.5/49.2 69.0/48.7 64.0/25.9 33.5/37.6 36.5/53.3 46.2/46.7 75.6 Qwen2-7B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 90.9 87.8/107.188.3/133.5 92.4/54.8 72.6/62.9 79.7/145.778.7/138.671.6/165.5 89.8/97.5 90.3 86.3/126.9 75.6/66.0 41.1/55.8 42.1/94.9 40.1/79.7 49.8/125.4 92.4/86.8 88.8/77.2 91.9 90.9/75.1 52.3/45.2 49.8/92.9 59.4/82.7 47.2/136.0 74.6/108.6 56.9/92.4 63.5/202.0 91.4 49.2/59.9 47.2/87.8 55.8/86.8 43.6/151.8 70.6/38.1 68.0/50.2 70.6/68.5 65.0/42.1 95.4 53.3/91.4 72.1/98.0 49.8/102.5 68.0/57.9 61.4/63.5 57.4/67.0 63.5/39.1 39.6/71.6 87.3 55.3/155.325.4/122.8 76.1/76.7 60.4/71.1 68.5/88.3 69.0/52.8 51.3/67.0 61.4/175.1 89.8 36.0/117.8 88.3/49.2 81.7/75.6 85.8/70.0 81.7/40.1 47.7/43.1 52.3/67.0 60.9/64.5 80.2 Qwen2-72B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 81.7 69.0/38.6 75.6/52.8 72.1/19.8 42.1/31.0 53.8/62.4 58.9/54.8 47.2/48.2 79.2/45.2 58.4 77.7/49.2 77.7/28.4 52.3/33.0 56.9/35.5 66.0/35.5 46.2/42.6 69.5/41.1 68.0/25.4 68.0 70.6/31.5 31.0/27.4 27.4/39.1 38.1/36.0 27.9/39.6 69.5/54.8 60.4/37.1 71.1/84.8 75.1 41.1/35.0 41.1/40.1
Chunk 30 Ā· 1,995 chars
ja ar en fr es pt zh ko ja ar 81.7 69.0/38.6 75.6/52.8 72.1/19.8 42.1/31.0 53.8/62.4 58.9/54.8 47.2/48.2 79.2/45.2 58.4 77.7/49.2 77.7/28.4 52.3/33.0 56.9/35.5 66.0/35.5 46.2/42.6 69.5/41.1 68.0/25.4 68.0 70.6/31.5 31.0/27.4 27.4/39.1 38.1/36.0 27.9/39.6 69.5/54.8 60.4/37.1 71.1/84.8 75.1 41.1/35.0 41.1/40.1 38.1/37.6 33.5/56.9 67.0/10.2 60.9/19.8 66.5/28.4 68.5/16.2 78.2 50.2/39.1 56.9/38.1 17.8/35.5 78.2/19.8 65.5/26.9 72.6/28.4 74.6/17.3 47.2/40.6 76.1 50.8/71.1 32.5/50.2 80.2/25.9 67.5/33.5 70.6/43.1 73.1/19.3 51.8/37.1 40.6/83.8 74.6 21.3/45.2 49.8/9.6 27.4/23.9 33.5/32.0 29.9/12.2 2.0/22.8 4.6/27.9 5.1/31.0 29.4 GPT-3.5-Turbo en fr es pt zh ko ja ar en fr es pt zh ko ja ar 81.2 86.8/45.2 86.8/58.9 89.8/25.9 87.3/42.6 83.8/92.9 88.3/68.0 66.5/73.6 84.3/58.9 86.8 84.8/57.9 84.8/34.0 82.2/43.6 69.5/58.4 82.7/44.2 60.9/57.9 84.8/50.8 82.7/31.0 83.2 85.8/40.6 83.2/34.5 76.7/60.9 79.2/43.6 69.5/56.9 81.7/69.5 81.2/45.7 85.3/100.5 83.2 84.3/43.1 77.7/60.4 83.8/48.7 68.5/76.7 81.7/11.7 85.3/26.4 79.7/32.5 85.3/18.8 88.3 83.2/63.5 85.8/58.4 67.5/49.8 80.2/22.3 76.7/33.0 76.1/31.0 80.2/20.8 71.1/47.7 85.8 78.2/95.4 53.3/66.5 76.1/34.0 73.6/39.6 76.7/47.7 75.1/24.4 79.7/44.7 77.2/121.8 86.3 54.3/55.8 65.0/16.2 62.4/32.0 68.5/37.1 65.0/19.8 60.4/26.9 59.9/46.2 64.5/41.1 70.0 GPT-4o Figure 10: Performance of RALMs on cross-lingual knowledge transfer when using LLM for evaluation. Specifically, we employ GPT-4-Turbo to assess the predictions of RALMs. -- 11 of 15 -- A.2 Multilingual Knowledge Selection en fr es pt zh ko ja ar en fr es pt zh ko ja ar 17.89 10.90 11.89 12.02 10.61 7.73 11.31 9.32 27.59 12.26 11.94 18.34 13.72 10.49 12.42 11.22 25.70 15.63 15.63 20.89 15.83 11.32 14.08 11.98 29.32 16.31 15.42 22.39 14.93 11.14 14.38 13.27 21.95 12.78 9.86 12.32 26.59 11.20 10.96 9.74 21.52 12.27 11.86 17.00 14.67 15.39 15.85 9.62 18.15 9.65 8.62 12.51 17.54 10.84 18.10 6.94 23.68 12.81 12.37 13.52 13.31 10.22 16.31
Chunk 31 Ā· 1,991 chars
11.94 18.34 13.72 10.49 12.42 11.22 25.70 15.63 15.63 20.89 15.83 11.32 14.08 11.98 29.32 16.31 15.42 22.39 14.93 11.14 14.38 13.27 21.95 12.78 9.86 12.32 26.59 11.20 10.96 9.74 21.52 12.27 11.86 17.00 14.67 15.39 15.85 9.62 18.15 9.65 8.62 12.51 17.54 10.84 18.10 6.94 23.68 12.81 12.37 13.52 13.31 10.22 16.31 20.56 Aya-23-8B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 53.00 38.77 39.04 36.40 23.87 18.83 20.72 17.94 48.96 36.98 37.00 34.20 22.07 16.73 18.45 15.30 49.14 32.08 39.03 29.72 23.47 20.08 20.47 14.20 45.65 32.33 33.66 36.19 25.07 17.19 18.47 16.08 46.60 30.40 26.72 26.98 30.35 15.94 15.15 13.94 39.34 25.86 23.63 22.81 18.87 26.77 15.41 12.66 52.07 38.77 38.19 34.89 22.88 21.89 30.09 16.26 50.23 32.66 32.86 28.91 22.81 16.36 17.37 26.28 Aya-23-35B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 35.18 17.73 18.36 17.89 28.63 15.51 13.95 12.67 31.24 19.86 17.79 21.96 27.86 12.75 12.92 12.88 28.96 19.95 22.14 17.85 24.64 14.60 12.52 12.69 34.49 22.82 24.44 26.95 23.01 14.80 15.93 16.53 25.73 10.17 11.31 14.19 38.05 9.21 9.30 7.76 37.11 18.86 18.90 18.15 26.16 17.30 17.26 15.59 35.44 21.47 18.61 21.90 32.29 17.27 20.70 12.99 28.37 18.40 18.30 16.29 15.47 14.59 16.29 23.71 Qwen2-7B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 58.36 49.89 49.63 50.50 48.34 43.64 53.06 45.17 54.28 42.55 44.70 40.21 39.52 34.61 42.45 36.51 53.68 41.98 46.93 42.87 37.86 37.84 36.77 36.35 52.74 40.94 42.75 42.55 32.05 36.65 41.02 34.50 58.13 49.59 50.04 50.56 49.82 43.13 49.44 41.67 54.18 39.47 39.32 39.45 37.43 38.89 38.71 33.55 58.49 48.81 48.93 51.08 42.37 40.92 47.77 40.27 54.60 37.24 38.99 39.88 35.58 32.91 35.54 42.75 Qwen2-72B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 45.89 28.51 31.48 29.81 18.59 17.27 22.28 17.18 45.98 25.85 30.00 29.10 22.06 20.35 22.89 20.17 15.94 8.93 12.13 9.63 6.10 6.21 6.85 7.21 20.45 13.14 15.09 17.16 9.77 8.22 9.25 10.28 9.20 4.45 5.49 7.63 11.78 4.78 5.38
Chunk 32 Ā· 1,994 chars
.54 42.75 Qwen2-72B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 45.89 28.51 31.48 29.81 18.59 17.27 22.28 17.18 45.98 25.85 30.00 29.10 22.06 20.35 22.89 20.17 15.94 8.93 12.13 9.63 6.10 6.21 6.85 7.21 20.45 13.14 15.09 17.16 9.77 8.22 9.25 10.28 9.20 4.45 5.49 7.63 11.78 4.78 5.38 3.92 25.83 12.77 14.98 13.60 13.19 17.13 11.26 11.23 26.02 16.41 16.51 16.22 13.80 12.61 10.72 10.77 11.48 6.28 6.74 5.55 3.77 3.43 6.16 6.10 GPT3.5-Turbo en fr es pt zh ko ja ar en fr es pt zh ko ja ar 43.98 38.16 39.45 40.28 44.78 37.48 41.07 39.06 35.20 42.53 27.66 28.25 29.37 23.51 27.12 25.34 24.54 20.22 24.74 21.71 23.37 16.83 20.39 21.67 35.77 29.52 29.79 48.09 26.80 27.34 25.64 28.78 36.21 28.49 29.59 28.51 50.95 26.67 28.92 31.21 41.60 30.40 31.72 32.68 31.74 40.51 29.01 30.15 51.56 40.54 43.85 43.37 42.83 37.44 45.50 40.26 26.38 19.19 22.84 22.53 22.43 18.37 19.61 49.87 GPT-4o Figure 11: Performance of RALMs on multilingual knowledge selection using Character 3-gram Recall as metric. The colors indicate the performance of the strict language setting, with deeper blues representing a stronger performance. -- 12 of 15 -- en fr es pt zh ko ja ar en fr es pt zh ko ja ar 18.78 9.14 8.63 7.11 6.09 3.05 6.09 3.05 31.98 9.64 8.12 13.71 9.64 2.54 5.58 4.57 26.40 10.15 11.17 15.74 9.14 3.55 6.09 2.54 29.44 10.66 11.68 18.27 7.11 1.52 4.06 4.57 23.35 10.66 4.06 6.09 24.87 7.11 9.64 4.06 28.43 11.68 7.61 13.20 11.68 12.69 9.64 6.60 21.32 5.58 5.58 8.63 15.74 6.09 14.72 5.58 23.86 7.61 6.60 5.08 5.08 1.52 6.60 13.20 Aya-23-8B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 63.45 40.10 39.59 34.01 12.69 9.14 7.61 7.11 57.87 44.10 40.10 35.03 10.66 7.11 8.12 4.57 57.36 33.50 45.53 28.43 14.72 10.15 11.17 5.08 54.31 36.04 33.50 41.98 15.23 4.06 9.14 10.15 55.33 31.98 25.89 26.40 34.69 10.58 9.60 8.61 44.67 24.87 21.83 17.26 13.71 22.60 7.61 7.11 64.47 43.65 39.59
Chunk 33 Ā· 1,995 chars
zh ko ja ar 63.45 40.10 39.59 34.01 12.69 9.14 7.61 7.11 57.87 44.10 40.10 35.03 10.66 7.11 8.12 4.57 57.36 33.50 45.53 28.43 14.72 10.15 11.17 5.08 54.31 36.04 33.50 41.98 15.23 4.06 9.14 10.15 55.33 31.98 25.89 26.40 34.69 10.58 9.60 8.61 44.67 24.87 21.83 17.26 13.71 22.60 7.61 7.11 64.47 43.65 39.59 35.53 10.66 9.64 35.63 4.57 61.42 36.55 35.03 28.93 13.20 5.58 7.61 27.61 Aya-23-35B en fr es pt zh ko ja ar en fr es pt zh ko ja ar 44.16 16.24 11.68 13.20 26.40 6.09 7.61 4.57 35.03 16.24 13.20 16.24 20.81 4.57 7.11 3.05 31.98 14.72 18.27 14.21 18.27 6.09 4.57 4.06 34.01 16.24 18.27 23.35 13.71 4.57 6.09 4.57 30.96 9.14 8.63 10.66 39.09 4.06 5.58 2.54 40.10 13.71 14.21 10.66 17.26 10.66 10.66 2.03 38.58 17.26 14.21 16.24 26.90 7.61 18.27 2.03 30.96 15.74 15.74 11.17 6.60 4.06 6.09 18.27 Qwen2-7B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 72.08 55.33 50.76 51.27 39.09 29.95 34.01 24.87 62.44 43.15 43.65 38.07 29.44 22.34 27.41 28.93 59.90 40.10 46.70 36.55 25.89 23.35 18.27 20.30 63.96 40.61 42.64 46.19 24.37 25.89 23.86 21.83 73.10 56.85 57.36 56.35 39.59 34.01 34.52 24.37 59.90 32.99 34.52 32.49 28.43 27.41 22.34 18.27 67.01 51.78 50.25 51.27 30.96 26.90 34.52 21.83 58.88 33.50 36.55 32.49 24.87 17.77 25.38 36.04 Qwen2-72B-Instruct en fr es pt zh ko ja ar en fr es pt zh ko ja ar 57.36 26.90 33.50 28.43 8.63 6.60 12.69 5.58 56.85 25.38 30.96 26.90 13.20 7.61 13.20 8.63 19.80 7.11 11.68 9.14 2.03 2.03 3.05 4.06 25.89 13.71 15.23 20.30 3.55 4.06 5.58 6.60 12.69 5.58 5.08 6.60 9.64 3.05 5.08 1.02 31.47 6.09 12.18 10.66 7.11 11.68 5.08 4.57 29.95 15.74 16.24 16.75 10.66 8.12 5.08 5.58 13.20 6.60 6.60 4.57 2.54 1.02 4.06 4.06 GPT3.5-Turbo en fr es pt zh ko ja ar en fr es pt zh ko ja ar 54.82 40.61 43.15 42.64 39.59 23.35 29.95 29.44 44.67 52.79 30.96 28.43 19.80 15.23 18.78 16.75 29.44 21.32 29.44 24.37 18.27 10.15 15.74 14.72 46.70 30.96
Chunk 34 Ā· 1,993 chars
29.95 15.74 16.24 16.75 10.66 8.12 5.08 5.58 13.20 6.60 6.60 4.57 2.54 1.02 4.06 4.06 GPT3.5-Turbo en fr es pt zh ko ja ar en fr es pt zh ko ja ar 54.82 40.61 43.15 42.64 39.59 23.35 29.95 29.44 44.67 52.79 30.96 28.43 19.80 15.23 18.78 16.75 29.44 21.32 29.44 24.37 18.27 10.15 15.74 14.72 46.70 30.96 32.49 53.30 18.27 13.71 12.69 15.74 47.72 34.01 34.01 28.93 49.75 22.34 27.41 22.34 53.30 31.47 32.49 31.98 26.40 36.04 17.26 19.80 64.97 48.22 50.76 48.22 35.53 29.95 34.52 27.41 26.90 17.77 22.34 17.77 13.71 8.63 11.17 44.67 GPT-4o Figure 12: Performance of RALMs on multilingual knowledge selection when using LLM for evaluation. Specifically, we employ GPT-4-Turbo to assess the predictions of RALMs. -- 13 of 15 -- A.3 Experiments on the Refined Prompt Prompt en fr es pt zh ko ja ar avg Flexible Language Setting Qwen2-7B-Instruct Vanilla 69.35 67.34 67.62 62.74 45.21 37.98 46.57 32.15 52.36 Refined 68.89 69.08 68.63 64.87 51.55 42.22 50.53 32.66 56.05 GPT-3.5-Turbo Vanilla 65.14 56.13 61.58 61.43 34.86 34.43 42.67 32.38 48.58 Refined 73.78 64.54 71.68 72.56 50.85 44.77 55.67 38.99 59.11 Strict Language Setting Qwen2-7B-Instruct Vanilla 27.98 26.51 26.64 29.03 20.99 25.05 25.65 20.60 25.31 Refined 29.04 26.76 26.47 29.76 23.30 25.91 25.52 19.97 25.84 GPT-3.5-Turbo Vanilla 21.45 18.30 20.40 22.40 8.43 8.40 13.78 7.16 15.04 Refined 26.83 20.10 26.17 26.09 12.86 13.85 20.31 10.45 19.58 Table 5: Comparison of performance between using the Vanilla prompt and the Refined prompt across all languages. As mentioned in §3.4, we refine the original prompt, re- quiring RALMs to first provide answers from documents in various languages, and then translate them into the query language. This method leverages the chain-of- thought to enhance RALMsā performance in cross-lingual knowledge transfer. Table 5 presents the experimental results of Qwen2-7B-Instruct and GPT-3.5-Turbo across all languages, confirming the effectiveness of our method. B Implementation
Chunk 35 Ā· 1,993 chars
then translate them into the
query language. This method leverages the chain-of-
thought to enhance RALMsā performance in cross-lingual
knowledge transfer. Table 5 presents the experimental
results of Qwen2-7B-Instruct and GPT-3.5-Turbo across all
languages, confirming the effectiveness of our method.
B Implementation Details
When collecting data, we utilized MediaWiki Action API7
to gather Wikipedia data. During evaluation, we simplify the
retrieval process by directly providing the gold documents.
Each document is divided into 200-token chunks, and we
employ mcontriever-msmarco8 as the retriever to select the
5 most relevant chunks for the QA as context for length con-
trol. We limit the RALMsā context input to 8,192 tokens and
instruct RALMs that the current year is 2200 to ensure that
the timestamps in the QA pairs are reasonable. RALMs are
also instructed to respond based on the provided context or
reply that the information is insufficient.
C Instructions in the Experiments
In this section, we provide the English instructions used for
the data construction and experiments. Instructions in other
languages were derived through translation from the English
version.
7https://www.mediawiki.org/wiki/API:Main page
8https://huggingface.co/facebook/mcontriever-msmarco
Instruction for Entities Modification
Given the original entity {ENTITY}, you should provide 8
unique and reasonable entities of the same type as the orig-
inal entity. If the given entity is a personās name, a movie
title, a game title, etc., return a fictional but reasonable en-
tity. If the given entity is a country or a city, provide a real
and comparatively similar entity. Return the result in the
form of Python lists, with no additional context.
Instruction for Document Update
Your task is to modify a given document by replacing cer-
tain entities within it to ensure logical coherence and con-
sistency. Please adhere to the following instructions:
1. The modified document must be relevant andChunk 36 Ā· 1,994 chars
the result in the
form of Python lists, with no additional context.
Instruction for Document Update
Your task is to modify a given document by replacing cer-
tain entities within it to ensure logical coherence and con-
sistency. Please adhere to the following instructions:
1. The modified document must be relevant and capable
of answering the question: {QUERY} with the answer
{ANSWER}.
2. Ensure all mentioned entities are thoroughly replaced.
For example, if replacing a personās name, ensure both the
first and the last name are replaced everywhere. 3. For en-
tities with aliases (e.g., āUnited Statesā and āAmericaā in
English), replace all variations. 4. After replacing an entity
(e.g., changing āUSAā to āUKā), adjust related content ac-
cordingly (e.g., changing āAmericanā to āBritishā).
Instruction for Monolingual Knowledge Extraction
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents, and you
need to use the same language as the query to give your an-
swer. If the information in the documents does not contain
the answer, you will generate āThe document contains in-
sufficient information, so I cannot answer the query based
on the document.ā If the information in the documents con-
tains the correct answer, you will succinctly and directly
give all the answers without including any context, and if
there are multiple answers, separate them by ā, ā. Note that
the current time is the year 2200, and the temporal infor-
mation, names, and other entity information mentioned in
the text are all correct. Now the Document is :{DOCS} ...
the query is:{QUERY}ā
-- 14 of 15 --
Instruction for Cross-lingual Knowledge Transfer
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents in dif-
ferent languages, and you need to use the same language
as the query to give your answer. If the information in the
documents does not contain the answer, you will generate
āTheChunk 37 Ā· 1,998 chars
ingual Knowledge Transfer
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents in dif-
ferent languages, and you need to use the same language
as the query to give your answer. If the information in the
documents does not contain the answer, you will generate
āThe document contains insufficient information, so I can-
not answer the query based on the document.ā If the infor-
mation in the documents contains the correct answer, you
will succinctly and directly give all the answers without
including any context, and if there are multiple answers,
separate them by ā, ā. Note that the current time is the year
2200, and the temporal information, names, and other en-
tity information mentioned in the text are all correct. Now
the Document is :{DOCS} ... the query is:{QUERY}ā
Refined Instruction for Cross-lingual Knowledge
Transfer
You are an accurate and reliable AI assistant that can an-
swer queries with the help of external documents in dif-
ferent languages, and you need to use the same language
as the query to give your answer. If the information in the
documents does not contain the answer, you will generate
āThe document contains insufficient information, so I can-
not answer the query based on the document.ā If the infor-
mation in the documents contains the correct answer, you
will succinctly and directly give all the answers without
including any context, and if there are multiple answers,
separate them by ā, ā. Note you need to provide the an-
swer from the original text and then transfer it into the
query language. The format is in the format [answer in
query language (answer in the original text), ...]. Note that
the current time is the year 2200, and the temporal infor-
mation, names, and other entity information mentioned in
the text are all correct. Now the Document is :{DOCS} ...
the query is:{QUERY}ā
Instruction for Multilingual Knowledge Selection
You are an accurate and reliable AI assistant thatChunk 38 Ā· 1,998 chars
text), ...]. Note that
the current time is the year 2200, and the temporal infor-
mation, names, and other entity information mentioned in
the text are all correct. Now the Document is :{DOCS} ...
the query is:{QUERY}ā
Instruction for Multilingual Knowledge Selection
You are an accurate and reliable AI assistant that can an-
swer queries with the help of multiple external documents
in various languages with different answers, and you need
to use the same language as the query to give your answer.
If the information in the documents does not contain the
answer, you will generate āThe document contains insuf-
ficient information, so I cannot answer the query based on
the document.ā Otherwise, you will succinctly and directly
give all the answers you think are correct without includ-
ing any context, and if there are multiple answers, separate
them by ā, ā. that the current time is the year 2200, and the
temporal information, names, and other entity information
mentioned in the text are all correct. Now the Document is
:{DOCS} ... the query is:{QUERY}ā
D Question Construction Guidelines
Below are the annotation guidelines for manually reviewing
and modifying data to ensure quality.
This task requires reviewing the provided documents and
QA to ensure they meet the following requirements; if they
do not, modifications must be made:
⢠The document must contain the knowledge mentioned
in the QA and be able to answer the QA.
⢠The document must not contain conflicting informa-
tion.
⢠If the QA in the target language differs from the English
QA, modify it according to the English version.
⢠Ensure that the timestamps in the document and QA
are consistent, with all dates falling between the years
2124 and 2200.
How to modify the QA or the document:
⢠If there is a significant difference between the QA and
the English QA, a re-translation is required.
⢠If the document lacks information to answer the QA,
add or revise sentences to include the necessary details.
⢠Ensure allChunk 39 · 1,105 chars
h all dates falling between the years 2124 and 2200. How to modify the QA or the document: ⢠If there is a significant difference between the QA and the English QA, a re-translation is required. ⢠If the document lacks information to answer the QA, add or revise sentences to include the necessary details. ⢠Ensure all instances of names, last names, and aliases are thoroughly consistent. If the main character is named Michael Johnson, it is essential to maintain con- sistency throughout. Use pronouns such as āheā (e.g., āHe went to the storeā), refer to him by his last name āJohnsonā (e.g., āJohnson attended the meetingā), or use his full name. ⢠Resolve any conflicts present in the document. ā Adjust the document to reflect a future time without logical inconsistencies. For instance, if the document states that a film was shot in 2021 but is set to be released in 2038, the filming date should be changed to 2038 to maintain consistency. ā Geographical Conflicts: These typically involve dis- crepancies between city and country names. For ex- ample, āNew York City in Italy.ā -- 15 of 15 --