SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
Summary
This paper introduces Inference-Time Cross-Lingual Intervention (INCLINE), a framework designed to bridge performance gaps between high- and low-performing languages in Large Language Models (LLMs). Most existing methods to address these gaps rely on resource-intensive pretraining or fine-tuning. INCLINE instead aligns internal representations of low-performing languages with those of high-performing languages during inference, using learned alignment matrices derived from parallel sentences. These matrices are trained via Least-Squares optimization and applied layer-by-layer to transform source language representations into the target language space. Experiments across nine benchmarks and five LLMs show INCLINE significantly improves performance on both discriminative and generative tasks, with average accuracy gains up to +4.96 over strong baselines. The method is cost-effective, requiring minimal computational resources and no additional training of the LLM. Results also indicate INCLINE is effective across various model sizes, in-context learning settings, and with high-quality parallel data. The authors release code to encourage further research.
PDF viewer
Chunks(39)
Chunk 0 Ā· 1,978 chars
Bridging the Language Gaps in Large Language Models with
Inference-Time Cross-Lingual Intervention
Weixuan Wang1ā Minghao Wu2ā Barry Haddow1 Alexandra Birch1
1School of Informatics, University of Edinburgh
2Monash University
{weixuan.wang, bhaddow, a.birch}@ed.ac.uk
minghao.wu@monash.edu
Abstract
Large Language Models (LLMs) have shown
remarkable capabilities in natural language
processing but exhibit significant performance
gaps among different languages. Most existing
approaches to address these disparities rely on
pretraining or fine-tuning, which are resource-
intensive. To overcome these limitations with-
out incurring significant costs, we propose
Inference-Time Cross-Lingual Intervention
(INCLINE), a novel framework that enhances
LLM performance on low-performing (source)
languages by aligning their internal represen-
tations with those of high-performing (target)
languages during inference. INCLINE ini-
tially learns alignment matrices using paral-
lel sentences from source and target languages
through a Least-Squares optimization, and then
applies these matrices during inference to trans-
form the low-performing language represen-
tations toward the high-performing language
space. Extensive experiments on nine bench-
marks with five LLMs demonstrate that IN-
CLINE significantly improves performance
across diverse tasks and languages, compared
to recent strong baselines. Our analysis demon-
strates that INCLINE is highly cost-effective
and applicable to a wide range of applications.
In addition, we release the code to foster re-
search along this line.1
1 Introduction
Large Language Models (LLMs) have achieved
remarkable success across a variety of natural lan-
guage processing tasks, demonstrating strong capa-
bilities in language understanding and generation
(OpenAI, 2023; Dubey et al., 2024; Mesnard et al.,
2024; Anthropic, 2024; OpenAI, 2024a,b). How-
ever, despite these advancements, most state-of-the-
art LLMs remain predominantlyChunk 1 Ā· 1,992 chars
ccess across a variety of natural lan- guage processing tasks, demonstrating strong capa- bilities in language understanding and generation (OpenAI, 2023; Dubey et al., 2024; Mesnard et al., 2024; Anthropic, 2024; OpenAI, 2024a,b). How- ever, despite these advancements, most state-of-the- art LLMs remain predominantly English-centric, exhibiting significant performance gaps among dif- ferent languages (Petrov et al., 2023; Kumar et al., ā Equal contribution. 1https://github.com/weixuan-wang123/INCLINE 5.0 2.5 0.0 2.5 5.0 7.5 projection on direction1 8 6 4 2 0 2 4 projection on direction2 lang en pt (a) Before intervention 2 0 2 4 projection on direction1 8 6 4 2 0 2 projection on direction2 lang en pt (b) After intervention Figure 1: Bivariate kernel density estimation plots dis- playing the representations (hidden states of the last token) from 100 random examples in English (blue) and their Portuguese translations (orange) from XCOPA (Ponti et al., 2020). After intervention using INCLINE, the Portuguese representations are aligned closer to the English representations. 2024), which can adversely affect user experience and potentially exclude large portions of the global population from accessing advanced AI services (Lai et al., 2023a; Wang et al., 2024a). Addressing the performance gaps across lan- guages is highly challenging. Recent approaches are mostly data-driven, such as multilingual super- vised fine-tuning or continued pre-training (Ćstün et al., 2024; Cui et al., 2023; Kuulmets et al., 2024). However, collecting and annotating large-scale datasets for numerous underrepresented languages is both time-consuming and resource-intensive (Lai et al., 2023b). Furthermore, training LLMs on mul- tilingual data requires substantial computational resources, limiting their practicality for widespread applications, especially in resource-constrained set- tings (Muennighoff et al., 2023; Li et al., 2023a). Given these limitations, a natural question
Chunk 2 Ā· 1,994 chars
intensive (Lai et al., 2023b). Furthermore, training LLMs on mul- tilingual data requires substantial computational resources, limiting their practicality for widespread applications, especially in resource-constrained set- tings (Muennighoff et al., 2023; Li et al., 2023a). Given these limitations, a natural question arises: How can we bridge the performance gaps between high-performing and low-performing languages without incurring prohibitive costs? Inspired by Lample et al. (2018) showing that word embeddings in different languages can be 1 arXiv:2410.12462v1 [cs.CL] 16 Oct 2024 -- 1 of 16 -- aligned to a shared representation space through learned rotations for word translation, we pro- pose Inference-Time Cross-Lingual Interven- tion (INCLINE). This novel framework utilizes a group of learned alignment matrices that trans- form the representations (e.g., hidden states) of a low-performing (source) language into those of a high-performing (target) language during infer- ence. Our framework comprises two main steps. First, we train the alignment matrices for each layer of LLM using parallel sentences from the source and target languages. The learning process is formulated as a Least-Squares optimization prob- lem, where these alignment matrices are learned by minimizing the distance between the projected source language representations and their corre- sponding target language representations, without the need for extensive retraining or fine-tuning the LLM. Second, we apply the learned alignment ma- trices to transform the source language input repre- sentations into the target language representation space at each layer during inference. By integrating these steps, INCLINE leverages the rich represen- tations learned from high-performing languages to enhance performance on downstream tasks in- volving low-performing languages. As shown in Figure 1, INCLINE effectively aligns the input representations in Portuguese to their parallel rep- resentations in
Chunk 3 Ā· 1,993 chars
y integrating these steps, INCLINE leverages the rich represen- tations learned from high-performing languages to enhance performance on downstream tasks in- volving low-performing languages. As shown in Figure 1, INCLINE effectively aligns the input representations in Portuguese to their parallel rep- resentations in English. In this study, we conduct extensive experiments to validate the effectiveness of INCLINE on nine widely used benchmarks using five LLMs. Our results demonstrate that aligning internal represen- tations using INCLINE significantly improves per- formance on diverse tasks among languages. Our contributions are summarized as follows: ⢠We propose INCLINE, a cross-lingual in- tervention approach that enhances LLMs by transforming source language representations into a target language representation space during inference without requiring additional training of LLMs (see Section 3). ⢠We conduct extensive evaluations across five discriminative tasks and four generative tasks, covering 21 languages. Our experimental re- sults show that INCLINE significantly im- proves model performance, boosting average accuracy by up to +4.96 compared to strong baselines (see Section 4). ⢠Our detailed analysis indicates that INCLINE is highly cost-effective, as it requires minimal computational resources while delivering sub- stantial performance improvements (see Sec- tion 5). Moreover, we demonstrate that IN- CLINE is effective with regard to LLM back- bones, model sizes, and in-context learning, underscoring its general applicability and po- tential for broader use in enhancing LLMs for underrepresented languages (see Section 6). 2 Related Work Multilingual LLMs LLMs are pivotal in multi- lingual NLP tasks, typically leveraging external par- allel datasets for training (Xue et al., 2021; Muen- nighoff et al., 2023; Chung et al., 2024). For low- resource languages, data augmentation techniques generate parallel data by mining sentence pairs or translating
Chunk 4 Ā· 1,993 chars
ork Multilingual LLMs LLMs are pivotal in multi- lingual NLP tasks, typically leveraging external par- allel datasets for training (Xue et al., 2021; Muen- nighoff et al., 2023; Chung et al., 2024). For low- resource languages, data augmentation techniques generate parallel data by mining sentence pairs or translating monolingual text using machine transla- tion tools (Edunov et al., 2018; Zhao et al., 2021; Ranaldi et al., 2023). However, these methods heavily rely on robust parallel corpora. To reduce data costs, studies have shifted toward Parameter- Efficient Fine-Tuning (PEFT) techniques (Pfeiffer et al., 2020; Parovi“c et al., 2022; Agrawal et al., 2023; Wu et al., 2024) and cross-lingual embed- dings mapping methods (Mikolov et al., 2013; Or- mazabal et al., 2019; Wang et al., 2022), which still demand considerable computational resources. Multilingual Prompting There is a growing in- terest in methods that do not require parameter ad- justments. Prompting techniques have emerged, uti- lizing LLMs with multilingual prompts (Lin et al., 2021c, 2022; Shi et al., 2022b; Huang et al., 2023). However, these strategies face challenges like poor translation quality and prompt framing interference (Wang et al., 2024c). Additionally, their effective- ness varies by task, as recent research indicates that few-shot learning may not outperform zero-shot learning in translation tasks (Hendy et al., 2023). Intervention To address these challenges, we explore inference-time intervention techniques as cost-effective and efficient alternatives to tradi- tional fine-tuning. Prior research in style trans- fer (Subramani et al., 2022; Turner et al., 2023), knowledge editing (Meng et al., 2022), and truthful- ness shifting (Li et al., 2023b; Rimsky et al., 2024) demonstrates the potential of linear probe-based interventions. However, these methods have been largely limited to monolingual contexts. Our goal is to design a novel cross-lingual inference-time intervention that
Chunk 5 Ā· 1,996 chars
owledge editing (Meng et al., 2022), and truthful- ness shifting (Li et al., 2023b; Rimsky et al., 2024) demonstrates the potential of linear probe-based interventions. However, these methods have been largely limited to monolingual contexts. Our goal is to design a novel cross-lingual inference-time intervention that effectively aligns representations across languages, aiming to improve performance 2 -- 2 of 16 -- Lately, with gold prices up more than 300% over the last decade, it is harder than ever. English text Ultimamente, com os preƧos do ouro a subirem mais de 300% na Ćŗltima dĆ©cada, a situação estĆ” mais difĆcil do que nunca. Portuguese text š¾š ā Question: O artigo foi embalado em plĆ”stico bolha. Qual a causa? A: Era frĆ”gil. B: Era pequeno. Answer: Test Input Layer 1 Layer 2 Layer N rep. ⦠⦠⦠⦠⦠LLM LLM Intervention Unaffected states Intervened states LLM Layer l-1 āš š š¼ ą· āš š” š¼ + āš ššš„ B A (a) Learning the Cross-Lingual Alignment (b) Inference-Time Transformation Layer l ⦠⦠Frozen parameters representation in Portuguese representation in English Transformation š1 ā š2 ā šš ā ⦠Figure 2: Framework of INCLINE. INCLINE involves two steps: (a) Learning the Cross-Lingual Alignment: sentence representations from a parallel dataset are used to train alignment matrices that map source (Portuguese) representations to the target (English) representations. (b) Inference-Time Transformation: this step adapts the source representations from downstream tasks into the target representation space using the alignment matrices. across multiple languages. 3 Methodology In Figure 2, we illustrate the framework of IN- CLINE, which enhances LLMs through inference- time cross-lingual intervention. Our approach com- prises two main steps: ⢠Learning the Cross-Lingual Alignment: Us- ing parallel corpora, we train alignment ma- trices for each layer to map ource language representations totarget language representa- tions (see Section 3.1). ⢠Inference-Time
Chunk 6 Ā· 1,983 chars
LLMs through inference-
time cross-lingual intervention. Our approach com-
prises two main steps:
⢠Learning the Cross-Lingual Alignment: Us-
ing parallel corpora, we train alignment ma-
trices for each layer to map ource language
representations totarget language representa-
tions (see Section 3.1).
⢠Inference-Time Transformation: During in-
ference, we utilize the learned alignment ma-
trices to transform input representations from
the source language into the target language
representation space, thereby improving the
LLMās performance on tasks in the source
language (see Section 3.2).
By minimizing the distance between the source
language representations and their corresponding
target language representations, we effectively re-
duce cross-lingual representation gaps and align
representation spaces across languages.
3.1 Learning the Cross-Lingual Alignment
Inspired by Schuster et al. (2019) that align embed-
dings across languages with learned linear trans-
formations, we aim to learn a cross-lingual align-
ment matrix Wl that aligns sentence representa-
tions from the source language to the target lan-
guage at the l-th layer of LLM. Given a parallel
dataset D = {(xxxs
i, xxxt
i)}N
i=1, where each xxxs
i is the
i-th source sentence and xxxt
i is its corresponding
translation in the target language. Both xxxs
i and xxxt
i
are sequences of tokens. From these sequences,
we extract sentence representations by taking the
hidden state of the last token in each sequence, de-
noted as hhhs
i,l ā Rd and hhht
i,l ā Rd for the source
and target sentence, respectively, where d is the
dimensionality of the hidden states.
To minimize the difference between the pro-
jected source sentence representations and the tar-
get sentence representations, our objective can be
defined as a Least-Squares optimization problem:
W ā
l = argmin
Wl
N X
i=1
Wlhhhs
i,l ā hhht
i,l
2 (1)
This problem seeks the optimal W ā
l that aligns the
source representations with the targetChunk 7 Ā· 1,997 chars
ence between the pro- jected source sentence representations and the tar- get sentence representations, our objective can be defined as a Least-Squares optimization problem: W ā l = argmin Wl N X i=1 Wlhhhs i,l ā hhht i,l 2 (1) This problem seeks the optimal W ā l that aligns the source representations with the target representa- tions by minimizing the distance between them. Hence, the closed-form solution to this optimiza- tion problem is: W ā l = N X i=1 (hhhs i,l)ā¤hhhs i,l !ā1 N X i=1 (hhhs i,l)ā¤hhht i,l ! (2) By applying the learned alignment matrix W ā l to the source sentence representations, we effectively map them into the target languageās representation space. This alignment reduces cross-lingual rep- resentation discrepancies, allowing the model to 3 -- 3 of 16 -- leverage knowledge from the target language to im- prove performance on tasks in the source language. 3.2 Inference-Time Transformation With the learned alignment matrix W ā l , we can enhance the LLMās processing of source language inputs by transforming their representations to the target representation space during inference. We denote the hidden state of the last token of the test input qqqs in the source language at the l- th layer of the LLM as hhhs q,l and then project this source language representation into the target rep- resentation space using the alignment matrix W ā l : Ėhhht q,l = W ā l hhhs q,l (3) To perform the cross-lingual intervention at the l-th layer using the intervention vector Ėhhht q,l, we adjust the original hidden state in source language hhhs q,l by blending it with the projected hidden state in target language Ėhhht q,l. This adjustment is controlled by a hyperparameter α, which balances the influence between the source and target hidden states: hhhmix q,l = hhhs q,l + αĖhhht q,l (4) Here, Equation 4 represents a shift of representa- tion of source language towards target language representation by a magnitude of α times. Decoding with Minimal Intervention In
Chunk 8 Ā· 1,993 chars
lled by a hyperparameter α, which balances the influence between the source and target hidden states: hhhmix q,l = hhhs q,l + αĖhhht q,l (4) Here, Equation 4 represents a shift of representa- tion of source language towards target language representation by a magnitude of α times. Decoding with Minimal Intervention In this work, we only conduct one single intervention on the last token of qqqs by replacing hhhs q,l with hhhmix q,l for the test input qqqs at the l-th layer of LLM. In such a way, we can effectively intervene the model output while preserve the features in the source language. Comparison with ITI and CAA Recently, ITI (Li et al., 2023b) and CAA (Rimsky et al., 2024) have been proposed as interventions in the model behaviors by manipulating the selected attention heads and hidden states, respectively. INCLINE is distinct from ITI and CAA due to three pri- mary differences. Firstly, ITI and CAA utilize a learned static intervention vector to alter model behaviors, whereas INCLINE leverages a set of alignment matrices to dynamically align input rep- resentations from the source language to the target language. Secondly, ITI and CAA apply the inter- vention vector across all token positions following the instruction, potentially causing excessive per- turbation during inference. In contrast, INCLINE performs a single intervention solely on the last token of the input. Additionally, unlike ITI and CAA, which target on only a limited number of lay- ers, INCLINE modifies the representations across all layers. These modifications enable the LLMs to comprehensively leverage their target language capabilities for multilingual prediction. 4 Experiments In this section, we introduce our experimental setup (Section 4.1) and present our results in Section 4.2. 4.1 Experimental Setup We present our evaluation tasks, model backbones, implementation details of INCLINE, and baselines in this section. Evaluation Tasks We conduct extensive evalua- tions across nine
Chunk 9 Ā· 1,994 chars
periments In this section, we introduce our experimental setup (Section 4.1) and present our results in Section 4.2. 4.1 Experimental Setup We present our evaluation tasks, model backbones, implementation details of INCLINE, and baselines in this section. Evaluation Tasks We conduct extensive evalua- tions across nine diverse downstream tasks, catego- rized into two groups: ⢠Discriminative Tasks: XCOPA (Ponti et al., 2020), XStoryCloze (Lin et al., 2021b), XWinograd (Lin et al., 2021b), XCSQA (Lin et al., 2021a), XNLI (Conneau et al., 2018); ⢠Generative Tasks: MZsRE (Wang et al., 2024b), Flores (Goyal et al., 2021), WMT23 (Kocmi et al., 2023), MGSM (Shi et al., 2022a). These tasks covers 21 languages including English (en), Arabic (ar), German (de), Greek (el), Spanish (es), Estonian (et), French (fr), Hindi (hi), Indone- sian (id), Italian (it), Japanese (ja), Dutch (nl), Por- tuguese (pt), Russian (ru), Swahili (sw), Tamil (ta), Thai (th), Turkish (tr), Ukrainian (uk), Vietnamese (vi), and Chinese (zh). We include more details of these tasks in Appendix A. Model Backbones In this work, we mainly use BLOOMZ-7B1-MT as our model backbone for all the baseline approaches, unless other- wise specified. To demonstrate the effective- ness of INCLINE across various model back- bones, we include four additional LLMs: LLAMA3- 8B-INSTRUCT (Dubey et al., 2024), LLAMA2- 7B-CHAT (Touvron et al., 2023), MISTRAL-7B- INSTRUCT (Jiang et al., 2023), FALCON-7B- INSTRUCT (Almazrouei et al., 2023). We present these results in Section 6. For the MGSM task, we employ the MATHOCTOPUS (Chen et al., 2023),2 a specialized model fine-tuned from LLAMA2-7B for mathematical reasoning tasks, as the backbone. 2https://huggingface.co/Mathoctopus/Parallel_7B 4 -- 4 of 16 -- XCOPA XStoryCloze XWinograd XCSQA XNLI μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN BASELINE 61.62 69.00 52.40 74.96 77.83 57.78 57.05 59.71 53.06 47.35 55.31 34.62 46.48
Chunk 10 Ā· 1,990 chars
easoning tasks, as the backbone. 2https://huggingface.co/Mathoctopus/Parallel_7B 4 -- 4 of 16 -- XCOPA XStoryCloze XWinograd XCSQA XNLI μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN BASELINE 61.62 69.00 52.40 74.96 77.83 57.78 57.05 59.71 53.06 47.35 55.31 34.62 46.48 50.04 41.48 MT-GOOGLE 73.31ā 73.52 73.05ā 76.63 76.05 80.08ā 57.63 57.12 57.90ā 58.52ā 54.84 64.40ā 50.72 49.80 52.00 MT-LLM 59.84 67.16 50.70 79.41 82.23 62.48 43.02 41.67 45.04 30.73 35.38 23.30 43.64 47.83 37.77 Intervention Methods ITI 60.91 67.56 52.60 76.38 79.33 58.70 48.24 58.37 33.06 46.32 55.33 31.92 46.32 49.51 41.84 CAA 63.96 71.80 54.15 78.16 80.92 61.61 58.42 60.70 55.01 47.97 56.01 35.10 46.17 50.92 39.52 INCLINE 65.22 (+3.60) 72.56 (+3.56) 56.05 (+3.65) 79.92 (+4.96) 82.03 (+4.20) 67.24 (+9.46) 59.35ā (+2.30) 62.04ā (+2.33) 55.32 (+2.26) 48.45 (+1.10) 56.45ā (+1.14) 35.64 (+1.02) 48.12 (+1.64) 51.44 (+1.40) 43.47 (+1.99) SFT 66.89 76.84 54.45 87.36 89.50 74.52 43.78 48.63 36.50 42.18 47.95 32.96 69.68 76.76 59.76 SFT +INCLINE 69.24 (+2.35) 79.28ā (+2.44) 61.22 (+6.77) 88.11ā (+0.75) 90.00ā (+0.50) 76.77 (+2.25) 49.84 (+6.06) 57.58 (+8.95) 38.24 (+1.74) 42.55 (+0.37) 48.38 (+0.43) 33.22 (+0.26) 71.17ā (+1.49) 77.83ā (+1.07) 61.84ā (+2.08) Table 1: Main results of discriminative tasks. All the tasks are evaluated using accuracy. ā denotes the best results. μALL , μSEEN , and μUNSEEN indicate the macro-average of results across all the languages, the seen languages, and the unseen languages, respectively. INCLINE (Ours) In this work, we mainly focus on aligning the low-performing language (source) representations closer to the English (target) rep- resentations, as LLMs are predominantly English- centric. For training the alignment matrices be- tween languages, we randomly sample 500 parallel sentence pairs for each language pair involving En- glish and other languages. These pairs are sourced from the News Commentary v16 dataset
Chunk 11 Ā· 1,991 chars
ser to the English (target) rep-
resentations, as LLMs are predominantly English-
centric. For training the alignment matrices be-
tween languages, we randomly sample 500 parallel
sentence pairs for each language pair involving En-
glish and other languages. These pairs are sourced
from the News Commentary v16 dataset (Barrault
et al., 2019), and for languages not covered by this
dataset, we use the CCAligned dataset (El-Kishky
et al., 2020). Following Rimsky et al. (2024), the
value of the α controlling the intervention strength
is in the range from -1 to 1 and determined by the
validation results for each language across tasks.
We use one A100 GPU (40G) for all experiments.
Baselines We compare INCLINE against sev-
eral established techniques: (1) BASELINE in-
dicates the predictions given by the original
BLOOMZ-7B1-MT; (2) MT-GOOGLE utilizes
GOOGLE TRANSLATE to translate non-English
questions into English; (3) MT-LLM leverages
BLOOMZ-7B1-MT to translate questions in non-
English languages into English, employing the
structured prompt template ā{Source Language}:
{Inputs} English:ā; (4) SFT represents the task-
specific supervised fine-tuning (SFT) involving up-
dating all parameters of the LLM on the English
training set for each downstream task individually
with the hyperparameters described in Appendix B
and evaluating the resulting model on the multi-
lingual test sets; (5) ITI (Li et al., 2023b) is an
intervention method that identifies attention heads
with high linear probing accuracy for truthfulness
and adjusts activations along these truth-correlated
directions during inference. Originally used to shift
models from generating false statements to truthful
ones, we adapt it to encourage the generation of
English text over non-English text. (6) CAA (Rim-
sky et al., 2024) employs the mean difference in
hidden states between positive and negative exam-
ples from additional training data as an intervention
vector to adjust the modelās behavior towardsChunk 12 Ā· 1,999 chars
tatements to truthful ones, we adapt it to encourage the generation of English text over non-English text. (6) CAA (Rim- sky et al., 2024) employs the mean difference in hidden states between positive and negative exam- ples from additional training data as an intervention vector to adjust the modelās behavior towards the desired direction. Initially designed for monolin- gual alignment-relevant tasks, we utilize it to shift the modelās output from non-English to English. 4.2 Results In this section, we present our results on the dis- criminative tasks (Table 1) and generative tasks (Table 2). Furthermore, we also categorize the lan- guages involved in the downstream tasks into two groups based on the training data of BLOOMZ- 7B1-MT: seen languages (ar, es, fr, hi, id, pt, sw, ta, vi, and zh) and unseen languages (de, el, et, it, ja, nl, ru, th, tr, and uk). The breakdown results are provided in Table 7 (see Appendix C). INCLINE significantly improves discrimina- tive task performance. The experimental re- sults in Table 1 clearly demonstrate the effective- ness of INCLINE. Although methods like SFT, MT-GOOGLE, and MT-LLM achieve high perfor- mance, they come with substantial costs, including the need for extensive fine-tuning of LLMs and reliance on third-party tools. Activation interven- tion methods, such as ITI and CAA, offer a more cost-effective solution but yield only minimal im- provements, indicating a potential inadequacy in capturing the complexities of multilingual tasks. In contrast, INCLINE provides significant perfor- mance gains by enhancing multilingual representa- tion alignment at inference time without requiring extensive resources or dependencies. This results in a more efficient improvement in multilingual 5 -- 5 of 16 -- MZsRE Flores WMT23 MGSM μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN BASELINE 39.96 45.79 32.67 46.09 58.57 21.12 13.78 14.39 13.63 39.35 39.80 38.90 MT-GOOGLE 73.56ā 72.76ā 74.56ā - - - - - -
Chunk 13 Ā· 1,998 chars
or dependencies. This results in a more efficient improvement in multilingual 5 -- 5 of 16 -- MZsRE Flores WMT23 MGSM μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN μALL μSEEN μUNSEEN BASELINE 39.96 45.79 32.67 46.09 58.57 21.12 13.78 14.39 13.63 39.35 39.80 38.90 MT-GOOGLE 73.56ā 72.76ā 74.56ā - - - - - - 46.70ā 47.70ā 45.70ā MT-LLM 33.18 39.25 25.61 - - - - - - 21.40 30.00 12.80 Intervention Methods ITI 36.31 41.72 29.54 2.85 2.97 1.95 2.34 3.16 2.13 40.50 41.90 39.10 CAA 42.88 50.17 33.78 47.87 60.63 16.75 13.74 14.86 13.46 39.43 40.85 38.00 INCLINE 43.22 (+3.26) 50.21 (+4.42) 34.49 (+1.82) 48.19ā (+2.10) 61.28ā (+2.71) 22.00ā (+0.88) 14.23ā (+0.45) 15.05ā (+0.66) 14.02ā (+0.39) 42.85 (+3.50) 43.30 (+3.50) 42.40 (+3.50) Table 2: Main results of generative tasks. ā denotes the best results. μALL , μSEEN , and μUNSEEN indicate the macro- average of results across all the languages, the seen languages, and the unseen languages, respectively. We use Exact Match (EM) to evaluate MZsRE, use BLEU to evaluate Flores and WMT23, and use accuracy to evaluate MGSM. performance. For example, INCLINE increases the average accuracy by +4.96 on XStoryCloze. Additionally, it delivers improvements of +4.20 and +9.46 for seen and unseen languages, respec- tively. Moreover, INCLINE can further improve the performance of the task-specific SFT. INCLINE significantly enhances generative task performance. The experimental results pre- sented in Table 2 suggest the effectiveness of IN- CLINE in enhancing performance across gener- ative tasks. Unlike ITI and CAA, which show only marginal improvements similar to those ob- served in discriminative tasks, INCLINE appears to achieve substantial advancements. Notably, ITI seems to struggle significantly in machine transla- tion tasks, such as Flores and WMT23, highlighting its limitations. Furthermore, INCLINE reportedly boosts accuracy in the MGSM task by up to +3.50 across various languages. This finding suggests that, although
Chunk 14 Ā· 1,994 chars
ppears to achieve substantial advancements. Notably, ITI seems to struggle significantly in machine transla- tion tasks, such as Flores and WMT23, highlighting its limitations. Furthermore, INCLINE reportedly boosts accuracy in the MGSM task by up to +3.50 across various languages. This finding suggests that, although the mathematical capabilities are in- dependent from the languages, understanding the questions written in different languages still re- quires language-specific knowledge. INCLINE successfully transfers the LLMsā natural language understanding capabilities from English to other languages. It is important to note that SFT is not evaluated on generative tasks because there are no training sets associated with these tasks. In summary, these results demonstrate that IN- CLINE offers a significant improvement in both discriminative and generative tasks by effectively aligning multilingual representations. 5 Analysis In this section, we conduct an in-depth analysis of INCLINE, focusing on four key aspects: compu- tational costs, enhanced consistency after interven- tion, the impacts of the intervened components of LLMs, and the choice of intervention strength α. 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 Number of training data 30 40 50 60 70 Accuracy 200 400 Seconds accuracy time (a) Training cost ar es hi id ru sw zh Language 0.5 0.6 0.7 0.8 0.9 1.0 CPC baseline INCLINE (b) Prediction consistency Figure 3: (a) Training costs of INCLINE with regard to the number of parallel sentences and time used for training alignment matrices. INCLINE is evaluated on XStoryCloze in Swahili. (b) Correct Prediction Con- sistency (CPC) between non-English and English on XStoryCloze for the model using INCLINE. This analysis provides a comprehensive understand- ing of how INCLINE operates and its implications for model performance and efficiency. INCLINE is highly efficient for training and in- troduces only marginal overhead for inference. To
Chunk 15 Ā· 1,987 chars
(CPC) between non-English and English on XStoryCloze for the model using INCLINE. This analysis provides a comprehensive understand- ing of how INCLINE operates and its implications for model performance and efficiency. INCLINE is highly efficient for training and in- troduces only marginal overhead for inference. To analyze the relationship between computational costs and accuracy, we measure both the training and inference costs of our method, INCLINE, us- ing the XStoryCloze task in Swahili. As shown in Figure 3(a), increasing the amount of training data does not necessarily lead to improved accuracy, even though the training time is directly propor- tional to the number of samples. In our study, we empirically determine that using 500 samples for training the alignment matrices provides the best balance between performance gains and compu- tational costs. Consequently, the training process takes only 172 seconds. During inference, our ap- proach involves a single intervention at the last token, resulting in a time complexity of O(1). This method incurs only a 12% increase in inference time, taking 0.80 seconds per item compared to 6 -- 6 of 16 -- XCOPA XCSQA Flores MGSM BASELINE 61.60 47.35 46.09 39.35 INCLINE INCLINE-HIDDEN 65.22 48.45 48.19 42.85 INCLINE-ATTN 63.87 48.18 47.54 41.55 INCLINE-FFN 64.20 47.96 46.10 41.80 INCLINE-EMB 63.16 47.59 39.23 40.90 Table 3: The averaged results of XStoryCloze, XCSQA, Flores, MGSM tasks with four configurations for IN- CLINE given by BLOOMZ-7B1-MT. 0.71 seconds without it, thereby maintaining a low inference cost. INCLINE effectively enhances the consistency of correct predictions between non-English lan- guages (source) and English (target). Recent non-English test sets are commonly translated from their English versions, either by humans or ma- chines, creating parallel datasets. To quantify the alignment between non-English languages (source) and English (target), we propose using the Cor- rect Prediction
Chunk 16 Ā· 1,968 chars
glish lan- guages (source) and English (target). Recent non-English test sets are commonly translated from their English versions, either by humans or ma- chines, creating parallel datasets. To quantify the alignment between non-English languages (source) and English (target), we propose using the Cor- rect Prediction Consistency (CPC) rate. This met- ric measures the proportion of questions correctly answered in both languages, with a higher CPC rate indicating better alignment. The results in Figure 3(b) demonstrate that CPC significantly im- proves after intervention by INCLINE, suggest- ing that INCLINE effectively aligns non-English representations with English ones for more accu- rate predictions. Notably, CPC for Swahili (sw) increases from 0.54 to 0.65 with INCLINE, show- ing its effectiveness for low-resource languages. Intervening on hidden states yields the great- est performance improvements. We apply IN- CLINE to various components of LLMs, including the hidden states (INCLINE-HIDDEN), the outputs of attention heads (INCLINE-ATTN), the outputs of FFN blocks (INCLINE-FFN), and the embed- dings (INCLINE-EMB). The results presented in Table 3 indicate that intervening on the hidden states (INCLINE-HIDDEN) leads to the most sig- nificant improvements across multilingual tasks. This finding suggests that hidden states can capture comprehensive semantic information that is cru- cial for cross-lingual alignment. While INCLINE- ATTN, INCLINE-FFN, and INCLINE-EMB also enhance performance, their performance gains vary across different tasks. These findings justify our design choice of using hidden states in INCLINE. 1 0 1 87 88 89 90 es INCLINE baseline 1 0 1 86 88 90 zh strength accuracy Figure 4: The accuracy changed with hyperparameter α on the XStoryCloze task with BLOOMZ-7B1-MT. ar es hi id ru sw zh AVG BLOOMZ-7B1-MT BASELINE 79.22 87.89 76.37 84.45 57.78 50.50 88.55 74.96 INCLINE 83.12 90.60 81.47 86.10 67.24 59.70 91.20
Chunk 17 Ā· 1,995 chars
0 1 87 88 89 90 es INCLINE baseline 1 0 1 86 88 90 zh strength accuracy Figure 4: The accuracy changed with hyperparameter α on the XStoryCloze task with BLOOMZ-7B1-MT. ar es hi id ru sw zh AVG BLOOMZ-7B1-MT BASELINE 79.22 87.89 76.37 84.45 57.78 50.50 88.55 74.96 INCLINE 83.12 90.60 81.47 86.10 67.24 59.70 91.20 79.92 LLAMA3-8B-INSTRUCT BASELINE 86.50 91.73 84.84 37.46 66.98 54.00 92.39 73.41 INCLINE 87.36 92.39 85.31 64.53 73.73 55.66 92.72 78.81 LLAMA2-7B-CHAT BASELINE 49.37 47.25 39.25 48.18 34.94 0.93 55.53 39.35 INCLINE 51.42 56.65 47.25 49.97 41.03 17.67 60.69 46.38 MISTRAL-7B-INSTRUCT BASELINE 18.33 81.34 24.95 76.64 83.65 2.58 90.07 53.94 INCLINE 36.71 84.23 35.77 80.18 85.13 25.71 90.34 62.58 FALCON-7B-INSTRUCT BASELINE 53.61 58.31 53.21 55.59 54.60 51.16 54.00 54.35 INCLINE 54.33 61.81 54.33 58.04 57.91 53.47 59.70 57.09 Table 4: The results of XStoryCloze dataset with five LLM backbones. The value of α varies across languages and de- pends on language relatedness. In this study, we introduce α to control the strength of intervention in Equation 4. To investigate the impact of α, we conduct a grid search to find the optimal α values across the languages in XStoryCloze. We present the results for Spanish and Chinese in Figure 4. We observe that the optimal α values for these two languages are opposite: positive for Spanish and negative for Chinese. These findings suggest that the value of α is likely to depend on language relat- edness, as both Spanish and English belong to the Indo-European language family, while Chinese be- longs to the Sino-Tibetan language family. Results for more languages are provided in Appendix D. 6 Discussions In this section, we conduct a series of experiments to investigate how variations in LLMs, model sizes, in-context learning, and the data used for training alignment matrices affect our results. Addition- ally, we also explore using French as the target language (Appendix E) and examine the effects of layer-specific
Chunk 18 Ā· 1,993 chars
s In this section, we conduct a series of experiments to investigate how variations in LLMs, model sizes, in-context learning, and the data used for training alignment matrices affect our results. Addition- ally, we also explore using French as the target language (Appendix E) and examine the effects of layer-specific intervention (Appendix F). 7 -- 7 of 16 -- ar el es fr hi ru tr vi zh AVG BASELINE 66.59 15.30 48.52 67.86 71.97 35.66 12.38 40.40 56.11 46.09 INCLINE 68.68 15.63 50.79 69.93 76.92 37.95 12.42 43.11 58.27 48.19 INCLINE-FDEV 73.95 15.76 56.11 75.84 77.85 39.33 12.92 46.49 60.19 50.94 Table 5: The BLEU results of Flores dataset with INCLINE and INCLINE-FDEV. 0.56b 1b 3b 7.1b 0 10 20 30 40 EM baseline INCLINE 6 8 10 percentage improvement improvement model size (a) Various model sizes 0-shot 2-shot 4-shot 0 20 40 60 EM 39.96 47.82 54.97 43.22 49.11 55.99 baseline INCLINE example counts (b) In-context learning Figure 5: (a) Exact Match (left y-axis) and relative improvements over the baseline (right y-axis) on MZsRE with respect to various model sizes of BLOOMZ. (b) Exact Match score for MZsRE dataset with INCLINE based on the zero-shot setting and few-shot settings given by BLOOMZ-7B1-MT. INCLINE consistently enhances performance across multiple LLMs. To demonstrate the ver- satility of INCLINE across different LLMs, we apply it to another four high-performing models on the XStoryCloze task. As shown in Table 4, IN- CLINE consistently enhances performance com- pared to the BASELINE. Specifically, we observe increases of +4.96 for BLOOMZ-7B1-MT, +5.40 for LLAMA3-8B-INSTRUCT, +7.03 for LLAMA2- 7B-CHAT, +8.64 for MISTRAL-7B-INSTRUCT, and +2.74 for FALCON-7B-INSTRUCT. Larger LLMs benefit more from INCLINE. Building on the work of Wang et al. (2024b), who demonstrates a scaling relationship between the size of backbone models and their performance, we evaluate the impact of different model sizes within the BLOOMZ series on the MZsRE dataset. Our
Chunk 19 Ā· 1,999 chars
CT, and +2.74 for FALCON-7B-INSTRUCT. Larger LLMs benefit more from INCLINE. Building on the work of Wang et al. (2024b), who demonstrates a scaling relationship between the size of backbone models and their performance, we evaluate the impact of different model sizes within the BLOOMZ series on the MZsRE dataset. Our findings, illustrated in Figure 5(a), show that the relative performance gain of INCLINE over the baseline increases with the size of the back- bone model. Specifically, the Exact Match (EM) scores (in the stacked columns) and the improve- ment percentages (in the line chart) indicate that larger models in the BLOOMZ series exhibit more significant enhancements when INCLINE is ap- plied. This observation demonstrates that larger LLMs can benefit more from INCLINE. INCLINE can further enhance model perfor- mance when combined with in-context learn- ing. In-context learning (ICL) has been shown to improve the performance of LLMs on the MZsRE task (Wang et al., 2024b). Building upon this find- ing, we evaluate the effectiveness of combining INCLINE with ICL. As illustrated in Figure 5(b), INCLINE demonstrates enhanced performance, achieving an additional increase of +1.02 in aver- age Exact Match (EM) score with four in-context examples compared to the baseline using ICL alone. While this improvement is smaller than the +3.26 increase observed in the zero-shot setting, it sug- gests that the benefits of INCLINE and ICL are complementary, with both methods capturing fea- tures from different perspectives. This highlights the versatility of INCLINE in various applications. High-quality parallel sentences improve align- ment in INCLINE. We explore how the qual- ity of parallel sentences affects the performance of INCLINE. By default, the alignment matri- ces of INCLINE are trained using 500 random samples from the News Commentary dataset. To assess the impact of sentence quality, we also train the alignment matrices using 500 high-quality parallel sentences
Chunk 20 Ā· 1,989 chars
ore how the qual- ity of parallel sentences affects the performance of INCLINE. By default, the alignment matri- ces of INCLINE are trained using 500 random samples from the News Commentary dataset. To assess the impact of sentence quality, we also train the alignment matrices using 500 high-quality parallel sentences from the development set of Flores, which are carefully translated by profes- sional human translators. We refer to this vari- ant as INCLINE-FDEV. In Table 5, INCLINE- FDEV significantly outperforms both the standard INCLINE and BASELINE, highlighting the impor- tance of high-quality parallel sentences. 7 Conclusion In this paper, we introduce Inference-Time Cross- Lingual Intervention (INCLINE), an innovative framework that bridges the performance gaps be- tween high-performing and low-performing lan- guages in LLMs. By training alignment matrices to transform source low-performing language repre- sentations into the target high-performing language representation space, INCLINE enhances perfor- mance on underrepresented languages without re- quiring additional training or fine-tuning of LLMs. 8 -- 8 of 16 -- Extensive experiments across nine benchmarks and five LLMs demonstrate that, INCLINE delivers significant improvements by up to +4.96 in terms of accuracy compared to strong baselines, while it only requires minimal computational costs. 8 Limitations While INCLINE demonstrates significant enhance- ment for the multilingual tasks with cross-lingual intervention, the alignment matrices are trained for specific pairs of source and target languages. Fu- ture work will focus on developing multilingual alignment matrices that can accommodate multiple languages simultaneously, reducing the need for language pair-specific training and enhancing scal- ability. Implementing INCLINE requires access to the internal layers and representations of LLMs. For proprietary or closed-source models, or models accessible only through APIs without exposure
Chunk 21 Ā· 1,996 chars
can accommodate multiple languages simultaneously, reducing the need for language pair-specific training and enhancing scal- ability. Implementing INCLINE requires access to the internal layers and representations of LLMs. For proprietary or closed-source models, or models accessible only through APIs without exposure of internal representations (e.g., GPT-4o), applying this method may not be feasible. How to perform cross-lingual alignment as a plug-and-play tool for all LLMs, including those with restricted access, requires further investigation. References Priyanka Agrawal, Chris Alberti, Fantine Huot, Joshua Maynez, Ji Ma, Sebastian Ruder, Kuzman Ganchev, Dipanjan Das, and Mirella Lapata. 2023. Qameleon: Multilingual qa with only 5 examples. Transactions of the Association for Computational Linguistics, 11:1754ā1771. Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al- shamsi, Alessandro Cappelli, Ruxandra Cojocaru, MĆ©rouane Debbah, Ćtienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. The falcon series of open language models. CoRR, abs/2311.16867. Anthropic. 2024. Claude 3.5 sonnet. LoĆÆc Barrault, OndĖrej Bojar, Marta R. Costa-jussĆ , Christian Federmann, Mark Fishel, Yvette Gra- ham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine trans- lation (WMT19). In Proceedings of the Fourth Con- ference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1ā61, Florence, Italy. As- sociation for Computational Linguistics. Nuo Chen, Zinan Zheng, Ning Wu, Linjun Shou, Ming Gong, Yangqiu Song, Dongmei Zhang, and Jia Li. 2023. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. CoRR, abs/2310.20246. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li,
Chunk 22 Ā· 1,997 chars
inguistics. Nuo Chen, Zinan Zheng, Ning Wu, Linjun Shou, Ming Gong, Yangqiu Song, Dongmei Zhang, and Jia Li. 2023. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. CoRR, abs/2310.20246. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1ā53. Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- ina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross- lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computa- tional Linguistics. Yiming Cui, Ziqing Yang, and Xin Yao. 2023. Efficient and effective text encoding for chinese llama and alpaca. CoRR, abs/2304.08177. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, AurĆ©lien Rodriguez, Austen Gregerson, Ava Spataru, Bap- tiste RoziĆØre, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- lonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor- gia Lewis Anderson, Graeme Nail, GrĆ©goire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Han- nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel
Chunk 23 Ā· 1,979 chars
akomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor- gia Lewis Anderson, Graeme Nail, GrĆ©goire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Han- nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al. 2024. The llama 3 herd of models. CoRR, abs/2407.21783. Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 489ā500. Association for Computational Lin- guistics. Ahmed El-Kishky, Vishrav Chaudhary, Francisco GuzmĆĀ”n, and Philipp Koehn. 2020. CCAligned: 9 -- 9 of 16 -- A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 5960ā5969. Association for Computational Linguistics. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, MarcāAurelio Ranzato, Francisco GuzmĆ”n, and Angela Fan. 2021. The flores-101 evaluation benchmark for low-resource and multilingual ma- chine translation. Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Has- san Awadalla. 2023. How good are GPT models at machine translation? A comprehensive evaluation. CoRR,
Chunk 24 Ā· 1,992 chars
101 evaluation benchmark for low-resource and multilingual ma- chine translation. Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Has- san Awadalla. 2023. How good are GPT models at machine translation? A comprehensive evaluation. CoRR, abs/2302.09210. Haoyang Huang, Tianyi Tang, Dongdong Zhang, Wayne Xin Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting. arXiv preprint arXiv:2305.07004. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LĆ©lio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- thĆ©e Lacroix, and William El Sayed. 2023. Mistral 7b. CoRR, abs/2310.06825. Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Fed- ermann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popovic, and Mariya Shmatova. 2023. Findings of the 2023 conference on machine translation (WMT23): llms are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation, WMT 2023, Singapore, December 6-7, 2023, pages 1ā42. Association for Computational Linguistics. Somnath Kumar, Vaibhav Balloli, Mercy Ranjit, Kabir Ahuja, Tanuja Ganu, Sunayana Sitaram, Kalika Bali, and Akshay Nambi. 2024. Bridging the gap: Dy- namic learning strategies for improving multilingual performance in llms. CoRR, abs/2405.18359. Hele-Andra Kuulmets, Taido Purason, Agnes Luhtaru, and Mark Fishel. 2024. Teaching llama a new lan- guage through cross-lingual knowledge transfer. In Findings of the Association for
Chunk 25 Ā· 1,993 chars
kshay Nambi. 2024. Bridging the gap: Dy- namic learning strategies for improving multilingual performance in llms. CoRR, abs/2405.18359. Hele-Andra Kuulmets, Taido Purason, Agnes Luhtaru, and Mark Fishel. 2024. Teaching llama a new lan- guage through cross-lingual knowledge transfer. In Findings of the Association for Computational Lin- guistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 3309ā3325. Association for Com- putational Linguistics. Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. 2023a. Chatgpt beyond en- glish: Towards a comprehensive evaluation of large language models in multilingual learning. In Find- ings of the Association for Computational Linguis- tics: EMNLP 2023, Singapore, December 6-10, 2023, pages 13171ā13189. Association for Computational Linguistics. Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2023b. Okapi: Instruction- tuned large language models in multiple languages with reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 318ā327. Association for Compu- tational Linguistics. Guillaume Lample, Alexis Conneau, MarcāAurelio Ran- zato, Ludovic Denoyer, and HervĆ© JĆ©gou. 2018. Word translation without parallel data. In 6th In- ternational Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenRe- view.net. Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. 2023a. Bactrian-x : A multi- lingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011. Kenneth Li, Oam Patel, Fernanda B. ViĆ©gas, Hanspeter Pfister, and Martin Wattenberg. 2023b. Inference- time intervention: Eliciting truthful answers from
Chunk 26 Ā· 1,996 chars
Wu, Alham Fikri Aji, and Timothy Baldwin. 2023a. Bactrian-x : A multi- lingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011. Kenneth Li, Oam Patel, Fernanda B. ViĆ©gas, Hanspeter Pfister, and Martin Wattenberg. 2023b. Inference- time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, and Xiang Ren. 2021a. Common sense beyond en- glish: Evaluating and improving multilingual lan- guage models for commonsense reasoning. In Pro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), pages 1274ā1287, Online. Association for Computational Linguistics. Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian OāHoro, Jeff Wang, Luke Zettle- moyer, Zornitsa Kozareva, Mona T. Diab, Veselin Stoyanov, and Xian Li. 2021b. Few-shot learn- ing with multilingual language models. CoRR, abs/2112.10668. Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- man Goyal, Shruti Bhosale, Jingfei Du, et al. 2021c. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668. 10 -- 10 of 16 -- Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- man Goyal, Shruti Bhosale, Jingfei Du, et al. 2022. Few-shot learning with multilingual generative lan- guage models. In Proceedings of the 2022 Confer- ence on Empirical Methods in Natural Language Processing, pages 9019ā9052. Kevin Meng, David Bau, Alex
Chunk 27 Ā· 1,997 chars
Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- man Goyal, Shruti Bhosale, Jingfei Du, et al. 2022. Few-shot learning with multilingual generative lan- guage models. In Proceedings of the 2022 Confer- ence on Empirical Methods in Natural Language Processing, pages 9019ā9052. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associ- ations in gpt. Advances in Neural Information Pro- cessing Systems, 35:17359ā17372. Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane RiviĆØre, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, LĆ©onard Hussenot, Aakanksha Chowdh- ery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, AmĆ©lie HĆ©liou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo- pher A. Choquette-Choo, ClĆ©ment Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Cristian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, and et al. 2024. Gemma: Open models based on gemini re- search and technology. CoRR, abs/2403.08295. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hai- ley Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Al- banie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generaliza- tion through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL
Chunk 28 Ā· 1,997 chars
ng, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Al- banie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generaliza- tion through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15991ā16111. Association for Computational Lin- guistics. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774. OpenAI. 2024a. Hello gpt-4o. OpenAI. 2024b. Learning to reason with llms. Aitor Ormazabal, Mikel Artetxe, Gorka Labaka, Aitor Soroa, and Eneko Agirre. 2019. Analyzing the limi- tations of cross-lingual word embedding mappings. arXiv preprint arXiv:1906.05407. Marinela ParoviĀ“c, Goran GlavaÅ”, Ivan VuliĀ“c, and Anna Korhonen. 2022. Bad-x: Bilingual adapters improve zero-shot cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 1791ā1799. Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. 2023. Language model tok- enizers introduce unfairness between languages. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Pro- cessing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Jonas Pfeiffer, Ivan VuliĀ“c, Iryna Gurevych, and Sebas- tian Ruder. 2020. Mad-x: An adapter-based frame- work for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052. Edoardo Maria Ponti, Goran GlavaÅ”, Olga Majewska, Qianchu Liu, Ivan VuliĀ“c, and Anna Korhonen. 2020. Xcopa: A multilingual dataset for causal common- sense reasoning. arXiv preprint arXiv:2005.00333. Leonardo Ranaldi, Giulia Pucci, and AndrĆ© Fre- itas. 2023. Empowering cross-lingual abili- ties of instruction-tuned large language models by translation-following demonstrations. CoRR, abs/2308.14186. Nina Rimsky, Nick Gabrieli, Julian
Chunk 29 Ā· 1,997 chars
ngual dataset for causal common- sense reasoning. arXiv preprint arXiv:2005.00333. Leonardo Ranaldi, Giulia Pucci, and AndrĆ© Fre- itas. 2023. Empowering cross-lingual abili- ties of instruction-tuned large language models by translation-following demonstrations. CoRR, abs/2308.14186. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2024. Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15504ā15522. Association for Computational Linguistics. Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-lingual alignment of contex- tual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the 2019 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 1599ā1613. Association for Compu- tational Linguistics. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2022a. Language models are multilingual chain-of-thought reasoners. Preprint, arXiv:2210.03057. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022b. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057. Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. 2022. Extracting latent steering vectors from pretrained language models. In Findings of the As- sociation for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 566ā581. Association for Computational Linguistics. 11 -- 11 of 16 -- Hugo Touvron, Louis
Chunk 30 Ā· 1,999 chars
Nivedita Suresh, and Matthew E. Peters. 2022. Extracting latent steering vectors from pretrained language models. In Findings of the As- sociation for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 566ā581. Association for Computational Linguistics. 11 -- 11 of 16 -- Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subrama- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, AurĆ©lien Ro- driguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine- tuned chat models. CoRR, abs/2307.09288. Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. Activation addition: Steering language models without optimization. CoRR, abs/2308.10248. Ahmet Ćstün, Viraat Aryabumi, Zheng Xin Yong, Wei- Yin Ko, Daniel Dāsouza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual language model. In Proceedings of the
Chunk 31 Ā· 1,991 chars
Yin Ko, Daniel Dāsouza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual language model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15894ā15939. Association for Computational Linguistics. Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, AiTi Aw, and Nancy Chen. 2024a. Seae- val for multilingual foundation models: From cross- lingual alignment to cultural reasoning. In Proceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 370ā390. Association for Computational Linguistics. Weixuan Wang, Barry Haddow, and Alexandra Birch. 2024b. Retrieval-augmented multilingual knowledge editing. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 335ā354, Bangkok, Thailand. Association for Computational Linguistics. Weixuan Wang, Barry Haddow, Alexandra Birch, and Wei Peng. 2024c. Assessing factual reliability of large language model knowledge. In Proceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa- pers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 805ā819. Association for Computational Linguistics. Xinyi Wang, Sebastian Ruder, and Graham Neubig. 2022. Expanding pretrained models to thousands more languages via lexicon-based adaptation. arXiv preprint arXiv:2203.09435. Minghao Wu, Thuy-Trang Vu, Lizhen Qu, George F. Foster, and Gholamreza Haffari. 2024. Adapting large
Chunk 32 Ā· 1,998 chars
9. Association for Computational Linguistics. Xinyi Wang, Sebastian Ruder, and Graham Neubig. 2022. Expanding pretrained models to thousands more languages via lexicon-based adaptation. arXiv preprint arXiv:2203.09435. Minghao Wu, Thuy-Trang Vu, Lizhen Qu, George F. Foster, and Gholamreza Haffari. 2024. Adapting large language models for document-level machine translation. CoRR, abs/2401.06468. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 483ā498. Association for Computational Linguistics. Wei Zhao, Steffen Eger, Johannes Bjerva, and Isabelle Augenstein. 2021. Inducing language-agnostic mul- tilingual representations. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, *SEM 2021, Online, Au- gust 5-6, 2021, pages 229ā240. Association for Com- putational Linguistics. 12 -- 12 of 16 -- A Details of Datasets The tasks and the corresponding output format, prompt template, evaluation metrics, the number of languages are shown in Table 6. B Hyperparameters for SFT We fine-tune all parameters of LLMs using the AdamW optimizer with a learning rate of 2 Ć 10ā6 and a batch size of 4. This process is conducted over three epochs on four NVIDIA A100 GPUs (80GB). During training, we use a linear learning rate schedule with a warm-up phase that constitutes 10% of the total training steps. C Detailed Results of Intervention The detailed results of BASELINE, MT-GOOGLE, MT-LLM, SFT, ITI, CAA INCLINE and SFT +INCLINE for each languages across discrimina- tive and generative tasks are shown in Table 7. D The value of α across languages We explore the optimal value of α for each lan- guage in XStoryCloze
Chunk 33 Ā· 1,992 chars
ps. C Detailed Results of Intervention The detailed results of BASELINE, MT-GOOGLE, MT-LLM, SFT, ITI, CAA INCLINE and SFT +INCLINE for each languages across discrimina- tive and generative tasks are shown in Table 7. D The value of α across languages We explore the optimal value of α for each lan- guage in XStoryCloze using grid search, as shown in Figure 6. E Projection to Non-English We have demonstrated the effectiveness of IN- CLINE in aligning representations from non- English to English. To further prove the generaliz- ability of INCLINE with another high-performing language, we conduct an ablation study aligning representations of various languages with French. As shown in Table 8, INCLINE enhances transla- tion performance to non-English languages, with an average BLEU score increase of +5.35. This fur- ther demonstrates that INCLINE can effectively align representations across different languages. F Layer-Specific Intervention To examine the effects of layer-specific interven- tions, we conduct a study applying interventions across different layers and evaluated the results using the MZsRE Portuguese test set. The find- ings, shown in Figure 7, demonstrate how accu- racy varies with interventions at different layers. Intervening in a single layer (scoring less than 50) resulted in lower performance compared to inter- ventions across all layers (52.09 in Table 7). Ac- cording to the trends in Figure 7, interventions in the higher layers lead to greater improvements than those in the lower layers, likely because they miti- gate information forgetting. Notably, interventions in the hidden states outperform other types signif- icantly. However, not every intervention leads to performance gains; both INCLINE-HIDDEN and INCLINE-FFN show substantial declines when in- tervening in the middle layers. The mechanisms underlying these effects merit further investigation. G Details of Visualizing Following Li et al. (2023b), we use Linear Regres- sion to examine
Chunk 34 Ā· 1,996 chars
ver, not every intervention leads to
performance gains; both INCLINE-HIDDEN and
INCLINE-FFN show substantial declines when in-
tervening in the middle layers. The mechanisms
underlying these effects merit further investigation.
G Details of Visualizing
Following Li et al. (2023b), we use Linear Regres-
sion to examine multilingual input representations.
For each English and corresponding Portuguese
sample from the News Commentary dataset (a total
of 500 items), we extract the hidden states at the
last token to create a probing dataset for each layer.
We randomly divide this dataset into training and
validation sets in a 4:1 ratio and fit a binary linear
classifier to the training set. Similar to principal
component analysis (PCA), we train a second lin-
ear probe on the same dataset, constrained to be
orthogonal to the first probe. This orthogonality
ensures that the two probes capture distinct aspects
of the data. Finally, we project the hidden states
of each sample in the MZsRE test set onto the di-
rections defined by the probes from the last layer,
allowing us to visualize and analyze the multilin-
gual representations effectively.
13
-- 13 of 16 --
Dataset Output prompt Metric |L|
XCOPA 2-way class Here is a premise: "{premise}". A: "{choice1}" B: "{choice2}"
What is the {question}? "A" or "B"? acc. 10
XStoryCloze 2-way class {input} What is a possible continuation for the story given the
following options? A: {quiz1} B: {quiz2}ā acc. 8
XWinograd 2-way class {input} Replace the _ in the above sentence with the correct option:
- {option1} - {option2} acc. 6
XNLI 3-way class Take the following as truth: {premise} Then the following statement:
"{hypothesis}" is "true", "false", or "inconclusive"? acc. 13
XCSQA multi-choice Question: {question} {choice} Answer: acc. 14
MZsRE answer {context} Quesion: {question} Answer: EM 10
Flores answer Translate the following sentence from {language} to English: {input} BLEU 10
WMT23 answer Translate the following sentenceChunk 35 Ā· 1,997 chars
{hypothesis}" is "true", "false", or "inconclusive"? acc. 13
XCSQA multi-choice Question: {question} {choice} Answer: acc. 14
MZsRE answer {context} Quesion: {question} Answer: EM 10
Flores answer Translate the following sentence from {language} to English: {input} BLEU 10
WMT23 answer Translate the following sentence from {language} to English: {input} BLEU 5
MGSM answer Write a response that appropriately completes the request in {language}.
Please answer in {language}. ### Instruction: {query}### Response: EM 9
Table 6: The nine datasets used to evaluate multilingual intervention. |L| indicates the number of languages. EM is
the Exact Match score and acc. represents the accuracy.
1 0 1
30
40
50
60
70
80
ar
INCLINE baseline
1 0 1
87
88
89
90
es
1 0 1
60
65
70
75
hi
1 0 1
82
84
86
id
1 0 1
30
40
50
60
ru
1 0 1
47.5
50.0
52.5
55.0
57.5
60.0
sw
1 0 1
86
88
90
zh
strength
accuracy
Figure 6: The accuracy changed with hyperparameter α on the XStoryCloze task.
0 5 10 15 20 25 30
10
20
30
40
50
15 20 25 30
46
48
INCLINE-hidden
INCLINE-attn
INCLINE-ffn
baseline
Layer
accuracy
Figure 7: The accuracy changed with layer-specific
intervention, where INCLINE-hidden, INCLINE-attn,
INCLINE-ffn represents the intervention in the hidden
states, in the output of attention heads, in the output of
the FFN block.
14
-- 14 of 16 --
Discriminative tasks
XCOPA en et id it sw ta th tr vi zh AVG
BASELINE 76.40 50.80 69.60 58.60 55.20 71.60 50.60 49.60 71.20 77.40 61.62
MT-GOOGLE - 75.40ā 75.00 76.00ā 76.20ā 62.20 62.40ā 78.40ā 76.40 77.80 73.31ā
MT-LLM - 44.80 69.80 59.40 60.20 71.20 47.40 51.20 61.60 73.00 59.84
SFT 86.40 50.60 78.40 67.80 59.00 77.20 47.60 53.00 83.00 84.60 66.80
ITI - 50.80 70.80 60.00 55.40 63.20 49.00 50.60 69.00 79.40 60.91
CAA - 51.20 72.20 61.20 59.20 73.00 52.20 52.00 74.80 79.80 63.96
INCLINE - 55.40 73.40 62.80 59.80 73.40 52.60 53.40 76.20 80.00 65.22
SFT +INCLINE - 53.20 81.20ā 65.80 60.80 85.00ā 54.40 53.40 84.40ā 85.00ā 69.24
XStoryClozeChunk 36 Ā· 1,999 chars
60 53.00 83.00 84.60 66.80 ITI - 50.80 70.80 60.00 55.40 63.20 49.00 50.60 69.00 79.40 60.91 CAA - 51.20 72.20 61.20 59.20 73.00 52.20 52.00 74.80 79.80 63.96 INCLINE - 55.40 73.40 62.80 59.80 73.40 52.60 53.40 76.20 80.00 65.22 SFT +INCLINE - 53.20 81.20ā 65.80 60.80 85.00ā 54.40 53.40 84.40ā 85.00ā 69.24 XStoryCloze en ar es hi id ru sw zh AVG BASELINE 91.46 79.22 87.89 76.37 84.45 57.78 50.50 88.55 74.96 MT-GOOGLE - 79.48 81.34 50.69 80.81 80.08ā 77.04 86.96 76.63 MT-LLM - 81.80 86.83 82.59 83.59 62.48 73.66 84.91 79.41 SFT 94.11 90.47 92.85 88.22 91.59 74.52 81.14 92.72 87.36 ITI - 78.23 90.54 80.28 85.70 58.70 52.55 88.68 76.38 CAA - 86.04 90.47 79.15 88.22 61.61 52.61 89.01 78.16 INCLINE - 83.12 90.60 81.47 86.10 67.24 59.70 91.20 79.92 SFT +INCLINE - 90.93ā 92.98ā 89.08ā 91.99ā 76.77 81.93ā 93.05ā 88.11ā XWinograd en fr ja pt ru zh AVG BASELINE 73.76 59.04 51.51 57.80 54.60 62.30 57.05 MT-GOOGLE - 61.45 58.39ā 59.32ā 57.41 50.60 57.63 MT-LLM - 54.22 47.86 33.08 42.22 37.70 43.02 SFT 78.06 62.65 14.91 43.35 58.09 39.89 43.78 ITI - 54.22 51.51 57.79 14.60 63.10 48.24 CAA - 60.24 52.87 58.17 57.14 63.69 58.42 INCLINE - 63.86ā 53.18 58.56 57.46 63.69ā 59.35ā SFT +INCLINE - 63.86ā 16.48 46.39 60.00ā 62.50 49.84 XCSQA en ar de es fr hi it ja nl pt ru sw vi zh AVG BASELINE 76.50 52.40 33.90 64.30 63.30 48.50 41.30 36.00 28.70 61.30 33.20 40.50 55.20 57.00 47.35 MT-GOOGLE - 61.60ā 65.00ā 68.00ā 67.20ā 32.10 68.70ā 57.30ā 66.50ā 66.90ā 64.50ā 19.60 62.90ā 60.40ā 58.52ā MT-LLM - 32.30 26.30 42.70 42.30 30.40 25.60 25.60 17.40 39.90 21.60 24.00 31.60 39.80 30.73 SFT 65.70 48.20 32.90 54.10 53.60 43.10 40.40 32.60 29.00 53.60 29.90 31.80 48.40 50.80 42.18 ITI - 52.10 34.20 64.50 63.70 48.10 40.00 25.90 26.00 61.20 33.50 40.90 54.90 57.20 46.32 CAA - 52.80 34.10 64.50 63.30 48.40 42.20 36.40 29.30 62.80 33.50 41.90 56.00 58.40 47.97 INCLINE - 53.20 34.90 65.00 63.80 48.80ā 42.90 36.80 29.80 62.60 33.80 42.20ā 57.30 58.70 48.45 SFT +INCLINE - 48.50 33.30 54.40 53.70 43.90
Chunk 37 Ā· 1,998 chars
TI - 52.10 34.20 64.50 63.70 48.10 40.00 25.90 26.00 61.20 33.50 40.90 54.90 57.20 46.32 CAA - 52.80 34.10 64.50 63.30 48.40 42.20 36.40 29.30 62.80 33.50 41.90 56.00 58.40 47.97 INCLINE - 53.20 34.90 65.00 63.80 48.80ā 42.90 36.80 29.80 62.60 33.80 42.20ā 57.30 58.70 48.45 SFT +INCLINE - 48.50 33.30 54.40 53.70 43.90 40.60 33.00 29.30 53.70 29.90 32.50 49.10 51.20 42.55 XNLI en ar de el es fr hi ru sw th tr vi zh AVG BASELINE 54.81 53.63 43.33 41.04 51.36 50.54 50.16 47.80 45.01 40.32 34.93 49.68 49.92 46.48 MT-GOOGLE - 51.46 53.13 52.71 51.84 50.82 41.58 51.68 50.54 50.50 52.00 51.94 50.42 50.72 MT-LLM - 46.87 43.25 36.29 52.12 51.40 45.31 42.08 43.43 34.07 33.17 47.23 48.42 43.64 SFT 86.37 77.17 68.10 59.48 82.71 81.48 72.42 66.87 67.15 54.55 49.80 77.62 78.76 69.68 ITI - 53.69 45.37 41.36 50.18 51.20 50.34 47.74 43.35 38.98 35.77 48.96 48.86 46.32 CAA - 53.59 44.67 41.62 52.83 52.75 50.28 34.40 45.75 40.48 36.41 50.32 50.92 46.17 INCLINE - 53.89 47.74 41.96 54.33 53.11 50.50ā 49.22 45.99 41.28 37.17 51.12 51.16ā 48.12 SFT +INCLINE - 78.44ā 71.02ā 61.22ā 83.07ā 82.14ā 73.85ā 69.68ā 69.14ā 55.69ā 51.60ā 78.64ā 79.52ā 71.17ā Generative tasks MZsRE en de es fr pt ru th tr vi zh AVG BASELINE 96.23 55.05 48.86 49.53 45.49 30.55 6.33 38.76 51.68 33.38 39.96 MT-GOOGLE - 78.73ā 76.18ā 75.50ā 71.74ā 63.66ā 78.47ā 77.39ā 60.97ā 79.41ā 73.56ā MT-LLM - 49.13 54.78 51.28 6.86 2.69 9.69 40.92 34.72 48.59 33.18 ITI - 53.84 44.41 43.34 41.99 19.11 6.59 38.63 46.70 32.17 36.31 CAA - 57.07 53.30 52.36 52.76 31.49 7.13 39.43 55.05 38.36 42.99 INCLINE - 57.20 53.30 51.82 52.09 31.49 7.40 41.86 55.32 38.49 43.22 Flores en ar el es fr hi ru tr vi zh AVG BASELINE - 66.59 15.30 48.52 67.86 71.97 35.66 12.38 40.40 56.11 46.09 ITI - 2.39 2.34 3.71 4.40 3.31 2.44 3.03 3.64 0.37 2.85 CAA - 67.88 15.92 54.85 68.16 72.98 38.99 12.09 43.01 56.93 47.87 INCLINE - 73.95ā 15.79ā 56.11ā 75.84ā 77.85ā 39.33ā 12.92ā 48.62ā 60.19ā 51.18ā WMT23 en de ja ru uk zh AVG BASELINE - 18.26 10.17 14.73 11.36
Chunk 38 Ā· 1,253 chars
15.30 48.52 67.86 71.97 35.66 12.38 40.40 56.11 46.09 ITI - 2.39 2.34 3.71 4.40 3.31 2.44 3.03 3.64 0.37 2.85 CAA - 67.88 15.92 54.85 68.16 72.98 38.99 12.09 43.01 56.93 47.87 INCLINE - 73.95ā 15.79ā 56.11ā 75.84ā 77.85ā 39.33ā 12.92ā 48.62ā 60.19ā 51.18ā WMT23 en de ja ru uk zh AVG BASELINE - 18.26 10.17 14.73 11.36 14.39 11.78 ITI - 2.75 1.79 2.32 1.66 3.16 2.34 CAA - 16.96 10.22 15.11 11.54 14.86 13.74 INCLINE - 18.85ā 10.30ā 15.24ā 11.71ā 15.05ā 14.23ā MGSM en de es fr ja ru sw th zh BASELINE 51.20 46.40 42.40 42.40 35.20 38.40 34.80 35.60 39.60 39.35 MT-GOOGLE - 46.00 50.40ā 47.20ā 44.40ā 46.80ā 45.60ā 45.60ā 47.60ā 46.70ā MT-LLM - 20.40 38.80 32.40 10.80 18.40 22.00 1.60 26.80 21.40 ITI - 46.00 43.20 44.80 35.60 40.00 36.80 34.80 42.80 40.50 CAA - 42.40 42.00 40.00 34.40 40.80 36.20 34.40 45.20 39.43 INCLINE - 48.40ā 46.80 45.20 37.60 44.80 38.00 38.80 43.20 42.85 Table 7: The overall results of nine NLP tasks with multilingual intervention. ā denotes the best results. 15 -- 15 of 16 -- en ar el es hi ru tr vi zh AVG BASELINE 45.11 44.70 15.37 39.37 50.18 36.99 10.51 38.77 42.20 35.91 INCLINE 52.36 52.33 15.62 51.37 55.40 39.69 10.94 46.48 47.14 41.26 Table 8: INCLINE on the Many-to-French translation task. 16 -- 16 of 16 --