ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework
Summary
This paper introduces ShifCon, a framework designed to enhance the performance of non-dominant languages in large language models (LLMs). Despite multilingual fine-tuning, non-dominant languages lag behind dominant ones like English due to imbalanced training data. ShifCon addresses this by aligning the internal forward process of non-dominant languages with that of the dominant language. It shifts non-dominant language representations into the dominant language subspace to access richer model information, then shifts them back before generation. A subspace distance metric identifies optimal layers for this shift, and multilingual contrastive learning further aligns representations. Experiments show ShifCon significantly improves performance, especially for low-resource languages, across multiple models and tasks. The framework is effective across different model scales and families, with analysis confirming the importance of selected layers for information aggregation.
PDF viewer
Chunks(44)
Chunk 0 ¡ 1,994 chars
arXiv:2410.19453v6 [cs.CL] 27 Jun 2025
ShifCon: Enhancing Non-Dominant Language Capabilities with a
Shift-based Multilingual Contrastive Framework
Hengyuan Zhang1 * â , Chenming Shang1 â , Sizhe Wang3, Dongdong Zhang2 B ,
Yiyao Yu1, Feng Yao4, Renliang Sun5, Yujiu Yang1B, Furu Wei2
1 Tsinghua University 2 Microsoft 3 University of Southern California
4 University of California, San Diego 5 University of California, Los Angeles
{zhang-hy22,scm22}@mails.tsinghua.edu.cn
Abstract
Although fine-tuning Large Language Models
(LLMs) with multilingual data can rapidly en-
hance the multilingual capabilities of LLMs,
they still exhibit a performance gap between
the dominant language (e.g., English) and non-
dominant ones due to the imbalance of train-
ing data across languages. To further enhance
the performance of non-dominant languages,
we propose ShifCon, a Shift-based multilin-
gual Contrastive framework that aligns the in-
ternal forward process of other languages to-
ward that of the dominant one. Specifically, it
shifts the representations of non-dominant lan-
guages into the dominant language subspace,
allowing them to access relatively rich infor-
mation encoded in the model parameters. The
enriched representations are then shifted back
into their original language subspace before
generation. Moreover, we introduce a subspace
distance metric to pinpoint the optimal layer
area for shifting representations and employ
multilingual contrastive learning to further en-
hance the alignment of representations within
this area. Experiments demonstrate that our
ShifCon framework significantly enhances the
performance of non-dominant languages, par-
ticularly for low-resource ones. Further analy-
sis offers extra insights to verify the effective-
ness of ShifCon and propel future research.
1 Introduction
While LLMs have demonstrated strong multilin-
gual capabilities (Lin et al., 2022; Achiam et al.,
2023; Anil et al., 2023), a performance gap remains
between the dominantChunk 1 ¡ 1,990 chars
for low-resource ones. Further analy- sis offers extra insights to verify the effective- ness of ShifCon and propel future research. 1 Introduction While LLMs have demonstrated strong multilin- gual capabilities (Lin et al., 2022; Achiam et al., 2023; Anil et al., 2023), a performance gap remains between the dominant language and non-dominant ones, primarily due to the imbalance in training data across languages (Shi et al., 2022; Huang et al., 2023; Gurgurov et al., 2024). A common strategy to mitigate this issue is translating dominant lan- guage data into non-dominant languages and apply- * This work was done during internship at Microsoft â Equal contribution B Corresponding author LDA Component 1 LDA Component 2 LDA Component 3 LDA Component 1 LDA Component 3 LDA Component 1 LDA Component 2 Project (b) Project (a) (a) (b) Figure 1: Two different projections on the sentence representations visualized using LDA. Projection (a) shows the representations are mutually aligned, imply- ing a language-agnostic status, whereas projection (b) illustrates separated representations in distinct spaces, suggesting a language-specific status. The sentence representations are obtained through mean-pooling the hidden states from the 15th layer of Llama-27B. ing Multilingual Supervised Fine-Tuning (MSFT) on the resulting multilingual datasets (Chen et al., 2023a; Zhang et al., 2023b). While MSFT provides initial capabilities for non-dominant languages, two key challenges limit further progress: 1) annotating high-quality data for non-dominant languages is expensive, even for the dominant language that serves as the source for translation (Kholodna et al., 2024); 2) transla- tion errors often lead to error propagation in sub- sequent procedures (Agrawal et al., 2024), thus requiring extensive verification to ensure data qual- ity. As a result, high-quality data for non-dominant languages is limited in scale, which restricts the effectiveness of MSFT. This raises an
Chunk 2 ¡ 1,996 chars
al., 2024); 2) transla- tion errors often lead to error propagation in sub- sequent procedures (Agrawal et al., 2024), thus requiring extensive verification to ensure data qual- ity. As a result, high-quality data for non-dominant languages is limited in scale, which restricts the effectiveness of MSFT. This raises an important question: Can we improve the performance of non- dominant languages with limited MSFT data? Considering this external limitation, previous work has delved into exploring internal represen- tation alignment to improve performance (Yoon -- 1 of 24 -- et al., 2024; Li et al., 2024a). A growing consensus indicates that it is the language-agnostic represen- tations, which are exhibited in the middle layer of the model, facilitating this enhancement (Kojima et al., 2024; Tang et al., 2024). Beyond those ef- forts, we consider that the representations, even in the middle layer, still retain language-specific information. Specifically, by visualizing sentence representations of translation pairs using linear dis- criminant analysis (LDA) in Fig. 1, we observe representations under projection (a) in the middle layer are mapped closely together (e.g., the 15th layer of Llama-27B out of 32 layers), suggesting a language-agnostic status, consistent with find- ings in prior research. However, in projection (b), we find that different languages occupy distinct subspaces across layers, indicating that language- specific information is consistently encoded within the representations (See Appendix A.1 for com- plete results across all languages, layers, and mod- els). This information enables the model to differ- entiate between languages. Moreover, we consider the superior performance of dominant languages is due to their representations being able to ac- cess more information during the internal forward process. This is because dominant language data predominates during pre-training, so much of the modelâs knowledge is encoded in the dominant language
Chunk 3 ¡ 1,992 chars
Moreover, we consider the superior performance of dominant languages is due to their representations being able to ac- cess more information during the internal forward process. This is because dominant language data predominates during pre-training, so much of the modelâs knowledge is encoded in the dominant language format, which is more easily accessible through its representations (Kassner et al., 2021; Yin et al., 2022; Zhao et al., 2024). Based on these findings, we propose a Shift- based multilingual Contrastive framework (Shif- Con) to boost the performance of non-dominant language. It includes shift-toward and shift- backward projections, as well as multilingual con- trastive learning (MCL). The shift-toward process maps non-dominant language representations into the dominant language subspace to obtain their dominant-like representations, allowing them to access more information encoded in the model, similar to how the dominant language operates. As language-specific information is crucial for gener- ating outputs in the target language (Li and Murray, 2023; Xu et al., 2023; Tang et al., 2024), a shift- backward process is needed to project the enriched dominant-like representations back into the origi- nal non-dominant language subspace before gen- eration. During this process, a subspace distance metric is introduced to pinpoint the optimal layer area for shifting representations. Moreover, our analysis reveals that even after shifting, the align- ment between non-dominant languageâs dominant- like representations and their dominant language counterparts remains insufficient. Therefore, we further apply multilingual contrastive learning to enhance their alignment. To summarize, our contributions are as follows: 1) We present ShifCon framework, designed to boost the performance of non-dominant languages by aligning their internal forward process with that of the dominant language. We also define a sub- space distance metric to pinpoint the optimal
Chunk 4 ¡ 1,997 chars
ing to enhance their alignment. To summarize, our contributions are as follows: 1) We present ShifCon framework, designed to boost the performance of non-dominant languages by aligning their internal forward process with that of the dominant language. We also define a sub- space distance metric to pinpoint the optimal layer area for implementing shift projection. 2) Extensive experiments validate the efficacy of ShifCon across diverse tasks and model scales, e.g., a 18.9% improvement on MGSM for low- resource languages in Llama-27B. Further anal- ysis confirms the effectiveness of the identified layer area for shift projection using subspace dis- tance metric. The improved alignment between dominant-like representations and their dominant counterparts enhances overall performance. 3) Moreover, we give the speculation that 30% of model layers with the lowest distance are likely focused on information aggregation and show that directly applying MCL to original representations may compromise the language-specific information within representations, which impedes the modelâs ability to generate in that language. 2 The Framework Our ShifCon (shown in Fig. 2) includes two mod- ules: 1) Shift Projection (§ 2.1), which maps the representations of non-dominant language into the dominant language subspace to obtain its dominant- like representations during internal forward pro- cess, and then shifts backwards to its native space before generation; 2) Multilingual Contrastive Learning (§ 2.2), which further aligns dominant- like representations of non-dominant languages with their dominant language counterparts. 2.1 Shift Projection 2.1.1 Shift-toward and Shift-backward To obtain the dominant-like representations for non- dominant languages, thereby enabling them to ac- cess more information encoded in the model pa- rameters during the internal forward process, our shift-toward module maps non-dominant language representations into dominant language subspace. -- 2 of 24 --
Chunk 5 ¡ 1,997 chars
backward To obtain the dominant-like representations for non- dominant languages, thereby enabling them to ac- cess more information encoded in the model pa- rameters during the internal forward process, our shift-toward module maps non-dominant language representations into dominant language subspace. -- 2 of 24 -- ¡¡¡ ¡¡¡ ¡¡¡ ¡¡¡ Contrast Shift backward Shift toward Forward like English, Response in Chinese/Russian. Low Subspace Distance Area Embedding Module !"#$%&'()*+,-. /0123#456,-.7# 8'(9:;<-.= A robe takes two bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Push Closer Push Away Positive English-like Representations Negative English-like Representations (b) Multilingual Contrastive Learning (a) ShifCon ¡¡¡ Russian Representation Chinese Representation English Representation ¡¡¡ Figure 2: An illustration of our ShifCon framework: (I) We shift non-dominant language representations (e.g., Chinese and Russian) into the dominant language subspace (e.g., English) to obtain their dominant-like repre- sentations. (II) Using parallel translation inputs between the non-dominant and dominant languages as positive samples, multilingual contrastive learning pushes non-dominant languageâs dominant-like representations closer to the dominant language and pushes away them from other representations. Specifically, given an input query in a non- dominant language l, the shift-toward process can be formulated as follows: ËhLto l = hLto l â vLto l + vLto d (1 ⤠Lto < L) (1) where Lto is the layer we shift the representation toward, hLto l â RnĂd denotes Lto-th layer hidden states of the input query in language l, where n is the number of tokens in the input query, d is the hidden dimension of the LLM. vLto l â Rd and vLto d â Rd are the Lto-th layer language vectors for the non-dominant language l and the dominant language, respectively.1 To compute the language vectors across all layers for each language l, a set of sentences in
Chunk 6 ¡ 1,994 chars
re n is the number of tokens in the input query, d is the hidden dimension of the LLM. vLto l â Rd and vLto d â Rd are the Lto-th layer language vectors for the non-dominant language l and the dominant language, respectively.1 To compute the language vectors across all layers for each language l, a set of sentences in that language is fed into the LLM. From the i-th layer of the LLM, sentence vectors are obtained by pooling the token representations2 within the sentence. These sentence vectors are then averaged to produce vi l â Rd. In this way, we gather a set of vectors Vl = [v1 l , v2 l , ..., vL l ], where L denotes the number of layers in the LLM. The obtained dominant-like representations of non- dominant language are then fed to the succeeding 1We utilize language vectors in the shift projection process, as it has been demonstrated to be an effective approach for language space mapping (LibovickĂ˝ et al., 2020; Xu et al., 2023; Tang et al., 2024). 2We explore different pooling methods in Appendix A.3. layers to access relatively rich information encoded in the model parameters. Since language-specific information is crucial for models to generate answers in that language, we shift dominant-like representations of the non- dominant language back to its native subspace at the Lbk-th layer before generation: hâ˛Lbk l = ËhLbk l â vLbk d + vLbk l (Lto < Lbk ⤠L) (2) where Lbk is the layer we shift the representa- tion backward, ËhLbk l represent the Lbk-th layer hid- den states of non-diminant language l. ËhLbk l are dominant-like representations because of the shift- toward projection. They are shifted back into their original subspace, resulting in hâ˛Lbk l . The represen- tations, now containing language-specific informa- tion of l, are then fed into the subsequent layers to produce responses in language l. 2.1.2 Language Subspace Distance It is crucial to establish an effective criterion for determining the optimal layer area for conducting shift projection
Chunk 7 ¡ 1,994 chars
in hâ˛Lbk l . The represen- tations, now containing language-specific informa- tion of l, are then fed into the subsequent layers to produce responses in language l. 2.1.2 Language Subspace Distance It is crucial to establish an effective criterion for determining the optimal layer area for conducting shift projection procedure. A practical solution is to select layers where the subspace of non-dominant languageâs dominant-like representations3 aligns 3We term the subspace of non-dominant languageâs dominant-like representations as âdominant-like subspaceâ. -- 3 of 24 -- well with the subspace of the dominant language counterparts, as greater alignment indicates they can be more similar in the internal forward process. Therefore, we introduce a subspace distance met- ric to measure the alignment between their sub- spaces, where smaller distances indicating stronger alignment. Specifically, for the language A, we define an affine subspace SA using the languageâs mean representation ÎźA â Rd along with kA prin- cipal directions of maximal variance in the lan- guage, defined by an orthonormal basis VA â RdĂkA . We consider this basis with kA directions can best describe the language-specific information of language A. To identify this subspace, we use XA â RnĂd to obtain ÎźA and employ singular value decomposition (SVD) on the XA to obtain VA, which is selected from the top-kA singular value by ÎŁA â RkAĂkA . Here, XA donates n contextualized token representations with d dimen- sionality in language A from the desired layer. We select the subspace dimensionality k such that the subspace accounted for 90% of the total variance in the language.4 Due to the varying dimensionality k of V across different languages, we adopt a Riemannian dis- tance metric that measures distances between pos- itive definite matrices (Bonnabel and Sepulchre, 2009; Chang et al., 2022) to quantify the distance between dominant-like subspace SDⲠand corre- sponding dominant language subspace
Chunk 8 ¡ 1,994 chars
he varying dimensionality k of V across different languages, we adopt a Riemannian dis- tance metric that measures distances between pos- itive definite matrices (Bonnabel and Sepulchre, 2009; Chang et al., 2022) to quantify the distance between dominant-like subspace SDⲠand corre- sponding dominant language subspace SD:5 Dist(SDⲠ, SD ) = v u u t d X i=1 log2(Îťi) + ||ÎźDⲠâ ÎźD ||2 (3) where Îťi is the i-th positive real eigenvalue of Kâ1 DⲠKD. Here KD â RdĂd can be calculated from the SVD of the right singular matrices VD: KD = 1 n â 1 VDÎŁ2 DV T D (4) We present the distance results of the XGLM7.5B in Fig. 3. We observe that the subspace distances in the middle layers are minimal, while the distances on the sides are larger with steep slopes. This obser- vation suggests that the middle layers in the model achieves superior alignment between dominant-like representations and their dominant language coun- terparts, enabling them access richer information analogous to dominant language representations, rendering it suitable for shift projection. 4See more details of computing process in Appendix A.2. 5After applying shift projection, the centroids of two sub- spaces will coincide, causing ||ÎźDⲠâ ÎźD ||2 = 0. Low Subspace Distance Area Shift-toward Shift-backward Figure 3: The distance of dominant-like subspace SDⲠand corresponding dominant language subspace SD in the XGLM7.5B using 1k FLORES samples per language. Its low subspace distance area, [13, 22], identified by β=30% (Finding 1), indicating shifting towards in the 13th layer and backward in the 22nd layer. To precisely identify these layers, we propose a simple method of sorting the distances in ascend- ing order and selecting the top-β6 layers with the smallest distances to establish the low subspace distance area. We find that the layers within the low subspace distance area are contiguous across models of different families and scales, making them ideally suited for shift projection. 2.2 Multilingual
Chunk 9 ¡ 1,998 chars
es in ascend-
ing order and selecting the top-β6 layers with the
smallest distances to establish the low subspace
distance area. We find that the layers within the
low subspace distance area are contiguous across
models of different families and scales, making
them ideally suited for shift projection.
2.2 Multilingual Contrastive Learning (MCL)
However, as shown in Fig. 3, some subspace dis-
tance still remains, even in the low subspace dis-
tance area (e.g., XGLM7.5Bâs 16th layer still ex-
hibits a subspace distance of about 47), which
requires further alignment to reduce. To address
this, we employ multilingual contrastive learning
to achieve a more refined alignment. We use trans-
lation pairs from dominant and non-dominant lan-
guages as positive pairs, pulling the dominant-like
representations of non-dominant language closer
to their dominant language counterparts. While the
dominant-like representations of other sentences in
the same batch serve as negative samples.
Formally, given a mini-batch of translation
pairs from non-dominant and dominant languages
{(si
l , si
d)}N
i=1, the Multilingual Contrastive Learn-
ing (MCL) loss at the t-th layer is:
Ëei
l = g(
hËht
l
ii
); ei
d = g(ht
d
i)
Lt
MCL(θ) =
N X
i=1
âlog exp(sim(Ëei
l , ei
d)/Ď )
P
j exp(sim(Ëei
l , ej
d)/Ď )
(5)
6We test β from 0% to 100%, choosing âN à βâ layers to
define the low subspace distance area. â¡â is ceiling function.
-- 4 of 24 --
Generation Classification
MGSM FLORES
(en-xx)
FLORES
(xx-en) XCOPA XNLI XStoryCloze
High Low High Low High Low High Low High Low High Low
Llama-27B 35.2 5.1 33.5 15.9 39.8 21.4 63.2 49.7 45.2 35.2 74.7 56.6
+MSFT 44.9 29.5 34.7 18.4 40.4 24.7 64.2 52.0 46.4 37.6 75.3 58.7
+AFP 46.3 31.7 35.2 19.1 41.0 25.3 65.0 52.8 46.8 38.7 76.0 59.8
+ShifCon 48.2 35.1 35.6 19.7 41.8 26.4 65.5 53.5 47.2 40.1 76.6 60.8
XGLM7.5B 4.0 1.9 32.2 31.5 41.2 35.8 63.8 57.3 44.5 41.4 65.2 58.4
+MSFT 10.6 7.0 33.5 32.8 42.3 37.3 64.9 58.3 45.9 42.3 66.7 60.1
+AFP 12.1 9.6 34.0 33.3Chunk 10 ¡ 1,995 chars
64.2 52.0 46.4 37.6 75.3 58.7 +AFP 46.3 31.7 35.2 19.1 41.0 25.3 65.0 52.8 46.8 38.7 76.0 59.8 +ShifCon 48.2 35.1 35.6 19.7 41.8 26.4 65.5 53.5 47.2 40.1 76.6 60.8 XGLM7.5B 4.0 1.9 32.2 31.5 41.2 35.8 63.8 57.3 44.5 41.4 65.2 58.4 +MSFT 10.6 7.0 33.5 32.8 42.3 37.3 64.9 58.3 45.9 42.3 66.7 60.1 +AFP 12.1 9.6 34.0 33.3 43.2 37.7 65.7 58.9 47.0 43.3 67.4 60.9 +ShifCon 13.7 11.7 34.5 34.1 43.7 38.5 66.8 60.1 48.6 44.3 68.1 62.2 BLOOM7.1B 13.2 3.7 41.4 24.3 45.7 30.7 57.7 52.1 42.4 36.6 67.3 58.1 +MSFT 21.9 12.5 42.3 25.9 46.5 33.1 59.2 53.9 44.0 38.9 68.6 59.8 +AFP 22.9 15.7 43.0 26.6 47.0 33.6 59.9 54.8 44.9 39.9 68.9 60.2 +ShifCon 24.5 18.8 43.4 27.2 47.2 34.5 60.3 56.3 45.5 40.8 69.5 60.9 Table 1: The average results of high- and low-resource languages across five tasks within three distinct model families. Detailed results for each language can be found in Appendix A.7. âen-xxâ denotes translation from English to another language, while âxx-enâ indicates translation from another language to English. Base model, e.g., Llama-27B, indicates fine-tuning solely with English data. where g(¡) is the pooling method used to obtain sen- tence representations, hËht l ii denotes the t-th layer dominant-like representations of si l , ht d i is the t- th layer representations of si d, and sim(, ) is cosine similarity function. Ď is a temperature hyperpa- rameter. MCL is performed on the layers between [Lto, Lbk) to achieve better alignment, resulting in the total MCL loss: LMCL = PLbkâ1 t=Lto Lt MCL. We illustrate the process of MCL in Fig. 2 (b) and train our ShifCon using the following loss: LShifCon(θ) = LMSFT(θ) + ÎąLMCL(θ) (6) where LMSFT denotes the loss of MSFT, computed through autoregressive language modeling on the multilingual dataset, and Îą â R+ is a hyper- parameter to balance these two losses. It is im- portant to note that when computing LMSFT for non-dominant language samples, their dominant- like representations are used during the internal forward process
Chunk 11 ¡ 1,993 chars
of MSFT, computed through autoregressive language modeling on the multilingual dataset, and Îą â R+ is a hyper- parameter to balance these two losses. It is im- portant to note that when computing LMSFT for non-dominant language samples, their dominant- like representations are used during the internal forward process instead of their original ones.7 3 Experiment 3.1 Experiment Settings Evaluation Tasks We conduct evaluations on a variety of multilingual benchmarks, covering both generation and classification tasks. 1) For 7In this work, we introduce a new strategy to obtain better language vectors for shift projection in the training phase. The details are illustrated in Appendix A.6. generation tasks, we consider FLORES (Team", 2022), a benchmark for machine translation, and MGSM (Shi et al.), a multilingual math reason- ing task. 2) For classification tasks, we utilize XNLI (Conneau et al., 2018), XCOPA (Ponti et al., 2020), and XStoryCloze (Lin et al., 2022), which are widely used generic reasoning datasets. For the evaluation of MGSM, we utilize MGSM8KInstruct (Chen et al., 2023a) as the train- ing set, which translates the GSM8K into nine non-English languages. For the evaluation of the other tasks, we follow Li et al. (2024a) and utilize Bactrian-X (Li et al., 2023b), which has been trans- lated into 52 languages from Alpaca (Taori et al., 2023) and Dolly (Conover et al., 2023), as the train- ing set. See Appendix A.4 for more details about the datasets we used in the experiment. Metrics For MGSM, we implement a rule-based extraction strategy (Chen et al., 2023a) to derive accuracy results in a zero-shot manner. We uti- lize the evaluation framework introduced by Zhang et al. (2024c) for assessing the other benchmarks in a 4-shot manner. Specifically, we assess the performance on the FLORES dataset using ChrF++ (Popovi´c, 2017) score, while the perfor- mance on the other datasets is evaluated based on rank classification accuracy.8 8The scoring function
Chunk 12 ¡ 1,988 chars
framework introduced by Zhang et al. (2024c) for assessing the other benchmarks in a 4-shot manner. Specifically, we assess the performance on the FLORES dataset using ChrF++ (Popovi´c, 2017) score, while the perfor- mance on the other datasets is evaluated based on rank classification accuracy.8 8The scoring function averages per-token logarithmic prob- abilities, excluding shared prefixes. The candidate with the highest score is chosen as the prediction. -- 5 of 24 -- đ˝ ratio Figure 4: The average results of all benchmarks across different β ratios in three distinct family models. Training Setup We incorporate LLMs from different families, such as Llama (Touvron et al., 2023), BLOOM (Scao et al., 2022), and XGLM (Lin et al., 2022), in our experiments. We utilize English as the dominant language in these three model families, as its data predominates in their corresponding pre-training corpus. The models trained using MSFT and the state-of-the- art alignment framework AFP (Li et al., 2024a), serve as the baseline for comparison. Since both MGSM8KInstruct and Bactrian-X are constructed through translation, we directly extract the instruc- tion content from their respective datasets to ac- quire the translation pairs for MCL. The details of model information and training settings can be found in Appendix A.5. 3.2 Performance of ShifCon We categorize the experimental languages into high- and low-resource languages based on their data ratios in the LLM pre-training corpus, and re- port their average results across different tasks in Table 1. As shown in Table 1, despite the initial capabilities provided by MSFT for non-dominant languages, our ShifCon consistently further boosts their performance. Specifically, for XGLM7.5B, our ShifCon improves performance by 2.1% for the high-resource languages on XCOPA and a more substantial improvement of 3.5% for the low- resource languages. Moreover, we observe that the enhancement of multilingual understanding also
Chunk 13 ¡ 1,998 chars
r ShifCon consistently further boosts their performance. Specifically, for XGLM7.5B, our ShifCon improves performance by 2.1% for the high-resource languages on XCOPA and a more substantial improvement of 3.5% for the low- resource languages. Moreover, we observe that the enhancement of multilingual understanding also fa- cilitates generation. For example, ShifCon exhibits an improvement of 7.3% on high-resource lan- guages on MGSM and a more significant improve- ment of 18.9% on low-resource languages. Based on these observations, we conclude that: ShifCon improves the performance of non-dominant lan- XCOPA XNLI XStoryCloze High Low High Low High Low XGLM564M 54.3 51.1 37.6 35.2 56.1 53.0 +MSFT 56.5 52.7 40.4 37.8 57.5 55.4 +AFP 57.3 53.9 41.5 39.1 58.0 56.6 +ShifCon 58.4 55.8 42.6 40.5 59.8 58.1 XGLM2.9B 61.5 54.9 41.8 37.6 61.7 54.9 +MSFT 63.4 57.2 44.6 40.5 64.1 57.6 +AFP 64.0 58.4 45.2 41.4 65.3 58.8 +ShifCon 65.5 59.8 46.8 43.3 66.5 60.4 BLOOM560M 53.8 51.2 39.8 34.2 60.3 54.2 +MSFT 55.1 52.3 41.7 35.4 62.2 54.1 +AFP 55.8 53.2 42.6 36.6 62.8 55.3 +ShifCon 56.7 54.8 43.5 38.2 63.6 56.8 BLOOM1.7B 55.4 51.7 41.5 35.3 62.4 54.8 +MSFT 56.9 53.4 43.2 36.3 64.6 56.3 +AFP 57.8 54.5 44.0 37.3 65.2 57.6 +ShifCon 58.7 55.8 44.8 38.9 66.8 59.2 Llama-38B 68.6 54.3 50.6 41.5 78.5 63.9 +MSFT 69.0 55.1 51.1 42.4 78.8 64.7 +AFP 69.3 56.0 51.3 43.1 79.1 65.6 +ShifCon 69.7 56.9 51.6 44.2 79.5 66.4 Table 2: The average performance of high- and low- resource languages across three classification tasks un- der model of different scales and families. Base model indicates fine-tuning solely with English data. guages, especially for low-resource languages. 3.3 Further Analysis Suitable β for Shift Projection We conduct ex- tra experiments to determine the number of layers for non-dominant languages to perform in their dominant-like representation during the internal forward process. In Fig. 4, the average performance of all benchmarks across three model families is shown for various
Chunk 14 ¡ 1,994 chars
nalysis Suitable β for Shift Projection We conduct ex- tra experiments to determine the number of layers for non-dominant languages to perform in their dominant-like representation during the internal forward process. In Fig. 4, the average performance of all benchmarks across three model families is shown for various selection ratios β (as defined in § 2.1), ranging from 0% to 100%. The results indicate a trend of initially increasing, peaking at a value of 30%, and subsequently declining. Similar trends can be observed in three models of different families. Therefore, we set β to 30% by default to obtain the low subspace distance area in our Shif- Con framework and give the following speculation: Finding 1. N Ă 30% of layers with low- est subspace distance are likely focused on information aggregation, making them suitable for non-dominant languages to for- ward in dominant-like representations. Where N denotes the number of layers in the model, and this speculation also aligns with the findings observed by Zhang et al. (2024a). -- 6 of 24 -- Shift Projection MCL Figure 5: Pooled sentence representations obtained with 300 FLORES samples per language from 15th layer of Llama-27B after utilizing shift projection and MCL modules. Visualization is based on LDA components 1 and 3. Llama-27B XGLM7.5B BLOOM7.1B ShifCon 46.0 43.8 44.1 w/o Shift Projection 42.2 39.9 40.7 w/o MCL 44.5 43.1 42.8 Table 3: The impact of Shift Projection and MCL in ShifCon on the average results of all benchmarks. âw/oâ means excluding this module from ShifCon. Performance of ShifCon across Different Scales Having verified the effectiveness of our ShifCon across different model families, we further as- sess its generalization on different model scales across three classification datasets. In the BLOOM family models, experiments are conducted at scales of 560M and 1.7B. For the XGLM fam- ily models, we utilize 564M and 2.9B scales, and for the Llama family model, we employ the Llama-38B
Chunk 15 ¡ 1,999 chars
model families, we further as- sess its generalization on different model scales across three classification datasets. In the BLOOM family models, experiments are conducted at scales of 560M and 1.7B. For the XGLM fam- ily models, we utilize 564M and 2.9B scales, and for the Llama family model, we employ the Llama-38B (Grattafiori et al., 2024). The average results for high- and low-resource languages are presented in Table 2. The results reveal that our ShifCon framework continues to exhibit superior performance compared to MSFT. Specifically, in XGLM family models, ShifCon demonstrates aver- age improvements of 4.9% and 4.5% for the 564M and 2.9B scales, respectively. For BLOOM fam- ily models, ShifCon shows average improvements of 4.1% and 4.3% for the 560M and 1.7B scales, respectively. For Llama-38B, ShifCon achieves an average improvement of 2.2%, a relatively mod- est gain compared to other models. This can be attributed to the inherently stronger multilingual capabilities of Llama-38B. Nonetheless, the appli- cation of ShifCon still brings benefits, particularly for low-resource languages. We believe this im- provement is due to the notable performance gaps that remain for these languages, which our frame- work helps to mitigate. Based on these observa- tions, we derive the conclusion below: ShifCon can generalize to models across different families and scales, which could be attributed to the selection Llama-27B XGLM7.5B BLOOM7.1B ShifCon 96.9 97.6 94.9 w/o Shift Projection 87.6 91.6 88.8 w/o MCL 96.6 97.3 95.5 Table 4: The average results of the language consistency on the MGSM task. âw/oâ means excluding this module from ShifCon. Figure 6: The subspace distances of Llama-27B after implementing shift projection and MCL. of appropriate layers determined by the subspace distance metric. Impact of Shift Projection and MCL Moreover, we investigate the impact of Shift Projection and MCL within ShifCon. Table 3 shows a performance decrease on âShifCon w/o Shift
Chunk 16 ¡ 1,994 chars
he subspace distances of Llama-27B after implementing shift projection and MCL. of appropriate layers determined by the subspace distance metric. Impact of Shift Projection and MCL Moreover, we investigate the impact of Shift Projection and MCL within ShifCon. Table 3 shows a performance decrease on âShifCon w/o Shift Projectionâ, indi- cating that directly implementing MCL using orig- inal representations of non-dominant languages, instead of their dominant-like counterparts, leads to this decline. We posit that applied MCL di- rectly on original representations may compromise language-specific information within the represen- tations, as it aims to bring representations of dif- ferent languages with the same meaning closer to- gether, making them become language-agnostic. To explore this further, we follow Zhang et al. (2024b) to employ a language detector9 tool to as- sess the language consistency of input and output 9https://pypi.org/project/langdetect -- 7 of 24 -- (a) (b) Figure 7: The low subspace distance areas of different models are delineated with dashed boxes. (a) shows the results for different model families; (b) shows the results for different scales of XGLM. between ShifCon and âShifCon w/o Shift Projec- tionâ. As shown in Table 4, a decrease in language consistency occurs when MCL is directly applied to the original representations. Based on this obser- vation, we give the following conclusion: Finding 2. Directly applying MCL to orig- inal representations may compromise the language-specific information within repre- sentations, which impedes the modelâs abil- ity to generate in that language, thereby adversely affecting performance. Moreover, comparing ShifCon and âShifCon w/o MCLâ, the performance increases. To delve deeper, we visualize the distribution of sentence represen- tations and subspace distance between ShifCon and âShifCon w/o MCLâ in Fig. 5 and Fig. 6, respec- tively. The visualization reveals that: Finding 3. MCL can further
Chunk 17 ¡ 1,996 chars
rmance. Moreover, comparing ShifCon and âShifCon w/o MCLâ, the performance increases. To delve deeper, we visualize the distribution of sentence represen- tations and subspace distance between ShifCon and âShifCon w/o MCLâ in Fig. 5 and Fig. 6, respec- tively. The visualization reveals that: Finding 3. MCL can further align the dominant-like representations of non- dominant language with its dominant language counterparts, thereby improving overall performance. Low Subspace Distance Area In Fig. 7, we show the subspace distance areas of different models uti- lizing the β value discovered in Finding 1. As depicted in Fig. 7 (a), we observe that the low sub- space distance areas of Llama-27B, XGLM7.5B, and BLOOM7.1B are [11, 20], [13, 22], and [14, 22] re- spectively. This indicates that: Finding 4. The low subspace distance ar- eas of models from different families vary but generally locate in the middle and late- middle layers. Moreover, the subspace distances of XGLM7.5B and BLOOM7.1B are higher than Llama-27B, possi- bly due to they are being pre-trained on large-scale Low Subspace Distance Area Sliding 25 30 35 40 45 Distance 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 49.0 Accuracy Figure 8: The subspace distance of the XGLM564M and its average performance across three classification tasks using various layer areas. Each pointâs result denotes a model trained with the specific layer index as the medium of the layer area, such as the 5th layer index indicating a model trained with the [2, 8] layer area. multilingual data, allowing them to learn more iso- lated representations for each language. Another observation we find is that: Finding 5. Models from the same family, despite having different layers, exhibit simi- lar locations in the model for their low sub- space distance areas. Specifically, in Fig. 7 (b), the low subspace dis- tance areas of XGLM7.5B and XGLM564M are [13, 22] and [9, 16], respectively, both situated in the middle of the model. Additionally,
Chunk 18 ¡ 1,997 chars
rom the same family, despite having different layers, exhibit simi- lar locations in the model for their low sub- space distance areas. Specifically, in Fig. 7 (b), the low subspace dis- tance areas of XGLM7.5B and XGLM564M are [13, 22] and [9, 16], respectively, both situated in the middle of the model. Additionally, the subspace distance of XGLM7.5B is higher than XGLM564M, possibly due to larger models showcasing enhanced language discrimination abilities. Effectiveness of Subspace Distance Area and Metric We conduct extra experiments to verify if the layers within low subspace distance area are suitable for our ShifCon framework. Specif- ically, for the XGLM564M with 24 layers, we select â24 Ă 30%â = 8 layers to apply our ShifCon. We explore the performance of shift projection in re- gions beyond its low subspace distance area [9, 16] in a 8 layers sliding window manner. As shown in Fig. 8, as we slide the experimen- tal layer area window from left to right, conduct- ing ShifCon in layer areas that exhibit great over- lap with low subspace distance areas results in improved performance. Moreover, as depicted in Fig. 7, we find that the subspace distances of layers within the low subspace distance area are close. This suggests that the language-specific in- formation within the representations remains rel- atively unchanged, resulting in a stable distance between the subspaces of languages. We speculate -- 8 of 24 -- the model in these layers may focus on processing semantic information. Based on these two observa- tions, we give the following speculation: Finding 6. Layers in the low subspace dis- tance area are likely focused on information aggregation, thus aiding in gathering more information for non-dominant languages and enhancing performance. This observation also highlights the effective- ness of our proposed distance metric (§ 2.1.2) in identifying the optimal layer area for our ShifCon. 4 Related Work Multilingual Bias in LLMs Large Language Models
Chunk 19 ¡ 1,994 chars
gregation, thus aiding in gathering more information for non-dominant languages and enhancing performance. This observation also highlights the effective- ness of our proposed distance metric (§ 2.1.2) in identifying the optimal layer area for our ShifCon. 4 Related Work Multilingual Bias in LLMs Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities as a result of their train- ing on extensive and diverse multilingual datasets. These models have shown proficiency in various aspects of language processing across multiple lan- guages, including multilingual reasoning, under- standing, and generation (Xue et al., 2021; Lin et al., 2022; Anil et al., 2023). However, empir- ical analysis indicates limited proficiency in low- resource languages, stemming from training data imbalances (Huang et al., 2023; Zhu et al., 2024b; Gurgurov et al., 2024) and distinct representation spaces (Wen-Yi and Mimno, 2023; Liu et al., 2024; Yao et al., 2024). Several studies have focused on scaling multilingual corpora through translation, which can provide preliminary capabilities for non- dominant languages. However, this approach is limited in both scale and quality due to the high cost of translated annotations and the presence of translation errors (Muennighoff et al., 2023; Zhang et al., 2023b; Chen et al., 2023b; Tan et al., 2024). In this study, we propose an internal alignment framework to further enhance the performance of non-dominant languages with limited MSFT data. Representation Alignment Previous studies have shown that projecting representations from the source to the target domain can mitigate domain discrepancies, facilitating effective cross-domain alignment and enhancing performance without dis- turbing the original domain subspace (Kozhevnikov and Titov, 2014; Chang et al., 2022; Xu et al., 2023; Zhu et al., 2024a). However, this method often re- sults in coarse alignment due to its unsupervised nature. On the other hand, contrastive
Chunk 20 ¡ 1,983 chars
ing effective cross-domain alignment and enhancing performance without dis- turbing the original domain subspace (Kozhevnikov and Titov, 2014; Chang et al., 2022; Xu et al., 2023; Zhu et al., 2024a). However, this method often re- sults in coarse alignment due to its unsupervised nature. On the other hand, contrastive learning offers a more detailed representation learning ap- proach by utilizing positive and negative pairs to encourage proximity within positive pairs and dis- tance between negative pairs in a supervised man- ner. This method is better at capturing the complex relationships between representations and achiev- ing precise alignment (Radford et al., 2021; Zhang et al., 2022; Li et al., 2023a; Zhang et al., 2023a, 2025b; Li et al., 2024a). Drawing from these in- sights, our framework first employs mean-shifted projection to map non-dominant language repre- sentations into the dominant language subspace, preserving language-specific information, and then applies contrastive learning for further alignment. 5 Conclusion This work aims to improve the performance of non-dominant languages with limited MSFT data. To achieve this, we propose ShifCon framework, which aims to align the internal forward process of non-dominant languages with that of the domi- nant language. It maps the representations of non- dominant languages into the dominant languageâs subspace to acquire their dominant-like representa- tions, allowing them to access more information en- coded in the model parameters. The dominant-like representations are then shifted back to their native subspace to yield answers in their languages. Fur- thermore, we propose a subspace distance metric to determine the optimal layer area for shift projec- tion, and we apply multilingual contrastive learning to further enhance the internal alignment. The ex- perimental results demonstrate that our proposed ShifCon effectively improves the performance of non-dominant languages across models of
Chunk 21 ¡ 1,997 chars
subspace distance metric to determine the optimal layer area for shift projec- tion, and we apply multilingual contrastive learning to further enhance the internal alignment. The ex- perimental results demonstrate that our proposed ShifCon effectively improves the performance of non-dominant languages across models of various families and scales. Our comprehensive analysis offers valuable insights for future research. 6 Limitations The ShifCon framework leverages translation pairs to conduct multilingual contrastive learning, which may pose challenges for low-resource languages or those lacking substantial parallel corpora. Fur- thermore, due to computational resource limita- tions, the framework is restricted to multilingual generative language models with parameters not exceeding 8B. Additionally, our forthcoming research endeav- ors will delve into exploring alternative model ar- chitectures, such as encoder-decoder models, to showcase the full potential and versatility of our proposed framework. -- 9 of 24 -- References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Ashish Agrawal, Barah Fazili, and Preethi Jyothi. 2024. Translation errors significantly impact low-resource languages in cross-lingual learning. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 319â329, St. Julianâs, Malta. Association for Computational Linguistics. Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403. SilvĂŠre Bonnabel and Rodolphe Sepulchre. 2009. Rie- mannian metric and geometric mean for positive semidefinite matrices of fixed rank. SIAM
Chunk 22 ¡ 1,993 chars
n John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403. SilvĂŠre Bonnabel and Rodolphe Sepulchre. 2009. Rie- mannian metric and geometric mean for positive semidefinite matrices of fixed rank. SIAM Journal on Matrix Analysis and Applications, 31:1055â1070. Tyler Chang, Zhuowen Tu, and Benjamin Bergen. 2022. The geometry of multilingual language model repre- sentations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 119â136, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Nuo Chen, Zinan Zheng, Ning Wu, Linjun Shou, Ming Gong, Yangqiu Song, Dongmei Zhang, and Jia Li. 2023a. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. arXiv preprint arXiv:2310.20246. Nuo Chen, Zinan Zheng, Ning Wu, Linjun Shou, Ming Gong, Yangqiu Song, Dongmei Zhang, and Jia Li. 2023b. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. arXiv preprint arXiv:2310.20246. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross- lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 2475â2485, Brus- sels, Belgium. Association for Computational Lin- guistics. Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023. Free dolly: Introducing the worldâs first truly open instruction- tuned llm. databricks. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of mod- els. arXiv e-prints, pages arXivâ2407. Daniil Gurgurov, Tanja Bäumel, and
Chunk 23 ¡ 1,992 chars
truly open instruction- tuned llm. databricks. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of mod- els. arXiv e-prints, pages arXivâ2407. Daniil Gurgurov, Tanja Bäumel, and Tatiana Anikina. 2024. Multilingual large language models and curse of multilinguality. arXiv preprint arXiv:2406.10602. Haoyang Huang, Tianyi Tang, Dongdong Zhang, Wayne Xin Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual- thought prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023. Nora Kassner, Philipp Dufter, and Hinrich SchĂźtze. 2021. Multilingual LAMA: Investigating knowledge in multilingual pretrained language models. In Pro- ceedings of the 16th Conference of the European Chapter of the Association for Computational Lin- guistics: Main Volume, pages 3250â3258, Online. Association for Computational Linguistics. Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, and Michael Gran- itzer. 2024. Llms in the loop: Leveraging large lan- guage model annotations for active learning in low- resource languages. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 397â412. Springer. Takeshi Kojima, Itsuki Okimura, Yusuke Iwasawa, Hit- omi Yanaka, and Yutaka Matsuo. 2024. On the multi- lingual ability of decoder-based pre-trained language models: Finding and controlling language-specific neurons. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6919â6971, Mexico City, Mexico. Association for Computational Linguistics. Cunliang Kong, Yun Chen, Hengyuan Zhang, Liner Yang, and Erhong Yang. 2022a. Multitasking frame- work for unsupervised
Chunk 24 ¡ 1,990 chars
erican Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6919â6971, Mexico City, Mexico. Association for Computational Linguistics. Cunliang Kong, Yun Chen, Hengyuan Zhang, Liner Yang, and Erhong Yang. 2022a. Multitasking frame- work for unsupervised simple definition generation. arXiv preprint arXiv:2203.12926. Cunliang Kong, Yujie Wang, Ruining Chong, Liner Yang, Hengyuan Zhang, Erhong Yang, and Yaping Huang. 2022b. Blcu-icall at semeval-2022 task 1: Cross-attention multitasking framework for defini- tion modeling. arXiv preprint arXiv:2204.07701. Mikhail Kozhevnikov and Ivan Titov. 2014. Cross- lingual model transfer using feature representation projection. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers), pages 579â585. Chong Li, Shaonan Wang, Jiajun Zhang, and Chengqing Zong. 2024a. Improving in-context learning of multilingual generative language models with cross- lingual alignment. In Proceedings of the 2024 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 8058â8076, Mexico City, Mexico. Association for Computational Linguistics. -- 10 of 24 -- Dawei Li, Zhen Tan, Tianlong Chen, and Huan Liu. 2024b. Contextualization distillation from large lan- guage model for knowledge graph completion. arXiv preprint arXiv:2402.01729. Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sunkwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, et al. 2024c. Dalk: Dynamic co-augmentation of llms and kg to answer alzheimerâs disease questions with scientific literature. arXiv preprint arXiv:2405.04819. Dawei Li, Hengyuan Zhang, Yanran Li, and Shiping Yang. 2023a. Multi-level contrastive learning for script-based character understanding. arXiv preprint arXiv:2310.13231. Haonan Li, Fajri Koto, Minghao Wu, Alham
Chunk 25 ¡ 1,983 chars
and kg to answer alzheimerâs disease questions with scientific literature. arXiv preprint arXiv:2405.04819. Dawei Li, Hengyuan Zhang, Yanran Li, and Shiping Yang. 2023a. Multi-level contrastive learning for script-based character understanding. arXiv preprint arXiv:2310.13231. Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. 2023b. Bactrian-x: Multilingual replicable instruction-following mod- els with low-rank adaptation. arXiv preprint arXiv:2305.15011. Tianjian Li and Kenton Murray. 2023. Why does zero- shot cross-lingual generation fail? an explanation and a solution. In Findings of the Association for Compu- tational Linguistics: ACL 2023, pages 12461â12476, Toronto, Canada. Association for Computational Lin- guistics. JindËrich LibovickĂ˝, Rudolf Rosa, and Alexander Fraser. 2020. On the language neutrality of pre-trained mul- tilingual representations. In Findings of the Associ- ation for Computational Linguistics: EMNLP 2020, pages 1663â1674, Online. Association for Computa- tional Linguistics. Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian OâHoro, Jeff Wang, Luke Zettle- moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy- anov, and Xian Li. 2022. Few-shot learning with multilingual generative language models. In Proceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019â9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Yihong Liu, Chunlan Ma, Haotian Ye, and Hinrich Schuetze. 2024. TransliCo: A contrastive learning framework to address the script barrier in multilin- gual pretrained language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2476â2499, Bangkok, Thailand. Association for Computational
Chunk 26 ¡ 1,986 chars
h Schuetze. 2024. TransliCo: A contrastive learning framework to address the script barrier in multilin- gual pretrained language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2476â2499, Bangkok, Thailand. Association for Computational Linguistics. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Confer- ence on Learning Representations. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hai- ley Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Al- banie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generaliza- tion through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991â16111, Toronto, Canada. Association for Computational Linguistics. Edoardo Maria Ponti, Goran GlavaĹĄ, Olga Majewska, Qianchu Liu, Ivan Vuli´c, and Anna Korhonen. 2020. XCOPA: A multilingual dataset for causal common- sense reasoning. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362â2376, Online. As- sociation for Computational Linguistics. Maja Popovi´c. 2017. chrF++: words helping charac- ter n-grams. In Proceedings of the Second Confer- ence on Machine Translation, pages 612â618, Copen- hagen, Denmark. Association for Computational Lin- guistics. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748â8763. PMLR. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ili´c, Daniel Hesslow,
Chunk 27 ¡ 1,993 chars
i Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748â8763. PMLR. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman CastagnĂŠ, Alexandra Sasha Luccioni, François Yvon, Matthias GallĂŠ, et al. 2022. Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Lan- guage models are multilingual chain-of-thought rea- soners. In The Eleventh International Conference on Learning Representations. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022. Language models are multilingual chain-of-thought reasoners. In International Conference on Learning Representations (ICLR). Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024. Large language models for data annotation: A survey. arXiv preprint arXiv:2402.13446. Tianyi Tang, Wenyang Luo, Haoyang Huang, Dong- dong Zhang, Xiaolei Wang, Xin Zhao, Furu Wei, and Ji-Rong Wen. 2024. Language-specific neurons: The key to multilingual capabilities in large language models. In Proceedings of the 62nd Annual Meeting -- 11 of 24 -- of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 5701â5715, Bangkok, Thailand. Association for Computational Linguistics. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca. "NLLB Team". 2022. No language left
Chunk 28 ¡ 1,999 chars
and. Association for Computational Linguistics. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca. "NLLB Team". 2022. No language left behind: Scaling human-centered machine translation. Yongqi Tong, Dawei Li, Sizhe Wang, Yujia Wang, Fei Teng, and Jingbo Shang. 2024. Can llms learn from previous mistakes? investigating llmsâ errors to boost for reasoning. arXiv preprint arXiv:2403.20046. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subrama- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Ro- driguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine- tuned chat models. Sizhe Wang, Yongqi Tong, Hengyuan Zhang, Dawei Li, Xin Zhang, and Tianlong Chen. 2024. Bpo: Towards balanced preference optimization between knowl- edge breadth and depth in alignment. arXiv preprint arXiv:2411.10914. Andrea W Wen-Yi and David Mimno. 2023. Hyperpoly- glot LLMs: Cross-lingual interpretability in
Chunk 29 ¡ 1,996 chars
odels. Sizhe Wang, Yongqi Tong, Hengyuan Zhang, Dawei Li, Xin Zhang, and Tianlong Chen. 2024. Bpo: Towards balanced preference optimization between knowl- edge breadth and depth in alignment. arXiv preprint arXiv:2411.10914. Andrea W Wen-Yi and David Mimno. 2023. Hyperpoly- glot LLMs: Cross-lingual interpretability in token embeddings. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 1124â1131, Singapore. Association for Computational Linguistics. Shaoyang Xu, Junzhuo Li, and Deyi Xiong. 2023. Lan- guage representation projection: Can we transfer factual knowledge across languages in multilingual language models? In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 3692â3702, Singapore. Associa- tion for Computational Linguistics. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 483â498, On- line. Association for Computational Linguistics. Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Ani- mesh Kumar, and Jingbo Shang. 2024. Data contam- ination can cross language barriers. arXiv preprint arXiv:2406.13236. Da Yin, Hritik Bansal, Masoud Monajatipoor, Liu- nian Harold Li, and Kai-Wei Chang. 2022. GeoM- LAMA: Geo-diverse commonsense probing on multi- lingual pre-trained language models. In Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, pages 2039â2055, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, and Minjoon Seo. 2024. Lang- Bridge: Multilingual reasoning without multilingual supervision. In Proceedings of the 62nd Annual Meeting of the
Chunk 30 ¡ 1,997 chars
rocessing, pages 2039â2055, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, and Minjoon Seo. 2024. Lang- Bridge: Multilingual reasoning without multilingual supervision. In Proceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 7502â7522, Bangkok, Thailand. Association for Computational Linguistics. Hengyuan Zhang, Xinrong Chen, Yingmin Qiu, Xiao Liang, Ziyue Li, Guanyu Wang, Weiping Li, Tong Mo, Wenyue Li, Hayden Kwok-Hay So, et al. 2025a. Guilomo: Allocating expert number and rank for lora- moe via bilevel optimization with guidedselection vectors. arXiv preprint arXiv:2506.14646. Hengyuan Zhang, Dawei Li, Yanran Li, Chenming Shang, Chufan Shi, and Yong Jiang. 2023a. As- sisting language learners: Automated trans-lingual definition generation via contrastive prompt learning. arXiv preprint arXiv:2306.06058. Hengyuan Zhang, Dawei Li, Shiping Yang, and Yan- ran Li. 2022. Fine-grained contrastive learning for definition generation. Hengyuan Zhang, Zitao Liu, Chenming Shang, Dawei Li, and Yong Jiang. 2025b. A question-centric multi- experts contrastive learning framework for improving the accuracy and interpretability of deep sequential knowledge tracing models. ACM Transactions on Knowledge Discovery from Data, 19(2):1â25. Hengyuan Zhang, Yanru Wu, Dawei Li, Sak Yang, Rui Zhao, Yong Jiang, and Fei Tan. 2024a. Balancing speciality and versatility: a coarse to fine framework for supervised fine-tuning large language model. In Findings of the Association for Computational Lin- guistics ACL 2024, pages 7467â7509. Liang Zhang, Qin Jin, Haoyang Huang, Dongdong Zhang, and Furu Wei. 2024b. Respond in my language: Mitigating language inconsistency in re- sponse generation based on large language models. In Proceedings of the 62nd Annual Meeting of the -- 12 of 24 -- Association for Computational Linguistics
Chunk 31 ¡ 1,999 chars
2024, pages 7467â7509. Liang Zhang, Qin Jin, Haoyang Huang, Dongdong Zhang, and Furu Wei. 2024b. Respond in my language: Mitigating language inconsistency in re- sponse generation based on large language models. In Proceedings of the 62nd Annual Meeting of the -- 12 of 24 -- Association for Computational Linguistics (Volume 1: Long Papers), pages 4177â4192, Bangkok, Thailand. Association for Computational Linguistics. Miaoran Zhang, Vagrant Gautam, Mingyang Wang, Je- sujoba Alabi, Xiaoyu Shen, Dietrich Klakow, and Marius Mosbach. 2024c. The impact of demonstra- tions on multilingual in-context learning: A multi- dimensional analysis. In Findings of the Associa- tion for Computational Linguistics ACL 2024, pages 7342â7371, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics. Shaolei Zhang, Qingkai Fang, Zhuocheng Zhang, Zhen- grui Ma, Yan Zhou, Langlin Huang, Mengyu Bu, Shangtong Gui, Yunji Chen, Xilin Chen, et al. 2023b. Bayling: Bridging cross-lingual alignment and instruction following through interactive trans- lation for large language models. arXiv preprint arXiv:2306.10968. Xin Zhao, Naoki Yoshinaga, and Daisuke Oba. 2024. Tracing the roots of facts in multilingual language models: Independent, shared, and transferred knowl- edge. In Proceedings of the 18th Conference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2088â2102, St. Julianâs, Malta. Association for Com- putational Linguistics. Mu Zhu, Qingzhou Wu, Zhongli Bai, Yu Song, and Qiang Gao. 2024a. Eeg-eye movement based subject dependence, cross-subject, and cross-session emo- tion recognition with multidimensional homogeneous encoding space alignment. Expert Systems with Ap- plications, 251:124001. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024b. Multilingual machine translation with large language models: Empirical results and analy- sis. In Findings
Chunk 32 ¡ 1,971 chars
tidimensional homogeneous encoding space alignment. Expert Systems with Ap- plications, 251:124001. Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024b. Multilingual machine translation with large language models: Empirical results and analy- sis. In Findings of the Association for Computational Linguistics: NAACL 2024. -- 13 of 24 -- A Appendix A.1 Visualization of Sentence Representations across Layers 25 0 25 50 75 100 60 40 20 0 20 40 60 Layer 1 20 0 20 40 60 60 40 20 0 20 40 60 Layer 3 20 0 20 40 40 20 0 20 40 Layer 5 20 10 0 10 20 30 40 30 20 10 0 10 20 30 40 Layer 7 10 0 10 20 40 30 20 10 0 10 20 30 Layer 9 10 5 0 5 10 15 30 20 10 0 10 20 Layer 11 5 0 5 10 30 20 10 0 10 20 Layer 13 2 0 2 4 6 8 20 10 0 10 20 Layer 15 2 0 2 4 6 8 10 20 10 0 10 20 30 Layer 17 5.0 2.5 0.0 2.5 5.0 7.5 10.0 30 20 10 0 10 20 30 40 Layer 19 5 0 5 10 40 30 20 10 0 10 20 30 40 Layer 21 10 0 10 20 10 0 10 20 30 Layer 23 10 0 10 20 30 10 0 10 20 30 40 Layer 25 10 0 10 20 30 40 20 0 20 40 Layer 27 20 0 20 40 40 30 20 10 0 10 20 Layer 29 20 0 20 40 40 30 20 10 0 10 20 30 Layer 31 EN ES FR TH SW ZH JA Figure 9: We follow Chang et al. (2022) to conduct LDA and present the visualization of sentence representations obtained by mean-pooling from Llama-27B across layers along LDA components 1 and 3. We utilize 300 samples for each language from the FLORES dataset. -- 14 of 24 -- 20 0 20 40 60 30 20 10 0 10 20 30 Layer 1 0 25 50 75 30 20 10 0 10 20 30 Layer 3 0 25 50 75 40 30 20 10 0 10 20 30 Layer 5 25 0 25 50 75 40 20 0 20 40 Layer 7 20 0 20 40 60 20 10 0 10 20 30 Layer 9 25 0 25 50 30 20 10 0 10 20 Layer 11 20 0 20 40 60 20 10 0 10 20 Layer 13 20 0 20 10 5 0 5 10 Layer 15 20 10 0 10 20 10 5 0 5 Layer 17 20 10 0 10 20 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 Layer 19 10 0 10 15 10 5 0 5 10 15 Layer 21 20 0
Chunk 33 ¡ 1,996 chars
75 40 20 0 20 40 Layer 7 20 0 20 40 60 20 10 0 10 20 30 Layer 9 25 0 25 50 30 20 10 0 10 20 Layer 11 20 0 20 40 60 20 10 0 10 20 Layer 13 20 0 20 10 5 0 5 10 Layer 15 20 10 0 10 20 10 5 0 5 Layer 17 20 10 0 10 20 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 Layer 19 10 0 10 15 10 5 0 5 10 15 Layer 21 20 0 20 15 10 5 0 5 10 15 20 25 Layer 23 20 0 20 15 10 5 0 5 10 15 20 Layer 25 20 0 20 40 60 30 20 10 0 10 Layer 27 25 0 25 50 75 30 20 10 0 10 Layer 29 EN ES FR TH SW ZH JA Figure 10: We follow Chang et al. (2022) to conduct LDA and present the visualization of sentence representations obtained by mean-pooling from BLOOM7.1B across layers along LDA components 1 and 3. We utilize 300 samples for each language from the FLORES dataset. -- 15 of 24 -- 20 0 20 40 60 80 100 30 20 10 0 10 20 30 40 Layer 1 20 0 20 40 60 80 20 0 20 40 60 Layer 3 20 0 20 40 60 80 40 20 0 20 40 60 Layer 5 40 20 0 20 40 60 80 20 0 20 40 Layer 7 20 0 20 40 60 80 40 20 0 20 40 Layer 9 10 0 10 20 30 20 10 0 10 20 Layer 11 10 5 0 5 10 15 20 15 10 5 0 5 10 15 20 Layer 13 10 5 0 5 10 15 20 10 5 0 5 10 15 20 Layer 15 15 10 5 0 5 10 10 5 0 5 10 15 Layer 17 15 10 5 0 5 10 15 10 5 0 5 10 15 20 Layer 19 40 20 0 20 20 10 0 10 20 Layer 21 20 10 0 10 20 20 10 0 10 20 30 Layer 23 20 10 0 10 20 20 10 0 10 20 30 40 50 Layer 25 20 10 0 10 20 20 10 0 10 20 30 40 Layer 27 20 10 0 10 20 30 20 10 0 10 20 30 40 Layer 29 20 10 0 10 20 30 20 10 0 10 20 30 40 Layer 31 EN ES FR TH SW ZH JA Figure 11: We follow Chang et al. (2022) to conduct LDA and present the visualization of sentence representations obtained by mean-pooling from XGLM7.5B across layers along LDA components 1 and 3. We utilize 300 samples for each language from the FLORES dataset. -- 16 of 24 -- A.2 Details of Language Subspace Distance For each language A, we obtain a data matrix XA â RnĂd of n contextualized token representations with d
Chunk 34 ¡ 1,995 chars
tations obtained by mean-pooling from XGLM7.5B across layers along LDA components 1 and 3. We utilize 300 samples for each language from the FLORES dataset. -- 16 of 24 -- A.2 Details of Language Subspace Distance For each language A, we obtain a data matrix XA â RnĂd of n contextualized token representations with d dimensionality in language A using 1k FLORES samples per language from the desired layer. The language subspace SA10 is described by the languageâs mean representation ÎźA â Rd along with k principal directions of maximal variance in the language, defined by an orthonormal basis VA â RdĂkA . In particular, ÎźA can be calculated as the mean value of XA along the token dimension n. As for VA, we first perform a singular value decomposition (SVD) of XA: XA = U ÎŁV T , where U â RnĂn and V â RdĂd are orthogonal. ÎŁ â RnĂd consists of a diagonal matrix ΣⲠâ RdĂd and a zero matrix, where ΣⲠ= diag(Ď1, Ď2, . . . , Ďd), with Ď1 ⼠Ď2 ⼠. . . ⼠Ďd ⼠0. ΣⲠdenotes the direction of greatest change in XA, which can be used for feature selecting. We select the first kA values to get ÎŁA = diag(Ď1, Ď2, . . . , ĎkA ) â RkAĂkA , while at the same time ensuring that the subspace accounted for 90% of the total variance in the language.11 Therefore, based on ÎŁA, we can obtain the corresponding VA and leverage U ÎŁAV T A to estimate XA. Since KA = 1 nâ1 Xâ1 A XA (Chang et al., 2022), the KA â RdĂd can be calculated with 1 nâ1 VAÎŁ2 AV T A . A.3 Impact of Different Pooling Methods We also investigate the impact of three different pooling methods, namely mean-pooling, max-pooling, and last token representation, to derive sentence embeddings for our ShifCon framework. Llama-27B XGLM7.5B BLOOM7.1B Mean-pooling 46.0 43.8 44.1 Max-pooling 45.2 43.3 43.6 Last token 45.8 44.1 43.7 Table 5: The average performance results of our Shif Con framework across all benchmarks for the three different pooling methods. As demonstrated in Table 5, the last token and mean pooling methods exhibit
Chunk 35 ¡ 1,995 chars
rk.
Llama-27B XGLM7.5B BLOOM7.1B
Mean-pooling 46.0 43.8 44.1
Max-pooling 45.2 43.3 43.6
Last token 45.8 44.1 43.7
Table 5: The average performance results of our Shif Con framework across all benchmarks for the three different
pooling methods.
As demonstrated in Table 5, the last token and mean pooling methods exhibit superior performance,
and our approach shows less sensitivity to the choice of pooling method.
A.4 Details of Evaluation
Due to the extensive training time required to train all languages included in Bactrian-X, we opt to
sample a subset of representative languages, covering both high and low-resource languages for training.
During evaluation, we focus on assessing the performance of the selected languages with corresponding
benchmarks. Detailed information regarding the languages used, evaluation metrics for each dataset are
presented in Table 6. The evaluation prompt template are presented in Table 7.
Dataset |Lang.| Languages Metric Data Type
Bactrian-X 8 English, Chinese, Indonesian, Spanish, Swahili, Thai, Turkish, Hindi - Train
MGSM8KInstruct 10 English, Chinese, Spanish, French, German, Russian, Japanese, Swahili, Thai, Bengali - Train
MGSM 10 English, Chinese, Spanish, French, German, Russian, Japanese, Swahili, Thai, Bengali Accuracy Test
XNLI 7 English, Spanish, Chinese, Turkish, Thai, Hindi, Swahili Accuracy Test
XCOPA 5 Chinese, Indonesian, Turkish, Thai, Swahili Accuracy Test
XStoryCloze 6 English, Spanish, Chinese, Indonesian, Hindi, Swahili Accuracy Test
FLORES 6 Spanish, Chinese, Indonesian, Turkish, Thai, Swahili ChrF++ Test
Table 6: Multilingual datasets used in our experiments. We utilize ChrF++ (Popovi´c, 2017) metric to evaluate the
translation performance.
10We follow Chang et al. (2022) to define the language subspace.
11Results were qualitatively similar for subspaces accounting for variance proportions in [75%, 90%, 95%, 99%].
-- 17 of 24 --
Task Pattern Verbalizer
XNLI {premise} Based on theChunk 36 ¡ 1,997 chars
ChrF++ (Popovi´c, 2017) metric to evaluate the
translation performance.
10We follow Chang et al. (2022) to define the language subspace.
11Results were qualitatively similar for subspaces accounting for variance proportions in [75%, 90%, 95%, 99%].
-- 17 of 24 --
Task Pattern Verbalizer
XNLI {premise} Based on the previous passage, is it true that Yes || Maybe || No
{hypothesis}? Yes, No, or Maybe? {label}
XCOPA {premise} {% if question == âcause" %}This happened because...
{% else %} As a consequence...{% endif %}
Help me pick the more plausible option: {choice1} || {choice2}
- {choice1}
- {choice2}
{label}
XStoryCloze {input_sentence_1} {input_sentence_2}
{input_sentence_3} {input_sentence_4}
What is a possible continuation for the story given the following {sentence_quiz_1} ||
options? {sentence_quiz_2}
- {sentence_quiz_1}
- {sentence_quiz_2}
{label}
FLORES Translate the following {src_language} text to {tgt_language}: {tgt_sentence}
{src_sentence} {tgt_sentence}
Table 7: The prompt templates used for evaluation following Muennighoff et al. (2023) and Zhang et al. (2024c).
-- 18 of 24 --
A.5 Implementation Details
Dimension Heads Layers
Llama-27B 4096 32 32
Llama-38B 4096 32 32
BLOOM7.1B 4096 32 30
BLOOM1.7B 2048 16 24
BLOOM560M 1024 16 24
XGLM7.5B 4096 32 32
XGLM2.9B 2048 16 48
XGLM564M 1024 16 24
Table 8: The detailed information of the models utilized in our experiment. âDimensionâ, âHeadsâ, and âLayersâ
denote the dimension of representation, attention heads, and number of layers, respectively.
Model Information In Table 8, we provide comprehensive details about the models utilized in our
experiment. Here, âDimensionâ, âHeadsâ, and âLayersâ represent the representation dimension, attention
heads, and number of layers, respectively.
Training Settings Our experiments are conducted with 4xA100 GPUs. Each experiment is run with
three different random seeds, and the results are averaged to obtain the final outcome. The temperature
Ď is set to 0.05 in theChunk 37 ¡ 1,984 chars
ersâ represent the representation dimension, attention heads, and number of layers, respectively. Training Settings Our experiments are conducted with 4xA100 GPUs. Each experiment is run with three different random seeds, and the results are averaged to obtain the final outcome. The temperature Ď is set to 0.05 in the multilingual contrastive learning procedure. We follow previous multitasking works (Kong et al., 2022b,a; Zhang et al., 2023a, 2025a) to explore Îą values in Eq. 6 within [0.5, 1.0, 1.5, 2.0] to determine the best performance. Following the training settings from previous works (Li et al., 2024b,c; Tong et al., 2024; Wang et al., 2024), we set the learning rate for training models with parameters exceeding 7 billion to 1e-5, while for others to 3e-5. We set the maximum sequence length to 512 and the global batch size to 128. In generation tasks, we utilize a greedy decoding strategy to help replicate our results accurately. A cosine scheduler with a 3% warm-up period is implemented. Mixed precision training and ZeRO are employed within the DeepSpeed training framework to accelerate the training process and conserve memory usage. The AdamW (Loshchilov and Hutter, 2019) optimizer is utilized to update the model parameters during the training process. For the AFP baseline method, we adhere to the training configuration outlined by Li et al. (2024a) to train the models. Specifically, we define psrc for cross-lingual guidance during training and perform multilingual contrastive learning on the first layer. Additionally, we explore our ShifCon framework with a two-stage training strategy, which involves initial training solely with MSFT loss to establish a preliminary model, followed by further fine-tuning using our shifCon framework. As depicted in Table 9, the results indicate that implementing a two-stage training strategy leads to better performance. We posit that the preliminary model obtained by MSFT in the first stage could offer better
Chunk 38 ¡ 1,992 chars
h MSFT loss to establish a preliminary model, followed by further fine-tuning using our shifCon framework. As depicted in Table 9, the results indicate that implementing a two-stage training strategy leads to better performance. We posit that the preliminary model obtained by MSFT in the first stage could offer better representations for each language, facilitating shift projection and multilingual contrastive learning. Consequently, all results are reported based on the two-stage training strategy in our paper. Llama-27B XGLM7.5B BLOOM7.1B MSFT 43.8 41.6 42.2 ShifCon w/ Two-Stage 46.0 43.8 44.1 ShifCon w/ One-Stage 44.8 41.7 42.5 Table 9: The average performance results of our Shif Con framework across all benchmarks for the three model families, comparing the two-stage and one-stage training strategies. -- 19 of 24 -- A.6 New Strategy for Obtaining Better Language Vectors Given that model parameters are updated at each training step, it is essential for the language vectors to be updated correspondingly. Inspired by the batch normalization paradigm, we introduce a novel strategy aimed at improving the quality of language vectors. As calculating the mean representation of all samples in language a after updating parameters for each batch is computationally expensive, we utilize the mean representation of language a samples in the t-th batch to estimate. Specifically, for the representations of language a in t-th batch at l-th layer, let vt denote the mean representation of language a samples from first batch to t-th batch and ut denote the mean representation of the samples in language a from the t-th batch (Noted that, vt is computed by t-th stepâs model). The estimation of vt, i.e., Ëvt, can be obtained by using the representations of t-th batch computed by corresponding t-th stepâs model: Ëvt = Pt i=1 Ρiâ1ui Pt i=1 Ρiâ1 (7) where Ρ ⼠1 denotes the enhancement factor. Ρiâ1 denotes the i â 1-th power of Ρ. As t increases, the model becomes more accurate,
Chunk 39 ¡ 1,997 chars
stepâs model). The estimation of vt, i.e., Ëvt, can be obtained by using the representations of t-th batch computed by corresponding t-th stepâs model: Ëvt = Pt i=1 Ρiâ1ui Pt i=1 Ρiâ1 (7) where Ρ ⼠1 denotes the enhancement factor. Ρiâ1 denotes the i â 1-th power of Ρ. As t increases, the model becomes more accurate, leading to more precise representation ut. Consequently, the corresponding weight factors are larger. Subsequently, we can estimate the mean representation of next batchâs vt through the following approach: Ëvt+1 = Pt+1 i=1 Ρiâ1ui Pt+1 i=1 Ρiâ1 = 1 Pt+1 i=1 Ρiâ1 Ρtut+1 + Pt i=1 Ρiâ1 Pt+1 i=1 Ρiâ1 1 Pt i=1 Ρiâ1 t X i=1 Ρiâ1 = Ρt Pt i=0 Ρi ut+1 + Ptâ1 i=0 Ρi Pt i=0 Ρi Ëvt (8) Here, we only need the estimated mean representation Ëvt and the true mean representation of the samples from the t + 1 batch ut+1, to generate an estimation of the mean representation of Ëvt+1. For simplicity, we directly set Ρt Pt i=0 Ρi = 1 4 and Ptâ1 i=0 Ρi Pt i=0 Ρi = 3 4 in this work. We conduct an extra ablation experiment on XGLM564M to verify the effectiveness of our proposed strategy. As the experimental results shown in Table 10, when compared with the straightforward method, that is, simply mean pooling the representations, our strategy can yield better performance. XCOPA XNLI XStoryCloze High Low High Low High Low w/ New Strategy 58.4 55.8 42.6 40.5 59.8 58.1 w/ Mean Pooling 58.1 55.3 42.3 40.1 59.6 57.6 Table 10: The average performance of high- and low-resource languages across three classification tasks with two different language vector strategies. -- 20 of 24 -- A.7 Detailed Results of Each Language across All the Benchmarks High Low EN ZH DE ES FR JA RU SW BN TH Llama-27B 51.4 29.6 37.2 34.8 36.4 26.2 30.8 2.8 7.2 5.2 +MSFT 59.8 43.2 45.2 46.0 42.4 34.4 43.6 31.6 22.8 34.2 +AFP 60.0 42.8 46.4 47.2 45.6 37.2 45.2 34.4 25.2 35.6 +ShifCon 58.2 48.4 48.8 45.6 47.2 40.4 48.8 38.0 28.4 38.8 XGLM7.5B 7.6 4.8 3.6 3.2 2.8 2.8 2.8 1.2 2.0 2.4 +MSFT 14.4 9.6 10.0 10.4
Chunk 40 ¡ 1,998 chars
FR JA RU SW BN TH Llama-27B 51.4 29.6 37.2 34.8 36.4 26.2 30.8 2.8 7.2 5.2 +MSFT 59.8 43.2 45.2 46.0 42.4 34.4 43.6 31.6 22.8 34.2 +AFP 60.0 42.8 46.4 47.2 45.6 37.2 45.2 34.4 25.2 35.6 +ShifCon 58.2 48.4 48.8 45.6 47.2 40.4 48.8 38.0 28.4 38.8 XGLM7.5B 7.6 4.8 3.6 3.2 2.8 2.8 2.8 1.2 2.0 2.4 +MSFT 14.4 9.6 10.0 10.4 10.8 8.0 10.8 6.8 7.2 6.8 +AFP 16.4 9.2 12.4 13.2 12.4 11.2 10.4 9.6 9.2 10.0 +ShifCon 15.6 12.8 14.0 12.8 15.2 12.0 13.6 11.2 11.6 14.4 Table 11: The detailed results of each language on the MGSM task in Llama-27B and XGLM7.5B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. High Low EN ZH ES FR SW BN TH DE JA RU BLOOM7.1B 20.0 9.2 11.6 12.0 2.4 5.2 1.6 4.0 2.4 6.8 +MSFT 26.8 18.8 21.6 20.4 11.6 13.2 10.4 13.6 12.4 14.0 +AFP 28.4 18.0 23.2 22.0 14.8 15.6 14.4 15.2 16.4 18.0 +ShifCon 28.0 21.2 24.8 24.0 19.2 18.8 17.6 19.6 18.4 19.4 Table 12: The detailed results of each language on the MGSM task in BLOOM7.1B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. High Low ES ZH ID SW TH TR Llama-27B 42.6 17.1 40.9 14.7 12.9 20.0 +MSFT 43.4 18.9 41.8 18.1 15.4 21.8 +AFP 43.9 19.5 42.4 18.9 16.2 22.4 +ShifCon 44.5 19.8 42.6 20.2 16.6 22.3 XGLM7.5B 36.1 17.8 42.9 33.2 31.5 30.0 +MSFT 36.8 19.1 44.5 35.7 32.1 30.5 +AFP 37.4 19.8 45.0 35.9 32.9 31.2 +ShifCon 37.8 20.8 44.9 36.8 33.8 31.8 BLOOM7.1B 40.2 35.2 48.8 37.1 16.2 19.6 +MSFT 40.5 36.0 50.5 37.8 17.6 22.3 +AFP 41.2 36.9 51.1 38.5 18.2 23.1 +ShifCon 41.6 36.8 51.7 39.2 18.4 23.9 Table 13: The detailed results of each language on the FLORES (en-xx) task in Llama-27B, XGLM7.5B, and BLOOM7.1B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. -- 21 of 24 -- High Low ES ZH ID SW TH TR Llama-27B 49.2 18.8 51.5 23.5 11.6 29.1 +MSFT 48.6 19.4 53.2 26.9 16.4 30.8 +AFP 49.0 19.7 54.4 27.5 17.1 31.6 +ShifCon 49.5 21.2 54.8 29.4 17.8 32.5 XGLM7.5B 41.8
Chunk 41 ¡ 1,995 chars
OM7.1B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. -- 21 of 24 -- High Low ES ZH ID SW TH TR Llama-27B 49.2 18.8 51.5 23.5 11.6 29.1 +MSFT 48.6 19.4 53.2 26.9 16.4 30.8 +AFP 49.0 19.7 54.4 27.5 17.1 31.6 +ShifCon 49.5 21.2 54.8 29.4 17.8 32.5 XGLM7.5B 41.8 33.4 48.3 42.9 26.7 37.9 +MSFT 43.0 33.9 50.1 43.9 28.3 39.6 +AFP 44.0 34.8 51.2 44.5 28.9 39.9 +ShifCon 43.8 35.6 51.7 45.2 29.8 40.4 BLOOM7.1B 45.8 39.6 51.6 43.8 20.3 28.1 +MSFT 46.4 39.9 53.3 45.4 23.0 30.8 +AFP 46.9 40.5 53.8 46.0 23.7 31.6 +ShifCon 47.6 41.3 52.8 46.8 24.5 32.4 Table 14: The detailed results of each language on the FLORES (xx-en) task in Llama-27B, XGLM7.5B, and BLOOM7.1B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. High Low ZH ID TR TH SW Llama-27B 63.8 62.6 49.0 51.4 48.8 +MSFT 65.0 63.4 51.8 52.6 51.5 +AFP 65.8 64.2 52.9 53.4 52.3 +ShifCon 66.8 64.2 54.1 53.2 53.2 XGLM7.5B 63.6 64.0 56.8 57.1 58.2 +MSFT 64.4 65.4 58.4 58.8 57.6 +AFP 65.3 66.2 59.3 59.2 58.3 +ShifCon 66.8 66.8 60.2 59.4 60.6 BLOOM7.1B 57.1 58.4 53.2 50.8 52.1 +MSFT 58.6 59.8 55.5 51.6 54.6 +AFP 59.4 60.5 56.3 52.8 55.5 +ShifCon 60.2 60.4 57.6 54.4 56.8 Table 15: The detailed results of each language on the XCOPA task in Llama-27B, XGLM7.5B, and BLOOM7.1B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. -- 22 of 24 -- High Low EN ES ZH TR TH HI SW Llama-27B 49.1 42.6 44.0 35.8 37.2 37.1 30.8 +MSFT 50.8 43.8 44.5 37.5 39.5 38.8 34.6 +AFP 50.8 44.4 45.3 38.6 40.7 39.6 35.8 +ShifCon 50.4 45.0 46.1 40.8 41.8 40.2 38.1 XGLM7.5B 46.9 41.6 45.0 39.8 43.2 42.6 40.1 +MSFT 48.7 42.4 46.7 41.3 44.4 42.2 41.2 +AFP 49.9 43.3 47.8 43.1 45.2 43.1 42.0 +ShifCon 51.2 45.8 48.9 44.7 44.8 43.8 43.8 BLOOM7.1B 46.0 40.2 41.1 34.9 35.4 38.6 37.5 +MSFT 47.1 42.1 42.9 36.8 38.2 41.1 39.7 +AFP 47.9 43.4 43.6 37.9 39.3 41.8 40.6 +ShifCon 48.3 43.2 45.0 39.3 40.7 41.8
Chunk 42 ¡ 1,999 chars
41.6 45.0 39.8 43.2 42.6 40.1 +MSFT 48.7 42.4 46.7 41.3 44.4 42.2 41.2 +AFP 49.9 43.3 47.8 43.1 45.2 43.1 42.0 +ShifCon 51.2 45.8 48.9 44.7 44.8 43.8 43.8 BLOOM7.1B 46.0 40.2 41.1 34.9 35.4 38.6 37.5 +MSFT 47.1 42.1 42.9 36.8 38.2 41.1 39.7 +AFP 47.9 43.4 43.6 37.9 39.3 41.8 40.6 +ShifCon 48.3 43.2 45.0 39.3 40.7 41.8 41.5 Table 16: The detailed results of each language on the XNLI task in Llama-27B, XGLM7.5B, and BLOOM7.1B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. High Low EN ES ZH ID HI SW Llama-27B 84.4 75.5 69.4 69.4 57.9 55.3 +MSFT 85.5 76.9 70.5 68.3 59.6 57.8 +AFP 86.4 77.3 71.6 69.2 60.5 59.2 +ShifCon 86.2 77.5 72.8 70.1 60.2 61.5 XGLM7.5B 73.5 63.7 60.4 63.2 59.5 57.2 +MSFT 74.4 65.8 62.8 64.0 61.2 59.1 +AFP 75.5 66.7 63.0 65.1 61.9 60.0 +ShifCon 75.2 67.4 62.4 67.2 62.8 61.5 BLOOM7.1B 72.2 66.3 66.2 64.7 60.4 55.8 +MSFT 72.8 66.8 67.1 67.5 61.6 58.0 +AFP 72.2 67.5 67.9 67.2 61.9 58.5 +ShifCon 72.6 68.2 68.5 68.8 62.8 59.1 Table 17: The detailed results of each language on the XStoryCloze task in Llama-27B, XGLM7.5B, and BLOOM7.1B. High- and low-resource languages are categorized based on their data ratios in the pre-training corpus. -- 23 of 24 -- A.8 Low Subspace Distance Areas of Models across Different Families and Scales Low Subspace Distance Area Layers Llama-27B [11, 20] 32 Llama-38B [11, 20] 32 BLOOM7.1B [14, 22] 30 BLOOM1.7B [10, 17] 24 BLOOM560M [10, 17] 24 XGLM7.5B [13, 22] 32 XGLM2.9B [9, 23] 48 XGLM564M [9, 16] 24 Table 18: The low subspace distance areas of models in our experiments. A.9 Language Code ISO 639-1 Language Family BN Bengali Indo-European DE German Indo-European EN English Indo-European ES Spanish Indo-European FR French Indo-European HI Hindi Indo-European ID Indonesian Austronesian JA Japanese Japonic RU Russian Indo-European ZH Chinese Sino-Tibetan TH Thai Kra-Dai SW Swahili Niger-Congo TR Turkish Turkic Table 19: Details of Language codes in this work. -- 24 of
Chunk 43 ¡ 325 chars
-European EN English Indo-European ES Spanish Indo-European FR French Indo-European HI Hindi Indo-European ID Indonesian Austronesian JA Japanese Japonic RU Russian Indo-European ZH Chinese Sino-Tibetan TH Thai Kra-Dai SW Swahili Niger-Congo TR Turkish Turkic Table 19: Details of Language codes in this work. -- 24 of 24 --