Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Summary

This paper introduces a novel pre-training approach to enhance cross-lingual capabilities in multilingual large language models (LLMs). Existing methods struggle with data imbalances and monolingual bias, leading to poor performance in low-resource languages. The authors propose a Cross-Lingual Mapping (CL) task during pre-training, which bi-directionally aligns languages in the model's embedding space without compromising monolingual fluency. They also introduce the Language Alignment Coefficient (LAC), a metric to quantify cross-lingual consistency, especially in low-data scenarios. Experiments on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show significant improvements: up to 11.9 BLEU score gains in MT, a 6.72 increase in CLQA BERTScore-Precision, and over 5% in CLNLU accuracy over strong baselines. The method is evaluated on diverse language pairs, including high- and low-resource scenarios, using datasets like WMT2024 and CrossSum. Results demonstrate robust performance gains, particularly for low-resource languages, and the approach generalizes well to unseen language pairs. The study highlights the effectiveness of integrating explicit cross-lingual objectives into pre-training, offering a promising direction for improving multilingual LLMs.

PDF viewer

Chunks(45)

Chunk 0 · 1,981 chars

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset
for Enhanced Multilingual LLM Performance
WEIHUA ZHENG, Singapore University of Technology and Design, Singapore
CHANG LIU, ByteDance, Singapore
ZHENGYUAN LIU, Agency for Science, Technology and Research, Singapore
XIN HUANG, Agency for Science, Technology and Research, Singapore
KUI WU, Agency for Science, Technology and Research, Singapore
MUHAMMAD HUZAIFAH MD SHAHRIN, Agency for Science, Technology and Research, Singapore
AITI AW, Agency for Science, Technology and Research, Singapore
ROY KA-WEI LEE, Singapore University of Technology and Design, Singapore
Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and
low-resource languages and the monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive
alignment, improve cross-lingual performance but often require extensive parallel data or suffer from instability. To address these
challenges, we introduce a Cross-Lingual Mapping Task in the pre-training phase, which enhances cross-lingual alignment without
compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM’s embedding space, improving
both language generation and comprehension. We further introduce a Language Alignment Coefficient to robustly quantify cross-
lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language
understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves up to 11.9 BLEU score gains in
MT, an increase of 6.72 in CLQA BERTScore-Precision and more than a 5% increase in CLNLU accuracy over strong multilingual
baselines. Our findings highlight the potential of embedding cross-lingual objectives into pre-training, improving multilingual LLMs.
CCS Concepts: • Computing methodologies → Natural language

Chunk 1 · 1,993 chars

in
MT, an increase of 6.72 in CLQA BERTScore-Precision and more than a 5% increase in CLNLU accuracy over strong multilingual
baselines. Our findings highlight the potential of embedding cross-lingual objectives into pre-training, improving multilingual LLMs.
CCS Concepts: • Computing methodologies → Natural language generation.
Additional Key Words and Phrases: Cross-Lingual, Large Language Models, Low-resource Language
ACM Reference Format:
WEIHUA ZHENG, CHANG LIU, ZHENGYUAN LIU, XIN HUANG, KUI WU, MUHAMMAD HUZAIFAH MD SHAHRIN, AITI AW,
and ROY KA-WEI LEE. 2025. Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual
LLM Performance. Proc. ACM Meas. Anal. Comput. Syst. 37, 4, Article 111 (August 2025), 20 pages. https://doi.org/XXXXXXX.XXXXXXX
Authors’ Contact Information: WEIHUA ZHENG, weihua_zheng@mymail.sutd.edu.sg, Singapore University of Technology and Design, Singa-
pore; CHANG LIU, ByteDance, Singapore, Liuc0058@e.ntu.edu.sg; ZHENGYUAN LIU, Agency for Science, Technology and Research, Singapore,
liu_zhengyuan@i2r.a-star.edu.sg; XIN HUANG, Agency for Science, Technology and Research, Singapore; KUI WU, Agency for Science, Technology
and Research, Singapore, wuk@i2r.a-star.edu.sg; MUHAMMAD HUZAIFAH MD SHAHRIN, Agency for Science, Technology and Research, Singapore,
huzaifah_md_shahrin@i2r.a-star.edu.sg; AITI AW, Agency for Science, Technology and Research, Singapore, aaiti@i2r.a-star.edu.sg; ROY KA-WEI LEE,
Singapore University of Technology and Design, Singapore, s.roylee@gmail.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

Chunk 2 · 1,997 chars

ithout fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
arXiv:2604.10590v1 [cs.CL] 12 Apr 2026

-- 1 of 20 --

2 Zheng et al.
1 Introduction
Motivation. Recent advancements in Large Language Models (LLMs) have significantly improved Natural Language
Processing (NLP) capabilities, achieving state-of-the-art results in tasks such as cross-lingual question answering
(CLQA), text summarization, and machine translation (MT) [ 9]. Multilingual LLMs, including mBERT [ 9], mT5 [42 ],
and Llama3 [10 ], have extended these advancements to multilingual settings. Nevertheless, these models remain
suboptimal in cross-lingual tasks requiring text generation and comprehension [ 44 ], particularly in MT and cross-
lingual summarization (CLSum), where maintaining semantic fidelity and linguistic coherence remains challenging.
A key factor behind this limitation is the imbalance in training data, with high-resource languages dominating pre-
training corpora [12 ]. Consequently, low-resource languages are underrepresented, leading to weaker cross-lingual
generalization. Moreover, standard pre-training predominantly relies on monolingual next-token prediction, reinforcing
fluency in individual languages but limiting cross-lingual transfer.
Several approaches have attempted to address these limitations. Continued pre-training on low-resource languages
improves cross-lingual proficiency [ 41 ], while word-level substitution and alignment

Chunk 3 · 1,997 chars

on monolingual next-token prediction, reinforcing
fluency in individual languages but limiting cross-lingual transfer.
Several approaches have attempted to address these limitations. Continued pre-training on low-resource languages
improves cross-lingual proficiency [ 41 ], while word-level substitution and alignment techniques enhance language
transfer [8 , 35, 39 ]. However, these methods struggle with polysemy, grammatical inconsistencies, and code-switching.
Furthermore, techniques like bilingual sentence masking, though promising, are often incompatible with decoder-only
LLMs that generate text autoregressively. These challenges highlight the need for a more effective pre-training paradigm
that explicitly integrates cross-lingual alignment while preserving monolingual fluency.
Research Objectives. This study aims to address these research gaps to improve the cross-lingual capabilities of
multilingual LLMs. Specifically, we seek to (i) develop an effective pre-training strategy that enhances cross-lingual
alignment while preserving monolingual fluency, (ii) introduce a robust evaluation metric to quantify language alignment,
and (iii) demonstrate the impact of our approach across diverse multilingual NLP tasks.
To achieve these objectives, we propose a novel Cross-Lingual Mapping (CL) Task that explicitly models linguistic
correspondences during the pre-training phase. Unlike existing approaches that rely on bilingual fine-tuning or
contrastive learning, our method integrates bi-directional CL within the model’s learning process, facilitating improved
alignment across languages. Additionally, we introduce the Language Alignment Coefficient (LAC), a new metric
designed to evaluate the consistency of language representations in multilingual LLMs, particularly in low-resource
scenarios.
Contributions. This work makes the following key contributions: (i) We propose a CL Task that enhances language
alignment during pre-training, allowing multilingual LLMs to learn

Chunk 4 · 1,994 chars

a new metric
designed to evaluate the consistency of language representations in multilingual LLMs, particularly in low-resource
scenarios.
Contributions. This work makes the following key contributions: (i) We propose a CL Task that enhances language
alignment during pre-training, allowing multilingual LLMs to learn direct cross-lingual correspondences. (ii) We
introduce the LAC, a novel metric that quantifies cross-lingual consistency and provides a robust evaluation of
multilingual representations. (iii) We construct an extensive evaluation framework encompassing MT, CLSum, CLQA
and CLNLU, enabling a comprehensive assessment of multilingual performance. (iv) Our experiments demonstrate
that the proposed approach significantly improves multilingual LLM performance, achieving up to 11.8 BLEU scores of
improvement in MT, a 6.72 points increase in BERTScore-Precision in CLQA and more than a 5% increase in CLNLU
accuracy over strong baseline„ namely the Llama-3-8B model obtained by applying supervised fine-tuning (SFT) with
the same instruction-tuning dataset.
These findings highlight the effectiveness of incorporating explicit cross-lingual objectives into pre-training, offering
a promising direction for enhancing multilingual LLMs without sacrificing monolingual fluency.
Manuscript submitted to ACM

-- 2 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 3
2 Related Work
Improving language alignment in multilingual models is critical for enhancing cross-lingual performance, particularly
for low-resource languages. Prior research has primarily focused on three strategies: parameter-sharing techniques,
contrastive learning, and bilingual data integration during pre-training. While effective to some extent, these methods
exhibit limitations that hinder robust cross-lingual generalization.
Parameter-sharing facilitates knowledge transfer by sharing model parameters and vocabularies across

Chunk 5 · 1,991 chars

parameter-sharing techniques,
contrastive learning, and bilingual data integration during pre-training. While effective to some extent, these methods
exhibit limitations that hinder robust cross-lingual generalization.
Parameter-sharing facilitates knowledge transfer by sharing model parameters and vocabularies across languages [ 7,
21, 30]. This paradigm underlies many strong multilingual encoders such as InfoXLM [ 5], VECO [ 27 ], LaBSE [ 11],
and RemBERT [ 6 ], which jointly train a single model over shared subword vocabularies while injecting additional
cross-lingual signals. However, it struggles with languages that have vastly different structures or limited training data.
Another approach, masked token prediction on bilingual parallel sentences [ 8], improves cross-lingual representations
but is constrained by the availability of parallel corpora.
Contrastive learning has also been widely explored for aligning multilingual representations. Wang et al . [38]
proposed aligning sentence embeddings via independent encoders, but these methods often accumulate errors, leading
to instability. InfoXLM and LaBSE, for instance, apply contrastive or translation-ranking objectives on parallel sentences
to obtain language-agnostic representations [5, 11 ]. Advances in unsupervised contrastive learning [ 6 , 13, 40 , 45 ]
have improved alignment using data augmentation, yet they often require additional fine-tuning, which limits their
generalization. Our approach circumvents these challenges by embedding cross-lingual alignment directly into pre-
training, improving stability without post-training adjustments.
Bilingual data integration has shown promise in optimizing multilingual models. Ham and Kim [16] employed a
teacher-student framework for monolingual semantic alignment, while reinforcement learning methods [ 4, 14 , 34 , 43 ]
used semantic similarity as a reward signal to improve language alignment. More recently, PreAlign [ 22] demonstrates
that explicitly

Chunk 6 · 1,997 chars

ise in optimizing multilingual models. Ham and Kim [16] employed a
teacher-student framework for monolingual semantic alignment, while reinforcement learning methods [ 4, 14 , 34 , 43 ]
used semantic similarity as a reward signal to improve language alignment. More recently, PreAlign [ 22] demonstrates
that explicitly establishing multilingual alignment before large-scale language model pre-training can further boost cross-
lingual transfer. However, these methods are computationally demanding and require fine-tuning stability. Similarly,
bilingual pre-training techniques [ 20, 29 ] and instruction fine-tuning [ 12 ] have demonstrated moderate success but
remain insufficient for deep cross-lingual generalization.
Our work builds on these insights by introducing a novel CL task that enhances multilingual alignment at the
pre-training stage. Unlike prior approaches that depend on post-hoc alignment techniques or large-scale bilingual
corpora, our method embeds cross-lingual objectives directly into training, strengthening both multilingual generation
and comprehension, providing a scalable and effective solution for improving multilingual LLMs.
3 Language Alignment Coefficient (LAC)
Measuring language alignment in multilingual models is essential for evaluating their cross-lingual transferability.
Existing methods primarily rely on the average cosine similarity of sentence embeddings across intermediate layers of
LLMs [ 24 ]. Additionally, projection variance has been used to quantify alignment stability. However, these metrics
often exhibit inconsistencies, as high cosine similarity does not necessarily imply robust alignment, particularly when
test samples are small or susceptible to outliers. The variance in similarity scores across layers can lead to unstable
assessments, limiting their reliability in evaluating multilingual representation alignment.
To overcome these limitations, we propose the Language Alignment Coefficient (LAC), a novel metric that accounts

Chunk 7 · 1,996 chars

amples are small or susceptible to outliers. The variance in similarity scores across layers can lead to unstable
assessments, limiting their reliability in evaluating multilingual representation alignment.
To overcome these limitations, we propose the Language Alignment Coefficient (LAC), a novel metric that accounts for
both alignment strength and stability. LAC is defined as the ratio of the average cosine similarity of sentence embeddings
to the standard deviation of these similarity values across selected layers 𝐿sub. This formulation is statistically equivalent
Manuscript submitted to ACM

-- 3 of 20 --

4 Zheng et al.
to the inverse of the Coefficient of Variation (CV) [ 32 ], ensuring a more stable and reliable measurement of alignment by
incorporating both similarity magnitude and dispersion.
Let 𝑋 = {𝑥𝑘 }𝑛
𝑘=1 and 𝑌 = {𝑦𝑘 }𝑚
𝑘=1 be sentence pairs in different languages, where 𝐸𝑖 (𝑋 ) and 𝐸𝑖 (𝑌 ) denote their
embeddings extracted from layer 𝑖 of the model by taking the hidden states of the last tokens 𝑥𝑛 and 𝑦𝑚 , respectively.
The LAC metric is defined as:
LAC = 1
|𝐿sub |
∑︁
 𝑖 ∈𝐿sub
cos(𝐸𝑖 (𝑋 ), 𝐸𝑖 (𝑌 ))
std({cos(𝐸𝑖 (𝑋 ), 𝐸𝑖 (𝑌 ))}𝑖 ∈𝐿sub ) , (1)
where 𝐿sub = {5, 10, 15, 20, 25} represents the subset of model layers used for evaluation, following prior work on
multilingual alignment assessment [24].
A higher LAC value indicates greater alignment stability, signifying that similarity scores remain consistent across
different model layers. Unlike prior metrics that consider only mean similarity, LAC normalizes similarity scores by
their dispersion, mitigating the effect of outliers and ensuring robust evaluation even in low-data multilingual scenarios.
By integrating alignment strength and variability, LAC provides a more consistent metric for assessing multilingual
LLMs, particularly in cases where test data is sparse or noisy.
4 Language Alignment in Multilingual Large Language Models
To address the challenges of

Chunk 8 · 1,999 chars

evaluation even in low-data multilingual scenarios.
By integrating alignment strength and variability, LAC provides a more consistent metric for assessing multilingual
LLMs, particularly in cases where test data is sparse or noisy.
4 Language Alignment in Multilingual Large Language Models
To address the challenges of language alignment in multilingual LLMs, we introduce a continued pre-training approach
that simultaneously optimizes two objectives: the Next-Token Prediction (NTP) task and the Cross-Lingual Mapping (CL)
task. These tasks are designed to enhance both the model’s monolingual generation capabilities and its ability to align
representations across languages. The overall pre-training framework is illustrated in Figure 1. In the NTP task, the
model predicts the next token 𝑥𝑖+1 based solely on the preceding tokens in the same language. In contrast„ in the CL
task, the model generates a token 𝑧𝑖+1 in a target language by using the complete sequence of the source language
𝑌 and previously generated target tokens. This dual-task approach ensures that multilingual alignment is achieved
without sacrificing monolingual fluency.
4.1 Next-Token Prediction Task
A major challenge in training multilingual LLMs is the imbalance in pre-training data, which results in performance
disparities between high-resource and low-resource languages. Prior studies have shown that improving low-resource
language modeling enhances overall multilingual alignment [30 , 41 ]. However, continued pre-training on low-resource
languages alone often leads to catastrophic forgetting of previously learned high-resource languages [ 15 , 19, 26 , 28 , 33].
To mitigate this issue, we adopt a decoder-only transformer model, parameterized by 𝜃 , for the NTP task. Given a
monolingual sentence 𝑋 = (𝑥0, 𝑥1, . . . , 𝑥𝑇 ), the objective is to predict the next token 𝑥𝑡 in the sequence, conditioning on
the preceding tokens. This is formulated as a negative log-likelihood loss:
LNTP (𝜃 ) =

Chunk 9 · 1,996 chars

ssue, we adopt a decoder-only transformer model, parameterized by 𝜃 , for the NTP task. Given a
monolingual sentence 𝑋 = (𝑥0, 𝑥1, . . . , 𝑥𝑇 ), the objective is to predict the next token 𝑥𝑡 in the sequence, conditioning on
the preceding tokens. This is formulated as a negative log-likelihood loss:
LNTP (𝜃 ) = −
𝑇 ∑︁
𝑡 =0
log 𝑃 (𝑥𝑡 |𝑥0, . . . , 𝑥𝑡 −1; 𝜃 ). (2)
This formulation ensures that the model maintains high-resource language proficiency while progressively improving
its low-resource language capabilities. By training on a mixture of high- and low-resource languages, the model achieves
better overall multilingual robustness without overfitting to dominant languages.
Manuscript submitted to ACM

-- 4 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 5
Multilingual LLM with better language Alignment
<s> x0 mi+1 mT ...
Cross-Lingual Mapping
xi ... <s> y0 yi+1 yT ... yi ...
z0 mi+1 mT ... zi ...
<s> x0 xi+1 xT ... xi ... <s> y0 yi+1 yT ... yi ...
z0 zi+1 zT ... zi ...
Next Token Prediction
Token Masking
Fig. 1. Continue pre-training including NTP and CL. "<s>" is "start token", "𝑚𝑖 " is the mask of the 𝑖𝑡ℎ token. "x", "y", "z" are tokens
from different languages.
4.2 Cross-Lingual Mapping Task
Monolingual next-token prediction lacks explicit cross-lingual alignment signals. Without bilingual contexts, models
trained on monolingual corpora reinforce same-language generation, limiting cross-lingual transfer in translation and
multilingual understanding tasks.
To address this limitation, we introduce the CL task, which explicitly trains the model to generate sequences in
a target language while conditioning on a source language sequence. Given a bilingual parallel sentence pair (𝑋, 𝑌 ),
where 𝑋 is the source sentence and 𝑌 is the target sentence, the model learns to generate 𝑌 based on the information in
𝑋 .
Prior work explored translation during

Chunk 10 · 1,998 chars

e model to generate sequences in
a target language while conditioning on a source language sequence. Given a bilingual parallel sentence pair (𝑋, 𝑌 ),
where 𝑋 is the source sentence and 𝑌 is the target sentence, the model learns to generate 𝑌 based on the information in
𝑋 .
Prior work explored translation during pretraining under Masked Language Modeling (MLM)[ 8 ], and extensively in
encoder-decoder architectures[26]. However, MLM’s bidirectional context differs fundamentally from Causal Language
Modeling’s left-to-right constraint, making cross-lingual alignment more challenging in decoder-only models. Research
on effective cross-lingual alignment within CLM-based decoder-only architectures remains limited.
Additionally, LLMs differ from traditional models in their instruction-driven paradigm. Rather than fixed input
formats, LLMs rely on prompts to determine task behavior. Our approach differs from traditional translation-based
pretraining in two ways: (i) We enhance cross-lingual embedding similarity directly within decoder-only models, without
explicit encoding-decoding separation; (ii) We use no explicit task instructions, enabling generalizable, instruction-
agnostic cross-lingual representations. Unlike contrastive methods requiring explicit alignment, our CL task enables
implicit transfer through a sequence-to-sequence objective:
LCL(𝜃 ) = −∑︁ 𝑡 = 0𝑇 log 𝑃 (𝑦𝑡 |𝑦0, . . . , 𝑦𝑡 −1; 𝑋 ; 𝜃 ). (3)
This ensures joint learning of semantic correspondences and syntactic transformations without predefined alignment
constraints.
Manuscript submitted to ACM

-- 5 of 20 --

6 Zheng et al.
Fig. 2. Language alignment of different pre-trained models. LAC is Language Alignment Coefficient.
4.3 Joint Optimization of Multilingual Alignment
We jointly optimize the NTP and CL objectives during pre-training. The final loss function for continued pre-training is
given by:
LPT (𝜃 ) = LNTP (𝜃 ) + LCL (𝜃 ). (4)
This dual-task training strategy enables the model

Chunk 11 · 1,996 chars

d models. LAC is Language Alignment Coefficient.
4.3 Joint Optimization of Multilingual Alignment
We jointly optimize the NTP and CL objectives during pre-training. The final loss function for continued pre-training is
given by:
LPT (𝜃 ) = LNTP (𝜃 ) + LCL (𝜃 ). (4)
This dual-task training strategy enables the model to retain strong monolingual generation capabilities while
simultaneously improving cross-lingual mapping. By jointly optimizing both objectives, the model learns to maintain
high-resource language fluency while developing robust alignment mechanisms for low-resource languages. This
approach results in a more effective multilingual LLM capable of both generating text fluently in a given language and
transferring knowledge efficiently across linguistic boundaries.
5 Dataset Creation
To ensure a comprehensive evaluation, we select the following language pairs: English-Chinese (EN-ZH), English-Czech
(EN-CS), Chinese-Japanese (ZH-JP), and Czech-Ukrainian (CS-UK). These pairs represent both high- and low-resource
scenarios, with EN-ZH (high-resource) and EN-CS (mid-resource) evaluating English-centric performance, while ZH-JP
(high-resource) and CS-UK (low-resource) assess non-English alignment.
Our fine-tuning dataset is derived from Cleaned Alpaca [ 36 ], augmented with machine translation (MT), cross-lingual
question answering (CLQA), and cross-lingual summarization (CLSum) tasks to enhance the model’s multilingual
capabilities. For CLSum, we utilize CrossSum [3 ] for EN-ZH and ZH-JP, while CLQA data is sourced from OpenHermes-
2.5 [ 37 ], supplemented with 2000 manually verified QA pairs (1000 EN-CS, 1000 CS-UK) generated using Claude 3.5.
Evaluation is conducted using the XQuAD dataset [1 ] for CLQA and Belebele [ 2] for CLNLU, focusing on our selected
language pairs.
Monolingual data for continued pre-training is sourced from CulturaX [ 31], covering English, Chinese, Czech,
Ukrainian, and Japanese. English data is included to prevent catastrophic

Chunk 12 · 1,994 chars

.5.
Evaluation is conducted using the XQuAD dataset [1 ] for CLQA and Belebele [ 2] for CLNLU, focusing on our selected
language pairs.
Monolingual data for continued pre-training is sourced from CulturaX [ 31], covering English, Chinese, Czech,
Ukrainian, and Japanese. English data is included to prevent catastrophic forgetting. Parallel data for MT and cross-
lingual mapping tasks come from WMT2024, with 50,000 sentence pairs per language pair for MT and 3.7 million pairs
for cross-lingual mapping. During continued pre-training, we optimize two separate objectives: next-token prediction
Manuscript submitted to ACM

-- 6 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 7
Task Name Sub-Task Language /
Language Pairs Data Volume Data Source
Monolingual Continue
Pretraining NA
English 37,182,019 tokens
ClutureX
Chinese 93,124,828 tokens
Japanese 85,415,308 tokens
Ukrainian 187,336,165 tokens
Czech 185,235,692 tokens
Cross-Lingual
Mapping NA EN-ZH, EN-CS,
CS-UK, ZH-JP
3.7M sentence
pairs per language pair WMT2024
Instruction
Finetuning for
general task
MT EN-ZH, EN-CS,
CS-UK, ZH-JP 50K per language pair WMT2024
CLQA EN-CS & CS-UK 1000 per language pair Tranlate from
OpenHermes-2.5
CL-Sum EN-ZH 3,893 CrossSum
ZH-JP 809
Evaluation
MT EN-ZH, EN-CS,
CS-UK, ZH-JP
FLORES-200:
1012 per language pair FLORES-200,
WMT 2022
WMT cs-uk: 1930
WMT en-cs: 2037
WMT en-zh: 2037
Xquad EN-ZH 1190 Xquad
Open-End
CLQA EN-CS & CS-UK 500 per language pair Tranlate from
OpenHermes-2.5
CL-Sum EN-ZH 101 CrossSum
ZH-JP 486
CLNLU EN-ZH, EN-CS,
CS-UK, ZH-JP 900 per language pair Belebele
Table 1. Task Details and Data Statistics
(NTP) and cross-lingual mapping. The NTP objective is trained on 4,128,308 instances (≈4.1M), and the per-batch
mixture allocates 52.7% and 47.3% of training examples to NTP and cross-lingual mapping, respectively. Since the two
losses are of comparable magnitude, we do not apply additional

Chunk 13 · 1,993 chars

Details and Data Statistics
(NTP) and cross-lingual mapping. The NTP objective is trained on 4,128,308 instances (≈4.1M), and the per-batch
mixture allocates 52.7% and 47.3% of training examples to NTP and cross-lingual mapping, respectively. Since the two
losses are of comparable magnitude, we do not apply additional task-specific weighting. No dataset overlap occurs
between training and evaluation phases.
Further details on dataset sources, language pairs, dataset sizes, and task-specific prompts are provided in Tables 1
and 2.
6 Experiments
6.1 Experiment Settings
We use Llama-3-8B [ 10 ] as the base model, given its extensive multilingual capabilities across 31 languages. Continued
pre-training is performed on Llama-3-8B, followed by fine-tuning on both cross-lingual and monolingual tasks. To
evaluate language alignment and generation quality, we compute sentence embedding similarities and perplexity scores.
For instruction fine-tuning, we train models on general and cross-lingual tasks to assess downstream performance.
Given that MT is the most representative cross-lingual task, ablation studies focus on it to isolate key factors influencing
cross-lingual effectiveness.
Manuscript submitted to ACM

-- 7 of 20 --

8 Zheng et al.
Task Prompts
MT Translate the following <Source Language> to <Target Language>:
<Source Sentence>
CLSum Please read the following <Source Language> text and generate a short
and precise <Target Language> summary: <Source Language Para-
graph>
Xquad <Source Language Paragraph> Based on the above paragraph, answer
the following questions in <Target Language> or numbers.
Cross-Lingual Open-ended Question and Answering Answer the following questions in <Target Language>. <Question in
Source Language>
CLNLU Answer the following single-choice questions and output only the
option. <Source Question> <Target Options>.
Table 2. Prompts for Different Tasks
6.2 Training details
We conducted training with a warm-up ratio of 0.01 and a sequence

Chunk 14 · 1,997 chars

swer the following questions in <Target Language>. <Question in
Source Language>
CLNLU Answer the following single-choice questions and output only the
option. <Source Question> <Target Options>.
Table 2. Prompts for Different Tasks
6.2 Training details
We conducted training with a warm-up ratio of 0.01 and a sequence length of 2048 tokens, limiting the process to 1
epoch for pre-training and 3 epochs for general fine-tuning. The training employed bf16 precision with Low-Rank
Adaptation (LoRA) [ 18 ], configured with a LoRA rank of 16 and target modules applied to all layers. We utilized 8
H100 GPUs, with each GPU processing 16 batches and an 8-step gradient accumulation, yielding an effective batch
size of 1024. The initial learning rate was set to 1e-4 with the AdamW optimizer. Under this configuration, continued
pre-training required approximately 70 hours, while fine-tuning was completed in 53 minutes.
6.3 Language Alignment and Perplexity Evaluation of Base Models
We assess language alignment and monolingual perplexity on Llama-3-8B and its variants following different pre-training
strategies:
(1) Llama_NTP: Trained with monolingual next-token prediction.
(2) Llama_Bi_NTP: The model is trained on a hybrid corpus comprising both monolingual data and bilingual sen-
tence pairs [ 20 ]. While maintaining the standard next-token prediction objective, the training data encompasses
both monolingual sequences and concatenated bilingual sentence pairs (e.g., "How are you? 你好吗？"). This
training configuration serves to augment the model’s cross-lingual transition probability.
(3) Llama_CLI: The model is jointly trained on monolingual next-token prediction and translation tasks during the
pre-training stage, where translation instances are formulated with an explicit instruction prompt, e.g., “Translate
the following language 1 to language 2.”
(4) Llama_CL: Our proposed model, incorporating monolingual next-token prediction and cross-lingual mapping.
The CL objective is

Chunk 15 · 1,992 chars

and translation tasks during the
pre-training stage, where translation instances are formulated with an explicit instruction prompt, e.g., “Translate
the following language 1 to language 2.”
(4) Llama_CL: Our proposed model, incorporating monolingual next-token prediction and cross-lingual mapping.
The CL objective is designed to align cross-lingual semantic spaces by predicting target-language tokens from source-
language sentence embeddings. While Bi_NTP remains fundamentally a Next-Token Prediction (NTP) task, it extends
standard monolingual NTP by incorporating code-switched sequences constructed from concatenated bilingual parallel
sentences. Under this configuration, the model primarily learns the mechanisms of linguistic transition. Although CL
conceptually overlaps with machine translation, its scope is broader; by aligning latent semantic spaces, it provides a
robust foundation for a wide range of cross-lingual downstream applications beyond simple translation. We evaluate
bilingual sentence alignment using Flores-200 and WMT2022, while monolingual perplexity is tested on 1,000 samples
per language from CulturaX [31].
Manuscript submitted to ACM

-- 8 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 9
Lang L-3-8B L_NTP L_Bi_NTP L_CL-only L_CL
EN 30.1 27.4 26.9 30.7 25.5
ZH 22.4 22.1 21.3 23.3 20.6
CS 21.1 16.8 16.1 22.1 15.2
UK 19.1 17.2 16.9 19.5 16.0
JP 25.1 22.9 22.2 25.5 20.8
Table 3. Perplexity Scores of Languages for Different Models. Best scores are in bold.
Results in Table 3 and Figure 2 show that while Llama_NTP and Llama_Bi_NTP significantly reduce monolingual
perplexity, particularly in low-resource languages like CS (-4.33 points), their improvements in cross-lingual alignment
remain inconsistent. Llama_Bi_NTP achieves gains primarily for CS-UK but shows limited impact on EN-CS due to the
lack of an explicit CL objective.
In contrast, Llama_CL achieves a more

Chunk 16 · 1,997 chars

monolingual
perplexity, particularly in low-resource languages like CS (-4.33 points), their improvements in cross-lingual alignment
remain inconsistent. Llama_Bi_NTP achieves gains primarily for CS-UK but shows limited impact on EN-CS due to the
lack of an explicit CL objective.
In contrast, Llama_CL achieves a more balanced improvement across both monolingual perplexity and cross-lingual
alignment. The integration of CL objective explicitly reinforces bilingual relationships, leading to notable performance
advantages: a 0.57-point improvement over Llama_Bi_NTP for EN-CS and consistent gains for EN-ZH (0.2), CS-UK
(0.15), and ZH-JP (0.15).
Further, Llama_CL demonstrates strong generalization to unseen language pairs, outperforming other models on
EN-UK alignment by 0.77, 0.87, and 0.62 points against Llama-3-8B, Llama_NTP, and Llama_Bi_NTP, respectively, while
also achieving stable alignment for EN-JP. This suggests that our method is able to generalized beyond trained bilingual
pairs, reinforcing multilingual robustness.
Overall, these findings highlight the necessity of explicit cross-lingual mapping in pre-training. While NTP enhances
monolingual fluency, our approach successfully bridges linguistic gaps, making it particularly effective for low-resource
language processing and multilingual applications.
6.4 Cross-Lingual Task Evaluation of Instruction-Tuned Models
To ensure a fair comparison and isolate the effectiveness of our pre-training strategy, we applied an identical instruction
fine-tuning (IFT) process to all model variants described in Section 6.3. The set of models includes the Llama-3-8B
base model, baseline variants (Llama_NTP and Llama_Bi_NTP), a joint-objective baseline (Llama_CLI, trained with
next-token prediction and translation objectives), and our proposed model, Llama_CL. Consequently, all downstream
performance gains reported in this section represent comparisons among fine-tuned checkpoints. We evaluated their
performance on MT, CLSum,

Chunk 17 · 1,998 chars

lama_Bi_NTP), a joint-objective baseline (Llama_CLI, trained with
next-token prediction and translation objectives), and our proposed model, Llama_CL. Consequently, all downstream
performance gains reported in this section represent comparisons among fine-tuned checkpoints. We evaluated their
performance on MT, CLSum, CLQA and CLNLU. For CLSum, we used the CrossSum dataset [ 3 ] for EN-ZH and ZH-JP
summarization. CLQA evaluation consisted of two components: reference-based QA using XQuAD [1] (EN-ZH) and
open QA, constructed from OpenHermes-2.5 [ 37]. And Belebele [ 2] is used as the CLNLU evaluation dataset. Reference
evaluation results are presented in Table 4.
To assess generalizability, we replicated these experiments on BLOOM-3B [21 ], observing similar trends to Llama-3-
8B. Since many of the languages in our test set are not supported by BLOOM, we have chosen to report scores only for
the EN-ZH language pair on the BLOOM-3B models. This ensures the validity and fairness of our evaluation.
Similar to the results obtained on the Llama-3-8B model, as shown in Table 5, our approach also demonstrates
significant improvements in translation tasks on the BLOOM-3B model. In particular, on the Flores-EN-ZH dataset, our
method achieves an impressive gain of 14.43, highlighting its effectiveness.
Manuscript submitted to ACM

-- 9 of 20 --

10 Zheng et al.
Tasks Data Sets Metric Llama-3-8B Llama_NTP Llama_Bi_NTP Llama_CLI Llama_CL
MT
WMT-EN-ZH
BLEU
36.2±0.5 37.8±0.5 (+1.6) 44.0±0.5 (+7.8) 49.4±0.5 (+13.1) 48.2±0.6 (+12.0)
WMT-EN-CS 19.4±0.3 25.1±0.4 (+5.7) 26.3±0.4 (+6.9) 31.4±0.6 (+12.0) 31.2±0.5 (+11.8)
WMT-CS-UK 18.2±0.3 21.0±0.4 (+2.8) 23.4±0.4 (+5.2) 25.1±0.4 (+6.9) 25.8±0.5 (+7.6)
Flores-EN-ZH 28.6±0.4 29.9±0.4 (+1.3) 33.6±0.5 (+5.0) 37.6±0.4 (+9.0) 36.2±0.5 (+7.6)
Flores-EN-CS 17.2±0.3 21.0±0.4 (+3.8) 21.6±0.4 (+4.4) 25.7±0.5 (+8.5) 23.2±0.4 (+6.0)
Flores-CS-UK 14.3±0.3 18.0±0.3 (+3.7) 18.5±0.4 (+4.2) 20.0±0.3 (+5.7) 19.4±0.4 (+5.1)
Flores-ZH-JP 15.7±0.3 17.4±0.3

Chunk 18 · 1,997 chars

5.1±0.4 (+6.9) 25.8±0.5 (+7.6)
Flores-EN-ZH 28.6±0.4 29.9±0.4 (+1.3) 33.6±0.5 (+5.0) 37.6±0.4 (+9.0) 36.2±0.5 (+7.6)
Flores-EN-CS 17.2±0.3 21.0±0.4 (+3.8) 21.6±0.4 (+4.4) 25.7±0.5 (+8.5) 23.2±0.4 (+6.0)
Flores-CS-UK 14.3±0.3 18.0±0.3 (+3.7) 18.5±0.4 (+4.2) 20.0±0.3 (+5.7) 19.4±0.4 (+5.1)
Flores-ZH-JP 15.7±0.3 17.4±0.3 (+1.7) 18.5±0.4 (+2.8) 21.6±0.4 (+5.9) 21.0±0.4 (+5.3)
WMT-EN-ZH
COMET
67.31±0.29 70.30±0.33 (+2.99) 70.90±0.34 (+3.59) 72.34±0.21 (+5.03) 71.10±0.35 (+3.79)
WMT-EN-CS 64.39±0.28 67.40±0.32 (+3.01) 70.10±0.36 (+5.71) 73.56±0.39 (+9.17) 72.60±0.40 (+8.21)
WMT-CS-UK 75.33±0.31 82.60±0.38 (+7.27) 84.00±0.39 (+8.67) 86.28±0.42 (+10.95) 85.10±0.41 (+9.77)
Flores-EN-ZH 75.74±0.31 82.90±0.38 (+7.16) 82.80±0.38 (+7.06) 83.90±0.11 (+8.16) 83.20±0.39 (+7.46)
Flores-EN-CS 75.18±0.30 81.20±0.36 (+6.02) 83.30±0.39 (+8.12) 88.03±0.23 (+12.85) 85.00±0.42 (+9.82)
Flores-CS-UK 76.47±0.30 84.30±0.39 (+7.83) 84.90±0.40 (+8.43) 87.46±0.38 (+10.99) 86.30±0.43 (+9.83)
Flores-ZH-JP 81.63±0.32 84.60±0.35 (+2.97) 85.90±0.37 (+4.27) 88.99±0.30 (+7.36) 87.60±0.40 (+5.97)
CLSum CrossSum
R-1 15.95 14.03(-1.92) 15.52(-0.43) 13.27(-2.68) 16.80(+0.85)
R-2 5.964 4.992(-0.972) 4.824(-1.140) 4.329(-1.635) 5.937(-0.027)
R-L 15.77 13.91(-1.86) 15.34(-0.43) 12.78(-2.99) 16.39(+0.62)
Rec 73.4 73.24(-0.16) 73.09(-0.31) 71.52(-1.88) 74.12(+0.72)
CLQA Xquad F1 86.80±0.27 87.50±0.28 (+0.70) 88.70±0.30 (+1.90) 85.71±0.38 (-1.09) 88.50±0.30 (+1.70)
EM 17.48 18.82(+1.34) 19.16(+1.68) 16.35(-1.13) 19.24(+1.76)
Prec 80.54 86.28(+5.74) 86.78(+6.24) 79.33(-1.21) 87.26(+6.72)
CLNLU
Belebele-EN-ZH
Acc
58.78% 61.67%(+2.89%) 62.89%(+4.11%) 61.33%(+2.55%) 61.78%(+3.00%)
Belebele-EN-CS 55.67% 59.78%(+4.11%) 60.78%(+5.11%) 58.56%(+2.89%) 61.33%(+5.66%)
Belebele-CS-UK 47.56% 49.78%(+2.22%) 51.33%(+3.77%) 47.02%(-0.54%) 54.22%(+6.66%)
Belebele-ZH-JP 40.22% 42.44%(+2.22%) 45.67%(+5.45%) 43.85%(+3.63%) 46.56%(+6.34%)
Table 4. Evaluation Results on Cross-Lingual Tasks for Llama-3. All results in this table are

Chunk 19 · 1,992 chars

% 59.78%(+4.11%) 60.78%(+5.11%) 58.56%(+2.89%) 61.33%(+5.66%)
Belebele-CS-UK 47.56% 49.78%(+2.22%) 51.33%(+3.77%) 47.02%(-0.54%) 54.22%(+6.66%)
Belebele-ZH-JP 40.22% 42.44%(+2.22%) 45.67%(+5.45%) 43.85%(+3.63%) 46.56%(+6.34%)
Table 4. Evaluation Results on Cross-Lingual Tasks for Llama-3. All results in this table are obtained by evaluating the instruction-
fine-tuned models.
R-1, R-2, R-L, Rec, EM, Prec, and Acc respectively represent ROUGE-1, ROUGE-2, ROUGE-L, BERTScore-Recall, Exact Match,
BERTScore-Precision, and Accuracy. Best scores are in bold.
For the CLQA task, while the Llama-3-8B model slightly lags behind the Llama_Bi_NTP model in F1 score, our
method consistently outperforms other approaches on the BLOOM-3B model, showcasing significant improvements
over the baseline. Notably, it achieves a 2.89% increase in F1, a 1.17% gain in Exact Match Rate, and a 5.69% boost in
BERTScore-Precision.
Surprisingly, our method demonstrated greater improvements on the CLSum task with the Bloom-3B model compared
to Llama-3-8B, achieving over a 1-point increase in ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore-Precision. While
there is still room for further enhancement, these results are promising. Additionally, our approach also delivered
significant improvements on the CLNLU task with the BLOOM model.
By comparing Table 4 and Table 5, it becomes clear that our method yields even greater improvements when applied
to a base model with weaker foundational capabilities.
6.4.1 Reference-based Evaluation. Table 4 presents a comparative analysis of our proposed model (Llama_CL) against
the Llama-3-8B baseline and several cross-lingual variants. The results indicate that while all adaptation strategies
improve upon the vanilla baseline in Machine Translation (MT) and CLNLU, they exhibit distinct trade-offs in terms of
task-specific performance versus general-purpose cross-lingual capability.
Manuscript submitted to ACM

-- 10 of 20 --

Bridging Linguistic Gaps: Cross-Lingual

Chunk 20 · 1,998 chars

cate that while all adaptation strategies
improve upon the vanilla baseline in Machine Translation (MT) and CLNLU, they exhibit distinct trade-offs in terms of
task-specific performance versus general-purpose cross-lingual capability.
Manuscript submitted to ACM

-- 10 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 11
Tasks Data Sets Metric BLOOM-3B BLOOM_NTP BLOOM_Bi_NTP BLOOM_CL
MT
WMT-EN-ZH BLEU 11.3±0.4 13.6±0.4 (+2.3) 16.6±0.4 (+5.3) 17.5±0.5 (+6.2)
Flores-EN-ZH 13.0±0.4 16.7±0.4 (+3.7) 18.9±0.5 (+5.9) 22.6±0.5 (+9.6)
WMT-EN-ZH COMET 57.93±0.28 58.49±0.30 (+0.56) 59.82±0.34 (+1.89) 59.70±0.34 (+1.77)
Flores-EN-ZH 60.25±0.30 62.63±0.33 (+2.38) 70.41±0.40 (+10.16) 74.68±0.43 (+14.43)
CLSum CrossSum-EN-ZH
R-1 6.20 6.70(+0.5) 7.10(+0.9) 7.30(+1.1)
R-2 1.90 2.20(+0.3) 2.90(+1.0) 3.40(+1.5)
R-L 5.12 5.43(+0.31) 5.71(+0.59) 6.88(+1.76)
Rec 62.70 63.12(+0.42) 63.89(+1.19) 64.13(+1.43)
CLQA Xquad-EN-ZH
F1 24.42±0.25 25.73±0.27 (+1.31) 27.08±0.28 (+2.66) 27.31±0.29 (+2.89)
EM 8.24 8.82(+0.58) 9.24(+1.00) 9.41(+1.17)
Prec 37.35 39.97(+2.62) 42.18(+4.83) 43.04(+5.69)
CLNLU Belebele-EN-ZH Acc 41.11% 43.11%(+2.00%) 44.67%(+3.56%) 46.33%(+5.22%)
Table 5. Evaluation Results on Cross-Lingual Tasks for BLOOM. All results in this table are obtained by evaluating the instruction-
fine-tuned models.
R-1, R-2, R-L, Rec, EM, Prec, and Acc respectively represent ROUGE-1, ROUGE-2, ROUGE-L, BERTScore-Recall, Exact Match,
BERTScore-Precision, and Accuracy. Best scores are in bold.
Specifically, the basic variants Llama_NTP and Llama_Bi_NTP demonstrate steady improvements in translation and
understanding. The competitive performance of Llama_Bi_NTP in CLQA (e.g., 88.70 F1 on Xquad) suggests that bilingual
Next Token Prediction (NTP) enhances the model’s capacity to perform language transitions at contextually appropriate
positions, thereby providing a solid foundation for cross-lingual alignment. However, a

Chunk 21 · 1,999 chars

erstanding. The competitive performance of Llama_Bi_NTP in CLQA (e.g., 88.70 F1 on Xquad) suggests that bilingual
Next Token Prediction (NTP) enhances the model’s capacity to perform language transitions at contextually appropriate
positions, thereby providing a solid foundation for cross-lingual alignment. However, a more complex pattern emerges
when considering Llama_CLI, which incorporates explicit translation instructions during the pre-training phase. While
Llama_CLI achieves the highest performance in most MT benchmarks—surpassing the baseline by up to 13.1 BLEU
points, it suffers from significant performance degradation in CLSum and CLQA. For instance, in CLSum, Llama_CLI
experiences a 2.68-point drop in ROUGE-1 compared to the baseline.
These results suggest that retaining task-specific instructions during the pre-training stage can easily trap the model
in a task-specific local optimum. Although such a strategy may yield marginal gains in certain downstream tasks
like CLNLU, it appears to induce a degree of overfitting to the translation objective that even subsequent fine-tuning
cannot fully rectify. This rigid optimization limits the model’s flexibility in handling complex, open-ended cross-lingual
reasoning tasks like summarization.
In contrast, our approach (Llama_CL) effectively mitigates this risk. By balancing instruction-tuning with robust
cross-lingual mapping, Llama_CL achieves MT performance comparable to the specialized Llama_CLI while remaining
the only variant to consistently outperform the baseline in CLSum. Furthermore, in CLQA, Llama_CL attains the highest
Exact Match (19.24%) and BERTScore-Precision (87.26) scores. These findings underscore that our method fosters a more
versatile cross-lingual representation space, ensuring high-fidelity generation and comprehension without sacrificing
the model’s generalizability across diverse task formats.
6.4.2 LLM-Assisted Reference-Free Evaluation. To further assess cross-lingual generalization, we

Chunk 22 · 1,997 chars

ings underscore that our method fosters a more
versatile cross-lingual representation space, ensuring high-fidelity generation and comprehension without sacrificing
the model’s generalizability across diverse task formats.
6.4.2 LLM-Assisted Reference-Free Evaluation. To further assess cross-lingual generalization, we evaluated models on
open-domain question answering, where responses must be coherent and contextually relevant without explicit refer-
ences. We constructed CLQA datasets for EN-CS and CS-UK using a subset of open-ended questions from OpenHermes-
2.5 [ 37]. Common question types and examples are provided in Table 6. Given the subjective nature of queries such as
Manuscript submitted to ACM

-- 11 of 20 --

12 Zheng et al.
Types of open-ended questions Question Example
Creative prompts Compose a poem about childhood.
Commonsense reasoning Why do people wear sunscreen?
Knowledge-based queries What are the key features of quantum computing?
List-style questions List five characteristics of a good employee.
Table 6. Examples of open-ended question types
Rank L-3-8B L_NTP L_Bi_NTP L_CL
1 181 271 252 325
2 332 182 281 190
3 261 213 212 202
4 163 271 192 220
Same 63
Table 7. Evaluation Results on Open-Ended CLQA Task. Best scores are in bold.
“Could you share a library-related joke?”, responses were rated on a 1–4 scale using Claude 3.5, with a ‘Same’ rating
assigned when models produced indistinguishable or uniformly incorrect answers.
As shown in Table 7, Llama_CL achieved the highest performance, receiving 325 top ratings out of 1,000 responses.
It consistently outperformed other models, particularly in tasks requiring nuanced comprehension and complex text
generation. Notably, when prompted to “Compose a poem about childhood”, other models provided basic descriptions,
whereas Llama_CL employed advanced poetic techniques such as parallelism, demonstrating improved contextual
understanding.
These results highlight the benefits of cross-lingual mapping in

Chunk 23 · 1,984 chars

and complex text
generation. Notably, when prompted to “Compose a poem about childhood”, other models provided basic descriptions,
whereas Llama_CL employed advanced poetic techniques such as parallelism, demonstrating improved contextual
understanding.
These results highlight the benefits of cross-lingual mapping in enhancing multilingual generative capabilities.
However, all models exhibited weaknesses in complex reasoning tasks, underscoring the need for further improvements
in logical inference and cross-lingual reasoning.
7 Ablation Study
7.1 Analysis of Cross-Lingual Training Objectives
We conducted ablation experiments on MT to isolate key factors influencing model performance. The experiments
were structured as follows:
(1) 𝐸𝑠𝑒𝑝 : Bilingual data was split into monolingual sentences and used for next-token prediction.
(2) 𝐸𝑝𝑜𝑠𝑡_𝑚𝑡 : Bilingual data was converted into instruction-tuning format and used during fine-tuning with the
prompt: “Translate the following <Source Language> sentence into <Target Language>.”
(3) 𝐸𝑝𝑟𝑒_𝑚𝑡 : Bilingual data was incorporated as instruction-tuning data during continued pre-training using the
same translation prompt as 𝐸𝑝𝑜𝑠𝑡_𝑚𝑡 .
(4) 𝐸𝑐𝑟𝑜𝑠𝑠 : Our proposed approach, integrating cross-lingual mapping into pre-training.
As shown in Table 8, the comparison between 𝐸𝑠𝑒𝑝 and 𝐸𝑐𝑟𝑜𝑠𝑠 confirms that improved cross-lingual performance is
not merely a result of increased data volume. The significant performance gap between 𝐸𝑝𝑜𝑠𝑡 _𝑚𝑡 and 𝐸𝑐𝑟𝑜𝑠𝑠 underscores
the importance of embedding cross-lingual tasks during pre-training, as instruction fine-tuning alone provides limited
benefits.
Manuscript submitted to ACM

-- 12 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 13
Test Datasets 𝐸_𝑠𝑒𝑝 𝐸_𝑝𝑜𝑠𝑡_𝑚𝑡 𝐸_𝑝𝑟𝑒_𝑚𝑡 𝐸_𝑐𝑟𝑜𝑠𝑠
WMT-en-zh 62.04 67.26 68.87 71.03
WMT-en-cs 66.72 63.42 69.78

Chunk 24 · 1,994 chars

des limited
benefits.
Manuscript submitted to ACM

-- 12 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 13
Test Datasets 𝐸_𝑠𝑒𝑝 𝐸_𝑝𝑜𝑠𝑡_𝑚𝑡 𝐸_𝑝𝑟𝑒_𝑚𝑡 𝐸_𝑐𝑟𝑜𝑠𝑠
WMT-en-zh 62.04 67.26 68.87 71.03
WMT-en-cs 66.72 63.42 69.78 72.47
WMT-cs-uk 77.92 74.21 81.17 84.97
Flores-en-zh 75.64 78.31 81.09 83.05
Flores-en-cs 79.95 76.44 83.02 84.87
Flores-cs-uk 79.96 60.11 83.45 86.04
Flores-zh-jp 76.21 71.45 85.97 87.51
Table 8. COMET score for MT task of models based on different pre-training settings. E_cross is our method.
Model MMLU LogiQA Alpaca_Eval
(Accuracy %) (Accuracy %) (Win Rate %)
L-3-8B-Ins 56.77 36.71 50.25
L_CL_Ins 57.01 36.25 49.75
Table 9. Model Performance Comparison on MMLU, LogiQA, and Alpaca_Eval. Best scores are in bold.
While 𝐸𝑝𝑟𝑒_𝑚𝑡 and 𝐸𝑐𝑟𝑜𝑠𝑠 achieve comparable results, on average, 𝐸𝑐𝑟𝑜𝑠𝑠 outperforms 𝐸𝑝𝑟𝑒𝑚𝑡 by 2.37 COMET points
across the seven language pairs, with the largest margin observed on the CS-UK pair (3.8 points), indicating stronger
language alignment when cross-lingual tasks are explicitly modeled during pre-training. This suggests that premature
instruction fine-tuning, as in 𝐸𝑝𝑟𝑒_𝑚𝑡 , may lead to overfitting, whereas 𝐸𝑐𝑟𝑜𝑠𝑠 facilitates better generalization across
language pairs.
These findings highlight the necessity of cross-lingual mapping in pre-training for establishing a robust multilingual
foundation. Explicitly integrating cross-lingual objectives during pre-training enables the model to develop deeper
alignment, resulting in improved transferability across languages compared to post-hoc instruction fine-tuning.
7.2 Performance on English Monolingual Downstream Tasks
Furthermore, we compared the results of fine-tuning Llama-3-8B (L-3-8B-Ins) and Llama_CL (L_CL_Ins) directly using
the Alpaca dataset [36 ] on English-only downstream tasks. For each base model, we conducted LoRA fine-tuning for
two

Chunk 25 · 1,991 chars

t-hoc instruction fine-tuning.
7.2 Performance on English Monolingual Downstream Tasks
Furthermore, we compared the results of fine-tuning Llama-3-8B (L-3-8B-Ins) and Llama_CL (L_CL_Ins) directly using
the Alpaca dataset [36 ] on English-only downstream tasks. For each base model, we conducted LoRA fine-tuning for
two epochs with a rank of 16, an initial learning rate of 1𝑒−4, a warm-up rate of 0.1, and a batch size of 256. We selected
MMLU [17], LogiQA [25 ], and AlpacaEval [23 ] to evaluate the models’ capabilities in language understanding, logical
reasoning, and open-ended generation, respectively.
As shown in Table 9, the model trained with our method achieves performance on downstream English tasks that is
comparable to the original base model, and even slightly surpasses it on the MMLU benchmark. These results suggest
that our pre-training strategy enhances cross-lingual transfer while maintaining English monolingual proficiency with
negligible degradation. We attribute this to retaining the Next-Token Prediction (NTP) objective and preserving a portion
of English monolingual data during continued pre-training, which together help mitigate catastrophic forgetting.
8 Case Study
To complement quantitative evaluations, we analyzed model outputs to identify performance patterns. Our findings
align with LLM-assisted evaluations, confirming Llama_CL’s consistent superiority.
Manuscript submitted to ACM

-- 13 of 20 --

14 Zheng et al.
Task Question L-3-8B L_NTP L_Bi_NTP L_CL Remark
MT Translate the fol-
lowing English to
Chinese: "It also
arguably makes read-
ing easier, though
writing is somewhat
complicated by the
need to find out
whether a verb or
adjective is used in a
substantivized form."
It also arguably makes
reading easier, though
writing is somewhat
complicated by the
need to find out
whether a verb or
adjective is used in a
substantivized form.
It also ar-
guably makes
reading eas-
ier, though
writing is
somewhat
complicated
by the need
to find

Chunk 26 · 1,990 chars

verb or
adjective is used in a
substantivized form."
It also arguably makes
reading easier, though
writing is somewhat
complicated by the
need to find out
whether a verb or
adjective is used in a
substantivized form.
It also ar-
guably makes
reading eas-
ier, though
writing is
somewhat
complicated
by the need
to find out
whether a
verb or adjec-
tive is used
in a substan-
tivized form.
它 也 使 阅 读 更 容
易 ， 但 由 于 需 要 确
定 动 词 或 形 容 词 是
否 使 用 了 名 词 化 形
式 ， 因 此 写 作 有 些
复杂。
它 也 可 以 说
使 阅 读 更 容
易 ， 虽 然 写
作 是 有 点 复
杂 的 ， 因 为
需 要 找 到 一
个 动 词 或 形
容 词 是 否 在
名 词 形 式 中
使用。
Llama-3-8B and
Llama_NTP failed to
translate the source
sentence.
Translate the fol-
lowing English to
Chinese: "According
to the lawsuit, waste
from the UN camp
was not properly
sanitized, causing
bacteria to enter
the tributary of the
Artibonite River, one
of Haiti’s largest."
根 据 诉 讼,来 自 联 合
国 营 地 的 废 物 未 得
到 适 当 消 毒,导 致 细
菌 进 入 艺 术 博 尼 特
河,海 地 最 大 的 河 流
之一。
根 据 诉 讼,联
合 国 营 地 的
垃 圾 没 有 得
到 适 当 的 消
毒,使 细 菌 进
入 阿 尔 蒂 波
内 特 河,该
河 是 海 地 最
大 的 河 流 之
一。
根 据 诉 讼,联 合 国 营
地 的 垃 圾 没 有 得 到
适 当 的 消 毒,使 细 菌
进 入 阿 尔 蒂 波 尼 特
河,是 海 地 最 大 的 河
流之一。
根 据 诉 讼 ，
联 合 国 营 地
的 废 物 没 有
得 到 适 当 的
消 毒 ， 导 致
细 菌 进 入 阿
尔 蒂 博 尼 特
河 的 分 流 之
一 ， 即 海 地
最 大 的 分 流
之一。
Llama_CL generates a
more detailed, accu-
rate translation.
Table 10. Translation Results Comparison
In MT, Llama-3-8B and Llama_NTP occasionally produced source-language outputs—an issue largely mitigated in
Llama_Bi_NTP and Llama_CL. Llama_CL showed fewer omissions and redundancies than Llama_Bi_NTP, indicating
stronger cross-lingual comprehension. However, all models struggled with rare words and domain-specific terms in low-
resource languages, revealing linguistic coverage limitations. For cross-lingual summarization, Llama_CL sometimes
defaulted to extractive summaries rather than abstractive content, exposing gaps in information synthesis. Sporadic
Slovak usage appeared in Llama_NTP, Llama_Bi_NTP, and Llama_CL summaries, likely from monolingual corpus
noise during

Chunk 27 · 1,997 chars

linguistic coverage limitations. For cross-lingual summarization, Llama_CL sometimes
defaulted to extractive summaries rather than abstractive content, exposing gaps in information synthesis. Sporadic
Slovak usage appeared in Llama_NTP, Llama_Bi_NTP, and Llama_CL summaries, likely from monolingual corpus
noise during continued pre-training. Despite improvements in cross-lingual QA, all models struggled with complex
logical reasoning, highlighting persistent challenges in multilingual reasoning capabilities. Case study examples are in
Tables 10, 11, and 12.
9 Conclusion
We introduced a novel continued pre-training approach that enhances cross-lingual capabilities in multilingual LLMs
by integrating cross-lingual mapping tasks alongside language-specific next-token prediction. This dual-task strategy
improved language alignment while preserving monolingual fluency, addressing key limitations in existing multilingual
models.
Extensive experiments validated our approach, demonstrating improvements in language alignment, monolingual
perplexity, and cross-lingual task performance. Our proposed LAC offers a robust metric for assessing inter-language
coherence, even with limited test data. Additionally, our cross-lingual open-domain QA dataset and LLM-assisted
evaluation highlight the effectiveness of explicit cross-lingual pre-training in enhancing generative capabilities across
seen and unseen language pairs.
Manuscript submitted to ACM

-- 14 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 15
Task Question L-3-8B L_NTP L_Bi_NTP L_CL Remark
CLSum Please read the following English text and generate a short
and precise Chinese summary: "By Peter BilesBBC World
Affairs Correspondent The then-prime minister only saw it
was likely after getting "raw intelligence" two days before
the Argentines landed. Papers released under the 30-year
rule show Mrs Thatcher was acutely worried about retaking
the

Chunk 28 · 1,999 chars

sh text and generate a short
and precise Chinese summary: "By Peter BilesBBC World
Affairs Correspondent The then-prime minister only saw it
was likely after getting "raw intelligence" two days before
the Argentines landed. Papers released under the 30-year
rule show Mrs Thatcher was acutely worried about retaking
the islands. One historian said the documents were among
the "most powerful material" declassified in the last three
decades. In October 1982, a few months after the war ended,
Mrs Thatcher gave evidence behind closed doors to the Falk-
land Islands Review Committee, chaired by Lord Franks. The
transcript of that dramatic testimony has now been published
for the first time. "I never, never expected the Argentines to
invade the Falklands head-on. It was such a stupid thing to do,
as events happened, such a stupid thing even to contemplate
doing", Mrs Thatcher told the Franks Committee. There had
been some British contingency planning in the month before
the Argentine invasion of the Falklands. On 26 March 1982,
Ministry of Defence officials came back to Mrs Thatcher with
a plan to deter a full-scale invasion. One sentence shocked
her, and she wrote it in her diary: "Moreover, if faced with
Argentine occupation on arrival there would be no certainty
that such a force would be able to retake the dependency."
She told the committee: "You can imagine that turned a knife
in my heart, that lot." However, in her oral evidence, she said
she had still considered an invasion unlikely: "I again stress,
I thought that they would be so absurd and ridiculous to
invade the Falklands that I did not think it would happen.
"Nevertheless, one has always to make contingency plans,
and soon after we got the South Georgia incident happening."
The picture changed on 31 March 1982 when Mrs Thatcher
was shown intelligence suggesting that an invasion was on
the cards: "I just say it was the worst I think moment of my
life." She also told the committee: "That night no-one could
tell

Chunk 29 · 1,998 chars

contingency plans,
and soon after we got the South Georgia incident happening."
The picture changed on 31 March 1982 when Mrs Thatcher
was shown intelligence suggesting that an invasion was on
the cards: "I just say it was the worst I think moment of my
life." She also told the committee: "That night no-one could
tell me whether we could retake the Falklands - no-one. We
did not know - we did not know." The British foreign secre-
tary at the time, Lord Carrington, also gave evidence to the
Franks Committee, where he too held the view that Argentina
was not going to invade the Falklands. "I have been accused,
and was accused violently after 2 April in the House of Com-
mons, and in the press and elsewhere - of wilfully ignoring
signs and statements and evidence, and just ignoring it. I can
truthfully say I did not do any of these things. . . they (the
signs of an invasion) were not there," ... It also bases its claim
on the islands’ proximity to the South American mainland.
Some 255 British servicemen and three Falklands civilians
died during the conflict. The number of Argentine dead is
estimated at about 650 All document images courtesy of the
National Archives.
玛 格 丽
特·撒 切
尔 前 英
国 首 相
在1982年
福克兰群
岛战争之
前就有所
预料，但
当时的政
府官员却
没有意识
到 这 一
点。
英国前首
相玛格丽
特·撒 切
尔 说 ，
她1982年
对英国失
去福克兰
群岛的战
争时期所
面临的压
力 是"人
生中最糟
糕 的 一
刻"。
1982年4月2日 ， 阿
根 廷 突 然 入 侵 福
克 兰 群 岛 ， 英 国
迅 速 集 结 军 队 回
应。30年后，英国
政府公布了关于这
场战争的新档案。
马 格 丽
特·撒 切
尔夫人当
时说，她
的生命从
来没有经
历过如此
严重的危
机 。 现
在，刚公
布的政府
文 件 显
示，这个
英国首相
在1982年
阿根廷对
福克兰群
岛的入侵
之前，并
没有料到
会发生这
样的事。
L_CL lists
two facts
but does
not gen-
erate a
summary
state-
ment.
Table 11. CLSum Results Comparison
While our method significantly improves cross-lingual performance, challenges remain in summarization and
complex reasoning tasks. Future work should explore better integration of summarization objectives and enhanced
multilingual reasoning. Expanding our framework to underrepresented languages will further strengthen multilingual
LLMs, reinforcing the value of

Chunk 30 · 1,998 chars

s-lingual performance, challenges remain in summarization and
complex reasoning tasks. Future work should explore better integration of summarization objectives and enhanced
multilingual reasoning. Expanding our framework to underrepresented languages will further strengthen multilingual
LLMs, reinforcing the value of explicitly modeling cross-lingual relationships in pre-training.
Manuscript submitted to ACM

-- 15 of 20 --

16 Zheng et al.
Task Question L-3-8B L_NTP L_Bi_NTP L_CL Remark
CLQA Answer the follow-
ing questions in
Czech. "Write a lim-
erick about the joys
of childhood in the
style of Lewis Car-
roll."
Kdybys si vzpom-
něl na dětství,
považte si to
rádi!Bylo to tak
plné nadšené
vzrušení, plné
úžasu a ra-
dosti!Nakonec
to skončilo, ale
jaká úžasná doba,
když jsme byli malí
a takoví, jaké jsme
dnes.
Jako malý jsem
si užíval,Až se
mi v životě zata-
jvalo,Vždycky
jsem věděl,Co
jsem chtěl,Jako
malé dítě, bylo to
nejlepší.
Jako malý holky v
lese, \nKterý nás
potěšil,\nTento svět
byl plný štěstí,\nS
radostí a veselím,
\nNikdy jsme se
necítily starosti. Over-
postingotázkou
Koukal jsem na
svět se štěstím
plný,\nA všude
kolem se tají
radost,\nVysoké
stromy, malé
holky, \nNávštěvy
kamarádů, hrací
kosti,\nVšechno
bylo úžasné, ne?
To je pravda!
While both L_CL
and L_Bi_NTP
employ parallel
structure, a tech-
nique common
in poetry, L_CL’s
writing style is
more poetic, with a
rhythm that better
aligns with the
musical quality of
verse.
Answer the follow-
ing questions in
Czech. "Explain to
me the importance
of the following
quotation.’The
only way to do
great work is to
love what you do.’"
Tento citát zna-
mená, že pokud
chcete dosáhnout
úspěchu v jakékoli
oblasti, musíte
milovat to, co
děláte. Kreativita
a odhodlání jsou
důležitými aspekty
úspěchu, ale pokud
se nebudete zají-
mat o to, co děláte,
pravděpodobně
nebudete schopni
dosáhnout svých
cílů. Mít opravdový
zájem o své pra-
covní povinnosti
je klíčem k tomu,
abyste se dostali
k tomu, kde se
chcete dostat.
To

Chunk 31 · 1,997 chars

ékoli
oblasti, musíte
milovat to, co
děláte. Kreativita
a odhodlání jsou
důležitými aspekty
úspěchu, ale pokud
se nebudete zají-
mat o to, co děláte,
pravděpodobně
nebudete schopni
dosáhnout svých
cílů. Mít opravdový
zájem o své pra-
covní povinnosti
je klíčem k tomu,
abyste se dostali
k tomu, kde se
chcete dostat.
To love what you
do je důležitý as-
pekt úspěšné práce.
Když si člověk
vybere práci,
kterou opravdu
miluje, je mnohem
pravděpodobnější,
že se mu podaří
vykonávat ji s
velkým nasazením
a přinášet kvalitní
výsledky. Milovat
svou práci zajišťuje,
že člověk se bude
cítit spokojenějším
a bude mít větší
motivaci a nadšení
k práci. Navíc,
když se člověk
skutečně zajímá
o svou práci, je
pravděpodobnější,
že bude mít větší
znalosti a schop-
nosti a bude lépe
splňovat poža-
davky a očekávání
zaměstnavatele.
To mít velký úspěch
ve svém povolání, je
třeba milovat to, co
děláte. Když máte rádi
to, co děláte, máte mo-
tivaci pracovat tvrdě,
vyvíjet se a dosahovat
cílů. Milujete své po-
volání, budete se cítit
spokojenější a vaše
práce se bude lépe pro-
jevovat v práci a živ-
otě.
To nejlepší, co
můžete dělat, je
milovat to, co
děláte. Pokud vás
vaše práce nudí,
vaše výsledky
budou značně
omezené. Když
milujete svou práci,
získáte z ní větší
spokojenost a z
ní se vám líbí být.
To vám umožní
dělat svou práci
s větší odvahou
a odhodláním a
získáte více z jejího
výkonu.
Responses of L_CL
and L-3-8B are
relatively accurate,
with the answer of
L_CL answer being
more relevant
to the question.
The response of
L_NTP contains
English content,
while the answer
of L_Bi_NTP
has repetitive
elements.
Table 12. CLQA Results Comparison
10 Limitation
While our approach significantly improves the cross-lingual generation and comprehension capabilities of multilingual
LLMs, certain limitations remain. Notably, we observed only marginal improvements in tasks requiring cross-lingual
text summarization and complex reasoning. This indicates that while our method enhances language alignment

Chunk 32 · 1,991 chars

ignificantly improves the cross-lingual generation and comprehension capabilities of multilingual
LLMs, certain limitations remain. Notably, we observed only marginal improvements in tasks requiring cross-lingual
text summarization and complex reasoning. This indicates that while our method enhances language alignment and
generation fluency, it struggles with tasks that demand deeper information extraction, filtering, and logical reasoning
across languages. These limitations may arise from the inherent complexity of summarization and reasoning tasks,
which require more sophisticated mechanisms to retain and manipulate detailed semantic information across languages.
Additionally, although our approach effectively handles low-resource languages and demonstrates strong alignment
for bilingual tasks, the gains in highly subjective or culturally nuanced tasks, such as humor or sentiment analysis, are
Manuscript submitted to ACM

-- 16 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 17
less pronounced. Addressing these challenges will be a focus of future research, where we aim to develop methods that
strengthen cross-lingual logical reasoning and enhance the model’s capacity for processing complex, abstract concepts.
Acknowledgments
This research is supported by the National Research Foundation, Singapore under its National Large Language Models
Funding Initiative. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the
author(s) and do not reflect the views of the National Research Foundation, Singapore. The research is also supported
by the National Research Foundation, Singapore under its National Large Language Models Funding Initiative (AISG
Award No: AISG-NMLP-2024-005 and AISG-NMLP-2024-004).
References
[1] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In

Chunk 33 · 1,998 chars

upported
by the National Research Foundation, Singapore under its National Large Language Models Funding Initiative (AISG
Award No: AISG-NMLP-2024-005 and AISG-NMLP-2024-004).
References
[1] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.).
Association for Computational Linguistics, Online, 4623–4637. doi:10.18653/v1/2020.acl-main.421
[2] Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke
Zettlemoyer, and Madian Khabsa. 2024. The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In Proceedings
of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics,
Bangkok, Thailand and virtual meeting, 749–775. https://aclanthology.org/2024.acl-long.44
[3] Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Yuan-Fang Li, Yong-Bin Kang, and Rifat Shahriyar. 2021. CrossSum: Beyond English-centric
cross-lingual summarization for 1,500+ language pairs. arXiv preprint arXiv:2112.08804 (2021).
[4] Guanlin Chen, Xiaolong Shi, Moke Chen, and Liang Zhou. 2020. Text similarity semantic calculation based on deep reinforcement learning.
International Journal of Security and Networks 15, 1 (2020), 59–66.
[5] Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021. InfoXLM:
An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer,
Dilek Hakkani-Tur, Iz Beltagy,

Chunk 34 · 1,999 chars

oXLM:
An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer,
Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational
Linguistics, Online, 3576–3588. doi:10.18653/v1/2021.naacl-main.280
[6] Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-Wen Li, Wen tau Yih, Yoon Kim, and James
Glass. 2022. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv:2204.10298 [cs.CL] https://arxiv.org/abs/2204.10298
[7] A Conneau. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).
[8] Alexis CONNEAU and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems,
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.
cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. arXiv preprint arXiv:1810.04805 (2018).
[10] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang,
Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston
Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya
Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong,

Chunk 35 · 1,995 chars

hie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston
Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya
Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus
Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano,
Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang,
Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah
Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng
Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng
Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik,
Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan,
Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas,
Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes
Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li,
Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura,

Chunk 36 · 1,996 chars

Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes
Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li,
Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan
Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre,
Manuscript submitted to ACM

-- 17 of 20 --

18 Zheng et al.
Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell,
Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang,
Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara
Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor
Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin
Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh
Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe
Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay
Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu,
Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf,
Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe,

Chunk 37 · 1,997 chars

Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu,
Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf,
Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie
Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic,
Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang
Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David
Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa
Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun,
Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella
Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid
Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim
Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya,
Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon
Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay
Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan,

Chunk 38 · 1,998 chars

er Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon
Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay
Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia,
Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca
Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso,
Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov,
Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini
Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong,
Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji,
Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao,
Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan
Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh
Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar,
Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe,
Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian,

Chunk 39 · 1,989 chars

ndsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar,
Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe,
Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal
Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked,
Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru,
Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang,
Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi,
Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei
Zhao. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
[11] Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT Sentence Embedding.
arXiv:2007.01852 [cs.CL] https://arxiv.org/abs/2007.01852
[12] Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. 2024. Multilingual Pretraining and Instruction Tuning Improve
Cross-Lingual Knowledge Alignment, But Only Shallowly. arXiv:2404.04659 [cs.CL] https://arxiv.org/abs/2404.04659
[13] Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2022. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821 [cs.CL]
https://arxiv.org/abs/2104.08821
[14] Hongyu Gong, Suma Bhat, Lingfei Wu, JinJun Xiong, and Wen-mei Hwu. 2019. Reinforcement learning based text style transfer without parallel
training corpus. arXiv preprint arXiv:1903.10671 (2019).
[15] Suchin Gururangan, Ana

Chunk 40 · 1,997 chars

Learning of Sentence Embeddings. arXiv:2104.08821 [cs.CL]
https://arxiv.org/abs/2104.08821
[14] Hongyu Gong, Suma Bhat, Lingfei Wu, JinJun Xiong, and Wen-mei Hwu. 2019. Reinforcement learning based text style transfer without parallel
training corpus. arXiv preprint arXiv:1903.10671 (2019).
[15] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining:
Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020).
[16] Jiyeon Ham and Eun-Sol Kim. 2021. Semantic Alignment with Calibrated Similarity for Multilingual Sentence Embedding. In Findings of the
Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.).
Association for Computational Linguistics, Punta Cana, Dominican Republic, 1781–1791. doi:10.18653/v1/2021.findings-emnlp.153
[17] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask
Language Understanding. arXiv:2009.03300 [cs.CY] https://arxiv.org/abs/2009.03300
Manuscript submitted to ACM

-- 18 of 20 --

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM
Performance 19
[18] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank
Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685
[19] Zixuan Ke and Bing Liu. 2022. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701 (2022).
[20] Minato Kondo, Takehito Utsuro, and Masaaki Nagata. 2024. Enhancing Translation Accuracy of Large Language Models through Continual
Pre-Training on Parallel Data. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Elizabeth Salesky,
Marcello Federico, and Marine

Chunk 41 · 1,997 chars

2022).
[20] Minato Kondo, Takehito Utsuro, and Masaaki Nagata. 2024. Enhancing Translation Accuracy of Large Language Models through Continual
Pre-Training on Parallel Data. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Elizabeth Salesky,
Marcello Federico, and Marine Carpuat (Eds.). Association for Computational Linguistics, Bangkok, Thailand (in-person and online), 203–220.
doi:10.18653/v1/2024.iwslt-1.26
[21] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François
Yvon, Matthias Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual language model. (2023).
[22] Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. 2024. PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of
Multilingual Alignment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal,
and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 10246–10257. doi:10.18653/v1/2024.emnlp-main.572
[23] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaEval:
An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.
[24] Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ali Payani, Ninghao Liu, and Mengnan Du. 2024. Quantifying Multilingual Performance of Large
Language Models Across Languages. arXiv:2404.11553 [cs.CL] https://arxiv.org/abs/2404.11553
[25] Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A Challenge Dataset for Machine Reading
Comprehension with Logical Reasoning. arXiv:2007.08124 [cs.CL] https://arxiv.org/abs/2007.08124
[26] Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[27] Fuli Luo,

Chunk 42 · 1,991 chars

n Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A Challenge Dataset for Machine Reading
Comprehension with Logical Reasoning. arXiv:2007.08124 [cs.CL] https://arxiv.org/abs/2007.08124
[26] Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[27] Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Songfang Huang, Fei Huang, and Luo Si. 2021. VECO: Variable and flexible cross-lingual pre-training
for language understanding and generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3980–3994.
[28] Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of
learning and motivation. Vol. 24. Elsevier, 109–165.
[29] Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, and Yoshimasa Tsuruoka. 2024. Enhancing Cross-lingual Sentence Embedding for Low-resource
Languages with Word Alignment. arXiv preprint arXiv:2404.02490 (2024).
[30] Benjamin Muller, Yanai Elazar, Benoît Sagot, and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of
Multilingual BERT. arXiv:2101.11109 [cs.CL] https://arxiv.org/abs/2101.11109
[31] Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. 2023.
Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400 (2023).
[32] Karl Pearson. 1896. VII. Mathematical contributions to the theory of evolution.—III. Regression, heredity, and panmixia. Philosophical Transactions
of the Royal Society of London. Series A, containing papers of a mathematical or physical character 187 (1896), 253–318.
[33] Roger Ratcliff. 1990. Connectionist models of recognition memory: constraints imposed by

Chunk 43 · 1,996 chars

ibutions to the theory of evolution.—III. Regression, heredity, and panmixia. Philosophical Transactions
of the Royal Society of London. Series A, containing papers of a mathematical or physical character 187 (1896), 253–318.
[33] Roger Ratcliff. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review
97, 2 (1990), 285.
[34] AB Siddique, Samet Oymak, and Vagelis Hristidis. 2020. Unsupervised paraphrasing via deep reinforcement learning. In Proceedings of the 26th ACM
SIGKDD international conference on knowledge discovery & data mining. 1800–1809.
[35] Henry Tang, Ameet Deshpande, and Karthik Narasimhan. 2022. ALIGN-MLM: Word Embedding Alignment is Crucial for Multilingual Pre-training.
arXiv:2211.08547 [cs.CL] https://arxiv.org/abs/2211.08547
[36] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford
Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
[37] Teknium. 2023. OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants. https://huggingface.co/datasets/teknium/
OpenHermes-2.5
[38] Liang Wang, Wei Zhao, and Jingming Liu. 2021. Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast.
arXiv:2109.00253 [cs.CL] https://arxiv.org/abs/2109.00253
[39] Shijie Wu and Mark Dredze. 2020. Do Explicit Alignments Robustly Improve Multilingual Encoders?. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational
Linguistics, Online, 4471–4482. doi:10.18653/v1/2020.emnlp-main.362
[40] Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: Contrastive Learning for Sentence Representation.
arXiv:2012.15466 [cs.CL] https://arxiv.org/abs/2012.15466
[41] Haoran Xu, Young Jin Kim,

Chunk 44 · 1,712 chars

ion for Computational
Linguistics, Online, 4471–4482. doi:10.18653/v1/2020.emnlp-main.362
[40] Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: Contrastive Learning for Sentence Representation.
arXiv:2012.15466 [cs.CL] https://arxiv.org/abs/2012.15466
[41] Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. 2024. A Paradigm Shift in Machine Translation: Boosting Translation
Performance of Large Language Models. arXiv:2309.11674 [cs.CL] https://arxiv.org/abs/2309.11674
[42] L Xue. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020).
[43] Go Yasui, Yoshimasa Tsuruoka, and Masaaki Nagata. 2019. Using Semantic Similarity as Reward for Reinforcement Learning in Sentence Generation.
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Fernando Alva-Manchego,
Eunsol Choi, and Daniel Khashabi (Eds.). Association for Computational Linguistics, Florence, Italy, 400–406. doi:10.18653/v1/P19-2056
Manuscript submitted to ACM

-- 19 of 20 --

20 Zheng et al.
[44] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan
Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: An Open Bilingual Pre-trained Model.
arXiv:2210.02414 [cs.CL] https://arxiv.org/abs/2210.02414
[45] Kun Zhou, Beichen Zhang, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Debiased Contrastive Learning of Unsupervised Sentence Representations.
arXiv:2205.00656 [cs.CL] https://arxiv.org/abs/2205.00656
Manuscript submitted to ACM

-- 20 of 20 --