Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
Summary
This paper introduces a novel pre-training approach to enhance cross-lingual capabilities in multilingual large language models (LLMs). Existing methods struggle with data imbalances and monolingual bias, leading to poor performance in low-resource languages. The authors propose a Cross-Lingual Mapping (CL) task during pre-training, which bi-directionally aligns languages in the model's embedding space without compromising monolingual fluency. They also introduce the Language Alignment Coefficient (LAC), a metric to quantify cross-lingual consistency, especially in low-data scenarios. Experiments on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show significant improvements: up to 11.9 BLEU score gains in MT, a 6.72 increase in CLQA BERTScore-Precision, and over 5% in CLNLU accuracy over strong baselines. The method is evaluated on diverse language pairs, including high- and low-resource scenarios, using datasets like WMT2024 and CrossSum. Results demonstrate robust performance gains, particularly for low-resource languages, and the approach generalizes well to unseen language pairs. The study highlights the effectiveness of integrating explicit cross-lingual objectives into pre-training, offering a promising direction for improving multilingual LLMs.
PDF viewer
Chunks(45)
Chunk 0 · 1,981 chars
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance WEIHUA ZHENG, Singapore University of Technology and Design, Singapore CHANG LIU, ByteDance, Singapore ZHENGYUAN LIU, Agency for Science, Technology and Research, Singapore XIN HUANG, Agency for Science, Technology and Research, Singapore KUI WU, Agency for Science, Technology and Research, Singapore MUHAMMAD HUZAIFAH MD SHAHRIN, Agency for Science, Technology and Research, Singapore AITI AW, Agency for Science, Technology and Research, Singapore ROY KA-WEI LEE, Singapore University of Technology and Design, Singapore Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages and the monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, improve cross-lingual performance but often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task in the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM’s embedding space, improving both language generation and comprehension. We further introduce a Language Alignment Coefficient to robustly quantify cross- lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves up to 11.9 BLEU score gains in MT, an increase of 6.72 in CLQA BERTScore-Precision and more than a 5% increase in CLNLU accuracy over strong multilingual baselines. Our findings highlight the potential of embedding cross-lingual objectives into pre-training, improving multilingual LLMs. CCS Concepts: • Computing methodologies → Natural language
Chunk 1 · 1,993 chars
in MT, an increase of 6.72 in CLQA BERTScore-Precision and more than a 5% increase in CLNLU accuracy over strong multilingual baselines. Our findings highlight the potential of embedding cross-lingual objectives into pre-training, improving multilingual LLMs. CCS Concepts: • Computing methodologies → Natural language generation. Additional Key Words and Phrases: Cross-Lingual, Large Language Models, Low-resource Language ACM Reference Format: WEIHUA ZHENG, CHANG LIU, ZHENGYUAN LIU, XIN HUANG, KUI WU, MUHAMMAD HUZAIFAH MD SHAHRIN, AITI AW, and ROY KA-WEI LEE. 2025. Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance. Proc. ACM Meas. Anal. Comput. Syst. 37, 4, Article 111 (August 2025), 20 pages. https://doi.org/XXXXXXX.XXXXXXX Authors’ Contact Information: WEIHUA ZHENG, weihua_zheng@mymail.sutd.edu.sg, Singapore University of Technology and Design, Singa- pore; CHANG LIU, ByteDance, Singapore, Liuc0058@e.ntu.edu.sg; ZHENGYUAN LIU, Agency for Science, Technology and Research, Singapore, liu_zhengyuan@i2r.a-star.edu.sg; XIN HUANG, Agency for Science, Technology and Research, Singapore; KUI WU, Agency for Science, Technology and Research, Singapore, wuk@i2r.a-star.edu.sg; MUHAMMAD HUZAIFAH MD SHAHRIN, Agency for Science, Technology and Research, Singapore, huzaifah_md_shahrin@i2r.a-star.edu.sg; AITI AW, Agency for Science, Technology and Research, Singapore, aaiti@i2r.a-star.edu.sg; ROY KA-WEI LEE, Singapore University of Technology and Design, Singapore, s.roylee@gmail.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
Chunk 2 · 1,997 chars
ithout fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. Manuscript submitted to ACM Manuscript submitted to ACM 1 arXiv:2604.10590v1 [cs.CL] 12 Apr 2026 -- 1 of 20 -- 2 Zheng et al. 1 Introduction Motivation. Recent advancements in Large Language Models (LLMs) have significantly improved Natural Language Processing (NLP) capabilities, achieving state-of-the-art results in tasks such as cross-lingual question answering (CLQA), text summarization, and machine translation (MT) [ 9]. Multilingual LLMs, including mBERT [ 9], mT5 [42 ], and Llama3 [10 ], have extended these advancements to multilingual settings. Nevertheless, these models remain suboptimal in cross-lingual tasks requiring text generation and comprehension [ 44 ], particularly in MT and cross- lingual summarization (CLSum), where maintaining semantic fidelity and linguistic coherence remains challenging. A key factor behind this limitation is the imbalance in training data, with high-resource languages dominating pre- training corpora [12 ]. Consequently, low-resource languages are underrepresented, leading to weaker cross-lingual generalization. Moreover, standard pre-training predominantly relies on monolingual next-token prediction, reinforcing fluency in individual languages but limiting cross-lingual transfer. Several approaches have attempted to address these limitations. Continued pre-training on low-resource languages improves cross-lingual proficiency [ 41 ], while word-level substitution and alignment
Chunk 3 · 1,997 chars
on monolingual next-token prediction, reinforcing fluency in individual languages but limiting cross-lingual transfer. Several approaches have attempted to address these limitations. Continued pre-training on low-resource languages improves cross-lingual proficiency [ 41 ], while word-level substitution and alignment techniques enhance language transfer [8 , 35, 39 ]. However, these methods struggle with polysemy, grammatical inconsistencies, and code-switching. Furthermore, techniques like bilingual sentence masking, though promising, are often incompatible with decoder-only LLMs that generate text autoregressively. These challenges highlight the need for a more effective pre-training paradigm that explicitly integrates cross-lingual alignment while preserving monolingual fluency. Research Objectives. This study aims to address these research gaps to improve the cross-lingual capabilities of multilingual LLMs. Specifically, we seek to (i) develop an effective pre-training strategy that enhances cross-lingual alignment while preserving monolingual fluency, (ii) introduce a robust evaluation metric to quantify language alignment, and (iii) demonstrate the impact of our approach across diverse multilingual NLP tasks. To achieve these objectives, we propose a novel Cross-Lingual Mapping (CL) Task that explicitly models linguistic correspondences during the pre-training phase. Unlike existing approaches that rely on bilingual fine-tuning or contrastive learning, our method integrates bi-directional CL within the model’s learning process, facilitating improved alignment across languages. Additionally, we introduce the Language Alignment Coefficient (LAC), a new metric designed to evaluate the consistency of language representations in multilingual LLMs, particularly in low-resource scenarios. Contributions. This work makes the following key contributions: (i) We propose a CL Task that enhances language alignment during pre-training, allowing multilingual LLMs to learn
Chunk 4 · 1,994 chars
a new metric designed to evaluate the consistency of language representations in multilingual LLMs, particularly in low-resource scenarios. Contributions. This work makes the following key contributions: (i) We propose a CL Task that enhances language alignment during pre-training, allowing multilingual LLMs to learn direct cross-lingual correspondences. (ii) We introduce the LAC, a novel metric that quantifies cross-lingual consistency and provides a robust evaluation of multilingual representations. (iii) We construct an extensive evaluation framework encompassing MT, CLSum, CLQA and CLNLU, enabling a comprehensive assessment of multilingual performance. (iv) Our experiments demonstrate that the proposed approach significantly improves multilingual LLM performance, achieving up to 11.8 BLEU scores of improvement in MT, a 6.72 points increase in BERTScore-Precision in CLQA and more than a 5% increase in CLNLU accuracy over strong baseline„ namely the Llama-3-8B model obtained by applying supervised fine-tuning (SFT) with the same instruction-tuning dataset. These findings highlight the effectiveness of incorporating explicit cross-lingual objectives into pre-training, offering a promising direction for enhancing multilingual LLMs without sacrificing monolingual fluency. Manuscript submitted to ACM -- 2 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 3 2 Related Work Improving language alignment in multilingual models is critical for enhancing cross-lingual performance, particularly for low-resource languages. Prior research has primarily focused on three strategies: parameter-sharing techniques, contrastive learning, and bilingual data integration during pre-training. While effective to some extent, these methods exhibit limitations that hinder robust cross-lingual generalization. Parameter-sharing facilitates knowledge transfer by sharing model parameters and vocabularies across
Chunk 5 · 1,991 chars
parameter-sharing techniques, contrastive learning, and bilingual data integration during pre-training. While effective to some extent, these methods exhibit limitations that hinder robust cross-lingual generalization. Parameter-sharing facilitates knowledge transfer by sharing model parameters and vocabularies across languages [ 7, 21, 30]. This paradigm underlies many strong multilingual encoders such as InfoXLM [ 5], VECO [ 27 ], LaBSE [ 11], and RemBERT [ 6 ], which jointly train a single model over shared subword vocabularies while injecting additional cross-lingual signals. However, it struggles with languages that have vastly different structures or limited training data. Another approach, masked token prediction on bilingual parallel sentences [ 8], improves cross-lingual representations but is constrained by the availability of parallel corpora. Contrastive learning has also been widely explored for aligning multilingual representations. Wang et al . [38] proposed aligning sentence embeddings via independent encoders, but these methods often accumulate errors, leading to instability. InfoXLM and LaBSE, for instance, apply contrastive or translation-ranking objectives on parallel sentences to obtain language-agnostic representations [5, 11 ]. Advances in unsupervised contrastive learning [ 6 , 13, 40 , 45 ] have improved alignment using data augmentation, yet they often require additional fine-tuning, which limits their generalization. Our approach circumvents these challenges by embedding cross-lingual alignment directly into pre- training, improving stability without post-training adjustments. Bilingual data integration has shown promise in optimizing multilingual models. Ham and Kim [16] employed a teacher-student framework for monolingual semantic alignment, while reinforcement learning methods [ 4, 14 , 34 , 43 ] used semantic similarity as a reward signal to improve language alignment. More recently, PreAlign [ 22] demonstrates that explicitly
Chunk 6 · 1,997 chars
ise in optimizing multilingual models. Ham and Kim [16] employed a teacher-student framework for monolingual semantic alignment, while reinforcement learning methods [ 4, 14 , 34 , 43 ] used semantic similarity as a reward signal to improve language alignment. More recently, PreAlign [ 22] demonstrates that explicitly establishing multilingual alignment before large-scale language model pre-training can further boost cross- lingual transfer. However, these methods are computationally demanding and require fine-tuning stability. Similarly, bilingual pre-training techniques [ 20, 29 ] and instruction fine-tuning [ 12 ] have demonstrated moderate success but remain insufficient for deep cross-lingual generalization. Our work builds on these insights by introducing a novel CL task that enhances multilingual alignment at the pre-training stage. Unlike prior approaches that depend on post-hoc alignment techniques or large-scale bilingual corpora, our method embeds cross-lingual objectives directly into training, strengthening both multilingual generation and comprehension, providing a scalable and effective solution for improving multilingual LLMs. 3 Language Alignment Coefficient (LAC) Measuring language alignment in multilingual models is essential for evaluating their cross-lingual transferability. Existing methods primarily rely on the average cosine similarity of sentence embeddings across intermediate layers of LLMs [ 24 ]. Additionally, projection variance has been used to quantify alignment stability. However, these metrics often exhibit inconsistencies, as high cosine similarity does not necessarily imply robust alignment, particularly when test samples are small or susceptible to outliers. The variance in similarity scores across layers can lead to unstable assessments, limiting their reliability in evaluating multilingual representation alignment. To overcome these limitations, we propose the Language Alignment Coefficient (LAC), a novel metric that accounts
Chunk 7 · 1,996 chars
amples are small or susceptible to outliers. The variance in similarity scores across layers can lead to unstable
assessments, limiting their reliability in evaluating multilingual representation alignment.
To overcome these limitations, we propose the Language Alignment Coefficient (LAC), a novel metric that accounts for
both alignment strength and stability. LAC is defined as the ratio of the average cosine similarity of sentence embeddings
to the standard deviation of these similarity values across selected layers 𝐿sub. This formulation is statistically equivalent
Manuscript submitted to ACM
-- 3 of 20 --
4 Zheng et al.
to the inverse of the Coefficient of Variation (CV) [ 32 ], ensuring a more stable and reliable measurement of alignment by
incorporating both similarity magnitude and dispersion.
Let 𝑋 = {𝑥𝑘 }𝑛
𝑘=1 and 𝑌 = {𝑦𝑘 }𝑚
𝑘=1 be sentence pairs in different languages, where 𝐸𝑖 (𝑋 ) and 𝐸𝑖 (𝑌 ) denote their
embeddings extracted from layer 𝑖 of the model by taking the hidden states of the last tokens 𝑥𝑛 and 𝑦𝑚 , respectively.
The LAC metric is defined as:
LAC = 1
|𝐿sub |
∑︁
𝑖 ∈𝐿sub
cos(𝐸𝑖 (𝑋 ), 𝐸𝑖 (𝑌 ))
std({cos(𝐸𝑖 (𝑋 ), 𝐸𝑖 (𝑌 ))}𝑖 ∈𝐿sub ) , (1)
where 𝐿sub = {5, 10, 15, 20, 25} represents the subset of model layers used for evaluation, following prior work on
multilingual alignment assessment [24].
A higher LAC value indicates greater alignment stability, signifying that similarity scores remain consistent across
different model layers. Unlike prior metrics that consider only mean similarity, LAC normalizes similarity scores by
their dispersion, mitigating the effect of outliers and ensuring robust evaluation even in low-data multilingual scenarios.
By integrating alignment strength and variability, LAC provides a more consistent metric for assessing multilingual
LLMs, particularly in cases where test data is sparse or noisy.
4 Language Alignment in Multilingual Large Language Models
To address the challenges ofChunk 8 · 1,999 chars
evaluation even in low-data multilingual scenarios. By integrating alignment strength and variability, LAC provides a more consistent metric for assessing multilingual LLMs, particularly in cases where test data is sparse or noisy. 4 Language Alignment in Multilingual Large Language Models To address the challenges of language alignment in multilingual LLMs, we introduce a continued pre-training approach that simultaneously optimizes two objectives: the Next-Token Prediction (NTP) task and the Cross-Lingual Mapping (CL) task. These tasks are designed to enhance both the model’s monolingual generation capabilities and its ability to align representations across languages. The overall pre-training framework is illustrated in Figure 1. In the NTP task, the model predicts the next token 𝑥𝑖+1 based solely on the preceding tokens in the same language. In contrast„ in the CL task, the model generates a token 𝑧𝑖+1 in a target language by using the complete sequence of the source language 𝑌 and previously generated target tokens. This dual-task approach ensures that multilingual alignment is achieved without sacrificing monolingual fluency. 4.1 Next-Token Prediction Task A major challenge in training multilingual LLMs is the imbalance in pre-training data, which results in performance disparities between high-resource and low-resource languages. Prior studies have shown that improving low-resource language modeling enhances overall multilingual alignment [30 , 41 ]. However, continued pre-training on low-resource languages alone often leads to catastrophic forgetting of previously learned high-resource languages [ 15 , 19, 26 , 28 , 33]. To mitigate this issue, we adopt a decoder-only transformer model, parameterized by 𝜃 , for the NTP task. Given a monolingual sentence 𝑋 = (𝑥0, 𝑥1, . . . , 𝑥𝑇 ), the objective is to predict the next token 𝑥𝑡 in the sequence, conditioning on the preceding tokens. This is formulated as a negative log-likelihood loss: LNTP (𝜃 ) =
Chunk 9 · 1,996 chars
ssue, we adopt a decoder-only transformer model, parameterized by 𝜃 , for the NTP task. Given a monolingual sentence 𝑋 = (𝑥0, 𝑥1, . . . , 𝑥𝑇 ), the objective is to predict the next token 𝑥𝑡 in the sequence, conditioning on the preceding tokens. This is formulated as a negative log-likelihood loss: LNTP (𝜃 ) = − 𝑇 ∑︁ 𝑡 =0 log 𝑃 (𝑥𝑡 |𝑥0, . . . , 𝑥𝑡 −1; 𝜃 ). (2) This formulation ensures that the model maintains high-resource language proficiency while progressively improving its low-resource language capabilities. By training on a mixture of high- and low-resource languages, the model achieves better overall multilingual robustness without overfitting to dominant languages. Manuscript submitted to ACM -- 4 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 5 Multilingual LLM with better language Alignment <s> x0 mi+1 mT ... Cross-Lingual Mapping xi ... <s> y0 yi+1 yT ... yi ... z0 mi+1 mT ... zi ... <s> x0 xi+1 xT ... xi ... <s> y0 yi+1 yT ... yi ... z0 zi+1 zT ... zi ... Next Token Prediction Token Masking Fig. 1. Continue pre-training including NTP and CL. "<s>" is "start token", "𝑚𝑖 " is the mask of the 𝑖𝑡ℎ token. "x", "y", "z" are tokens from different languages. 4.2 Cross-Lingual Mapping Task Monolingual next-token prediction lacks explicit cross-lingual alignment signals. Without bilingual contexts, models trained on monolingual corpora reinforce same-language generation, limiting cross-lingual transfer in translation and multilingual understanding tasks. To address this limitation, we introduce the CL task, which explicitly trains the model to generate sequences in a target language while conditioning on a source language sequence. Given a bilingual parallel sentence pair (𝑋, 𝑌 ), where 𝑋 is the source sentence and 𝑌 is the target sentence, the model learns to generate 𝑌 based on the information in 𝑋 . Prior work explored translation during
Chunk 10 · 1,998 chars
e model to generate sequences in a target language while conditioning on a source language sequence. Given a bilingual parallel sentence pair (𝑋, 𝑌 ), where 𝑋 is the source sentence and 𝑌 is the target sentence, the model learns to generate 𝑌 based on the information in 𝑋 . Prior work explored translation during pretraining under Masked Language Modeling (MLM)[ 8 ], and extensively in encoder-decoder architectures[26]. However, MLM’s bidirectional context differs fundamentally from Causal Language Modeling’s left-to-right constraint, making cross-lingual alignment more challenging in decoder-only models. Research on effective cross-lingual alignment within CLM-based decoder-only architectures remains limited. Additionally, LLMs differ from traditional models in their instruction-driven paradigm. Rather than fixed input formats, LLMs rely on prompts to determine task behavior. Our approach differs from traditional translation-based pretraining in two ways: (i) We enhance cross-lingual embedding similarity directly within decoder-only models, without explicit encoding-decoding separation; (ii) We use no explicit task instructions, enabling generalizable, instruction- agnostic cross-lingual representations. Unlike contrastive methods requiring explicit alignment, our CL task enables implicit transfer through a sequence-to-sequence objective: LCL(𝜃 ) = −∑︁ 𝑡 = 0𝑇 log 𝑃 (𝑦𝑡 |𝑦0, . . . , 𝑦𝑡 −1; 𝑋 ; 𝜃 ). (3) This ensures joint learning of semantic correspondences and syntactic transformations without predefined alignment constraints. Manuscript submitted to ACM -- 5 of 20 -- 6 Zheng et al. Fig. 2. Language alignment of different pre-trained models. LAC is Language Alignment Coefficient. 4.3 Joint Optimization of Multilingual Alignment We jointly optimize the NTP and CL objectives during pre-training. The final loss function for continued pre-training is given by: LPT (𝜃 ) = LNTP (𝜃 ) + LCL (𝜃 ). (4) This dual-task training strategy enables the model
Chunk 11 · 1,996 chars
d models. LAC is Language Alignment Coefficient. 4.3 Joint Optimization of Multilingual Alignment We jointly optimize the NTP and CL objectives during pre-training. The final loss function for continued pre-training is given by: LPT (𝜃 ) = LNTP (𝜃 ) + LCL (𝜃 ). (4) This dual-task training strategy enables the model to retain strong monolingual generation capabilities while simultaneously improving cross-lingual mapping. By jointly optimizing both objectives, the model learns to maintain high-resource language fluency while developing robust alignment mechanisms for low-resource languages. This approach results in a more effective multilingual LLM capable of both generating text fluently in a given language and transferring knowledge efficiently across linguistic boundaries. 5 Dataset Creation To ensure a comprehensive evaluation, we select the following language pairs: English-Chinese (EN-ZH), English-Czech (EN-CS), Chinese-Japanese (ZH-JP), and Czech-Ukrainian (CS-UK). These pairs represent both high- and low-resource scenarios, with EN-ZH (high-resource) and EN-CS (mid-resource) evaluating English-centric performance, while ZH-JP (high-resource) and CS-UK (low-resource) assess non-English alignment. Our fine-tuning dataset is derived from Cleaned Alpaca [ 36 ], augmented with machine translation (MT), cross-lingual question answering (CLQA), and cross-lingual summarization (CLSum) tasks to enhance the model’s multilingual capabilities. For CLSum, we utilize CrossSum [3 ] for EN-ZH and ZH-JP, while CLQA data is sourced from OpenHermes- 2.5 [ 37 ], supplemented with 2000 manually verified QA pairs (1000 EN-CS, 1000 CS-UK) generated using Claude 3.5. Evaluation is conducted using the XQuAD dataset [1 ] for CLQA and Belebele [ 2] for CLNLU, focusing on our selected language pairs. Monolingual data for continued pre-training is sourced from CulturaX [ 31], covering English, Chinese, Czech, Ukrainian, and Japanese. English data is included to prevent catastrophic
Chunk 12 · 1,994 chars
.5. Evaluation is conducted using the XQuAD dataset [1 ] for CLQA and Belebele [ 2] for CLNLU, focusing on our selected language pairs. Monolingual data for continued pre-training is sourced from CulturaX [ 31], covering English, Chinese, Czech, Ukrainian, and Japanese. English data is included to prevent catastrophic forgetting. Parallel data for MT and cross- lingual mapping tasks come from WMT2024, with 50,000 sentence pairs per language pair for MT and 3.7 million pairs for cross-lingual mapping. During continued pre-training, we optimize two separate objectives: next-token prediction Manuscript submitted to ACM -- 6 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 7 Task Name Sub-Task Language / Language Pairs Data Volume Data Source Monolingual Continue Pretraining NA English 37,182,019 tokens ClutureX Chinese 93,124,828 tokens Japanese 85,415,308 tokens Ukrainian 187,336,165 tokens Czech 185,235,692 tokens Cross-Lingual Mapping NA EN-ZH, EN-CS, CS-UK, ZH-JP 3.7M sentence pairs per language pair WMT2024 Instruction Finetuning for general task MT EN-ZH, EN-CS, CS-UK, ZH-JP 50K per language pair WMT2024 CLQA EN-CS & CS-UK 1000 per language pair Tranlate from OpenHermes-2.5 CL-Sum EN-ZH 3,893 CrossSum ZH-JP 809 Evaluation MT EN-ZH, EN-CS, CS-UK, ZH-JP FLORES-200: 1012 per language pair FLORES-200, WMT 2022 WMT cs-uk: 1930 WMT en-cs: 2037 WMT en-zh: 2037 Xquad EN-ZH 1190 Xquad Open-End CLQA EN-CS & CS-UK 500 per language pair Tranlate from OpenHermes-2.5 CL-Sum EN-ZH 101 CrossSum ZH-JP 486 CLNLU EN-ZH, EN-CS, CS-UK, ZH-JP 900 per language pair Belebele Table 1. Task Details and Data Statistics (NTP) and cross-lingual mapping. The NTP objective is trained on 4,128,308 instances (≈4.1M), and the per-batch mixture allocates 52.7% and 47.3% of training examples to NTP and cross-lingual mapping, respectively. Since the two losses are of comparable magnitude, we do not apply additional
Chunk 13 · 1,993 chars
Details and Data Statistics (NTP) and cross-lingual mapping. The NTP objective is trained on 4,128,308 instances (≈4.1M), and the per-batch mixture allocates 52.7% and 47.3% of training examples to NTP and cross-lingual mapping, respectively. Since the two losses are of comparable magnitude, we do not apply additional task-specific weighting. No dataset overlap occurs between training and evaluation phases. Further details on dataset sources, language pairs, dataset sizes, and task-specific prompts are provided in Tables 1 and 2. 6 Experiments 6.1 Experiment Settings We use Llama-3-8B [ 10 ] as the base model, given its extensive multilingual capabilities across 31 languages. Continued pre-training is performed on Llama-3-8B, followed by fine-tuning on both cross-lingual and monolingual tasks. To evaluate language alignment and generation quality, we compute sentence embedding similarities and perplexity scores. For instruction fine-tuning, we train models on general and cross-lingual tasks to assess downstream performance. Given that MT is the most representative cross-lingual task, ablation studies focus on it to isolate key factors influencing cross-lingual effectiveness. Manuscript submitted to ACM -- 7 of 20 -- 8 Zheng et al. Task Prompts MT Translate the following <Source Language> to <Target Language>: <Source Sentence> CLSum Please read the following <Source Language> text and generate a short and precise <Target Language> summary: <Source Language Para- graph> Xquad <Source Language Paragraph> Based on the above paragraph, answer the following questions in <Target Language> or numbers. Cross-Lingual Open-ended Question and Answering Answer the following questions in <Target Language>. <Question in Source Language> CLNLU Answer the following single-choice questions and output only the option. <Source Question> <Target Options>. Table 2. Prompts for Different Tasks 6.2 Training details We conducted training with a warm-up ratio of 0.01 and a sequence
Chunk 14 · 1,997 chars
swer the following questions in <Target Language>. <Question in Source Language> CLNLU Answer the following single-choice questions and output only the option. <Source Question> <Target Options>. Table 2. Prompts for Different Tasks 6.2 Training details We conducted training with a warm-up ratio of 0.01 and a sequence length of 2048 tokens, limiting the process to 1 epoch for pre-training and 3 epochs for general fine-tuning. The training employed bf16 precision with Low-Rank Adaptation (LoRA) [ 18 ], configured with a LoRA rank of 16 and target modules applied to all layers. We utilized 8 H100 GPUs, with each GPU processing 16 batches and an 8-step gradient accumulation, yielding an effective batch size of 1024. The initial learning rate was set to 1e-4 with the AdamW optimizer. Under this configuration, continued pre-training required approximately 70 hours, while fine-tuning was completed in 53 minutes. 6.3 Language Alignment and Perplexity Evaluation of Base Models We assess language alignment and monolingual perplexity on Llama-3-8B and its variants following different pre-training strategies: (1) Llama_NTP: Trained with monolingual next-token prediction. (2) Llama_Bi_NTP: The model is trained on a hybrid corpus comprising both monolingual data and bilingual sen- tence pairs [ 20 ]. While maintaining the standard next-token prediction objective, the training data encompasses both monolingual sequences and concatenated bilingual sentence pairs (e.g., "How are you? 你好吗?"). This training configuration serves to augment the model’s cross-lingual transition probability. (3) Llama_CLI: The model is jointly trained on monolingual next-token prediction and translation tasks during the pre-training stage, where translation instances are formulated with an explicit instruction prompt, e.g., “Translate the following language 1 to language 2.” (4) Llama_CL: Our proposed model, incorporating monolingual next-token prediction and cross-lingual mapping. The CL objective is
Chunk 15 · 1,992 chars
and translation tasks during the pre-training stage, where translation instances are formulated with an explicit instruction prompt, e.g., “Translate the following language 1 to language 2.” (4) Llama_CL: Our proposed model, incorporating monolingual next-token prediction and cross-lingual mapping. The CL objective is designed to align cross-lingual semantic spaces by predicting target-language tokens from source- language sentence embeddings. While Bi_NTP remains fundamentally a Next-Token Prediction (NTP) task, it extends standard monolingual NTP by incorporating code-switched sequences constructed from concatenated bilingual parallel sentences. Under this configuration, the model primarily learns the mechanisms of linguistic transition. Although CL conceptually overlaps with machine translation, its scope is broader; by aligning latent semantic spaces, it provides a robust foundation for a wide range of cross-lingual downstream applications beyond simple translation. We evaluate bilingual sentence alignment using Flores-200 and WMT2022, while monolingual perplexity is tested on 1,000 samples per language from CulturaX [31]. Manuscript submitted to ACM -- 8 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 9 Lang L-3-8B L_NTP L_Bi_NTP L_CL-only L_CL EN 30.1 27.4 26.9 30.7 25.5 ZH 22.4 22.1 21.3 23.3 20.6 CS 21.1 16.8 16.1 22.1 15.2 UK 19.1 17.2 16.9 19.5 16.0 JP 25.1 22.9 22.2 25.5 20.8 Table 3. Perplexity Scores of Languages for Different Models. Best scores are in bold. Results in Table 3 and Figure 2 show that while Llama_NTP and Llama_Bi_NTP significantly reduce monolingual perplexity, particularly in low-resource languages like CS (-4.33 points), their improvements in cross-lingual alignment remain inconsistent. Llama_Bi_NTP achieves gains primarily for CS-UK but shows limited impact on EN-CS due to the lack of an explicit CL objective. In contrast, Llama_CL achieves a more
Chunk 16 · 1,997 chars
monolingual perplexity, particularly in low-resource languages like CS (-4.33 points), their improvements in cross-lingual alignment remain inconsistent. Llama_Bi_NTP achieves gains primarily for CS-UK but shows limited impact on EN-CS due to the lack of an explicit CL objective. In contrast, Llama_CL achieves a more balanced improvement across both monolingual perplexity and cross-lingual alignment. The integration of CL objective explicitly reinforces bilingual relationships, leading to notable performance advantages: a 0.57-point improvement over Llama_Bi_NTP for EN-CS and consistent gains for EN-ZH (0.2), CS-UK (0.15), and ZH-JP (0.15). Further, Llama_CL demonstrates strong generalization to unseen language pairs, outperforming other models on EN-UK alignment by 0.77, 0.87, and 0.62 points against Llama-3-8B, Llama_NTP, and Llama_Bi_NTP, respectively, while also achieving stable alignment for EN-JP. This suggests that our method is able to generalized beyond trained bilingual pairs, reinforcing multilingual robustness. Overall, these findings highlight the necessity of explicit cross-lingual mapping in pre-training. While NTP enhances monolingual fluency, our approach successfully bridges linguistic gaps, making it particularly effective for low-resource language processing and multilingual applications. 6.4 Cross-Lingual Task Evaluation of Instruction-Tuned Models To ensure a fair comparison and isolate the effectiveness of our pre-training strategy, we applied an identical instruction fine-tuning (IFT) process to all model variants described in Section 6.3. The set of models includes the Llama-3-8B base model, baseline variants (Llama_NTP and Llama_Bi_NTP), a joint-objective baseline (Llama_CLI, trained with next-token prediction and translation objectives), and our proposed model, Llama_CL. Consequently, all downstream performance gains reported in this section represent comparisons among fine-tuned checkpoints. We evaluated their performance on MT, CLSum,
Chunk 17 · 1,998 chars
lama_Bi_NTP), a joint-objective baseline (Llama_CLI, trained with next-token prediction and translation objectives), and our proposed model, Llama_CL. Consequently, all downstream performance gains reported in this section represent comparisons among fine-tuned checkpoints. We evaluated their performance on MT, CLSum, CLQA and CLNLU. For CLSum, we used the CrossSum dataset [ 3 ] for EN-ZH and ZH-JP summarization. CLQA evaluation consisted of two components: reference-based QA using XQuAD [1] (EN-ZH) and open QA, constructed from OpenHermes-2.5 [ 37]. And Belebele [ 2] is used as the CLNLU evaluation dataset. Reference evaluation results are presented in Table 4. To assess generalizability, we replicated these experiments on BLOOM-3B [21 ], observing similar trends to Llama-3- 8B. Since many of the languages in our test set are not supported by BLOOM, we have chosen to report scores only for the EN-ZH language pair on the BLOOM-3B models. This ensures the validity and fairness of our evaluation. Similar to the results obtained on the Llama-3-8B model, as shown in Table 5, our approach also demonstrates significant improvements in translation tasks on the BLOOM-3B model. In particular, on the Flores-EN-ZH dataset, our method achieves an impressive gain of 14.43, highlighting its effectiveness. Manuscript submitted to ACM -- 9 of 20 -- 10 Zheng et al. Tasks Data Sets Metric Llama-3-8B Llama_NTP Llama_Bi_NTP Llama_CLI Llama_CL MT WMT-EN-ZH BLEU 36.2±0.5 37.8±0.5 (+1.6) 44.0±0.5 (+7.8) 49.4±0.5 (+13.1) 48.2±0.6 (+12.0) WMT-EN-CS 19.4±0.3 25.1±0.4 (+5.7) 26.3±0.4 (+6.9) 31.4±0.6 (+12.0) 31.2±0.5 (+11.8) WMT-CS-UK 18.2±0.3 21.0±0.4 (+2.8) 23.4±0.4 (+5.2) 25.1±0.4 (+6.9) 25.8±0.5 (+7.6) Flores-EN-ZH 28.6±0.4 29.9±0.4 (+1.3) 33.6±0.5 (+5.0) 37.6±0.4 (+9.0) 36.2±0.5 (+7.6) Flores-EN-CS 17.2±0.3 21.0±0.4 (+3.8) 21.6±0.4 (+4.4) 25.7±0.5 (+8.5) 23.2±0.4 (+6.0) Flores-CS-UK 14.3±0.3 18.0±0.3 (+3.7) 18.5±0.4 (+4.2) 20.0±0.3 (+5.7) 19.4±0.4 (+5.1) Flores-ZH-JP 15.7±0.3 17.4±0.3
Chunk 18 · 1,997 chars
5.1±0.4 (+6.9) 25.8±0.5 (+7.6) Flores-EN-ZH 28.6±0.4 29.9±0.4 (+1.3) 33.6±0.5 (+5.0) 37.6±0.4 (+9.0) 36.2±0.5 (+7.6) Flores-EN-CS 17.2±0.3 21.0±0.4 (+3.8) 21.6±0.4 (+4.4) 25.7±0.5 (+8.5) 23.2±0.4 (+6.0) Flores-CS-UK 14.3±0.3 18.0±0.3 (+3.7) 18.5±0.4 (+4.2) 20.0±0.3 (+5.7) 19.4±0.4 (+5.1) Flores-ZH-JP 15.7±0.3 17.4±0.3 (+1.7) 18.5±0.4 (+2.8) 21.6±0.4 (+5.9) 21.0±0.4 (+5.3) WMT-EN-ZH COMET 67.31±0.29 70.30±0.33 (+2.99) 70.90±0.34 (+3.59) 72.34±0.21 (+5.03) 71.10±0.35 (+3.79) WMT-EN-CS 64.39±0.28 67.40±0.32 (+3.01) 70.10±0.36 (+5.71) 73.56±0.39 (+9.17) 72.60±0.40 (+8.21) WMT-CS-UK 75.33±0.31 82.60±0.38 (+7.27) 84.00±0.39 (+8.67) 86.28±0.42 (+10.95) 85.10±0.41 (+9.77) Flores-EN-ZH 75.74±0.31 82.90±0.38 (+7.16) 82.80±0.38 (+7.06) 83.90±0.11 (+8.16) 83.20±0.39 (+7.46) Flores-EN-CS 75.18±0.30 81.20±0.36 (+6.02) 83.30±0.39 (+8.12) 88.03±0.23 (+12.85) 85.00±0.42 (+9.82) Flores-CS-UK 76.47±0.30 84.30±0.39 (+7.83) 84.90±0.40 (+8.43) 87.46±0.38 (+10.99) 86.30±0.43 (+9.83) Flores-ZH-JP 81.63±0.32 84.60±0.35 (+2.97) 85.90±0.37 (+4.27) 88.99±0.30 (+7.36) 87.60±0.40 (+5.97) CLSum CrossSum R-1 15.95 14.03(-1.92) 15.52(-0.43) 13.27(-2.68) 16.80(+0.85) R-2 5.964 4.992(-0.972) 4.824(-1.140) 4.329(-1.635) 5.937(-0.027) R-L 15.77 13.91(-1.86) 15.34(-0.43) 12.78(-2.99) 16.39(+0.62) Rec 73.4 73.24(-0.16) 73.09(-0.31) 71.52(-1.88) 74.12(+0.72) CLQA Xquad F1 86.80±0.27 87.50±0.28 (+0.70) 88.70±0.30 (+1.90) 85.71±0.38 (-1.09) 88.50±0.30 (+1.70) EM 17.48 18.82(+1.34) 19.16(+1.68) 16.35(-1.13) 19.24(+1.76) Prec 80.54 86.28(+5.74) 86.78(+6.24) 79.33(-1.21) 87.26(+6.72) CLNLU Belebele-EN-ZH Acc 58.78% 61.67%(+2.89%) 62.89%(+4.11%) 61.33%(+2.55%) 61.78%(+3.00%) Belebele-EN-CS 55.67% 59.78%(+4.11%) 60.78%(+5.11%) 58.56%(+2.89%) 61.33%(+5.66%) Belebele-CS-UK 47.56% 49.78%(+2.22%) 51.33%(+3.77%) 47.02%(-0.54%) 54.22%(+6.66%) Belebele-ZH-JP 40.22% 42.44%(+2.22%) 45.67%(+5.45%) 43.85%(+3.63%) 46.56%(+6.34%) Table 4. Evaluation Results on Cross-Lingual Tasks for Llama-3. All results in this table are
Chunk 19 · 1,992 chars
% 59.78%(+4.11%) 60.78%(+5.11%) 58.56%(+2.89%) 61.33%(+5.66%) Belebele-CS-UK 47.56% 49.78%(+2.22%) 51.33%(+3.77%) 47.02%(-0.54%) 54.22%(+6.66%) Belebele-ZH-JP 40.22% 42.44%(+2.22%) 45.67%(+5.45%) 43.85%(+3.63%) 46.56%(+6.34%) Table 4. Evaluation Results on Cross-Lingual Tasks for Llama-3. All results in this table are obtained by evaluating the instruction- fine-tuned models. R-1, R-2, R-L, Rec, EM, Prec, and Acc respectively represent ROUGE-1, ROUGE-2, ROUGE-L, BERTScore-Recall, Exact Match, BERTScore-Precision, and Accuracy. Best scores are in bold. For the CLQA task, while the Llama-3-8B model slightly lags behind the Llama_Bi_NTP model in F1 score, our method consistently outperforms other approaches on the BLOOM-3B model, showcasing significant improvements over the baseline. Notably, it achieves a 2.89% increase in F1, a 1.17% gain in Exact Match Rate, and a 5.69% boost in BERTScore-Precision. Surprisingly, our method demonstrated greater improvements on the CLSum task with the Bloom-3B model compared to Llama-3-8B, achieving over a 1-point increase in ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore-Precision. While there is still room for further enhancement, these results are promising. Additionally, our approach also delivered significant improvements on the CLNLU task with the BLOOM model. By comparing Table 4 and Table 5, it becomes clear that our method yields even greater improvements when applied to a base model with weaker foundational capabilities. 6.4.1 Reference-based Evaluation. Table 4 presents a comparative analysis of our proposed model (Llama_CL) against the Llama-3-8B baseline and several cross-lingual variants. The results indicate that while all adaptation strategies improve upon the vanilla baseline in Machine Translation (MT) and CLNLU, they exhibit distinct trade-offs in terms of task-specific performance versus general-purpose cross-lingual capability. Manuscript submitted to ACM -- 10 of 20 -- Bridging Linguistic Gaps: Cross-Lingual
Chunk 20 · 1,998 chars
cate that while all adaptation strategies improve upon the vanilla baseline in Machine Translation (MT) and CLNLU, they exhibit distinct trade-offs in terms of task-specific performance versus general-purpose cross-lingual capability. Manuscript submitted to ACM -- 10 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 11 Tasks Data Sets Metric BLOOM-3B BLOOM_NTP BLOOM_Bi_NTP BLOOM_CL MT WMT-EN-ZH BLEU 11.3±0.4 13.6±0.4 (+2.3) 16.6±0.4 (+5.3) 17.5±0.5 (+6.2) Flores-EN-ZH 13.0±0.4 16.7±0.4 (+3.7) 18.9±0.5 (+5.9) 22.6±0.5 (+9.6) WMT-EN-ZH COMET 57.93±0.28 58.49±0.30 (+0.56) 59.82±0.34 (+1.89) 59.70±0.34 (+1.77) Flores-EN-ZH 60.25±0.30 62.63±0.33 (+2.38) 70.41±0.40 (+10.16) 74.68±0.43 (+14.43) CLSum CrossSum-EN-ZH R-1 6.20 6.70(+0.5) 7.10(+0.9) 7.30(+1.1) R-2 1.90 2.20(+0.3) 2.90(+1.0) 3.40(+1.5) R-L 5.12 5.43(+0.31) 5.71(+0.59) 6.88(+1.76) Rec 62.70 63.12(+0.42) 63.89(+1.19) 64.13(+1.43) CLQA Xquad-EN-ZH F1 24.42±0.25 25.73±0.27 (+1.31) 27.08±0.28 (+2.66) 27.31±0.29 (+2.89) EM 8.24 8.82(+0.58) 9.24(+1.00) 9.41(+1.17) Prec 37.35 39.97(+2.62) 42.18(+4.83) 43.04(+5.69) CLNLU Belebele-EN-ZH Acc 41.11% 43.11%(+2.00%) 44.67%(+3.56%) 46.33%(+5.22%) Table 5. Evaluation Results on Cross-Lingual Tasks for BLOOM. All results in this table are obtained by evaluating the instruction- fine-tuned models. R-1, R-2, R-L, Rec, EM, Prec, and Acc respectively represent ROUGE-1, ROUGE-2, ROUGE-L, BERTScore-Recall, Exact Match, BERTScore-Precision, and Accuracy. Best scores are in bold. Specifically, the basic variants Llama_NTP and Llama_Bi_NTP demonstrate steady improvements in translation and understanding. The competitive performance of Llama_Bi_NTP in CLQA (e.g., 88.70 F1 on Xquad) suggests that bilingual Next Token Prediction (NTP) enhances the model’s capacity to perform language transitions at contextually appropriate positions, thereby providing a solid foundation for cross-lingual alignment. However, a
Chunk 21 · 1,999 chars
erstanding. The competitive performance of Llama_Bi_NTP in CLQA (e.g., 88.70 F1 on Xquad) suggests that bilingual Next Token Prediction (NTP) enhances the model’s capacity to perform language transitions at contextually appropriate positions, thereby providing a solid foundation for cross-lingual alignment. However, a more complex pattern emerges when considering Llama_CLI, which incorporates explicit translation instructions during the pre-training phase. While Llama_CLI achieves the highest performance in most MT benchmarks—surpassing the baseline by up to 13.1 BLEU points, it suffers from significant performance degradation in CLSum and CLQA. For instance, in CLSum, Llama_CLI experiences a 2.68-point drop in ROUGE-1 compared to the baseline. These results suggest that retaining task-specific instructions during the pre-training stage can easily trap the model in a task-specific local optimum. Although such a strategy may yield marginal gains in certain downstream tasks like CLNLU, it appears to induce a degree of overfitting to the translation objective that even subsequent fine-tuning cannot fully rectify. This rigid optimization limits the model’s flexibility in handling complex, open-ended cross-lingual reasoning tasks like summarization. In contrast, our approach (Llama_CL) effectively mitigates this risk. By balancing instruction-tuning with robust cross-lingual mapping, Llama_CL achieves MT performance comparable to the specialized Llama_CLI while remaining the only variant to consistently outperform the baseline in CLSum. Furthermore, in CLQA, Llama_CL attains the highest Exact Match (19.24%) and BERTScore-Precision (87.26) scores. These findings underscore that our method fosters a more versatile cross-lingual representation space, ensuring high-fidelity generation and comprehension without sacrificing the model’s generalizability across diverse task formats. 6.4.2 LLM-Assisted Reference-Free Evaluation. To further assess cross-lingual generalization, we
Chunk 22 · 1,997 chars
ings underscore that our method fosters a more versatile cross-lingual representation space, ensuring high-fidelity generation and comprehension without sacrificing the model’s generalizability across diverse task formats. 6.4.2 LLM-Assisted Reference-Free Evaluation. To further assess cross-lingual generalization, we evaluated models on open-domain question answering, where responses must be coherent and contextually relevant without explicit refer- ences. We constructed CLQA datasets for EN-CS and CS-UK using a subset of open-ended questions from OpenHermes- 2.5 [ 37]. Common question types and examples are provided in Table 6. Given the subjective nature of queries such as Manuscript submitted to ACM -- 11 of 20 -- 12 Zheng et al. Types of open-ended questions Question Example Creative prompts Compose a poem about childhood. Commonsense reasoning Why do people wear sunscreen? Knowledge-based queries What are the key features of quantum computing? List-style questions List five characteristics of a good employee. Table 6. Examples of open-ended question types Rank L-3-8B L_NTP L_Bi_NTP L_CL 1 181 271 252 325 2 332 182 281 190 3 261 213 212 202 4 163 271 192 220 Same 63 Table 7. Evaluation Results on Open-Ended CLQA Task. Best scores are in bold. “Could you share a library-related joke?”, responses were rated on a 1–4 scale using Claude 3.5, with a ‘Same’ rating assigned when models produced indistinguishable or uniformly incorrect answers. As shown in Table 7, Llama_CL achieved the highest performance, receiving 325 top ratings out of 1,000 responses. It consistently outperformed other models, particularly in tasks requiring nuanced comprehension and complex text generation. Notably, when prompted to “Compose a poem about childhood”, other models provided basic descriptions, whereas Llama_CL employed advanced poetic techniques such as parallelism, demonstrating improved contextual understanding. These results highlight the benefits of cross-lingual mapping in
Chunk 23 · 1,984 chars
and complex text generation. Notably, when prompted to “Compose a poem about childhood”, other models provided basic descriptions, whereas Llama_CL employed advanced poetic techniques such as parallelism, demonstrating improved contextual understanding. These results highlight the benefits of cross-lingual mapping in enhancing multilingual generative capabilities. However, all models exhibited weaknesses in complex reasoning tasks, underscoring the need for further improvements in logical inference and cross-lingual reasoning. 7 Ablation Study 7.1 Analysis of Cross-Lingual Training Objectives We conducted ablation experiments on MT to isolate key factors influencing model performance. The experiments were structured as follows: (1) 𝐸𝑠𝑒𝑝 : Bilingual data was split into monolingual sentences and used for next-token prediction. (2) 𝐸𝑝𝑜𝑠𝑡_𝑚𝑡 : Bilingual data was converted into instruction-tuning format and used during fine-tuning with the prompt: “Translate the following <Source Language> sentence into <Target Language>.” (3) 𝐸𝑝𝑟𝑒_𝑚𝑡 : Bilingual data was incorporated as instruction-tuning data during continued pre-training using the same translation prompt as 𝐸𝑝𝑜𝑠𝑡_𝑚𝑡 . (4) 𝐸𝑐𝑟𝑜𝑠𝑠 : Our proposed approach, integrating cross-lingual mapping into pre-training. As shown in Table 8, the comparison between 𝐸𝑠𝑒𝑝 and 𝐸𝑐𝑟𝑜𝑠𝑠 confirms that improved cross-lingual performance is not merely a result of increased data volume. The significant performance gap between 𝐸𝑝𝑜𝑠𝑡 _𝑚𝑡 and 𝐸𝑐𝑟𝑜𝑠𝑠 underscores the importance of embedding cross-lingual tasks during pre-training, as instruction fine-tuning alone provides limited benefits. Manuscript submitted to ACM -- 12 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 13 Test Datasets 𝐸_𝑠𝑒𝑝 𝐸_𝑝𝑜𝑠𝑡_𝑚𝑡 𝐸_𝑝𝑟𝑒_𝑚𝑡 𝐸_𝑐𝑟𝑜𝑠𝑠 WMT-en-zh 62.04 67.26 68.87 71.03 WMT-en-cs 66.72 63.42 69.78
Chunk 24 · 1,994 chars
des limited benefits. Manuscript submitted to ACM -- 12 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 13 Test Datasets 𝐸_𝑠𝑒𝑝 𝐸_𝑝𝑜𝑠𝑡_𝑚𝑡 𝐸_𝑝𝑟𝑒_𝑚𝑡 𝐸_𝑐𝑟𝑜𝑠𝑠 WMT-en-zh 62.04 67.26 68.87 71.03 WMT-en-cs 66.72 63.42 69.78 72.47 WMT-cs-uk 77.92 74.21 81.17 84.97 Flores-en-zh 75.64 78.31 81.09 83.05 Flores-en-cs 79.95 76.44 83.02 84.87 Flores-cs-uk 79.96 60.11 83.45 86.04 Flores-zh-jp 76.21 71.45 85.97 87.51 Table 8. COMET score for MT task of models based on different pre-training settings. E_cross is our method. Model MMLU LogiQA Alpaca_Eval (Accuracy %) (Accuracy %) (Win Rate %) L-3-8B-Ins 56.77 36.71 50.25 L_CL_Ins 57.01 36.25 49.75 Table 9. Model Performance Comparison on MMLU, LogiQA, and Alpaca_Eval. Best scores are in bold. While 𝐸𝑝𝑟𝑒_𝑚𝑡 and 𝐸𝑐𝑟𝑜𝑠𝑠 achieve comparable results, on average, 𝐸𝑐𝑟𝑜𝑠𝑠 outperforms 𝐸𝑝𝑟𝑒𝑚𝑡 by 2.37 COMET points across the seven language pairs, with the largest margin observed on the CS-UK pair (3.8 points), indicating stronger language alignment when cross-lingual tasks are explicitly modeled during pre-training. This suggests that premature instruction fine-tuning, as in 𝐸𝑝𝑟𝑒_𝑚𝑡 , may lead to overfitting, whereas 𝐸𝑐𝑟𝑜𝑠𝑠 facilitates better generalization across language pairs. These findings highlight the necessity of cross-lingual mapping in pre-training for establishing a robust multilingual foundation. Explicitly integrating cross-lingual objectives during pre-training enables the model to develop deeper alignment, resulting in improved transferability across languages compared to post-hoc instruction fine-tuning. 7.2 Performance on English Monolingual Downstream Tasks Furthermore, we compared the results of fine-tuning Llama-3-8B (L-3-8B-Ins) and Llama_CL (L_CL_Ins) directly using the Alpaca dataset [36 ] on English-only downstream tasks. For each base model, we conducted LoRA fine-tuning for two
Chunk 25 · 1,991 chars
t-hoc instruction fine-tuning. 7.2 Performance on English Monolingual Downstream Tasks Furthermore, we compared the results of fine-tuning Llama-3-8B (L-3-8B-Ins) and Llama_CL (L_CL_Ins) directly using the Alpaca dataset [36 ] on English-only downstream tasks. For each base model, we conducted LoRA fine-tuning for two epochs with a rank of 16, an initial learning rate of 1𝑒−4, a warm-up rate of 0.1, and a batch size of 256. We selected MMLU [17], LogiQA [25 ], and AlpacaEval [23 ] to evaluate the models’ capabilities in language understanding, logical reasoning, and open-ended generation, respectively. As shown in Table 9, the model trained with our method achieves performance on downstream English tasks that is comparable to the original base model, and even slightly surpasses it on the MMLU benchmark. These results suggest that our pre-training strategy enhances cross-lingual transfer while maintaining English monolingual proficiency with negligible degradation. We attribute this to retaining the Next-Token Prediction (NTP) objective and preserving a portion of English monolingual data during continued pre-training, which together help mitigate catastrophic forgetting. 8 Case Study To complement quantitative evaluations, we analyzed model outputs to identify performance patterns. Our findings align with LLM-assisted evaluations, confirming Llama_CL’s consistent superiority. Manuscript submitted to ACM -- 13 of 20 -- 14 Zheng et al. Task Question L-3-8B L_NTP L_Bi_NTP L_CL Remark MT Translate the fol- lowing English to Chinese: "It also arguably makes read- ing easier, though writing is somewhat complicated by the need to find out whether a verb or adjective is used in a substantivized form." It also arguably makes reading easier, though writing is somewhat complicated by the need to find out whether a verb or adjective is used in a substantivized form. It also ar- guably makes reading eas- ier, though writing is somewhat complicated by the need to find
Chunk 26 · 1,990 chars
verb or adjective is used in a substantivized form." It also arguably makes reading easier, though writing is somewhat complicated by the need to find out whether a verb or adjective is used in a substantivized form. It also ar- guably makes reading eas- ier, though writing is somewhat complicated by the need to find out whether a verb or adjec- tive is used in a substan- tivized form. 它 也 使 阅 读 更 容 易 , 但 由 于 需 要 确 定 动 词 或 形 容 词 是 否 使 用 了 名 词 化 形 式 , 因 此 写 作 有 些 复杂。 它 也 可 以 说 使 阅 读 更 容 易 , 虽 然 写 作 是 有 点 复 杂 的 , 因 为 需 要 找 到 一 个 动 词 或 形 容 词 是 否 在 名 词 形 式 中 使用。 Llama-3-8B and Llama_NTP failed to translate the source sentence. Translate the fol- lowing English to Chinese: "According to the lawsuit, waste from the UN camp was not properly sanitized, causing bacteria to enter the tributary of the Artibonite River, one of Haiti’s largest." 根 据 诉 讼,来 自 联 合 国 营 地 的 废 物 未 得 到 适 当 消 毒,导 致 细 菌 进 入 艺 术 博 尼 特 河,海 地 最 大 的 河 流 之一。 根 据 诉 讼,联 合 国 营 地 的 垃 圾 没 有 得 到 适 当 的 消 毒,使 细 菌 进 入 阿 尔 蒂 波 内 特 河,该 河 是 海 地 最 大 的 河 流 之 一。 根 据 诉 讼,联 合 国 营 地 的 垃 圾 没 有 得 到 适 当 的 消 毒,使 细 菌 进 入 阿 尔 蒂 波 尼 特 河,是 海 地 最 大 的 河 流之一。 根 据 诉 讼 , 联 合 国 营 地 的 废 物 没 有 得 到 适 当 的 消 毒 , 导 致 细 菌 进 入 阿 尔 蒂 博 尼 特 河 的 分 流 之 一 , 即 海 地 最 大 的 分 流 之一。 Llama_CL generates a more detailed, accu- rate translation. Table 10. Translation Results Comparison In MT, Llama-3-8B and Llama_NTP occasionally produced source-language outputs—an issue largely mitigated in Llama_Bi_NTP and Llama_CL. Llama_CL showed fewer omissions and redundancies than Llama_Bi_NTP, indicating stronger cross-lingual comprehension. However, all models struggled with rare words and domain-specific terms in low- resource languages, revealing linguistic coverage limitations. For cross-lingual summarization, Llama_CL sometimes defaulted to extractive summaries rather than abstractive content, exposing gaps in information synthesis. Sporadic Slovak usage appeared in Llama_NTP, Llama_Bi_NTP, and Llama_CL summaries, likely from monolingual corpus noise during
Chunk 27 · 1,997 chars
linguistic coverage limitations. For cross-lingual summarization, Llama_CL sometimes defaulted to extractive summaries rather than abstractive content, exposing gaps in information synthesis. Sporadic Slovak usage appeared in Llama_NTP, Llama_Bi_NTP, and Llama_CL summaries, likely from monolingual corpus noise during continued pre-training. Despite improvements in cross-lingual QA, all models struggled with complex logical reasoning, highlighting persistent challenges in multilingual reasoning capabilities. Case study examples are in Tables 10, 11, and 12. 9 Conclusion We introduced a novel continued pre-training approach that enhances cross-lingual capabilities in multilingual LLMs by integrating cross-lingual mapping tasks alongside language-specific next-token prediction. This dual-task strategy improved language alignment while preserving monolingual fluency, addressing key limitations in existing multilingual models. Extensive experiments validated our approach, demonstrating improvements in language alignment, monolingual perplexity, and cross-lingual task performance. Our proposed LAC offers a robust metric for assessing inter-language coherence, even with limited test data. Additionally, our cross-lingual open-domain QA dataset and LLM-assisted evaluation highlight the effectiveness of explicit cross-lingual pre-training in enhancing generative capabilities across seen and unseen language pairs. Manuscript submitted to ACM -- 14 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 15 Task Question L-3-8B L_NTP L_Bi_NTP L_CL Remark CLSum Please read the following English text and generate a short and precise Chinese summary: "By Peter BilesBBC World Affairs Correspondent The then-prime minister only saw it was likely after getting "raw intelligence" two days before the Argentines landed. Papers released under the 30-year rule show Mrs Thatcher was acutely worried about retaking the
Chunk 28 · 1,999 chars
sh text and generate a short and precise Chinese summary: "By Peter BilesBBC World Affairs Correspondent The then-prime minister only saw it was likely after getting "raw intelligence" two days before the Argentines landed. Papers released under the 30-year rule show Mrs Thatcher was acutely worried about retaking the islands. One historian said the documents were among the "most powerful material" declassified in the last three decades. In October 1982, a few months after the war ended, Mrs Thatcher gave evidence behind closed doors to the Falk- land Islands Review Committee, chaired by Lord Franks. The transcript of that dramatic testimony has now been published for the first time. "I never, never expected the Argentines to invade the Falklands head-on. It was such a stupid thing to do, as events happened, such a stupid thing even to contemplate doing", Mrs Thatcher told the Franks Committee. There had been some British contingency planning in the month before the Argentine invasion of the Falklands. On 26 March 1982, Ministry of Defence officials came back to Mrs Thatcher with a plan to deter a full-scale invasion. One sentence shocked her, and she wrote it in her diary: "Moreover, if faced with Argentine occupation on arrival there would be no certainty that such a force would be able to retake the dependency." She told the committee: "You can imagine that turned a knife in my heart, that lot." However, in her oral evidence, she said she had still considered an invasion unlikely: "I again stress, I thought that they would be so absurd and ridiculous to invade the Falklands that I did not think it would happen. "Nevertheless, one has always to make contingency plans, and soon after we got the South Georgia incident happening." The picture changed on 31 March 1982 when Mrs Thatcher was shown intelligence suggesting that an invasion was on the cards: "I just say it was the worst I think moment of my life." She also told the committee: "That night no-one could tell
Chunk 29 · 1,998 chars
contingency plans, and soon after we got the South Georgia incident happening." The picture changed on 31 March 1982 when Mrs Thatcher was shown intelligence suggesting that an invasion was on the cards: "I just say it was the worst I think moment of my life." She also told the committee: "That night no-one could tell me whether we could retake the Falklands - no-one. We did not know - we did not know." The British foreign secre- tary at the time, Lord Carrington, also gave evidence to the Franks Committee, where he too held the view that Argentina was not going to invade the Falklands. "I have been accused, and was accused violently after 2 April in the House of Com- mons, and in the press and elsewhere - of wilfully ignoring signs and statements and evidence, and just ignoring it. I can truthfully say I did not do any of these things. . . they (the signs of an invasion) were not there," ... It also bases its claim on the islands’ proximity to the South American mainland. Some 255 British servicemen and three Falklands civilians died during the conflict. The number of Argentine dead is estimated at about 650 All document images courtesy of the National Archives. 玛 格 丽 特·撒 切 尔 前 英 国 首 相 在1982年 福克兰群 岛战争之 前就有所 预料,但 当时的政 府官员却 没有意识 到 这 一 点。 英国前首 相玛格丽 特·撒 切 尔 说 , 她1982年 对英国失 去福克兰 群岛的战 争时期所 面临的压 力 是"人 生中最糟 糕 的 一 刻"。 1982年4月2日 , 阿 根 廷 突 然 入 侵 福 克 兰 群 岛 , 英 国 迅 速 集 结 军 队 回 应。30年后,英国 政府公布了关于这 场战争的新档案。 马 格 丽 特·撒 切 尔夫人当 时说,她 的生命从 来没有经 历过如此 严重的危 机 。 现 在,刚公 布的政府 文 件 显 示,这个 英国首相 在1982年 阿根廷对 福克兰群 岛的入侵 之前,并 没有料到 会发生这 样的事。 L_CL lists two facts but does not gen- erate a summary state- ment. Table 11. CLSum Results Comparison While our method significantly improves cross-lingual performance, challenges remain in summarization and complex reasoning tasks. Future work should explore better integration of summarization objectives and enhanced multilingual reasoning. Expanding our framework to underrepresented languages will further strengthen multilingual LLMs, reinforcing the value of
Chunk 30 · 1,998 chars
s-lingual performance, challenges remain in summarization and complex reasoning tasks. Future work should explore better integration of summarization objectives and enhanced multilingual reasoning. Expanding our framework to underrepresented languages will further strengthen multilingual LLMs, reinforcing the value of explicitly modeling cross-lingual relationships in pre-training. Manuscript submitted to ACM -- 15 of 20 -- 16 Zheng et al. Task Question L-3-8B L_NTP L_Bi_NTP L_CL Remark CLQA Answer the follow- ing questions in Czech. "Write a lim- erick about the joys of childhood in the style of Lewis Car- roll." Kdybys si vzpom- něl na dětství, považte si to rádi!Bylo to tak plné nadšené vzrušení, plné úžasu a ra- dosti!Nakonec to skončilo, ale jaká úžasná doba, když jsme byli malí a takoví, jaké jsme dnes. Jako malý jsem si užíval,Až se mi v životě zata- jvalo,Vždycky jsem věděl,Co jsem chtěl,Jako malé dítě, bylo to nejlepší. Jako malý holky v lese, \nKterý nás potěšil,\nTento svět byl plný štěstí,\nS radostí a veselím, \nNikdy jsme se necítily starosti. Over- postingotázkou Koukal jsem na svět se štěstím plný,\nA všude kolem se tají radost,\nVysoké stromy, malé holky, \nNávštěvy kamarádů, hrací kosti,\nVšechno bylo úžasné, ne? To je pravda! While both L_CL and L_Bi_NTP employ parallel structure, a tech- nique common in poetry, L_CL’s writing style is more poetic, with a rhythm that better aligns with the musical quality of verse. Answer the follow- ing questions in Czech. "Explain to me the importance of the following quotation.’The only way to do great work is to love what you do.’" Tento citát zna- mená, že pokud chcete dosáhnout úspěchu v jakékoli oblasti, musíte milovat to, co děláte. Kreativita a odhodlání jsou důležitými aspekty úspěchu, ale pokud se nebudete zají- mat o to, co děláte, pravděpodobně nebudete schopni dosáhnout svých cílů. Mít opravdový zájem o své pra- covní povinnosti je klíčem k tomu, abyste se dostali k tomu, kde se chcete dostat. To
Chunk 31 · 1,997 chars
ékoli oblasti, musíte milovat to, co děláte. Kreativita a odhodlání jsou důležitými aspekty úspěchu, ale pokud se nebudete zají- mat o to, co děláte, pravděpodobně nebudete schopni dosáhnout svých cílů. Mít opravdový zájem o své pra- covní povinnosti je klíčem k tomu, abyste se dostali k tomu, kde se chcete dostat. To love what you do je důležitý as- pekt úspěšné práce. Když si člověk vybere práci, kterou opravdu miluje, je mnohem pravděpodobnější, že se mu podaří vykonávat ji s velkým nasazením a přinášet kvalitní výsledky. Milovat svou práci zajišťuje, že člověk se bude cítit spokojenějším a bude mít větší motivaci a nadšení k práci. Navíc, když se člověk skutečně zajímá o svou práci, je pravděpodobnější, že bude mít větší znalosti a schop- nosti a bude lépe splňovat poža- davky a očekávání zaměstnavatele. To mít velký úspěch ve svém povolání, je třeba milovat to, co děláte. Když máte rádi to, co děláte, máte mo- tivaci pracovat tvrdě, vyvíjet se a dosahovat cílů. Milujete své po- volání, budete se cítit spokojenější a vaše práce se bude lépe pro- jevovat v práci a živ- otě. To nejlepší, co můžete dělat, je milovat to, co děláte. Pokud vás vaše práce nudí, vaše výsledky budou značně omezené. Když milujete svou práci, získáte z ní větší spokojenost a z ní se vám líbí být. To vám umožní dělat svou práci s větší odvahou a odhodláním a získáte více z jejího výkonu. Responses of L_CL and L-3-8B are relatively accurate, with the answer of L_CL answer being more relevant to the question. The response of L_NTP contains English content, while the answer of L_Bi_NTP has repetitive elements. Table 12. CLQA Results Comparison 10 Limitation While our approach significantly improves the cross-lingual generation and comprehension capabilities of multilingual LLMs, certain limitations remain. Notably, we observed only marginal improvements in tasks requiring cross-lingual text summarization and complex reasoning. This indicates that while our method enhances language alignment
Chunk 32 · 1,991 chars
ignificantly improves the cross-lingual generation and comprehension capabilities of multilingual LLMs, certain limitations remain. Notably, we observed only marginal improvements in tasks requiring cross-lingual text summarization and complex reasoning. This indicates that while our method enhances language alignment and generation fluency, it struggles with tasks that demand deeper information extraction, filtering, and logical reasoning across languages. These limitations may arise from the inherent complexity of summarization and reasoning tasks, which require more sophisticated mechanisms to retain and manipulate detailed semantic information across languages. Additionally, although our approach effectively handles low-resource languages and demonstrates strong alignment for bilingual tasks, the gains in highly subjective or culturally nuanced tasks, such as humor or sentiment analysis, are Manuscript submitted to ACM -- 16 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 17 less pronounced. Addressing these challenges will be a focus of future research, where we aim to develop methods that strengthen cross-lingual logical reasoning and enhance the model’s capacity for processing complex, abstract concepts. Acknowledgments This research is supported by the National Research Foundation, Singapore under its National Large Language Models Funding Initiative. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation, Singapore. The research is also supported by the National Research Foundation, Singapore under its National Large Language Models Funding Initiative (AISG Award No: AISG-NMLP-2024-005 and AISG-NMLP-2024-004). References [1] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In
Chunk 33 · 1,998 chars
upported by the National Research Foundation, Singapore under its National Large Language Models Funding Initiative (AISG Award No: AISG-NMLP-2024-005 and AISG-NMLP-2024-004). References [1] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 4623–4637. doi:10.18653/v1/2020.acl-main.421 [2] Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 749–775. https://aclanthology.org/2024.acl-long.44 [3] Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Yuan-Fang Li, Yong-Bin Kang, and Rifat Shahriyar. 2021. CrossSum: Beyond English-centric cross-lingual summarization for 1,500+ language pairs. arXiv preprint arXiv:2112.08804 (2021). [4] Guanlin Chen, Xiaolong Shi, Moke Chen, and Liang Zhou. 2020. Text similarity semantic calculation based on deep reinforcement learning. International Journal of Security and Networks 15, 1 (2020), 59–66. [5] Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021. InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy,
Chunk 34 · 1,999 chars
oXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 3576–3588. doi:10.18653/v1/2021.naacl-main.280 [6] Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-Wen Li, Wen tau Yih, Yoon Kim, and James Glass. 2022. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv:2204.10298 [cs.CL] https://arxiv.org/abs/2204.10298 [7] A Conneau. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019). [8] Alexis CONNEAU and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips. cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018). [10] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong,
Chunk 35 · 1,995 chars
hie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura,
Chunk 36 · 1,996 chars
Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Manuscript submitted to ACM -- 17 of 20 -- 18 Zheng et al. Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe,
Chunk 37 · 1,997 chars
Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan,
Chunk 38 · 1,998 chars
er Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian,
Chunk 39 · 1,989 chars
ndsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783 [11] Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT Sentence Embedding. arXiv:2007.01852 [cs.CL] https://arxiv.org/abs/2007.01852 [12] Changjiang Gao, Hongda Hu, Peng Hu, Jiajun Chen, Jixing Li, and Shujian Huang. 2024. Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly. arXiv:2404.04659 [cs.CL] https://arxiv.org/abs/2404.04659 [13] Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2022. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821 [cs.CL] https://arxiv.org/abs/2104.08821 [14] Hongyu Gong, Suma Bhat, Lingfei Wu, JinJun Xiong, and Wen-mei Hwu. 2019. Reinforcement learning based text style transfer without parallel training corpus. arXiv preprint arXiv:1903.10671 (2019). [15] Suchin Gururangan, Ana
Chunk 40 · 1,997 chars
Learning of Sentence Embeddings. arXiv:2104.08821 [cs.CL] https://arxiv.org/abs/2104.08821 [14] Hongyu Gong, Suma Bhat, Lingfei Wu, JinJun Xiong, and Wen-mei Hwu. 2019. Reinforcement learning based text style transfer without parallel training corpus. arXiv preprint arXiv:1903.10671 (2019). [15] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020). [16] Jiyeon Ham and Eun-Sol Kim. 2021. Semantic Alignment with Calibrated Similarity for Multilingual Sentence Embedding. In Findings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 1781–1791. doi:10.18653/v1/2021.findings-emnlp.153 [17] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY] https://arxiv.org/abs/2009.03300 Manuscript submitted to ACM -- 18 of 20 -- Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance 19 [18] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685 [19] Zixuan Ke and Bing Liu. 2022. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701 (2022). [20] Minato Kondo, Takehito Utsuro, and Masaaki Nagata. 2024. Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Elizabeth Salesky, Marcello Federico, and Marine
Chunk 41 · 1,997 chars
2022). [20] Minato Kondo, Takehito Utsuro, and Masaaki Nagata. 2024. Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), Elizabeth Salesky, Marcello Federico, and Marine Carpuat (Eds.). Association for Computational Linguistics, Bangkok, Thailand (in-person and online), 203–220. doi:10.18653/v1/2024.iwslt-1.26 [21] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual language model. (2023). [22] Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, and Jiajun Chen. 2024. PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 10246–10257. doi:10.18653/v1/2024.emnlp-main.572 [23] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval. [24] Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ali Payani, Ninghao Liu, and Mengnan Du. 2024. Quantifying Multilingual Performance of Large Language Models Across Languages. arXiv:2404.11553 [cs.CL] https://arxiv.org/abs/2404.11553 [25] Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. arXiv:2007.08124 [cs.CL] https://arxiv.org/abs/2007.08124 [26] Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). [27] Fuli Luo,
Chunk 42 · 1,991 chars
n Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. arXiv:2007.08124 [cs.CL] https://arxiv.org/abs/2007.08124 [26] Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). [27] Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Songfang Huang, Fei Huang, and Luo Si. 2021. VECO: Variable and flexible cross-lingual pre-training for language understanding and generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3980–3994. [28] Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109–165. [29] Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, and Yoshimasa Tsuruoka. 2024. Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment. arXiv preprint arXiv:2404.02490 (2024). [30] Benjamin Muller, Yanai Elazar, Benoît Sagot, and Djamé Seddah. 2021. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. arXiv:2101.11109 [cs.CL] https://arxiv.org/abs/2101.11109 [31] Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. 2023. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400 (2023). [32] Karl Pearson. 1896. VII. Mathematical contributions to the theory of evolution.—III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character 187 (1896), 253–318. [33] Roger Ratcliff. 1990. Connectionist models of recognition memory: constraints imposed by
Chunk 43 · 1,996 chars
ibutions to the theory of evolution.—III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character 187 (1896), 253–318. [33] Roger Ratcliff. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review 97, 2 (1990), 285. [34] AB Siddique, Samet Oymak, and Vagelis Hristidis. 2020. Unsupervised paraphrasing via deep reinforcement learning. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1800–1809. [35] Henry Tang, Ameet Deshpande, and Karthik Narasimhan. 2022. ALIGN-MLM: Word Embedding Alignment is Crucial for Multilingual Pre-training. arXiv:2211.08547 [cs.CL] https://arxiv.org/abs/2211.08547 [36] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca. [37] Teknium. 2023. OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants. https://huggingface.co/datasets/teknium/ OpenHermes-2.5 [38] Liang Wang, Wei Zhao, and Jingming Liu. 2021. Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast. arXiv:2109.00253 [cs.CL] https://arxiv.org/abs/2109.00253 [39] Shijie Wu and Mark Dredze. 2020. Do Explicit Alignments Robustly Improve Multilingual Encoders?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4471–4482. doi:10.18653/v1/2020.emnlp-main.362 [40] Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: Contrastive Learning for Sentence Representation. arXiv:2012.15466 [cs.CL] https://arxiv.org/abs/2012.15466 [41] Haoran Xu, Young Jin Kim,
Chunk 44 · 1,712 chars
ion for Computational Linguistics, Online, 4471–4482. doi:10.18653/v1/2020.emnlp-main.362 [40] Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and Hao Ma. 2020. CLEAR: Contrastive Learning for Sentence Representation. arXiv:2012.15466 [cs.CL] https://arxiv.org/abs/2012.15466 [41] Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. 2024. A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models. arXiv:2309.11674 [cs.CL] https://arxiv.org/abs/2309.11674 [42] L Xue. 2020. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020). [43] Go Yasui, Yoshimasa Tsuruoka, and Masaaki Nagata. 2019. Using Semantic Similarity as Reward for Reinforcement Learning in Sentence Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Fernando Alva-Manchego, Eunsol Choi, and Daniel Khashabi (Eds.). Association for Computational Linguistics, Florence, Italy, 400–406. doi:10.18653/v1/P19-2056 Manuscript submitted to ACM -- 19 of 20 -- 20 Zheng et al. [44] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130B: An Open Bilingual Pre-trained Model. arXiv:2210.02414 [cs.CL] https://arxiv.org/abs/2210.02414 [45] Kun Zhou, Beichen Zhang, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Debiased Contrastive Learning of Unsupervised Sentence Representations. arXiv:2205.00656 [cs.CL] https://arxiv.org/abs/2205.00656 Manuscript submitted to ACM -- 20 of 20 --