Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

Summary

Marco-LLM is a multilingual large language model developed by Alibaba’s MarcoPolo Team to address performance gaps in low-resource languages. Built upon the Qwen2 foundation, it undergoes massive multilingual continual pre-training using 300 billion tokens curated from 29 languages, including web data, parallel corpora, high-quality knowledge sets, and synthetic data. The training employs a two-stage strategy: Stage I balances adaptation and catastrophic forgetting with a mixed data distribution, while Stage II increases low-resource language proportions with a lower learning rate. Comprehensive evaluations on benchmarks such as MMMLU, Belebele, Flores, and AGIEval demonstrate that Marco-LLM significantly outperforms state-of-the-art models like Llama3 and Qwen2.5. Notably, Marco-7B and Marco-72B achieve leading scores in low-resource languages like Nepali and Kazakh, while maintaining strong performance in high-resource languages like English and Chinese. The model also excels in any-to-any machine translation tasks, surpassing commercial systems in several language pairs. Subsequent post-training includes multilingual supervised fine-tuning and preference alignment to enhance instruction following and cultural appropriateness. These results validate that targeted massive multilingual training effectively bridges the capability gap between high- and low-resource languages.

PDF viewer

Chunks(81)

Chunk 0 · 1,999 chars

2024-12-6
Marco-LLM: Bridging Languages via Massive
Multilingual Training for Cross-Lingual
Enhancement
Lingfeng Ming*, Bo Zeng*, Chenyang Lyu*, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu,
Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue
Wang†, Weihua Luo, Kaifu Zhang
MarcoPolo Team, Alibaba International Digital Commerce
Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their
excellent performance is still largely limited to major world languages, primarily English. Many LLMs
continue to face challenges with multilingual tasks, especially when it comes to low-resource languages.
To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual en-
hancement LLM. We have collected a substantial amount of multilingual data for several low-resource
languages and conducted extensive continual pre-training using the Qwen2 models. This effort has
resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various
multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many oth-
ers, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore,
Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the
effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only
perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain
strong performance in English and other major languages, closing the performance gap between high-
and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication
to ensuring LLMs work accurately across various languages.
Figure 1 | Comparison of English-centric performance vs Multilingual performance on MMMLU and
Flores. Our Marco-LLM demonstrates strong performance on both

Chunk 1 · 1,999 chars

-
and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication
to ensuring LLMs work accurately across various languages.
Figure 1 | Comparison of English-centric performance vs Multilingual performance on MMMLU and
Flores. Our Marco-LLM demonstrates strong performance on both dimensions.
∗Equal Contribution.
†Corresponding Author: wanglongyue.wly@alibaba-inc.com
© 2024 Alibaba. All rights reserved
arXiv:2412.04003v1 [cs.CL] 5 Dec 2024

-- 1 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Contents
1 Introduction 4
2 Related Work 5
3 Massive Multilingual Continual Pretraining for Large Language Models 6
3.1 Data Collection and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Multilingual Web Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Parallel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.3 High-quality Knowledge Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.4 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.5 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.6 Data Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Two-Stage Continual Pretraining Strategy . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Benchmarks and Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Baseline LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Evolution of performance during continual pretraining . . . . . . . . . . . . . . . . . 18
3.5 Data ablations on Parallel Data . . . . . . . . . . . . . .

Chunk 2 · 1,998 chars

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Evolution of performance during continual pretraining . . . . . . . . . . . . . . . . . 18
3.5 Data ablations on Parallel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Effect of Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Extensive Multilingual Post-training for Large Language Models 19
4.1 Multilingual Supervised Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.3 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.4 Baseline LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Multilingual Preference Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Dataset Construction from Existing Preference Data . . . . . . . . . . . . . . . 29
4.2.2 Multilingual Preference Data Generation and Translation . . . . . . . . . . . . 29
4.2.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2

-- 2 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
5 Conclusion and Future Work 31
A Appendix 38
A.1 Dataset Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.1.1 Translation Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.1.2 Prompt Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.2 More Evaluation Results for Instruct LLMs . . . . . . . . . . . . . . . . . . . . . . . . 40
3

Chunk 3 · 1,993 chars

. . . . . . . . . . . . . . . . 38
A.1.1 Translation Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.1.2 Prompt Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.2 More Evaluation Results for Instruct LLMs . . . . . . . . . . . . . . . . . . . . . . . . 40
3

-- 3 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
1. Introduction
Large Language Models (LLMs) [Brown et al., 2020, OpenAI, 2023, Touvron et al., 2023b,c, Dubey
and et al, 2024, Guo et al., 2024] have transformed the field of Natural Language Processing (NLP) by
achieving impressive results across a variety of tasks such as language understanding, generation, and
translation [Bang et al., 2023, Wang et al., 2023, Jiao et al., 2023, Bai et al., 2023, Hui et al., 2024,
Yao et al., 2024]. Models like GPT-4 [OpenAI, 2023], GPT-4o [Hurst et al., 2024], PaLM[Chowdhery
et al., 2024], and LLaMA [Dubey and et al, 2024] have demonstrated that scaling up model size and
training data leads to significant performance gains. However, the majority of these advances have
been centered around high-resource languages, predominantly English [Bang et al., 2023, Jiao et al.,
2023, Lai et al., 2023]. This focus has resulted in a performance gap when these models are applied
to multilingual tasks, especially involving low-resource languages.
Multilingual NLP faces unique challenges due to the diversity and imbalance of linguistic re-
sources [Pires et al., 2019, Conneau and Lample, 2019]. Low-resource languages often lack the
extensive textual data required for training large models, which hinders the development of effective
language technologies for these languages. As a result, speakers of low-resource languages are
underrepresented in the benefits brought by recent advancements in NLP.
To address this disparity, we have developed Marco-LLM, a multilingual language model trained
with a focus on low-resource

Chunk 4 · 1,997 chars

the development of effective
language technologies for these languages. As a result, speakers of low-resource languages are
underrepresented in the benefits brought by recent advancements in NLP.
To address this disparity, we have developed Marco-LLM, a multilingual language model trained
with a focus on low-resource languages. An overview of the framework of our Marco-LLM is shown in
Figure 2. By collecting a substantial amount of multilingual data covering a series of underrepresented
languages, and through massive multilingual continual pre-training and extensive multilingual post-
training (including multilingual supervised finetuning and multilingual preference alignment) using
the Qwen2 model [Bai et al., 2023] as a foundation, Marco-LLM aims to bridge the performance gap
in multilingual NLP tasks.
Our main contributions can be summarized as follows:
• We compile and curate a large-scale multilingual dataset tailored for low-resource languages,
enhancing the diversity and richness of training data.
• We perform massive multilingual continual pre-training and post-training on the Qwen2 model to
develop Marco-LLM, a multilingual LLM that substantially improves performance on low-resource
language tasks.
• We conduct comprehensive evaluations on benchmarks such as MMMLU, Flores, Belebele, AGIEval,
multilingual MT-bench, etc, demonstrating that Marco-LLM outperforms state-of-the-art models in
multilingual settings.
The rest of this paper is structured as:
• In Section 2, we give an overview of relevant literature regarding the development trajectory of
LLMs especially multilingual LLMs and continual pre-training for LLMs.
• We present the details of our continual pre-training experiments in Section 3 including the mono-
lingual and multilingual data collection and curation, training setup and evaluation results.
• Furthermore, in Section 4 we demonstrate how we conduct the post-training (including supervised
finetuning and preference alignment)for the Marco-LLM

Chunk 5 · 1,998 chars

f our continual pre-training experiments in Section 3 including the mono-
lingual and multilingual data collection and curation, training setup and evaluation results.
• Furthermore, in Section 4 we demonstrate how we conduct the post-training (including supervised
finetuning and preference alignment)for the Marco-LLM that has been continual pre-trained shown
in Section 3, including how we construct our multilingual supervised finetuning data, how to
formulate the task format, supervised training details as well as corresponding evaluation results
on multilingual benchmark datasets.
4

-- 4 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Figure 2 | An overview of the training and evaluation paradigm of our Marco-LLM, we conducted
massive multilingual continual pre-training, multilingual supervised finetuning and preference align-
ment. We further perform extensive evaluation on multilingual benchmarks to validate the efficacy of
our Marco-LLM.
2. Related Work
In recent years, Large Language Models (LLMs) such as GPT-3 [Brown et al., 2020], GPT-4 [OpenAI,
2023], PaLM [Chowdhery et al., 2024], and LLaMA [Touvron et al., 2023b] have revolutionized
the research paradigm of Natural Language Processing (NLP). These transformer-based [Vaswani
et al., 2017] models have demonstrated exceptional capabilities in generating and understanding
text, achieving state-of-the-art results in tasks like language translation, summarization, and question-
answering [Bang et al., 2023, Wang et al., 2023]. However, their primary focus has been on high-
resource languages, leaving a gap in multilingual performance [Jiao et al., 2023, Bang et al., 2023, Lai
et al., 2023]. To bridge the gap in multilingual applications, multilingual models such as mBERT [Pires
et al., 2019], XLM-R [Conneau and Lample, 2019], mT5 [Xue et al., 2021], and PolyLM[Wei et al.,
2023b] have been developed. These models aim to provide cross-lingual capabilities

Chunk 6 · 1,990 chars

et al., 2023, Bang et al., 2023, Lai
et al., 2023]. To bridge the gap in multilingual applications, multilingual models such as mBERT [Pires
et al., 2019], XLM-R [Conneau and Lample, 2019], mT5 [Xue et al., 2021], and PolyLM[Wei et al.,
2023b] have been developed. These models aim to provide cross-lingual capabilities by being trained
on diverse language datasets. Despite their promise, they often struggle with low-resource languages
due to insufficient training data and the inherent difficulty of balancing multiple languages within
one model [Shi et al., 2023, Dac Lai et al., 2023, Singh et al., 2024a, Lovenia et al., 2024]. Efforts
in this area have included data augmentation, transfer learning, and specialized models for specific
languages or tasks [Conneau et al., 2018, Artetxe et al., 2018, Conneau and Lample, 2019, Goyal
et al., 2021, Team et al., 2022]. These approaches, while beneficial, have not fully harnessed the
potential of large-scale language models [Üstün et al., 2024]. Continual pre-training offers a viable
solution to enhance the adaptability of LLMs by allowing models to incorporate new information
without starting from scratch [Ke et al., 2023, Çağatay Yıldız et al., 2024]. This method enables
models to improve performance in underrepresented areas, particularly for low-resource languages.
By leveraging existing strengths and optimizing computational resources, continual pre-training
presents a scalable approach to overcoming current limitations in multilingual and low-resource
5

-- 5 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
language processing. Building on these insights, we introduce Marco-LLM, which leverages continual
pre-training of the Qwen2 model with a focus on low-resource languages. Our approach integrates
large-scale multilingual data collection and advanced training techniques to enhance the model’s
capabilities in multilingual NLP tasks.
3. Massive Multilingual

Chunk 7 · 1,998 chars

se insights, we introduce Marco-LLM, which leverages continual
pre-training of the Qwen2 model with a focus on low-resource languages. Our approach integrates
large-scale multilingual data collection and advanced training techniques to enhance the model’s
capabilities in multilingual NLP tasks.
3. Massive Multilingual Continual Pretraining for Large Language Models
In this section, we present the technical details of how we conduct continual pretraining on LLMs.
Specifically, the process of continual pretraining for multilingual LLMs encompasses the following:
(1) the curation and filtering of a large-scale training corpus, which will be introduced in Section 3.1;
(2) simple and scalable strategies to efficiently conduct continual pretraining at large scale for LLMs,
detailed in Section 3.2; (3) a comprehensive presentation of evaluation results across multiple
benchmarks, along with an analysis, in Section 3.3.
3.1. Data Collection and Filtering
We create our dataset for Marco-LLM training from a variety of data sources containing knowledge
until the end of 2024.4. We apply several data cleaning mechanisms and de-duplication methods on
each data source to obtain high-quality tokens.
3.1.1. Multilingual Web Data Curation
To produce a high-quality multilingual training data, we meticulously designed a cascaded data-
processing pipeline. Similar to prior processing pipelines (e.g., CCNet[Wenzek et al., 2020], Re-
fineWeb[Penedo et al., 2023], Llama[Touvron et al., 2023a], etc.) focusing on English, our multilingual
pipeline features a series of data-cleaning strategies targeting text quality and information distribution.
Document Preparation. Our pipeline starts from the target document preparation to keep
information distribution. First of all, we extract the main contents from the raw Common Crawl
(WARC) files. Meanwhile, we perform URL filter and language identification to avoid subsequent
computationally expensive processing. The former targets fraudulent and/or

Chunk 8 · 1,998 chars

ne starts from the target document preparation to keep
information distribution. First of all, we extract the main contents from the raw Common Crawl
(WARC) files. Meanwhile, we perform URL filter and language identification to avoid subsequent
computationally expensive processing. The former targets fraudulent and/or adult websites (e.g.,
predominantly pornographic, violent, related to gambling, etc.), while the latter focus on the target
languages. Specifically, we classify documents according to their primary languages and remove those
with low confidence in classification, leveraging inexpensive n-gram models (e.g, fast-Text [Joulin
et al., 2016]).
Quality Filtering. The filters in this part aim for removing text with low quality. We filter
out document based on some heuristic rule filters: (1). word blocklists and garbled text filters;
(2). document length, the ratio of special symbols, the ratio of stop words, and the ratio of short,
consecutive, or incomplete lines; (3). repeated words, n-grams. Inspired by CulturaX [Nguyen et al.,
2024], the filtering thresholds are based on a statistical analysis of large document samples. To
enhance our filtering process, we utilize the KenLM library [Wenzek et al., 2020] to evaluate a vast
array of web documents. Documents with perplexity scores largely above average are subsequently
removed.
Deduplication. After filtering, we implement a comprehensive deduplication pipeline following
the procedure in RefineWeb [Penedo et al., 2023]. This pipeline integrates document-level MinHash
deduplication and sub-document exact-match deduplication, effectively identifying and removing
duplicate content within and across documents.
6

-- 6 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 1 | Overview of corpus utilization rates across various languages and categories of our corpus,
we show the total number of tokens (in Billion Tokens) available and used for high-resource

Chunk 9 · 1,992 chars

ss documents.
6

-- 6 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 1 | Overview of corpus utilization rates across various languages and categories of our corpus,
we show the total number of tokens (in Billion Tokens) available and used for high-resource and
low-resource languages, as well as other data sources such as synthetic data.
Category Language 	Total Tokens (B) Used Tokens (B) Utilization Rate (%)
High-Resource Languages
English (en) 	1,459.7 	90.4 	6.2
Chinese (zh) 	214.7 	48.2 	22.4
Arabic (ar) 	45.8 	10.6 	23.0
German (de) 	442.8 	10.6 	2.4
Spanish (es) 	397.8 	10.6 	2.7
French (fr) 	320.8 	10.6 	3.3
Korean (ko) 	41.8 	10.6 	25.2
Japanese (ja) 	224.2 	10.6 	4.7
Portuguese (pt) 	145.3 	10.6 	7.3
Turkish (tr) 	80.6 	10.6 	13.1
Low-Resource Languages
Bengali (bn) 	6.5 	1.9 	28.7
Hebrew (he) 	11.3 	1.9 	16.4
Indonesian (id) 	23.8 	1.9 	7.8
Italian (it) 	56.3 	1.9 	3.3
Malay (ms) 	2.9 	1.9 	64.4
Dutch (nl) 	31.3 	1.9 	6.0
Polish (pl) 	45.3 	1.9 	4.1
Russian (ru) 	251.0 	1.9 	0.7
Thai (th) 	8.3 	1.9 	22.4
Ukrainian (uk) 	18.4 	1.9 	10.1
Urdu (ur) 	8.7 	1.9 	21.4
Vietnamese (vi) 	24.4 	1.9 	7.6
Czech (cs) 	270.2 	1.9 	0.7
Greek (el) 	376.8 	1.9 	0.5
Hungarian (hu) 	214.4 	1.9 	0.9
Kazakh (kk) 	16.8 	1.9 	11.1
Romanian (ro) 	160.0 	1.9 	1.2
Azerbaijani (az) 	19.4 	1.9 	9.6
Nepali (ne) 	22.6 	1.9 	8.2
Other Data Sources
Parallel Data 	103.0 	20.8 	20.2
High-quality Knowledge Data 	65.0 	16.4 	25.2
Synthetic Data 	6.0 	4.4 	73.3
Total 	5,115.9 	300.0 	5.9
It’s worth noting that we perform quality filtering and deduplications within data for each
language. At this stage, we obtain many high-quality monolingual data in low-resource languages,
generally called multilingual data, the statistics is detailed in Table 1. We determine the amount of
multilingual tokens used in continual pretraining experimentally (shown in Section 3.2), balancing
model performance on Chinese, English, and multilingual

Chunk 10 · 1,986 chars

many high-quality monolingual data in low-resource languages,
generally called multilingual data, the statistics is detailed in Table 1. We determine the amount of
multilingual tokens used in continual pretraining experimentally (shown in Section 3.2), balancing
model performance on Chinese, English, and multilingual benchmarks.
3.1.2. Parallel Data
Follow prior works such as PaLM2 [Anil et al., 2023], Skywork [Wei et al., 2023a], and PolyLM [Wei
et al., 2023b], we employ parallel data into our continual pretraining dataset to further improve
the cross-lingual and multilingual ability of Marco. This data is meticulously structured to pair a
complete source-language paragraph with its corresponding target-language counterpart, ensuring a
seamless alignment of linguistic capabilities between the two languages.
We mainly focus on open-source parallel data, OPUS [Tiedemann, 2012, Zhang et al., 2020] and
7

-- 7 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
1000
1100
1200
1300
1400
1500
Corpus Size by Category (Billion Tokens)
Total Tokens
Used Tokens
English
Chinese
Arabic
German
Spanish
French
Korean
Japanese	
Portuguese
Turkish
Bengali
Hebrew
Indonesian
Italian
Malay
Dutch
Polish
Russian
Thai
Ukrainian
Urdu
Vietnamese
Czech
Greek
Hungarian
Kazakh
Romanian	
Azerbaijani
Nepali
Parallel Data
High-quality Knowledge Data
Synthetic Data
0
20
40
60
80
100
Tokens (Billion)
Figure 3 | The amount of tokens per category in our multilingual continual pretraining corpus for
Marco-LLM.
CCAligned [Chaudhary et al., 2019, El-Kishky et al., 2020]. The former covers 100 languages, is
English-centric, meaning that all training pairs include English on either the source or target side,
while the latter consists of parallel or comparable web-document pairs in 137 languages aligned with
English. There are many bad cases, such as translation errors, in these open-source data. We utilize
the following pipeline to process

Chunk 11 · 1,997 chars

eaning that all training pairs include English on either the source or target side,
while the latter consists of parallel or comparable web-document pairs in 137 languages aligned with
English. There are many bad cases, such as translation errors, in these open-source data. We utilize
the following pipeline to process them.
Heuristic Filtering. To obtain high-quality parallel corpora, we develop heuristics to remove
additional low-quality documents, outliers, and documents with excessive repetitions. Some examples
of heuristics include:
• We filter the sentence pairs using the ratio of special symbols, the ratio of stop words, the ratio
of digits„ and the ratio of repeated words in source sentence;
• We use similarity scores of LASER embeddings ∗ for sentence pairs to filter out sentence pairs
with low scores.
Diverse Translation Templates. After filtering, the translation templates are employed to
concatenate the parallel corpora. Prior work [Üstün et al., 2024] has shown the importance of diverse
wording, templates, and task types to aid generalization to different natural inputs. Therefore, we
utilize diverse translation templates to make the input diversity to enhance semantic alignment
between multilingual parallel sentences, the details can be found in Appendix A.1.1.
Empirically, the parallel data has shown the importance of enhancing cross-lingual and multilingual
ability of Marco, specific NLP downstream tasks such as machine translation. More experimental
details are presented in Section 3.5.
∗https://github.com/facebookresearch/LASER
8

-- 8 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
3.1.3. High-quality Knowledge Data
Prior works, such as MiniCPM [Hu et al., 2024], Skywork [Wei et al., 2023a], and Llama3 [Dubey
and et al, 2024], have shown that introducing the high-quality knowledge data into the pretraining
stage enhances model capabilities. Similarly, we mix some high-quality knowledge data

Chunk 12 · 1,998 chars

ment
3.1.3. High-quality Knowledge Data
Prior works, such as MiniCPM [Hu et al., 2024], Skywork [Wei et al., 2023a], and Llama3 [Dubey
and et al, 2024], have shown that introducing the high-quality knowledge data into the pretraining
stage enhances model capabilities. Similarly, we mix some high-quality knowledge data into continual
pretraining. Our high-quality knowledge data consists of mQA (multilingual Question Answer),
STEM, and some ability-oriented SFT data, like math and coding. All of them are collected from the
open-source website, Huggingface†. Similar to our pipelines for parallel multilingual data described
above, we implement filters to remove data with low quality and the data from common benchmarks.
In addition to the n-grams deduplications, we employ semantic deduplications to process them.
Note that, we do not include any training sets from commonly used benchmarks in our high-quality
knowledge data.
3.1.4. Synthetic Data
Books and papers are high-quality knowledge data, but they are scarce. phi models [Gunasekar
et al., 2023, Li et al., 2023] are trained by synthetic textbooks to enhance performance. Recent work
[Whitehouse et al., 2023] suggests that multilingual synthetic data can also enhance cross-lingual
transfer. Also, we mix some synthetic data into our continual pretraining. Our synthetic data mainly
consists of two parts: keywords-based explanation and story data, and ability-oriented SFT data. We
describe them below.
Keywords-based Explanation/Story Data. To synthetic Wikipedia and book data, we first use
GPT-4 to generate a large number of keywords covering diverse topics such as mathematics, history,
physics, etc. That results in 100k keyword pools. Then, we sample several target keywords from the
keyword pool and employ other LLMs with the designed prompts to generate explanations of the
target words and generate textbook stories, respectively. Similar to AttrPrompt [Yu et al., 2023], the
prompts contains several attributes, we vary

Chunk 13 · 1,999 chars

esults in 100k keyword pools. Then, we sample several target keywords from the
keyword pool and employ other LLMs with the designed prompts to generate explanations of the
target words and generate textbook stories, respectively. Similar to AttrPrompt [Yu et al., 2023], the
prompts contains several attributes, we vary the value of attribute to generate diverse samples. An
example of such prompts can be found in Appendix A.1.2, where the LLMs are instructed to generate
training data based on attributes such as key words and subject.
Ability-Oriented SFT Data. WizardLM [Xu et al., 2024], WizardMath [Luo et al., 2023], and
WizardCoder [Luo et al., 2024] are designed to synthetic diverse instructions on target ability. We
follow them to enhance Marco-LLM capacity of math and coding. To further improve the cross-lingual
and multilingual ability, we use in-context learning method and translation technology to generate
multilingual data. The former is that we employ the few-shot prompt to generate a new sample in the
similar domain, the prompts are listed in Appendix A.1.2. The latter is that we randomly translate
the original instruction, question, or answer to other language, resulting in cross-lingual QA.
To enhance the diversity of synthetic data, we employ several superb models (like GPT-4, Deepseek-
v2 [DeepSeek-AI et al., 2024], DBRX‡, Command-R Plus [Cohere For AI, 2024], etc.) to act as the
generator, as its generation is of higher quality.
3.1.5. Data Statistics
The composition of the continual pretraining dataset used for Marco-LLM is detailed in Table 1.
The multilingual data mainly covers the following languages: English (en), Chinese (zh), Arabic (ar),
German (de), Spanish (es), French (fr), Korean (ko), Japanese (ja), Portuguese (pt), Turkish (tr),
Azerbaijani (az), Bengali (bn), Czech (cs), Greek (el), Hebrew (he), Hungarian (hu), Indonesian (id),
†https://huggingface.co/
‡ttps://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
9

-- 9 of 42

Chunk 14 · 1,996 chars

c (ar),
German (de), Spanish (es), French (fr), Korean (ko), Japanese (ja), Portuguese (pt), Turkish (tr),
Azerbaijani (az), Bengali (bn), Czech (cs), Greek (el), Hebrew (he), Hungarian (hu), Indonesian (id),
†https://huggingface.co/
‡ttps://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
9

-- 9 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Italian (it), Kazakh (kk), Malay (ms), Nepali (ne), Dutch (nl), Polish (pl), Romanian (ro), Russian (ru),
Thai (th), Ukrainian (uk), Urdu (ur), Vietnamese (vi). The data distribution across different languages
is extremely uneven, shown in Figure 3. We group them into a rough taxonomy of lower-resourced
(LR) and higher-resourced (HR). This yield a split of the 29 languages in our training mixture into
10 HR and 19 LR languages. It’s worth noting that our dataset contains 5.1T tokens in total, but only
about 300B tokens are used to develop our Marco. We will discuss the utilization of data in Section
3.1.6.
3.1.6. Data Utilization
Although we gathered a total of 5.1T training data, the actual amount used for continual pretraining
is 300B. The overall data utilization rate is only 5.9%. Table 1 presents a detailed breakdown of the
utilization of each language and data source. We have the following findings:
• It is evident that, overall, the data utilization rates for both high-resource and low-resource
languages are low. The more data we collect, the less effectively we utilize it. The corpus of Malay
is only 2.9B, resulting in its utilization rate is up to 64.4%.
• The parallel data, high-quality knowledge data, and synthetic data have higher utilization rate.
These data are of high quality and focused on specific tasks; moreover, the cost of acquisition is
substantial. Therefore, these data are well used. Specifically, the parallel data is used to enhance
the down-stream machine translation, while the others are employed to enhance oriented ability
of

Chunk 15 · 1,998 chars

her utilization rate.
These data are of high quality and focused on specific tasks; moreover, the cost of acquisition is
substantial. Therefore, these data are well used. Specifically, the parallel data is used to enhance
the down-stream machine translation, while the others are employed to enhance oriented ability
of LLM. The quality and diversity of synthetic data enhances Marco-LLM performance, thus its
utilization rate is up to 73.3%.
• The multilingual capacity relies on the tokens utilized. Intuitively, to improve performance in
specific languages, we should immerse the LLM in a more diverse corpus. The number of tokens
used in each high-resource languages is 10.6B, compared to 1.9B in each low-resource languages.
According to Table 6 and Table 7, the improvement in low-resource languages is greater than that
in high-resource languages. It indicates that open-source models (e.g., Qwen2/2.5 and llama3/3.1)
primarily focus on high-resource languages, with the continuous pretraining on only 1.9B corpus
for each low-resource languages producing significant effects.
Based on the above findings, we outlines the following directions for further exploration:
• Further improving quality and diversity of multilingual data. The overall data utilization rate
is only 5.9%. It means that we should further improve quality of our training data for the next
model iterations. Moreover, take the diversity multilingual data into consideration, we will pay
more attention to collecting multilingual data with local features.
• Synthetic data scaling and improving self-improvement capability. We conclude that training
with synthetic data is effective. Therefore, we will continuously focus on developing advanced
techniques that can control and manipulate specific attributes of the generated data, enabling the
creation of diverse and customizable synthetic datasets. In addition, we will explore methods that
integrate domain-specific knowledge to ensure that the generated data adheres

Chunk 16 · 1,999 chars

l continuously focus on developing advanced
techniques that can control and manipulate specific attributes of the generated data, enabling the
creation of diverse and customizable synthetic datasets. In addition, we will explore methods that
integrate domain-specific knowledge to ensure that the generated data adheres to the underlying
constraints and patterns present in the target domain. Furthermore, We aim to unlock the potential
of emerging self-improvement capabilities by utilizing Marco-LLM to generate synthetic data, as we
believe it can potentially bootstrap its own performance by iteratively learning from the enhanced
synthetic data.
10

-- 10 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 2 | The proportion of high-quality and multilingual sources.
Data 	Stage-I Stage-II
English (en) 	32% 28%
Chinese (zh) 	17% 15%
High-Resourced (Others) 30% 26%
Low-Resourced 	9% 15%
Parallel Multilingual Data 6% 8%
High-quality Knowledge Data 5% 6%
Synthetic Data 	1% 2%
3.2. Two-Stage Continual Pretraining Strategy
We develop Marco-LLM based on Qwen2, which is naturally designed to be multilingual-friendly
and has an extensive vocabulary of 150k, ensuring a high compression rate for multilingual data.
Nonetheless, the primary challenge of continual pretraining lies in balancing the adaptation (which
can lead to suboptimal performance on the new dataset) and catastrophic forgetting (resulting in
significant capability loss on the previous dataset). In this paper, we propose an advanced two-stage
continual pretraining learning approach designed to facilitate the transfer of commonsense knowledge,
primarily acquired in English and Chinese, to a variety of low-resource languages, as well as specific
NLP downstream tasks such as machine translation.
when it comes to continual pretraining, the data mixture and the learning rate are two crucial
hyper-parameters to optimize Marco. In our practice, we employ different

Chunk 17 · 1,988 chars

rimarily acquired in English and Chinese, to a variety of low-resource languages, as well as specific
NLP downstream tasks such as machine translation.
when it comes to continual pretraining, the data mixture and the learning rate are two crucial
hyper-parameters to optimize Marco. In our practice, we employ different hyper-parameters in the
two-stage training. Specifically, we employ data mixing to balance the adaptation of multilingual
capabilities and prevent catastrophic forgetting in Stage-I, while the goal of Stage-II is to further
strengthen Marco-LLM’s multilingual capabilities via a lower maximum learning rate. We describe
these stages below.
Data Mixture. Optimizing LLMs to learn knowledge encoded in multiple languages simultaneously
is a significant challenge. We concretely formulate this problem as transferring general knowledge to
low-resource languages while maintaining the advantage of original languages in the base model. To
address this issue, we keep the 32% English and 17% Chinese corpus in Stage-I training to avoid
catastrophic forgetting, as shown in Table 2. Meanwhile, the proportion of other high-resourced
(HR, apart from Chinese and English) and low-resourced (LR) are about 30% and 9%, respectively,
to develop Marco-LLM with multilingual capabilities. The intuition is that HR has a higher priority
and is more commonly used. In Stage-II, we raise a greater proportion of LR data from 9% to
15%, to further strength Marco-LLM multilingual capabilities in low-resourced. Moreover, parallel
data, high-quality knowledge data, and synthetic data are increased, while dropping down English
and Chinese corpus. Notably, the proportion of diverse corpus is determined by a large number of
experiments in Marco-1.5B with a constant learning rate, not by manual experience.
Lower Maximal Learning Rate. Learning rate is a crucial hyper-parameter in neural network
models that controls the magnitude of parameter updates [Ibrahim et al., 2024]. However,

Chunk 18 · 1,997 chars

tion of diverse corpus is determined by a large number of
experiments in Marco-1.5B with a constant learning rate, not by manual experience.
Lower Maximal Learning Rate. Learning rate is a crucial hyper-parameter in neural network
models that controls the magnitude of parameter updates [Ibrahim et al., 2024]. However, most
performant open-source LLMs [Yang et al., 2024a, Dubey and et al, 2024] decay their learning rate
to a small value (i.e. ∼ 1e-7) by the end of training. In our practice, we employ the learning rate to
be re-warned and re-decayed to improve adaptation per compute spent when continual pretraining
on a new distribution. The key challenge is to optimize the maximum learning rate. Empirically,
decreasing the schedule’s maximum learning rate can help reduce forgetting, whereas increasing it
can improve adaptation. We refer to Section 3.6 for learning rate tuning. After conducting fine-grained
learning rate experiments when the data mixture is determined, we set maximal learning rate to
11

-- 11 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
1𝑒 − 5 in Stage-I to strike a balance between the acquisition of multilingual languages and catastrophic
forgetting. Based on the HR multilingual capacities acquired from the Stage-I training, we proceed
to train the LR corpus with a smaller learning rate of 6𝑒 − 6. As described in Table 3, the training
tokens in Stage-I is about 160B, while Stage-II introduces an additional 140B tokens. Note that,
an attempt to apply the Warmup-Stable-Decay (WSD) learning rate scheduler [Hu et al., 2024]
resulted in subpar performance. It is suspected that the stable stage does not necessarily benefit
model continual pretraining.
Training. Marco-LLM were trained using Pai-Megatron-LM§ on a cluster of 512 A100 GPU
(64x80G) servers. To accelerate multi-node training for large model, tensor parallel and pipeline
parallel were set to 8 to maximize the model flops utilization

Chunk 19 · 1,997 chars

he stable stage does not necessarily benefit
model continual pretraining.
Training. Marco-LLM were trained using Pai-Megatron-LM§ on a cluster of 512 A100 GPU
(64x80G) servers. To accelerate multi-node training for large model, tensor parallel and pipeline
parallel were set to 8 to maximize the model flops utilization (MFU) of NVIDIA GPUs. Specifically, we
employ 512 GPU cards for Marco-72B, and 256 GPU cards for Marco-1.5B/7B.
To ensure the model learns the distribution akin, we conduct experiments in Marco-1.5B to
optimize the mixing of data and learning rate. Then, we scale them up to Marco-7B and Marco-72B.
We adopt most of the pretraining settings and model architectures from Qwen2. During the training,
all Marco-LLM were continuously pre-trained over 300B tokens, using the Adam (𝛽1 = 0.9, 𝛽2 = 0.95
) optimizer. We utilize a context window length of 32768 for Marco-1.5B and 7B, while the sequence
length is 8192 in Marco-72B. At each stage, we warm-up the learning rate from 0 to the maximum
learning rate over the first 10B tokens, and then decay it to 10% of the maximal learning rate using a
cosine schedule. We use a weight decay of 0.1 and gradient clipping of 1.0.
After the two-stage continual pretraining, we have developed our multilingual large model, Marco.
Our two-stage continual pretraining is both effective and efficient to extend to the incoming rarely
low-resourced languages, we leave it further exploration for future model iterations.
Table 3 | The training corpus tokens and learning rate in two Stage Continual Pretraining.
Stage Training Tokens (B) LR
Stage-I 160 1𝑒 − 5
Stage-II 140 6𝑒 − 6
3.3. Evaluation Results
We aim to assess the capabilities of Marco-LLM from various perspectives: 1) the ability of LLMs to
understand and generate natural language, as well as the ability to grasp world knowledge; 2) the
performance of LLMs across different languages; and 3) their capacity to handle cross-lingual tasks
such as machine translation. Following

Chunk 20 · 1,995 chars

the capabilities of Marco-LLM from various perspectives: 1) the ability of LLMs to
understand and generate natural language, as well as the ability to grasp world knowledge; 2) the
performance of LLMs across different languages; and 3) their capacity to handle cross-lingual tasks
such as machine translation. Following the experiment design of previous work [Wei et al., 2023b],
we gather a subset of datasets from previous NLP tasks to construct a multilingual benchmark. Table
4 summarizes the evaluation tasks and datasets, together with their language coverage.
3.3.1. Benchmarks and Evaluation Protocol
All the datasets in the above multilingual benchmark can be divided into four groups: Natural
Language Understanding, Knowledge, Question answering, and Machine Translation. The details of
each dataset that we use for benchmarking are given below.
AGIEval: AGIEval [Zhong et al., 2023] is a benchmark dataset designed to evaluate the reasoning
and problem-solving abilities of artificial intelligence models on tasks that mimic human examinations.
§https://github.com/alibaba/Pai-Megatron-Patch/
12

-- 12 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 4 | Evaluation benchmarks overview. The table presents the comprehensive evaluation suite
used in our experiments, spanning four major task categories: general knowledge, multilingual
understanding, question answering, and machine translation. Each dataset is evaluated using either
accuracy (Acc.), F1 score, or BLEU metric, covering a diverse range of languages from single-language
(en, zh) to multilingual scenarios.
Task Dataset Split Metric #Languages #-shots
General CEVAL Val Acc. One 5-shot
Knowledge AGIEval Test Acc. One 5-shot
ARC Test Acc. One 25-shot
MMLU Test Acc. One 5-shot
Multilingual XCOPA Val Acc. Six 5-shot
Understanding X-MMLU Val Acc. Thirteen 5-shot
XStoryCloze Val Acc. Six 5-shot
Question TyDiQA Val F1 Six 1-shot
Answering Belebele Test Acc.

Chunk 21 · 1,997 chars

guages #-shots
General CEVAL Val Acc. One 5-shot
Knowledge AGIEval Test Acc. One 5-shot
ARC Test Acc. One 25-shot
MMLU Test Acc. One 5-shot
Multilingual XCOPA Val Acc. Six 5-shot
Understanding X-MMLU Val Acc. Thirteen 5-shot
XStoryCloze Val Acc. Six 5-shot
Question TyDiQA Val F1 Six 1-shot
Answering Belebele Test Acc. Twenty-Eight 1-shot
Machine Flores Devtest BLEU Twenty-Eight 1-shot
Translation WMT-16 Test BLEU Three 1-shot
It includes thousands of multiple-choice questions sourced from real-world standardized tests such as
the GRE, GMAT, LSAT, and other professional certification exams. The questions cover a wide range
of subjects, including mathematics, logical reasoning, reading comprehension, and analytical writing.
ARC (AI2 Reasoning Challenge): The ARC dataset [Clark et al., 2018] consists of 7,787 natural
language questions designed to assess an AI model’s ability to answer grade-school-level science
questions. It is divided into two subsets: Easy Set, containing 5,217 questions that can often
be answered with surface-level information, and Challenge Set, comprising 2,570 more difficult
questions that require reasoning and background knowledge beyond simple retrieval. Questions are
multiple-choice, with four options each, covering topics like biology, physics, chemistry, and earth
science.
Belebele: Belebele [Bandarkar et al., 2024] is a multilingual multiple-choice reading comprehension
dataset spanning 122 language variants, including low-resource and typologically diverse languages.
It provides a benchmark for evaluating machine reading comprehension across a wide linguistic
spectrum. Each question includes a passage, a query, and four answer choices.
CEVAL: CEVAL [Huang et al., 2023] is a comprehensive evaluation suite designed to assess the
capabilities of Chinese language models. It includes over 13,000 multiple-choice questions sourced
from real Chinese national college entrance examinations and professional qualification tests. The
dataset covers

Chunk 22 · 1,985 chars

swer choices.
CEVAL: CEVAL [Huang et al., 2023] is a comprehensive evaluation suite designed to assess the
capabilities of Chinese language models. It includes over 13,000 multiple-choice questions sourced
from real Chinese national college entrance examinations and professional qualification tests. The
dataset covers 52 subjects across disciplines such as mathematics, physics, law, medicine, and language
arts. Each question has four options.
Flores: The Flores (Facebook Low Resource Languages for Emergent Situations) dataset [Team
et al., 2022] is a multilingual machine translation benchmark focused on low-resource languages. It
includes parallel sentences in over 100 languages, with a particular emphasis on underrepresented
13

-- 13 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
and low-resource languages. The dataset provides high-quality, professionally translated sentences,
allowing for accurate evaluation of machine translation models.
HellaSwag: HellaSwag [Zellers et al., 2019] is a benchmark dataset designed to test a model’s
ability to perform commonsense reasoning and natural language inference. It contains 70,000
multiple-choice questions generated from descriptions of everyday activities. Each question provides
a context and four possible endings, with one correct continuation and three distractors that are
misleading and adversarially constructed. For example: Context: "A person is cooking onions on a stove.
They begin to cry because..." Options: 1. "the onions release gas that irritates the eyes." (Correct) 2. "they
are listening to sad music." 3. "the stove is very hot." 4. "they forgot to buy garlic."
MMLU (Massive Multitask Language Understanding): The MMLU benchmark [Hendrycks et al.,
2021] is designed to evaluate the broad knowledge and problem-solving abilities of AI models across
57 subjects. It includes over 57,000 multiple-choice questions from high school, undergraduate,

Chunk 23 · 1,992 chars

ot." 4. "they forgot to buy garlic."
MMLU (Massive Multitask Language Understanding): The MMLU benchmark [Hendrycks et al.,
2021] is designed to evaluate the broad knowledge and problem-solving abilities of AI models across
57 subjects. It includes over 57,000 multiple-choice questions from high school, undergraduate, and
professional levels. Subjects span various domains, including history, mathematics, law, medicine,
computer science, and more. Each question has four options, and the tasks often require reasoning,
calculation, or application of specialized knowledge.
TyDiQA: TyDiQA (Typologically Diverse Question Answering) is a benchmark dataset [Clark et al.,
2020] for information-seeking question answering in 11 typologically diverse languages. It includes
over 200,000 question-answer pairs with passages from Wikipedia in languages such as Arabic,
Bengali, Finnish, Japanese, Korean, Russian, Swahili, Telugu, and others. The dataset is designed to
test a model’s ability to understand and generate answers in different languages without relying on
English translations. TyDiQA has two primary tasks: Gold Passage Task, where models are provided
with the correct passage containing the answer, and Minimal Answer Task, where models must find
the minimal span in the passage that answers the question.
WMT16: The WMT16 [Bojar et al., 2016] datasets are part of the Conference on Machine Trans-
lation’s annual shared tasks, which provide standard benchmark datasets for evaluating machine
translation systems. WMT16 includes parallel corpora for language pairs such as English-German,
English-French, English-Russian and expands based on previous years by adding languages like
Romanian and incorporating more challenging test sets.
XCOPA: XCOPA [Ponti et al., 2020] is a multilingual dataset for evaluating causal commonsense
reasoning in AI systems across multiple languages. It extends the original COPA (Choice of Plausible
Alternatives) dataset to 11 languages, including

Chunk 24 · 1,998 chars

dding languages like
Romanian and incorporating more challenging test sets.
XCOPA: XCOPA [Ponti et al., 2020] is a multilingual dataset for evaluating causal commonsense
reasoning in AI systems across multiple languages. It extends the original COPA (Choice of Plausible
Alternatives) dataset to 11 languages, including languages like Haitian Creole, Quechua, and Yoruba.
Each question consists of a premise and two alternative causes or effects, and the task is to select
the more plausible one. For example: Premise: "The ground is wet." Options: 1. "It rained last night."
(Cause) 2. "The sun is shining."
X-MMLU: X-MMLU [Dac Lai et al., 2023] is an extension of the MMLU benchmark for evaluating
multilingual models. It includes translated versions of the original MMLU tasks into multiple languages.
The dataset aims to assess a model’s ability to perform multitask language understanding across
diverse languages, testing both its knowledge and reasoning skills in non-English contexts. X-MMLU
covers subjects like mathematics, science, and humanities, requiring models to demonstrate proficiency
comparable to educated human speakers in various languages.
14

-- 14 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
XStoryCloze: XStoryCloze [Lin et al., 2021] is a multilingual version of the Story Cloze Test, designed
to evaluate a model’s ability to understand and reason about narratives in different languages. The
dataset provides short, four-sentence stories followed by two possible endings, and the task is to
choose the coherent conclusion. It includes translations of the stories into multiple languages, testing
narrative understanding and commonsense reasoning. For example: Story: 1. "Maria woke up early
on Saturday." 2. "She was excited about the trip." 3. "She packed her bags quickly." 4. "She grabbed her
keys and left the house." Endings: A. "She arrived at the airport just in time for her flight." (Correct) B.
"She

Chunk 25 · 1,999 chars

sting
narrative understanding and commonsense reasoning. For example: Story: 1. "Maria woke up early
on Saturday." 2. "She was excited about the trip." 3. "She packed her bags quickly." 4. "She grabbed her
keys and left the house." Endings: A. "She arrived at the airport just in time for her flight." (Correct) B.
"She decided to go back to sleep because it was raining."
3.3.2. Baseline LLMs
Llama3 and Llama3.1 The Llama 3 series includes the Llama3-8B/70B and Llama3.1-8B/70B
model. These models have been pretrained on over 15 trillion tokens from publicly available sources,
achieving superior performance on various benchmarks [Dubey and et al, 2024].
Qwen2 and Qwen2.5 We compare with the Qwen2 [Yang et al., 2024b] series LLMs, including
the Qwen2-7B/72B and Qwen2.5-7B/72B models. Qwen2 LLM is pre-trained on 7 trillion tokens
and Qwen2.5 is a further optimized version of Qwen2, which is pretrained on 18 trillion tokens and
achieved state-of-the-art performance on multiple evaluation benchmarks.
3.3.3. Experimental Results
Table 5 | Performance comparison of LLMs across various benchmarks: Results for LLMs with parame-
ters of both 7B and 70B. The best performance in each benchmark is in bold.
7B Models
Model 	AGIEval Belebele CEval Flores MMLU TyDiQA WMT16 XCOPA XMMLU XStoryCloze
Qwen2-7B 	64.6 73.4 83.0 27.1 71.9 52.3 18.1 70.6 60.2 	70.6
Qwen2.5-7B 	66.5 72.3 81.4 27.2 75.4 59.9 18.2 73.6 62.6 	70.3
Llama3-8B 	24.3 55.3 37.5 33.1 53.6 50.5 24.6 71.7 49.7 	66.5
Llama3.1-8B 44.9 63.3 52.8 33.4 66.2 57.0 25.8 71.6 49.2 	71.7
Marco-7B 	68.8 78.8 83.5 35.0 74.4 60.8 29.0 76.6 61.2 	71.9
70B+ Models
Qwen2-72B 	78.2 86.5 90.4 38.7 83.8 58.7 30.2 80.9 78.5 	77.1
Qwen2.5-72B 80.8 87.6 90.6 35.0 86.3 63.7 31.0 84.7 79.9 	76.3
Llama3-70B 	60.6 85.5 66.8 37.4 79.2 64.3 34.3 81.1 72.0 	76.9
Llama3.1-70B 61.7 86.2 67.3 36.9 78.8 62.8 35.0 83.0 71.4 	75.4
Marco-72B 	84.4 90.0 93.7 45.0 86.3 62.7 35.1 85.7 81.2 	78.7
Results divided by benchmarks Our proposed models, Marco-7B and

Chunk 26 · 1,991 chars

7.1
Qwen2.5-72B 80.8 87.6 90.6 35.0 86.3 63.7 31.0 84.7 79.9 	76.3
Llama3-70B 	60.6 85.5 66.8 37.4 79.2 64.3 34.3 81.1 72.0 	76.9
Llama3.1-70B 61.7 86.2 67.3 36.9 78.8 62.8 35.0 83.0 71.4 	75.4
Marco-72B 	84.4 90.0 93.7 45.0 86.3 62.7 35.1 85.7 81.2 	78.7
Results divided by benchmarks Our proposed models, Marco-7B and Marco-72B, demonstrate
superior multilingual capabilities across a variety of benchmarks compared to baseline models of
similar sizes (see Tables 5). On multilingual understanding and reasoning tasks such as Belebele,
CEVAL, MMLU, XMMLU, XCOPA, and XStoryCloze, the Marco-LLM consistently outperform the
other strong LLMs, indicating enhanced proficiency in comprehending and common sense reasoning
across diverse languages. For the 7B models, Marco-7B achieves the best performance across several
benchmarks, obtaining the highest scores in AGIEval (68.8), Belebele (78.8), CEval (83.5), Flores
(35.0), and TyDiQA (60.8). These results highlight its proficiency in handling diverse tasks and
datasets, outperforming the strongest competitor, Qwen2.5-7B, by 2.3, 6.5, 2.1, 7.8, and 0.9 points,
respectively. The 72B models further amplify these strengths. Marco-72B achieves the highest scores
in multiple benchmarks, including AGIEval (84.4), Belebele (90.0), CEval (93.7), Flores (45.0), and
15

-- 15 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
XMMLU (81.2), showcasing its exceptional capacity to handle complex linguistic tasks. Compared
to Qwen2.5-72B, Marco-72B surpasses by 3.6 points in AGIEval, 2.4 points in Belebele, and 10.0
points in Flores. Notably, Marco-72B achieves the highest scores in benchmarks like CEVAL (93.7)
and XMMLU (81.4), underscoring its advanced multilingual understanding and reasoning abilities.
In addition, the Marco-LLM exhibit strong performance in multilingual translation and generation
tasks, as evidenced by competitive scores on the Flores and WMT16 benchmarks. Their

Chunk 27 · 1,995 chars

est scores in benchmarks like CEVAL (93.7)
and XMMLU (81.4), underscoring its advanced multilingual understanding and reasoning abilities.
In addition, the Marco-LLM exhibit strong performance in multilingual translation and generation
tasks, as evidenced by competitive scores on the Flores and WMT16 benchmarks. Their superior
results in TyDiQA further highlight their capacity for multilingual question answering and knowledge
retrieval. Overall, these findings supported that our focus on enhancing the multilingual abilities of
LLMs has led to models that are highly effective in understanding, reasoning, and generating text
across multiple languages, thereby validating the efficacy of our approach.
Table 6 | Average performance on multilingual benchmarks shown in Table 4 for five 7B-parameter
LLMs divided by 29 languages. The best performance for each language is highlighted in bold.
Language 	Llama3-8B Llama3.1-8B Qwen2-7B Qwen2.5-7B Marco-7B
Chinese (zh) 	55.1 	63.5 	75.8 	75.5 	76.0
English (en) 	69.6 	74.2 	77.4 	78.0 	77.9
Arabic (ar) 	40.5 	52.6 	61.3 	64.5 	66.0
German (de) 	47.3 	59.8 	69.5 	69.0 	72.9
Spanish (es) 	57.0 	64.9 	70.4 	71.9 	72.6
French (fr) 	56.0 	63.5 	69.8 	71.2 	72.5
Japanese (ja) 	63.4 	74.9 	76.7 	76.4 	77.3
Korean (ko) 	43.2 	63.8 	76.0 	79.7 	78.3
Portuguese (pt) 	56.8 	64.7 	70.7 	72.3 	72.3
Turkish (tr) 	51.1 	62.9 	66.0 	66.6 	73.4
Azerbaijani (az) 	39.1 	48.2 	52.2 	53.2 	69.4
Bengali (bn) 	45.9 	49.8 	63.9 	62.3 	68.9
Hebrew (he) 	50.2 	56.6 	69.8 	68.6 	77.3
Indonesian (id) 	64.0 	66.1 	77.3 	77.8 	82.3
Italian (it) 	68.1 	69.4 	80.3 	79.4 	83.4
Polish (pl) 	62.2 	60.7 	77.2 	76.3 	80.9
Malay (ms) 	56.8 	57.7 	73.2 	73.9 	80.2
Dutch (nl) 	61.0 	65.2 	80.2 	71.6 	82.1
Romanian (ro) 	65.6 	67.4 	77.3 	72.6 	80.8
Russian (ru) 	69.2 	70.7 	81.2 	77.1 	83.2
Thai (th) 	54.8 	53.3 	69.1 	73.7 	72.9
Ukrainian (uk) 	60.9 	60.4 	72.7 	70.4 	79.9
Urdu (ur) 	50.0 	56.7 	63.9 	59.9 	71.3
Vietnamese (vi) 	67.6 	70.3 	76.1 	79.0 	81.2
Czech

Chunk 28 · 1,995 chars

.2 	73.9 	80.2
Dutch (nl) 	61.0 	65.2 	80.2 	71.6 	82.1
Romanian (ro) 	65.6 	67.4 	77.3 	72.6 	80.8
Russian (ru) 	69.2 	70.7 	81.2 	77.1 	83.2
Thai (th) 	54.8 	53.3 	69.1 	73.7 	72.9
Ukrainian (uk) 	60.9 	60.4 	72.7 	70.4 	79.9
Urdu (ur) 	50.0 	56.7 	63.9 	59.9 	71.3
Vietnamese (vi) 	67.6 	70.3 	76.1 	79.0 	81.2
Czech (cs) 	59.2 	63.6 	76.9 	70.0 	78.2
Greek (el) 	68.1 	67.7 	65.0 	67.2 	77.1
Hungarian (hu) 	59.8 	61.0 	63.3 	57.9 	69.7
Kazakh (kk) 	41.3 	44.0 	43.1 	45.9 	66.1
Nepali (ne) 	37.0 	41.56 	36.33 	42.1 	65.8
Avg. Scores 	55.9 	61.2 	69.4 	69.1 	75.5
Results divided by languages The experiments evaluate the performance of our Marco-LLM: both
7B and 72B variants against various strong open-source LLMs across a diverse set of languages on the
benchmarks shown in Section 3.3.1. Both Marco-LLM are built on continual pretraining on Qwen2
with massive multilingual data. The results divided by languages shown in Table 6 reveals that
Marco-7B achieves a leading average score of 75.5 among the 7B parameter models. In low-resource
languages such as Nepali and Kazakh, Marco-7B obtained strong performance with scores of 65.8 and
66.1, respectively, outperforming Qwen2-7B by substantial margins of 29.47 and 23.0 points. This
improvement underscores the benefits of our continual pretraining approach using a vast multilingual
16

-- 16 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 7 | Average performance on multilingual benchmarks shown in Table 4 for five 70B-parameter
LLMs divided by 29 languages. The best performance for each language is highlighted in bold.
Language 	Llama3-70B Llama3.1-70B Qwen2-72B Qwen2.5-72B Marco-72B
Chinese (zh) 	75.2 	75.0 	83.7 	84.8 	86.4
English (en) 	82.7 	83.2 	84.6 	85.1 	86.0
Arabic (ar) 	73.0 	73.4 	76.2 	77.4 	79.7
German (de) 	80.8 	80.9 	84.0 	85.0 	87.0
Spanish (es) 	80.7 	79.4 	82.5 	83.1 	84.8
French (fr) 	78.9 	80.7 	80.4 	82.7 	82.8
Japanese (ja) 	82.5 	84.5

Chunk 29 · 1,999 chars

.1-70B Qwen2-72B Qwen2.5-72B Marco-72B
Chinese (zh) 	75.2 	75.0 	83.7 	84.8 	86.4
English (en) 	82.7 	83.2 	84.6 	85.1 	86.0
Arabic (ar) 	73.0 	73.4 	76.2 	77.4 	79.7
German (de) 	80.8 	80.9 	84.0 	85.0 	87.0
Spanish (es) 	80.7 	79.4 	82.5 	83.1 	84.8
French (fr) 	78.9 	80.7 	80.4 	82.7 	82.8
Japanese (ja) 	82.5 	84.5 	86.6 	86.1 	86.2
Korean (ko) 	87.1 	87.0 	88.8 	88.9 	90.6
Portuguese (pt) 	81.1 	80.8 	83.6 	83.8 	85.5
Turkish (tr) 	80.8 	81.3 	79.0 	81.5 	84.4
Azerbaijani (az) 	77.2 	77.9 	78.8 	79.1 	85.8
Bengali (bn) 	79.7 	79.7 	83.2 	82.9 	86.7
Hebrew (he) 	80.1 	82.3 	83.9 	84.1 	85.1
Indonesian (id) 	87.7 	88.0 	87.6 	88.4 	93.0
Italian (it) 	87.1 	87.9 	88.1 	89.9 	91.0
Polish (pl) 	87.2 	87.1 	88.6 	88.8 	88.2
Malay (ms) 	85.2 	87.2 	83.4 	87.9 	90.7
Dutch (nl) 	88.9 	88.8 	89.0 	90.3 	93.0
Romanian (ro) 	88.4 	88.2 	88.2 	87.3 	90.4
Russian (ru) 	87.9 	88.3 	89.9 	91.0 	90.3
Thai (th) 	80.2 	80.4 	85.7 	87.6 	88.3
Ukrainian (uk) 	89.0 	88.6 	88.2 	90.2 	91.7
Urdu (ur) 	80.6 	81.2 	81.4 	82.1 	87.9
Vietnamese (vi) 	87.1 	89.6 	88.7 	90.0 	90.8
Czech (cs) 	87.1 	87.6 	87.8 	90.0 	91.4
Greek (el) 	88.3 	89.6 	89.4 	89.0 	91.1
Hungarian (hu) 	83.8 	83.0 	80.1 	88.1 	88.3
Kazakh (kk) 	74.7 	77.9 	70.3 	72.2 	84.8
Nepali (ne) 	70.1 	73.8 	67.1 	74.1 	86.7
Avg. Scores 	82.5 	83.2 	83.8 	85.2 	87.9
corpus, which enhances the model’s ability to generalize across languages with limited resources.
Marco-7B also shows competitive performance in high-resource languages like Arabic and German,
surpassing Qwen2.5-7B by 1.5 and 3.9 points, respectively. These results highlight the effectiveness
of our continual pretraining approach, which improves the model’s adaptability to different linguistic
structures and complexities.
Similarly, Marco-72B exhibits remarkable performance (shown in Table 7), achieving the highest
average score of 87.9 among the LLMs with 70B+ parameters. The Marco-72B model further extends
these capabilities with an average score of 87.9 across 29

Chunk 30 · 1,980 chars

el’s adaptability to different linguistic
structures and complexities.
Similarly, Marco-72B exhibits remarkable performance (shown in Table 7), achieving the highest
average score of 87.9 among the LLMs with 70B+ parameters. The Marco-72B model further extends
these capabilities with an average score of 87.9 across 29 languages. It demonstrates remarkable
performance in low-resource languages, achieving 86.7 in Nepali and 84.8 in Kazakh, reflecting
improvements over Qwen2.5-72B by 12.6 and 12.6 points, respectively. This indicates the efficacy of
scaling model size alongside multilingual training data to achieve superior performance in challenging
linguistic contexts. Additionally, Marco-72B remains highly competitive in high-resource languages,
achieving scores of 79.7 in Arabic and 87.0 in German, which are improvements of 2.3 and 2.0
points over Qwen2.5-72B. The consistent outperformance of Marco-LLM across both high-resource
and low-resource languages (especially compared with Qwen2 - the base LLM we conduct continual
pretraining, our Marco-LLM achieved substantial improvements over the languages shown in Table 1)
underscores the pivotal role of leveraging a vast multilingual corpus during pretraining as shown in
Figure 3. This approach not only enhances the models’ ability to generalize to low-resource languages,
but also maintains strong performance in high-resource languages such as English and Chinese. These
17

-- 17 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
findings suggest that the comprehensive training methodologies employed in the Marco-LLM are key
factors contributing to their leading performance in multilingual settings.
0 	50 	100 	150 	200 	250 	300
Billion of tokens
55
60
65
70
75
Xstorycloze (Accuracy)
zh
en
ar
es
(a)
0 	50 	100 	150 	200 	250 	300
Billion of tokens
62
66
70
74
78
82 	Xwinograd (Accuracy)
zh
en
fr
ja
pt
 (b)
0 	50 	100 	150 	200 	250 	300
Billion of

Chunk 31 · 1,981 chars

ntributing to their leading performance in multilingual settings.
0 	50 	100 	150 	200 	250 	300
Billion of tokens
55
60
65
70
75
Xstorycloze (Accuracy)
zh
en
ar
es
(a)
0 	50 	100 	150 	200 	250 	300
Billion of tokens
62
66
70
74
78
82 	Xwinograd (Accuracy)
zh
en
fr
ja
pt
 (b)
0 	50 	100 	150 	200 	250 	300
Billion of tokens
35
40
45
50
55
60
Belebele (Accuracy)
zh
en
ar
de
es
fr
ko
pt
tr
 (c)
0 	50 	100 	150 	200 	250 	300
Billion of tokens
5
10
15
20
25
30
35
40
45
Flores (Bleu)
zh
ar
de
es
fr
ko
ja
pt
tr
 (d)
Figure 4 | Evolution of performance on question answering and machine translation during continual
pretraining in Marco-1.5B.
3.4. Evolution of performance during continual pretraining
During continual pretraining, we track the performance of our models on a few question answering
and machine translation benchmarks. we report the performance of Marco-1.5B in Figure 4. On
question answering benchmarks, the performance for multilingual languages such as Arabic (ar),
German (de), Spanish (es), French (fr), Korean (ko), Japanese (ja), Portuguese (pt), and Turkish
(tr) improve steadily, and while capabilities in Chinese (zh) and English (en) are maintained, and
even slightly increased. On Flores, we observe the performance of all languages shows consistent
improvement. Notably, Figure 4(d) illustrates the averages for both English-to-multilingual (EN→XX)
and multilingual-to-English (XX→EN) directions. These indicators are consistent with our goals
aiming at enhancing multilingual capabilities.
3.5. Data ablations on Parallel Data
Empirically, introducing parallel corpora can enhance semantic alignment between cross-languages.
In this section, we will explore the impact of parallel corpora on downstream specific NLP tasks, such
as machine translation. We mainly focus on the quality of parallel data. The open-source parallel data
has many bad cases, what happens if we ignore them on continual pretraining? we report our findings
in Figure 5, where

Chunk 32 · 1,998 chars

is section, we will explore the impact of parallel corpora on downstream specific NLP tasks, such
as machine translation. We mainly focus on the quality of parallel data. The open-source parallel data
has many bad cases, what happens if we ignore them on continual pretraining? we report our findings
in Figure 5, where Marco-w/o-parallel-data-filtering indicates that Marco-LLM were continuously
pre-trained on Qwen2 without any filtering on parallel data. We have the following findings:
• Phenomena vary across different model scales. Compared to base model Qwen2, the smaller
models, specifically the 1.5B and 7B models, Marco-w/o-parallel-data-filtering shows improvements,
whereas the 72B model has performance degradation. We will elaborate on this phenomenon. Firstly,
Marco’s machine translation capability benefits from both monolingual data and parallel data, thus
resulting in gradient conflicts utilizing low-quality parallel data during continual pretraining in
smaller models. The monolingual data, due to its dominant proportion, plays a crucially positive role,
which explains the improvements observed with Marco-w/o-parallel-data-filtering. Secondly, due to
parameter redundancy, the 72B model contains a higher number of monosemantic neurons[Gurnee
et al., 2023]. The low-quality parallel data negatively interferes with the machine translation task.
This meaningful discovery reveals an important insights that some conclusions obtained on small
models are not necessarily scaled to larger models. Furthermore, beyond the issue of parallel
18

-- 18 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
1.5B 7B 72B
0
5
10
15
20
25
30
35
Flores (BLEU)
Qwen2
Marco-w/o-parallel-data-filtering
Marco
Figure 5 | The performance of different model size on Flores benchmark. Marco-w/o-parallel-data-
filtering denotes that we continuously pre-trained Marco-LLM based on Qwen2 without applying any
filtering to parallel data.
data,

Chunk 33 · 1,992 chars

7B 72B
0
5
10
15
20
25
30
35
Flores (BLEU)
Qwen2
Marco-w/o-parallel-data-filtering
Marco
Figure 5 | The performance of different model size on Flores benchmark. Marco-w/o-parallel-data-
filtering denotes that we continuously pre-trained Marco-LLM based on Qwen2 without applying any
filtering to parallel data.
data, we suspect that cross-language gradient conflicts may also exist during multilingual training,
which we will explore in future research.
• High-quality data enhances the performance of model. Compared to base model Qwen2,
our Marco-LLM that has been continuously pre-trained with the processed parallel data, shows
significant improvements across the 1.5B, 7B, and 72B models. We attribute it primarily to its data
quality resulting from our data-engineering efforts. Furthermore, we will enhance the quality of
our training data for the upcoming model iterations.
3.6. Effect of Learning Rate
In this section, we explore the effect of the crucial hyper-parameter during continual pretraining. We
select learning rate (lr) from {1𝑒−5, 2𝑒−5, 3𝑒−5} and conduct continual pretraining on 200B corpus
under determined data mixture. Figure 6 demonstrates the training dynamic between the average
score on question answering benchmarks and different learning rate. Specifically, Figure 6(a) presents
that the forgetting of Chinese and English ability is aggravated with the increase of learning rate,
and multilingual ability is generally enhanced in Figure 6(b). Interestingly, the average accuracy in
multilingual languages goes up firstly and then down when learning rate is 3𝑒−5. We believe that this
is due to the loss of primary language ability resulting in reducing natural language understanding.
Notably, the machine translation ability gets better as the learning rate increases, which shows
consistently in Figure 4(d). Therefore, we set the peak learning rate to 1𝑒−5 in our experiments, as it
plays a pivotal role in striking a balance between the acquisition of

Chunk 34 · 1,979 chars

ing in reducing natural language understanding.
Notably, the machine translation ability gets better as the learning rate increases, which shows
consistently in Figure 4(d). Therefore, we set the peak learning rate to 1𝑒−5 in our experiments, as it
plays a pivotal role in striking a balance between the acquisition of multilingual languages and the
forgetting of English and Chinese.
4. Extensive Multilingual Post-training for Large Language Models
After conducting continuous pre-training on up to 29 languages, we proceeded with post pre-training
of the Macro model. This phase of training primarily comprises two stages: Supervised Fine-Tuning
(SFT) [Wei et al., 2022, Ouyang et al., 2022] and Direct Preference Optimization (DPO) [Rafailov
et al., 2023]. The main objective of the Supervised Fine-Tuning (SFT) stage is to activate and enhance
the model’s multilingual capabilities across various domains, including commonsense reasoning,
dialogue-based question answering, precise instruction following, mathematical and logical reasoning,
multilingual comprehension and translation, and coding. During this stage, our research particularly
focuses on (1) the automatic generation and cost-effective collection of high-quality multilingual data,
and (2) the transfer of extensive domain knowledge from high-resource languages, such as English,
19

-- 19 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
0 	50 	100 	150 	200
Billion of tokens
57
58
59
60
61
62
63
64 	AVG. Accuracy (En&Zh)
lr=1e-5
lr=2e-5
lr=3e-5
(a) The average accuracy in English and Chi-
nese with different learning rate.
0 	50 	100 	150 	200
Billion of tokens
46
47
48
49
50
51
52
AVG. Accuracy (Multilingual)
lr=1e-5
lr=2e-5
lr=3e-5
(b) The average accuracy in multilingual lan-
guages with different learning rate.
Figure 6 | The average performance on question answering benchmarks with different learning rate
during continual pretraining in

Chunk 35 · 1,994 chars

100 150 200
Billion of tokens
46
47
48
49
50
51
52
AVG. Accuracy (Multilingual)
lr=1e-5
lr=2e-5
lr=3e-5
(b) The average accuracy in multilingual lan-
guages with different learning rate.
Figure 6 | The average performance on question answering benchmarks with different learning rate
during continual pretraining in Marco-1.5B.
Chinese, and French, to low-resource languages. The DPO stage aims to ensure that the content
generated by the model aligns with specified preference.
4.1. Multilingual Supervised Fine-tuning
4.1.1. Data Construction
The Supervised Fine-Tuning (SFT) dataset is composed of several components:
• A limited set of detoxification data annotated by experts, along with multilingual self-cognition
enhancement data.
• Synthetic data and parallel corpora, which include precise instruction enhancement data, multi-
lingual Alpaca dialogue data, as well as synthetic data for specific tasks such as comprehension,
generation, and translation.
• Open-source instruction data, including Chain-of-Thought (CoT) enhancement data and general
instruction data like Aya collection [Singh et al., 2024b]. This includes multi-turn dialogues,
coding, mathematics, and more, with examples such as UltraChat, Glaive-Code, MetaMathQA,
MathInstruct, Belle, and Orca.
• We utilize parallel data for multilingual machine translation tasks, enhancing language diversity
and quality within the dataset. Specifically, we use the dev sets from WMT-14 to WMT-20 [Barrault
et al., 2020] and the WikiMatrix [Schwenk et al., 2021].
Data Collection and Processing Our data collection endeavors encompass two primary facets. On
one hand, we engage in the aggregation and cleansing of open-source data. On the other hand, we
employ data synthesis and the translation of parallel corpora to augment the dataset. Building upon
these two approaches, we have developed both multilingual Supervised Fine-Tuning (SFT) datasets
and multilingual Direct Preference Optimization (DPO) datasets. The following

Chunk 36 · 1,998 chars

ng of open-source data. On the other hand, we
employ data synthesis and the translation of parallel corpora to augment the dataset. Building upon
these two approaches, we have developed both multilingual Supervised Fine-Tuning (SFT) datasets
and multilingual Direct Preference Optimization (DPO) datasets. The following sections provide a
detailed examination of the composition of these datasets and the experiments conducted to assess
data distribution proportions.
Data Cleaning Given that the majority of our dataset is derived from open-source, synthetic,
and parallel corpus data, we implemented a comprehensive data processing strategy to ensure
quality. Initially, we applied regular expression filtering to remove inconsistencies such as merged
20

-- 20 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
multi-turn dialogues, incorrect numbering in segmented outputs, HTML format outputs, emoji data,
and hyperlinks or URL references. Additionally, regular expressions and Python calculations were
employed to capture and validate mathematical equations, with incorrect mathematical results being
discarded.
Data Filtering To enhance the overall performance of the model, we employed a comprehensive
pipeline based on seminal works for data quality filtering, effectively removing low-quality training
samples:
• Quality Scoring: For English-language data, we utilized the Deita model in conjunction to score the
raw data within a range of 1 to 6. High-scoring data were selected to construct parallel corpora,
with GPT-4 further filtering the translated data.
• QA similarity: In assessing QA relevance, we evaluated the semantic similarity between input
and output fields. Data with similarity below a certain threshold were considered irrelevant and
subsequently removed.
• Mathematics Grading: For mathematical data, we deployed models from the open-source project
Open-Web-Math to score the data, retaining only those with higher

Chunk 37 · 1,999 chars

ed the semantic similarity between input
and output fields. Data with similarity below a certain threshold were considered irrelevant and
subsequently removed.
• Mathematics Grading: For mathematical data, we deployed models from the open-source project
Open-Web-Math to score the data, retaining only those with higher scores for training purposes.
• Multilingual difficulty scoring: For extensive multilingual parallel datasets, assessing transla-
tion quality through open-source models poses challenges. We employed the Instruct-Following-
Difficulty (IFD) method. By examining the ratio of Conditioned Answer Score (CA) to Direct
Answer Score (DA) during the initial iterations, we filtered data that significantly benefited the
model.
• Semantic deduplication: Initially, we traversed the dataset to remove duplicate instructions. We
then applied MinHash and SimHash techniques for further deduplication. Lastly, embeddings
were extracted using open-source models, retaining data with embedding similarity below a 0.7
threshold.
4.1.2. Training Setup
In our instruction fine-tuning, neat packing was employed to train a CT model with a context length
of 16,384 on a our supervised finetuning data of totally 5.7 milliion examples. The training utilized
the Adam optimizer with a cosine schedule learning rate. It was observed that setting a large learning
rate during full model fine-tuning could severely affect the general knowledge acquired during the
pre-training and CT phases, leading to a performance degradation. The optimal instruction fine-tuning
learning rate was determined by adjusting the minimum pre-training learning rate in accordance
with the batch size, with the maximum and minimum fine-tuning learning rates identified as 6e-6
and 6e-7, respectively.
4.1.3. Evaluation Benchmarks
In the evaluation experiments, we employ TyDiQA, AGIEval, CEVAL, Belebele and a multilingual
version of the original MMLU:
Multilingual MMLU (MMMLU) The Multilingual Massive Multitask Language

Chunk 38 · 1,999 chars

ith the maximum and minimum fine-tuning learning rates identified as 6e-6
and 6e-7, respectively.
4.1.3. Evaluation Benchmarks
In the evaluation experiments, we employ TyDiQA, AGIEval, CEVAL, Belebele and a multilingual
version of the original MMLU:
Multilingual MMLU (MMMLU) The Multilingual Massive Multitask Language Understanding (MMMLU)
dataset¶ is an extension of the MMLU benchmark [Hendrycks et al., 2021] into multiple languages.
MMMLU includes translations of the original 57 subjects into 14 languages including Arabic, Ben-
gali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Brazilian Portuguese,
Swahili, Yoruba, and Simplified Chinese, covering areas such as STEM, humanities, social sciences,
¶https://huggingface.co/datasets/openai/MMMLU
21

-- 21 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
and more. Each language version contains approximately 15,908 multiple-choice questions, mirroring
the structure of the original MMLU dataset.
4.1.4. Baseline LLMs
The LLMs we compared in this section are all Instruct models by default. We employ Llama3 and
Llama3.1 as well as Qwen2 and Qwen2.5. Additionally, we also use Aya-23 and Aya-expanse:
Aya-23 and Aya-expanse The Aya-23 and Aya-expanse LLMs [Aryabumi et al., 2024] that includes
the Aya-23-8B, Aya-23-35B, Aya-expanse-8B and Aya-expanse-35B models, which are open-source
multilingual language models supporting 23 languages. Aya-23/Aya-expanse are based on Command-
R ∗∗ with sophisticated multilingual supervised finetuning.
Table 8 | Performance comparison of LLMs across multiple major benchmarks. Best performance in
each benchmark is marked in bold.
Model 	MMMLU TydiQA AGIEval CEval Belebele
7B Models
Aya-23-8B 	41.0 47.2 37.1 43.9 52.5
Aya-expanse-8B 	48.2 28.3 36.7 48.5 64.3
Llama3-8B 	46.6 39.7 43.4 50.8 50.7
Llama3.1-8B 	49.2 53.0 41.8 55.6 63.9
Qwen2-7B 	52.2 29.2 57.1 81.8 69.4
Qwen2.5-7B 	56.0 39.0 59.0 77.9 70.0
Marco-Chat-7B

Chunk 39 · 1,990 chars

benchmark is marked in bold.
Model 	MMMLU TydiQA AGIEval CEval Belebele
7B Models
Aya-23-8B 	41.0 47.2 37.1 43.9 52.5
Aya-expanse-8B 	48.2 28.3 36.7 48.5 64.3
Llama3-8B 	46.6 39.7 43.4 50.8 50.7
Llama3.1-8B 	49.2 53.0 41.8 55.6 63.9
Qwen2-7B 	52.2 29.2 57.1 81.8 69.4
Qwen2.5-7B 	56.0 39.0 59.0 77.9 70.0
Marco-Chat-7B 	60.1 57.7 61.5 86.4 79.3
70B Models
Aya-23-35B 	50.1 50.2 44.4 53.6 66.3
Aya-expanse-32B 58.9 30.0 45.7 56.9 72.7
Llama3-70B 	64.3 52.0 57.1 66.7 76.2
Llama3.1-70B 	71.7 53.1 55.0 71.6 84.4
Qwen2-72B 	69.2 40.3 66.0 90.6 85.3
Qwen2.5-72B 	69.0 48.4 67.5 88.2 88.9
Marco-72B 	76.1 61.0 72.7 94.5 89.6
4.1.5. Results and Discussion
Evaluation Results divided by benchmarks Table 8 presents the average scores of our Marco-
Chat-7B and Marco-72B, in comparison with several baseline models across five major benchmarks:
MMMLU, TydiQA, AGIEval, CEval, and Belebele. These benchmarks are designed to evaluate language
models on a diverse range of tasks and languages, highlighting their multilingual comprehension and
reasoning abilities.
Our Marco-Chat-7B model consistently achieves the highest scores among the 7B parameter
models across all benchmarks. Specifically, it significantly outperforms the baselines on CEval and
Belebele, which focus on Chinese educational subjects and a variety of African languages, respectively.
On CEval, Marco-Chat-7B attains a score of 86.4, surpassing the next best model, Qwen2-7B, by
a substantial margin of 4.6 points. This indicates our model’s strong capability in understanding
and processing Chinese language content. Similarly, on Belebele, which evaluates proficiency in
underrepresented African languages, Marco-Chat-7B achieves a score of 79.3, outperforming others
https://cohere.com/blog/aya-expanse-connecting-our-world
∗∗https://cohere.com/command
22

-- 22 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 9 | MMMLU Results: Performance of various LLMs across

Chunk 40 · 1,996 chars

arco-Chat-7B achieves a score of 79.3, outperforming others
https://cohere.com/blog/aya-expanse-connecting-our-world
∗∗https://cohere.com/command
22

-- 22 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 9 | MMMLU Results: Performance of various LLMs across different languages in the MMMLU
dataset for both 7B and 70B models. The best performance in each language is highlighted in bold.
Language Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco-Chat GPT-4
7B Models
Arabic 	50.9 	56.6 40.5 	42.2 42.1 	48.8 	60.6 62.7
Bengali 	42.6 	45.3 36.4 	39.8 27.4 	33.4 	54.4 60.0
German 	57.3 	62.3 53.5 	55.6 43.3 	53.9 	65.9 68.0
Spanish 	60.3 	65.3 55.8 	59.1 47.9 	56.1 	67.7 68.2
French 	61.1 	65.0 55.8 	58.9 46.9 	55.5 	67.6 67.5
Hindi 	44.5 	46.6 41.4 	45.8 38.9 	46.2 	54.3 62.2
Indonesian 56.6 	61.4 51.0 	54.3 46.7 	53.3 	62.3 66.1
Italian 	60.2 	64.7 53.3 	56.3 47.1 	55.3 	65.4 68.3
Japanese 	56.3 	61.0 42.3 	52.1 45.6 	51.5 	64.2 64.4
Korean 	54.1 	59.1 46.5 	50.8 43.6 	50.7 	63.0 63.6
Chinese 	62.0 	64.3 51.4 	55.5 45.7 	52.5 	66.5 65.9
Portuguese 59.9 	64.4 55.5 	59.0 46.9 	55.8 	67.6 68.8
Swahili 	34.9 	35.7 37.5 	40.3 26.2 	32.0 	43.9 53.1
Yoruba 	30.5 	32.8 31.0 	31.4 26.4 	29.9 	37.2 38.0
Avg. Score 52.2 	56.0 46.6 	50.1 41.0 	48.2 	60.1 62.6
70B Models
Arabic 	72.0 	74.3 60.6 	71.1 51.8 	61.6 	79.3 71.1
Bengali 	68.3 	67.2 53.8 	66.5 32.9 	43.9 	76.6 64.8
German 	74.4 	72.5 71.4 	77.0 55.5 	64.7 	80.7 75.7
Spanish 	77.0 	77.5 74.3 	79.3 58.0 	67.5 	82.6 76.8
French 	75.6 	76.0 73.1 	77.9 58.1 	67.5 	80.7 75.8
Hindi 	69.9 	69.1 65.0 	72.7 47.6 	58.8 	76.9 70.1
Indonesian 73.1 	73.3 70.6 	75.7 55.5 	65.4 	79.0 73.7
Italian 	75.3 	72.5 73.3 	77.8 57.8 	66.5 	81.6 75.8
Japanese 	74.1 	74.7 65.6 	73.8 54.5 	64.3 	81.6 71.6
Korean 	72.3 	71.8 64.5 	72.7 53.7 	62.4 	78.8 71.3
Chinese 	77.5 	76.7 69.5 	74.8 54.1 	63.4 	82.0 72.5
Portuguese 76.8 	76.9 73.7 	78.9 58.3 	67.2 	81.7 76.2
Swahili 	47.3 	48.8

Chunk 41 · 1,998 chars

70.6 75.7 55.5 65.4 79.0 73.7
Italian 75.3 72.5 73.3 77.8 57.8 66.5 81.6 75.8
Japanese 74.1 74.7 65.6 73.8 54.5 64.3 81.6 71.6
Korean 72.3 71.8 64.5 72.7 53.7 62.4 78.8 71.3
Chinese 77.5 76.7 69.5 74.8 54.1 63.4 82.0 72.5
Portuguese 76.8 76.9 73.7 78.9 58.3 67.2 81.7 76.2
Swahili 47.3 48.8 51.1 64.0 33.6 38.4 63.7 68.1
Yoruba 34.6 35.5 33.6 41.2 30.4 33.4 44.0 47.3
Avg. Score 69.2 69.0 64.3 71.7 50.1 58.9 76.1 70.8
by nearly 9 points. This demonstrates the effectiveness of our multilingual training approach in
capturing linguistic nuances across diverse languages, including those with limited available data. In
the 70B size LLMs, Marco-72B leads the group by a large margin on all benchmarks. It achieves a
score of 76.1 on MMMLU, which assesses multitask language understanding across various subjects
and languages. This result is 4.4 points higher than the next best model, Llama3.1-70B, highlighting
our model’s superior general language understanding capabilities. On TydiQA, a benchmark for
typologically diverse question answering, Marco-72B attains a score of 61.0, outperforming the
second-best model by 7.9 points. This suggests that our model shows strong performance at various
tasks across a wide range of languages with different grammatical structures and scripts. Additionally,
Marco-72B achieves an impressive score of 94.5 on CEval, indicating excellent proficiency in Chinese
across various academic subjects. On Belebele, it reaches 89.6, showcasing strong performance in
languages that are often underrepresented in training data.
These results highlight the effectiveness of our approach in building a truly multilingual LLM. By
consistently achieving top performance across benchmarks that evaluate different languages and
tasks, our models demonstrate robust multilingual proficiency and adaptability. This underscores
the importance of incorporating a diverse and comprehensive multilingual dataset during training
23

-- 23

Chunk 42 · 1,994 chars

multilingual LLM. By
consistently achieving top performance across benchmarks that evaluate different languages and
tasks, our models demonstrate robust multilingual proficiency and adaptability. This underscores
the importance of incorporating a diverse and comprehensive multilingual dataset during training
23

-- 23 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 10 | Performance comparison of language models on Belebele benchmark [Bandarkar et al.,
2024] across different languages.
Language Qwen2 Qwen2.5 Llama-3 Llama3.1 Aya-23 Aya-expanse Marco-Chat
7B Models
Azerbaijani 59.9 	60.4 41.3 	57.7 34.1 	49.9 	72.3
Bengali 	64.9 	64.2 46.8 	57.2 28.7 	41.6 	75.1
Czech 	75.4 	75.9 54.3 	73.4 61.6 	76.9 	84.4
Greek 	68.7 	75.2 59.0 	74.3 65.2 	80.3 	81.4
Hebrew 	74.2 	72.7 45.9 	59.3 61.7 	77.9 	82.1
Hungarian 57.3 	63.0 45.1 	52.4 35.0 	44.0 	68.0
Indonesian 77.2 	78.4 61.9 	75.0 64.7 	77.6 	82.8
Italian 	78.2 	81.9 60.6 	71.8 67.0 	75.7 	86.8
Japanese 	78.0 	69.8 49.9 	64.7 64.2 	74.1 	83.1
Kazakh 	48.0 	51.2 36.0 	49.1 28.3 	38.0 	73.1
Malay 	78.9 	77.1 57.9 	67.4 53.2 	73.3 	83.4
Dutch 	58.7 	74.6 56.3 	70.1 66.2 	72.1 	85.3
Nepali 	44.3 	49.4 38.9 	46.7 32.9 	40.8 	70.0
Polish 	78.4 	70.3 55.0 	65.0 61.1 	73.9 	73.3
Romanian 72.4 	74.0 55.2 	68.7 65.9 	74.8 	73.2
Russian 	79.7 	73.3 52.7 	70.2 64.1 	74.3 	87.9
Ukrainian 76.1 	73.2 51.1 	69.8 61.8 	76.2 	83.4
Urdu 	64.9 	64.4 49.0 	59.2 35.6 	50.2 	76.2
Thai 	72.3 	74.5 43.0 	57.2 37.6 	41.8 	76.9
Vietnamese 80.0 	77.0 53.8 	68.3 61.4 	73.2 	86.3
Avg. Scores 69.4 	70.0 50.7 	63.9 52.5 	64.3 	79.3
70B Models
Azerbaijani 79.9 	81.6 63.3 	79.7 51.9 	58.3 	85.6
Bengali 	84.9 	87.3 75.3 	81.4 42.1 	64.8 	89.2
Czech 	89.7 	91.9 79.0 	86.8 79.6 	85.8 	91.8
Greek 	89.2 92.6 87.0 	89.4 80.1 	83.6 	91.9
Hebrew 	85.0 	86.9 75.9 	78.9 77.0 	83.1 	86.0
Hungarian 72.8 89.3 58.0 	74.1 53.6 	50.8 	87.0
Indonesian 88.9 	91.7 82.6 	87.3 77.8 	81.1 	93.1
Italian 	89.8

Chunk 43 · 1,994 chars

79.7 51.9 	58.3 	85.6
Bengali 	84.9 	87.3 75.3 	81.4 42.1 	64.8 	89.2
Czech 	89.7 	91.9 79.0 	86.8 79.6 	85.8 	91.8
Greek 	89.2 92.6 87.0 	89.4 80.1 	83.6 	91.9
Hebrew 	85.0 	86.9 75.9 	78.9 77.0 	83.1 	86.0
Hungarian 72.8 89.3 58.0 	74.1 53.6 	50.8 	87.0
Indonesian 88.9 	91.7 82.6 	87.3 77.8 	81.1 	93.1
Italian 	89.8 	90.4 83.8 	87.9 81.1 	77.7 	91.1
Japanese 	87.8 	90.2 82.2 	86.9 73.7 	78.2 	90.1
Kazakh 	73.6 	76.0 54.1 	78.2 40.7 	55.1 	81.7
Malay 	88.7 	91.2 87.7 	88.7 74.8 	76.4 	92.1
Dutch 	90.9 	93.2 80.4 	87.1 80.8 	85.7 	94.4
Nepali 	70.6 	80.1 55.7 	76.0 39.9 	50.1 	84.4
Polish 	86.4 	89.7 75.0 	88.7 73.9 	83.7 	90.6
Romanian 88.4 	92.1 73.4 	86.3 75.7 	77.7 	90.6
Russian 	90.0 	94.1 86.1 	87.2 73.2 	84.6 	92.7
Ukrainian 90.9 93.4 84.1 	90.2 74.3 	79.8 	93.0
Urdu 	83.1 	86.4 78.9 	90.2 47.2 	63.7 	88.2
Thai 	86.2 	87.1 79.7 	82.7 53.6 	63.8 	87.6
Vietnamese 88.8 	92.0 81.7 	83.9 75.8 	69.2 	92.2
Avg. Scores 85.3 	88.9 76.2 	84.4 66.3 	72.7 	89.6
to enhance the language understanding abilities of large language models. Our observations also
reveal that models with a higher number of parameters, like Marco-72B, substantially benefit from
our multilingual training strategy, achieving significant improvements over other large models. Fur-
thermore, the consistent gains across both high-resource languages (like English and Chinese) and
low-resource languages (as represented in Belebele) indicate that our models do not merely rely on
the abundance of data in certain languages but truly learn to generalize across linguistic boundaries.
24

-- 24 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 11 | Performance comparison of various LLMs and MT systems on the Flores benchmark. The
best performance in each column is highlighted in bold.
GPT4 DeepL Google Aya-32B Aya-35B Qwen2-72B Qwen2.5-72B Llama3-70B Llama3.1-70B Marco-7B Marco-72B
En→XX
En2Ar 	40.4 	48.1 	50.0 	31.5 	24.4 	17.1 	29.9 	29.2 	33.5

Chunk 44 · 1,988 chars

ment
Table 11 | Performance comparison of various LLMs and MT systems on the Flores benchmark. The
best performance in each column is highlighted in bold.
GPT4 DeepL Google Aya-32B Aya-35B Qwen2-72B Qwen2.5-72B Llama3-70B Llama3.1-70B Marco-7B Marco-72B
En→XX
En2Ar 	40.4 	48.1 	50.0 	31.5 	24.4 	17.1 	29.9 	29.2 	33.5 	41.5 	61.2
En2De 	45.9 	48.7 	49.3 	32.3 	33.1 	37.7 	40.4 	43.0 	44.7 	41.6 	47.7
En2Es 	33.1 	32.9 	34.6 	28.4 	20.6 	32.0 	31.7 	31.5 	32.5 	33.1 	37.2
En2Fr 	54.4 	59.1 	57.9 	50.5 	34.5 	52.3 	53.2 	52.6 	55.2 	54.9 	58.8
En2It 	37.2 	41.5 	39.1 	32.4 	25.3 	34.2 	34.8 	34.8 	36.6 	36.2 	40.6
En2Ja 	34.6 	36.8 	41.1 	31.0 	9.6 	29.6 	33.0 	14.3 	33.0 	36.9 	37.7
En2Ko 	28.5 	32.9 	33.7 	27.2 	12.3 	19.2 	24.8 	0.1 	27.7 	29.0 	31.6
En2Nl 	34.8 	37.0 	36.3 	25.4 	24.2 	28.4 	30.9 	32.0 	34.7 	33.0 	35.9
En2Pl 	30.3 	33.4 	33.7 	24.8 	16.5 	20.4 	26.0 	28.6 	29.5 	28.3 	30.6
En2Pt 	54.8 	45.7 	56 	44.0 	41.7 	50.1 	52.6 	52.0 	54.8 	54.5 	57.9
En2Ru 	36.8 	40.5 	40.9 	32.3 	23.8 	33.6 	36.1 	35.6 	37.9 	37.7 	43.4
En2Tr 	36.9 	45.0 	44.2 	33.1 	26.4 	22.8 	30.4 	32.7 	36.8 	35.8 	37.7
En2Uk 	37.0 	42.9 	41.6 	33.5 	17.0 	25.2 	30.0 	36.5 	36.8 	36.3 	44.3
En2Zh 	44.2 	48.6 	50.6 	26.0 	15.6 	28.0 	33.9 	13.3 	31.3 	45.3 	48.6
Avg. Scores 39.2 	42.4 	43.5 	32.3 	23.2 	30.8 	34.8 	31.2 	37.5 	38.9 	43.8
XX→En
Ar2En 	42.7 	47.7 	46.8 	33.3 	41.1 	41.5 	44.6 	37.1 	46.1 	48.0 	58.0
De2En 	47.7 	51.0 	51.3 	30.8 	40.9 	46.9 	48.4 	46.3 	49.4 	50.6 	54.7
Es2En 	34.3 	36.9 	36.3 	24.8 	33.8 	34.5 	35.3 	33.7 	35.0 	40.2 	47.2
Fr2En 	48.9 	50.8 	52.7 	27.7 	45.1 	48.8 	49.5 	47.5 	50.7 	51.5 	56.8
It2En 	36.7 	40.2 	40.2 	28.6 	37.5 	36.7 	38.2 	36.5 	38.4 	42.9 	49.2
Ja2En 	30.4 	37.0 	36.7 	20.5 	22.6 	29.8 	31.9 	26.2 	32.3 	36.3 	49.5
Ko2En 	33.3 	39.3 	38.2 	21.9 	25.9 	32.1 	34.5 	28.7 	33.8 	37.0 	49.0
Nl2En 	36.0 	37.7 	38.7 	23.3 	32.8 	35.9 	36.3 	34.8 	37.0 	39.8 	46.4
Pl2En 	33.5 	35.8 	37.0 	19.6 	27.6 	33.7 	34.7 	32.1 	35.3 	38.4

Chunk 45 · 1,993 chars

5 	36.7 	38.2 	36.5 	38.4 	42.9 	49.2
Ja2En 	30.4 	37.0 	36.7 	20.5 	22.6 	29.8 	31.9 	26.2 	32.3 	36.3 	49.5
Ko2En 	33.3 	39.3 	38.2 	21.9 	25.9 	32.1 	34.5 	28.7 	33.8 	37.0 	49.0
Nl2En 	36.0 	37.7 	38.7 	23.3 	32.8 	35.9 	36.3 	34.8 	37.0 	39.8 	46.4
Pl2En 	33.5 	35.8 	37.0 	19.6 	27.6 	33.7 	34.7 	32.1 	35.3 	38.4 	45.9
Pt2En 	53.1 	55.8 	56.3 	37.9 	50.7 	53.0 	53.5 	51.8 	54.9 	54.5 	60.3
Ru2En 	38.7 	43.3 	42.9 	23.6 	36.2 	39.0 	40.2 	37.9 	41.0 	43.8 	49.2
Tr2En 	42.6 	48.5 	47.7 	31.1 	36.5 	39.9 	42.3 	37.9 	43.1 	43.4 	52.0
Uk2En 	43.4 	47.2 	47.3 	27.4 	38.8 	41.6 	43.7 	40.5 	44.9 	46.3 	58.8
Zh2En 	31.3 	36.8 	37.7 	24.3 	26.7 	31.0 	35.4 	29.0 	34.7 	38.2 	45.5
Avg. Scores 36.8 	43.4 	43.6 	26.8 	35.4 	38.9 	40.6 	37.2 	41.2 	43.6 	51.6
MMMLU Results We also highlight the results on the recently released multilingual version of
MMLU from OpenAI ††, which are shown in Table 9. The MMMLU dataset evaluates language models
across a diverse set of languages and tasks. This benchmark is crucial for assessing the multilingual
capabilities of language models, particularly their ability to process and understand languages beyond
††https://huggingface.co/datasets/openai/MMMLU
25

-- 25 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
English.
In the 7B LLMs, Marco-7B consistently outperforms baseline models across a wide range of lan-
guages. It achieves the highest scores in languages such as Arabic (60.6), Bengali (54.4), German
(65.9), and Spanish (67.7), demonstrating a significant advantage over other models. This perfor-
mance is particularly noteworthy in Bengali, where Marco-7B surpasses the second best model (Qwen-
2.5-7B) by a substantial margin of 4.1, highlighting its ability to handle languages with less training
data effectively. The model’s strong results in languages like Hindi (+7.7 compared to the second
best model) and Korean further emphasize its robust multilingual capabilities,

Chunk 46 · 1,990 chars

surpasses the second best model (Qwen-
2.5-7B) by a substantial margin of 4.1, highlighting its ability to handle languages with less training
data effectively. The model’s strong results in languages like Hindi (+7.7 compared to the second
best model) and Korean further emphasize its robust multilingual capabilities, making it a versatile
tool for diverse linguistic contexts. In the 70B LLMs, Marco-72B achieves the highest average score
across all languages, showcasing its superior multilingual understanding. It demonstrated competitive
performance in high-resource languages such as German (80.7), Spanish (82.6), and Chinese (82.0),
while also delivering strong performance in lower-resource languages like Swahili (63.7) and Yoruba
(44.0). These results underscore the model’s extensive language processing abilities, positioning it as
a leading solution for multilingual applications. It is worth noting that our Marco-72B outperformed
GPT-4 (for 7B LLMs we compare with GPT-4o-mini and for 70B LLMs we compare GPT-4) [Hurst
et al., 2024] in many languages. truction (existing preference dataset)
Belebele Results The results on the Belebele [Bandarkar et al., 2024], presented in Table 10,
illustrate the strong multilingual capabilities of the Marco-Chat models across both 7B and 70B
parameter scales.
For the 7B models, Marco-Chat consistently achieves the highest scores across most languages, with
an average score of 79.3, significantly outperforming other models such as Qwen2 (69.4) and Qwen2.5
(70.0). This performance is particularly impressive in low-resource languages like Kazakh and Nepali,
where Marco-Chat scores 73.1 and 70.0, respectively, showcasing improvements of 25.1 and 25.7
points over Qwen2. These substantial gains underscore the success of our continual pretraining
approach, which effectively leverages a vast multilingual corpus to enhance generalization capabilities
across languages with limited resources. Additionally, Marco-Chat remains highly

Chunk 47 · 1,989 chars

howcasing improvements of 25.1 and 25.7
points over Qwen2. These substantial gains underscore the success of our continual pretraining
approach, which effectively leverages a vast multilingual corpus to enhance generalization capabilities
across languages with limited resources. Additionally, Marco-Chat remains highly competitive in high-
resource languages such as Italian (86.8) and Japanese (83.1), further demonstrating its versatility
and robustness. The 70B models exhibit similar patterns, with Marco-Chat achieving an average score
of 89.6, leading the performance across nearly all languages. Notable improvements are observed in
Bengali (89.2) and Indonesian (93.1), surpassing Qwen2.5 by 1.9 and 1.4 points, respectively. The
model’s ability to handle complex linguistic tasks is further exemplified in high-resource languages
such as Dutch (94.4) and Russian (92.7), where it achieves top scores. These results highlight the
strategic advantage of our approach, which effectively captures linguistic nuances and complexities,
benefiting from extensive multilingual pretraining. Overall, the results on the Belebele benchmark
validate our focus on enhancing multilingual capabilities.
English-pivot Translation Results The English-pivot translation results on Flores benchmark [Goyal
et al., 2021, Team et al., 2022], as detailed in Table 11, highlight the strengths of the Marco-Chat
LLMs in translation tasks across a variety of languages. This benchmark is designed to evaluate
translation quality from English to multiple target languages (EN→XX) and vice versa (XX→EN),
offering insights into the multilingual capabilities of different models.
In the EN→XX translation tasks, the Marco-72B model achieves an average score of 43.8, surpass-
ing the second-best model, Google Translate, by a margin of 0.3 points. Notably, Marco-72B shows
strong performance in translating English into Arabic (En2Ar) with a BLEU score of 61.2, outperform-
ing Google by 11.2 points, and in

Chunk 48 · 1,994 chars

N→XX translation tasks, the Marco-72B model achieves an average score of 43.8, surpass-
ing the second-best model, Google Translate, by a margin of 0.3 points. Notably, Marco-72B shows
strong performance in translating English into Arabic (En2Ar) with a BLEU score of 61.2, outperform-
ing Google by 11.2 points, and in Portuguese (En2Pt) with a BLEU score of 57.9, leading by 1.9 points.
These results highlight the model’s ability to handle both high-resource languages, such as French
26

-- 26 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 12 | Any2Any translation performance on Flores benchmark across various language pairs for
LLMs. The table compares results from the Instruct models of Qwen2, Qwen2.5, Llama3, Llama3.1,
and Marco-LLM, with the best performance highlighted in bold for each translation direction.
Trans. Dir. Qwen2-7B Qwen2.5-7B Llama3-8B Llama3.1-8B Aya-expanse-8B Aya-23-8B Marco-7B
Ar2Ja 	16.2 	14.3 	13.1 	17.5 	16.9 	17.1 	21.8
Es2Ja 	17.2 	10.7 	11.2 	18.1 	18.3 	19.9 	22.4
Fr2Ja 	19.9 	14.0 	14.1 	21.6 	17.6 	22.3 	25.8
Hu2Ja 	15.2 	9.6 	12.3 	16.0 	10.7 	12.6 	20.3
Hu2Ko 	10.5 	6.9 	9.4 	10.2 	9.2 	11.0 	14.0
Ja2Ar 	9.8 	7.4 	8.6 	11.1 	12.3 	9.6 	15.5
Ja2Es 	16.1 	15.0 	15.9 	16.7 	12.9 	16.2 	19.1
Ja2Ko 	16.3 	12.1 	12.2 	17.3 	18.7 	17.2 	22.6
Ja2Th 	11.7 	11.4 	11.1 	11.6 	0.8 	2.0 	16.4
Ja2Zh 	19.5 	17.6 	6.9 	10.2 	14.9 	12.7 	22.9
Kk2Ar 	7.3 	5.5 	7.1 	8.6 	2.3 	5.7 	11.6
Kk2Fr 	13.8 	8.9 	17.9 	14.1 	5.6 	10.6 	20.5
Kk2Ja 	11.7 	6.0 	10.4 	12.2 	4.1 	9.6 	16.8
Kk2Ko 	8.3 	4.7 	9.6 	9.3 	4.5 	7.5 	13.1
Kk2Pt 	12.7 	8.8 	15.2 	10.2 	4.2 	9.7 	16.9
Kk2Th 	6.7 	7.1 	8.9 	10.4 	0.4 	1.2 	12.7
Kk2Zh 	11.9 	10.8 	10.2 	13.1 	3.0 	7.6 	18.5
Ko2Ja 	22.4 	21.2 	17.5 	22.3 	23.9 	19.3 	26.2
Ko2Th 	11.9 	9.7 	11.2 	12.2 	0.8 	2.0 	14.7
Ko2Zh 	20.0 	20.3 	10.2 	16.2 	16.6 	14.7 	22.7
Th2Ar 	11.3 	9.1 	8.7 	5.5 	1.7 	6.8 	15.4
Th2Es 	16.4 	15.1 	16.4 	9.1 	1.5 	10.0 	18.6
Th2Fr 	22.2

Chunk 49 · 1,995 chars

7.1 	8.9 	10.4 	0.4 	1.2 	12.7
Kk2Zh 	11.9 	10.8 	10.2 	13.1 	3.0 	7.6 	18.5
Ko2Ja 	22.4 	21.2 	17.5 	22.3 	23.9 	19.3 	26.2
Ko2Th 	11.9 	9.7 	11.2 	12.2 	0.8 	2.0 	14.7
Ko2Zh 	20.0 	20.3 	10.2 	16.2 	16.6 	14.7 	22.7
Th2Ar 	11.3 	9.1 	8.7 	5.5 	1.7 	6.8 	15.4
Th2Es 	16.4 	15.1 	16.4 	9.1 	1.5 	10.0 	18.6
Th2Fr 	22.2 	20.3 	19.3 	12.4 	5.1 	14.2 	25.2
Th2Ja 	16.5 	15.4 	12.6 	14.5 	0.0 	5.5 	22.7
Th2Kk 	2.2 	1.7 	4.4 	5.4 	0.3 	0.8 	9.3
Th2Ko 	12.2 	9.4 	8.3 	7.6 	0.8 	5.1 	16.5
Th2Zh 	18.8 	18.0 	9.5 	9.2 	0.0 	6.4 	22.3
Tr2Ja 	17.7 	7.3 	11.9 	20.0 	19.6 	21.4 	23.2
Uk2Fr 	27.5 	24.3 	31.2 	33.0 	31.5 	33.0 	34.6
Uk2Ja 	17.3 	13.0 	14.1 	20.1 	16.9 	21.8 	26.6
Uk2Kk 	3.4 	2.7 	7.9 	8.3 	0.7 	1.8 	12.4
Uk2Ko 	13.2 	8.9 	10.7 	15.8 	16.0 	17.3 	18.9
Uk2Th 	12.4 	11.2 	13.7 	15.2 	1.1 	2.9 	19.9
Uk2Zh 	20.7 	18.4 	11.6 	9.8 	16.1 	17.8 	25.6
Ur2Ar 	8.0 	7.4 	5.2 	4.8 	3.4 	6.4 	10.7
Ur2Ko 	8.7 	6.8 	8.5 	9.3 	4.5 	9.0 	12.8
Zh2Ar 	11.9 	9.8 	10.6 	13.5 	14.9 	13.6 	17.0
Zh2Fr 	24.5 	21.7 	21.8 	25.7 	24.5 	21.3 	28.9
Zh2Ja 	19.2 	15.0 	10.9 	18.4 	14.4 	17.3 	27.7
Zh2Ko 	14.2 	10.4 	9.7 	14.8 	15.7 	15.4 	21.2
Zh2Pt 	22.6 	20.9 	19.5 	20.5 	21.0 	21.5 	25.4
Zh2Th 	13.3 	10.8 	12.8 	13.9 	1.0 	2.5 	19.8
Avg. Score 	14.6 	11.9 	12.2 	13.9 	9.7 	11.9 	19.7
(En2Fr, 58.8), and low-resource languages, such as Ukrainian (En2Uk, 44.3), demonstrating its
strong capability and robustness. Besides, for the language pairs where Marco-LLM underperformed
competitive commercial MT systems such as En2De, En2Ja and En2Tr, it still achieved the best trans-
lation performance compared to the other open-sourced LLMs. For XX→EN translations, Marco-72B
achieves an impressive average score of 51.6, leading the performance with a significant margin of 8.0
points over the second-best model, Google. The model shows outstanding performance in translating
from Italian (It2En, 49.2) and Korean (Ko2En, 49.0), reflecting its advanced capacity to capture
nuanced linguistic features and deliver

Chunk 50 · 1,997 chars

sive average score of 51.6, leading the performance with a significant margin of 8.0
points over the second-best model, Google. The model shows outstanding performance in translating
from Italian (It2En, 49.2) and Korean (Ko2En, 49.0), reflecting its advanced capacity to capture
nuanced linguistic features and deliver high-quality translations. The consistent outperformance
27

-- 27 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
0 230 460 690 920 1150
Training Steps
0.2
0.3
0.4
0.5
0.6
0.7
Accuracy
Accuracy Trends Across Different Training Steps
Yoruba
Chinese
Arabic
Italian
Averge
Figure 7 | Accuracy trends across training checkpoints for different languages on MMMLU. The model
shows rapid initial learning (0-230 steps) followed by performance stabilization. High-resource
languages (ZH-CN, IT-IT) consistently outperform low-resource ones (YO-NG), with a persistent
performance gap of 29%.
across both high-resource languages (e.g., French to English, 56.8) and low-resource languages (e.g.,
Ukrainian to English, 58.8) underscores the effectiveness of our multilingual approach. The results
on the Flores benchmark [Team et al., 2022] validate the effectiveness of focusing on enhancing
multilingual capabilities.
Non-English-pivot Translation Results The Non-English-pivot translation results (Any2Any transla-
tion - translation from any languages to any languages) the Flores benchmark, as shown in Table ??,
highlight the superior performance of the Marco-7B model in Any2Any translation tasks.
The Marco-7B model achieves the highest average score of 19.5, which is a substantial margin
above the second-best model, Qwen2-7B, with an average score of 14.4. This represents a notable
improvement of 5.1 points. In some specific language directions, Marco-7B exhibits remarkable
strong performance. For example, Marco-7B outperformed Qwen2-7B in Chinese to Japanese (Zh2Ja)
translation by 8.5 points. Similarly, in the

Chunk 51 · 1,996 chars

second-best model, Qwen2-7B, with an average score of 14.4. This represents a notable
improvement of 5.1 points. In some specific language directions, Marco-7B exhibits remarkable
strong performance. For example, Marco-7B outperformed Qwen2-7B in Chinese to Japanese (Zh2Ja)
translation by 8.5 points. Similarly, in the Arabic to Japanese (Ar2Ja) translation, Marco-7B achieves
a score of 21.8, outperforming Llama3.1-8B by 4.3 points. Moreover, in the French to Japanese
(Fr2Ja) translation, Marco-7B scores 25.8, which is 4.2 points higher than Llama3.1-8B. These
results underscore the model’s capability to manage complex linguistic structures, particularly in
non-English-pivot language pairs. The Marco-7B model’s superior performance is evident across
both high-resource and low-resource languages, demonstrating its versatility and robustness. This
is particularly important for enhancing the multilingual/cross-lingual capabilities of large language
models, as it ensures consistent translation quality across a wide range of languages beyond traditional
English-centric translation [Fan et al., 2020].
Performance Across Different Training Steps Our analysis of the performance of Marco-7B model
across different training steps (0-1150) on the MMMLU benchmark is shown in Figure 7. The most
significant improvements occur during the early training phase (0-230 steps), where the overall
28

-- 28 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
average accuracy increases dramatically from 40.81% to 59.29%. After 460 step, the performance
stabilizes across most languages, with the final overall accuracy reaching 60.05%. High-resource
languages like Chinese (ZH-CN) and Italian (IT-IT) achieve and maintain higher accuracy levels
(>65%) compared to low-resource languages, with Yoruba (YO-NG) plateauing around 37%. This
persistent performance gap of approximately 29% between high and low-resource languages suggests
that while

Chunk 52 · 1,995 chars

ing 60.05%. High-resource
languages like Chinese (ZH-CN) and Italian (IT-IT) achieve and maintain higher accuracy levels
(>65%) compared to low-resource languages, with Yoruba (YO-NG) plateauing around 37%. This
persistent performance gap of approximately 29% between high and low-resource languages suggests
that while the model effectively captures multilingual knowledge early in training, achieving better
performance across languages remains a challenge that may require more monolingual data during
the pretraining phase rather than simply extending training steps for SFT.
4.2. Multilingual Preference Alignment
Preference alignment is critical for ensuring that an LLM’s outputs are consistent with human expec-
tations and values. However, most LLMs are predominantly aligned on preference in English data,
leading to a disparity in performance when applied to other languages. In a multilingual setting, this
alignment becomes even more essential due to the variations in language structures [She et al., 2024],
idiomatic expressions, and cultural references. By focusing on multilingual preference alignment, we
aim to enhance Marco’s ability to generate responses that are not only grammatically correct but also
culturally appropriate and contextually relevant in multiple languages.
Moreover, multilingual preference alignment helps mitigate biases that may arise from training
on datasets that lack linguistic diversity. It promotes fairness and inclusivity, enabling the model
to cater to a broader user base. By aligning the model’s preferences across different languages, we
ensure that Marco-LLM can effectively understand and respond to users worldwide, fostering better
communication and understanding across linguistic boundaries.
4.2.1. Dataset Construction from Existing Preference Data
To construct a comprehensive multilingual preference dataset, we began with the LMSYS Arena
Human Preference dataset [Chiang et al., 2024] ‡‡. This dataset comprises 57.5k high-quality

Chunk 53 · 1,935 chars

stering better
communication and understanding across linguistic boundaries.
4.2.1. Dataset Construction from Existing Preference Data
To construct a comprehensive multilingual preference dataset, we began with the LMSYS Arena
Human Preference dataset [Chiang et al., 2024] ‡‡. This dataset comprises 57.5k high-quality human
preference annotations for various prompts and responses in English. We selected a subset of high-
quality examples based on criteria such as clarity, relevance, and diversity of topics to ensure a robust
foundation for multilingual preference alignment. The selected examples were then translated into
the 28 target languages. This translation step was critical to extend the language coverage of the
original data. By leveraging existing English preference data and extending it to multiple languages,
we aim to improve the performance of preference alignment of Marco-LLM under various languages
beyond English.
4.2.2. Multilingual Preference Data Generation and Translation
In addition to the translated data, we expanded our preference dataset by incorporating prompts from
the UltraFeedback dataset [Cui et al., 2023], which are also translated into 28 languages. For each
prompt, we utilized Marco-LLM to generate at least two distinct responses with different generation
configuration. This approach allowed us to capture the model’s inherent variability in generating
responses across different languages. To establish preferences between the generated responses,
we employed another LLM to evaluate and select the better response based on predefined criteria
such as relevance, coherence, and adherence to the prompt. This process effectively created a set of
preference pairs that reflect the model’s capabilities and the desired outcomes in various languages.
By generating and evaluating responses within the target languages, we ensured that the preference
data was culturally and linguistically

Chunk 54 · 1,975 chars

herence, and adherence to the prompt. This process effectively created a set of
preference pairs that reflect the model’s capabilities and the desired outcomes in various languages.
By generating and evaluating responses within the target languages, we ensured that the preference
data was culturally and linguistically appropriate.
‡‡https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k
29

-- 29 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
0.45 	43.85%
27.50% 	28.65%
Language: ar
Win 	Loss 	Tie	
0.20
0.30
0.40
0.50 	50.52%
25.00% 	24.48%
Language: az
Win 	Loss 	Tie
0.20
0.30
0.40
0.50
51.56%
25.00% 	23.44%
Language: bn
Win 	Loss 	Tie	
0.25
0.30
0.35
0.40
0.45
29.17%
43.13%
27.71%
Language: cs
Win 	Loss 	Tie	
0.31
0.32
0.33
0.34
0.35
0.36 	35.73%
31.77%	
32.50%
Language: de
Win 	Loss 	Tie
0.20
0.30
0.40
0.50
0.60
22.29%
53.54%
24.17%
Language: el
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
0.45 	42.60%
25.21%
32.19%
Language: es
Win 	Loss 	Tie
0.28
0.30
0.32
0.34
0.36
0.38 	36.98%
29.27%
33.75%
Language: fr
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
0.45
41.56%
31.56%
26.88%
Language: he
Win 	Loss 	Tie
0.20
0.30
0.40
0.50 	48.23%
28.23%
23.54%
Language: hu
Win 	Loss 	Tie
0.30
0.31
0.32
0.33
0.34
0.35
0.36
34.48%	
35.00%
30.52%
Language: id
Win 	Loss 	Tie
0.32
0.33
0.33
0.34
0.34
0.34
34.17%
32.08%
33.75%
Language: it
Win 	Loss 	Tie
0.30
0.32
0.34
0.36
32.92%
30.83%
36.25%
Language: ja
Win 	Loss 	Tie
0.20
0.40
0.60
66.04%
17.50% 	16.46%
Language: kk
Win 	Loss 	Tie	
0.30
0.32
0.34
0.36
0.38
37.08%
31.77%	
31.15%
Language: ko
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
0.45 	44.58%
27.08% 	28.33%
Language: ms
Win 	Loss 	Tie
0.20
0.30
0.40
0.50 	48.33%
29.79%
21.88%
Language: ne
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
37.81%
26.15%
36.04%
Language: nl
Win 	Loss 	Tie
0.32
0.33
0.33
0.34
0.34
0.34
0.35
34.62%
33.05%
32.33%
Language: pl
Win 	Loss 	Tie
0.25
0.30
0.35
0.40

Chunk 55 · 1,996 chars

0.25
0.30
0.35
0.40
0.45 	44.58%
27.08% 	28.33%
Language: ms
Win 	Loss 	Tie
0.20
0.30
0.40
0.50 	48.33%
29.79%
21.88%
Language: ne
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
37.81%
26.15%
36.04%
Language: nl
Win 	Loss 	Tie
0.32
0.33
0.33
0.34
0.34
0.34
0.35
34.62%
33.05%
32.33%
Language: pl
Win 	Loss 	Tie
0.25
0.30
0.35
0.40 	39.79%
26.15%
34.06%
Language: pt
Win 	Loss 	Tie	
0.30
0.32
0.34
0.36
0.38
36.67%
32.08%	
31.25%
Language: ro
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
37.71%
25.63%
36.67%
Language: ru
Win 	Loss 	Tie
0.20
0.30
0.40
0.50
52.08%
24.90% 	23.02%
Language: th
Win 	Loss 	Tie
0.25
0.30
0.35
0.40
40.73%
26.15%
33.12%
Language: tr
Win 	Loss 	Tie
0.28
0.30
0.33
0.35
0.38
0.40 	38.92%
32.31%
28.77%
Language: uk
Win 	Loss 	Tie
0.20
0.30
0.40
0.50
0.60 	57.08%
21.88% 	21.04%
Language: ur
Win 	Loss 	Tie
0.28
0.30
0.33
0.35
0.38
0.40 	39.58%
29.17%
31.25%
Language: vi
Win 	Loss 	Tie	
0.25
0.28
0.30
0.33
0.35
0.38 	37.29%
27.40%
35.31%
Language: zh
Marco-LLM vs Baseline Models Win Rates on Multilingual MT-bench by Language
Figure 8 | Performance comparison of Marco-LLM against baseline models across 28 languages on
multilingual MT-bench. Each subplot shows the win rate (blue), loss rate (green), and tie rate (red) for
a specific language. Win rates indicate Marco-LLM’s superior responses, loss rates represent baseline
models’ better performance, and tie rates show equivalent quality responses.
4.2.3. Evaluation Results
To evaluate Marco’s multilingual capabilities for capturing preference in different languages, we
translated the original English MT-Bench benchmark [Chiang et al., 2024] into the 28 target languages.
We then compare the generated responses from LLMs in a pairwise manner using GPT-4o-mini,
specifically we compare the responses of Marco-LLM (7B) with the responses from the other six
baseline LLMs including Qwen2, Qwen2.5, Llama3, Llama3.1, Aya-23, Aya-expanse (all in 7/8B size).
The results from the multilingual MT-bench, as illustrated in Figure 8, reveal

Chunk 56 · 1,992 chars

rom LLMs in a pairwise manner using GPT-4o-mini,
specifically we compare the responses of Marco-LLM (7B) with the responses from the other six
baseline LLMs including Qwen2, Qwen2.5, Llama3, Llama3.1, Aya-23, Aya-expanse (all in 7/8B size).
The results from the multilingual MT-bench, as illustrated in Figure 8, reveal that Marco-chat-7B
model outperforms baseline models in 25 out of 28 languages. We evaluate model responses using
GPT-4o-mini where win rates, loss rates, and tie rates (Marco-LLM vs baseline) were averaged for
the baseline models mentioned in Section 4.1.4 across each language. Marco-chat-7B achieved
better generation quality in low-resource languages. For instance, in Azerbaijani (az), the model
achieves a win rate of 50.52% compared to a loss rate of 25%, while in Bengali (bn), the win rate
is 51.56% against a loss rate of 23.44% as well as Kazakh(he) where Marco-LLM obtained a win
rate of 66.04% . These results indicate a clear advantage over the baseline models. Hebrew (he)
also demonstrates a strong performance with a win rate of 41.56%, surpassing the loss rate of
31.56%. In certain high-resource languages, Marco-chat-7B maintains competitive performance. For
example, in French (fr), our model achieves a win rate of 35.98% against a loss rate of 29.27%, and
in Chinese (zh), it achieves a win rate of 37.29% compared to a loss rate of 27.40%. These results
reflect the model’s effective handling of languages with extensive linguistic data. The model also
performs consistently well across various language families, maintaining win rates around 35% for
Indo-European languages such as Italian (it) with a win rate of 34.17%, and Dutch (nl) with a win
rate of 37.81%. These results suggest balanced performance across diverse linguistic structures. While
Marco-LLM achieved strong performance in 25 languages out of 28 languages, it still underperformed
in languages including Czech, Greek and Indonesian. This suggests the need for further refinement

Chunk 57 · 1,996 chars

tch (nl) with a win
rate of 37.81%. These results suggest balanced performance across diverse linguistic structures. While
Marco-LLM achieved strong performance in 25 languages out of 28 languages, it still underperformed
in languages including Czech, Greek and Indonesian. This suggests the need for further refinement in
handling complex grammatical structures. Overall, the experimental results highlight Marco-chat-7B’s
strong multilingual performance, particularly in languages where it achieves a higher win rate than
loss rate. The model effectively addresses the challenges posed by both high-resource and low-resource
30

-- 30 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
languages, demonstrating its capability and potential for deployment in a wide range of linguistic
environments.
5. Conclusion and Future Work
In this paper, we introduced Marco-LLM, a multilingual LLM specifically designed to address the
challenges posed by low-resource languages. By leveraging a large and diverse multilingual dataset,
we conducted extensive multilingual continual pre-training and post-training, including supervised
finetuning and preference alignment, based on the Qwen2 model. Our comprehensive evaluations on
benchmarks such as MMMLU, Flores, Belebele, CEVAL, TydiQA, and multilingual MT-bench validated
that Marco-LLM obtained excellent performance in multilingual tasks. The results demonstrate that
focusing on low-resource languages can bridge existing performance gaps and extend the benefits
of LLMs to a wider range of linguistic communities. Our work highlights the importance of data
diversity and targeted training strategies in enhancing model performance across diverse languages.
For future work, there are several directions for future research. One promising direction is to extend
Marco-LLM’s capabilities to include more languages, further enriching the linguistic diversity it can
handle. Additionally, exploring

Chunk 58 · 1,983 chars

eted training strategies in enhancing model performance across diverse languages.
For future work, there are several directions for future research. One promising direction is to extend
Marco-LLM’s capabilities to include more languages, further enriching the linguistic diversity it can
handle. Additionally, exploring the integration of multilingual reasoning capabilities could enhance
the model’s ability to understand and generate more complex language structures. Furthermore,
improving model efficiency and scalability will be essential for deploying these systems in real-world
applications, particularly in resource-constrained environments.
References
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey,
Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira,
M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. Ábrego, J. Ahn, J. Austin,
P. Barham, J. A. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catasta, Y. Cheng, C. Cherry, C. A.
Choquette-Choo, A. Chowdhery, C. Crepy, S. Dave, M. Dehghani, S. Dev, J. Devlin, M. Díaz, N. Du,
E. Dyer, V. Feinberg, F. Feng, V. Fienber, M. Freitag, X. Garcia, S. Gehrmann, L. Gonzalez, and et al.
Palm 2 technical report. CoRR, abs/2305.10403, 2023.
M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. In International
Conference on Learning Representations, 2018.
V. Aryabumi, J. Dang, D. Talupuru, S. Dash, D. Cairuz, H. Lin, B. Venkitesh, M. Smith, J. A. Campos, Y. C.
Tan, K. Marchisio, M. Bartolo, S. Ruder, A. Locatelli, J. Kreutzer, N. Frosst, A. Gomez, P. Blunsom,
M. Fadaee, A. Üstün, and S. Hooker. Aya 23: Open weight releases to further multilingual progress,
2024. URL https://arxiv.org/abs/2405.15032.
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier
large vision-language model with versatile abilities. arXiv preprint

Chunk 59 · 1,992 chars

unsom,
M. Fadaee, A. Üstün, and S. Hooker. Aya 23: Open weight releases to further multilingual progress,
2024. URL https://arxiv.org/abs/2405.15032.
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier
large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
L. Bandarkar, D. Liang, B. Muller, M. Artetxe, S. N. Shukla, D. Husa, N. Goyal, A. Krishnan, L. Zettle-
moyer, and M. Khabsa. The belebele benchmark: a parallel reading comprehension dataset in
122 language variants. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the
62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 749–775, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi:
10.18653/v1/2024.acl-long.44. URL https://aclanthology.org/2024.acl-long.44.
31

-- 31 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, et al.
A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and
interactivity. arXiv preprint arXiv:2302.04023, 2023.
L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Federmann, M. Fishel, A. Fraser,
Y. Graham, P. Guzman, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, A. Martins, M. Morishita,
C. Monz, M. Nagata, T. Nakazawa, and M. Negri, editors. Proceedings of the Fifth Conference on
Machine Translation, Online, Nov. 2020. Association for Computational Linguistics. URL https:
//aclanthology.org/2020.wmt-1.0.
O. Bojar, C. Buck, R. Chatterjee, C. Federmann, L. Guillou, B. Haddow, M. Huck, A. J. Yepes, A. Névéol,
M. Neves, P. Pecina, M. Popel, P. Koehn, C. Monz, M. Negri, M. Post, L. Specia, K. Verspoor,
J. Tiedemann, and M. Turchi, editors. Proceedings of the First Conference on Machine Translation:
Volume 2, Shared Task

Chunk 60 · 1,993 chars

0.
O. Bojar, C. Buck, R. Chatterjee, C. Federmann, L. Guillou, B. Haddow, M. Huck, A. J. Yepes, A. Névéol,
M. Neves, P. Pecina, M. Popel, P. Koehn, C. Monz, M. Negri, M. Post, L. Specia, K. Verspoor,
J. Tiedemann, and M. Turchi, editors. Proceedings of the First Conference on Machine Translation:
Volume 2, Shared Task Papers, Berlin, Germany, Aug. 2016. Association for Computational Linguistics.
doi: 10.18653/v1/W16-2300. URL https://aclanthology.org/W16-2300.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,
A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu,
C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In H. Larochelle,
M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing
Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.
neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
V. Chaudhary, Y. Tang, F. GuzmÃ¡n, H. Schwenk, and P. Koehn. Low-resource corpus filtering using
multilingual sentence embeddings. In Proceedings of the Fourth Conference on Machine Translation
(Volume 3: Shared Task Papers, Day 2), pages 263–268, Florence, Italy, August 2019. Association
for Computational Linguistics. URL http://www.aclweb.org/anthology/W19-5435.
W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E.
Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference,
2024.
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung,
C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay,
N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J.

Chunk 61 · 1,995 chars

r evaluating llms by human preference,
2024.
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung,
C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay,
N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard,
G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra,
K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi,
D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira,
R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei,
K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel. Palm: scaling language modeling with
pathways. J. Mach. Learn. Res., 24(1), Mar. 2024. ISSN 1532-4435.
J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki. TyDi
QA: A benchmark for information-seeking question answering in typologically diverse languages.
Transactions of the Association for Computational Linguistics, 8:454–470, 2020. doi: 10.1162/tacl_
a_00317. URL https://aclanthology.org/2020.tacl-1.30.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have
solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/
abs/1803.05457.
32

-- 32 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Cohere For AI. c4ai-command-r-plus-08-2024, 2024. URL https://huggingface.co/CohereForAI/
c4ai-command-r-plus-08-2024.
A. Conneau and G. Lample. Cross-lingual language model pretraining. Advances in Neural Information
Processing Systems, 32:7059–7069, 2019.
A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov. XNLI:
Evaluating cross-lingual sentence representations. In Proceedings of the 2018

Chunk 62 · 1,987 chars

-08-2024.
A. Conneau and G. Lample. Cross-lingual language model pretraining. Advances in Neural Information
Processing Systems, 32:7059–7069, 2019.
A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov. XNLI:
Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium, Oct.-
Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1269. URL
https://aclanthology.org/D18-1269.
G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun. Ultrafeedback: Boosting
language models with high-quality feedback, 2023.
V. Dac Lai, C. Van Nguyen, N. T. Ngo, T. Nguyen, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. Okapi:
Instruction-tuned large language models in multiple languages with reinforcement learning from
human feedback. arXiv e-prints, pages arXiv–2307, 2023.
DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Guo,
D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang,
H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan,
J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Zhang,
M. Li, M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, P. Wang, P. Zhang, Q. Zhu,
Q. Chen, Q. Du, R. J. Chen, R. L. Jin, R. Ge, R. Pan, R. Xu, R. Chen, S. S. Li, S. Lu, S. Zhou, S. Chen,
S. Wu, S. Ye, S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Zheng, T. Wang, T. Pei, T. Yuan, T. Sun,
W. L. Xiao, W. Zeng, W. An, W. Liu, W. Liang, W. Gao, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi,
X. Liu, X. Wang, X. Shen, X. Chen, X. Chen, X. Nie, and X. Sun. Deepseek-v2: A strong, economical,
and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024.
A. Dubey and et al. The llama 3 herd of models, 2024. URL

Chunk 63 · 1,994 chars

W. Zeng, W. An, W. Liu, W. Liang, W. Gao, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi,
X. Liu, X. Wang, X. Shen, X. Chen, X. Chen, X. Nie, and X. Sun. Deepseek-v2: A strong, economical,
and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024.
A. Dubey and et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
A. El-Kishky, V. Chaudhary, F. Guzmán, and P. Koehn. CCAligned: A massive collection of cross-
lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2020), pages 5960–5969, Online, November 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.480. URL https://www.aclweb.
org/anthology/2020.emnlp-main.480.
A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek,
V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin. Beyond
english-centric multilingual machine translation, 2020. URL https://arxiv.org/abs/2010.11125.
N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, G. Wenzek, D. Ju, S. Krishnan, M. Ranzato, F. Guzmán,
and A. Fan. The flores-101 evaluation benchmark for low-resource and multilingual machine
translation. 2021.
S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. Kauffmann,
G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai,
Y. T. Lee, and Y. Li. Textbooks are all you need. CoRR, abs/2306.11644, 2023.
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong,
and W. Liang. Deepseek-coder: When the large language model meets programming – the rise of
code intelligence, 2024. URL https://arxiv.org/abs/2401.14196.
33

-- 33 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding

Chunk 64 · 1,996 chars

hen the large language model meets programming – the rise of
code intelligence, 2024. URL https://arxiv.org/abs/2401.14196.
33

-- 33 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a
haystack: Case studies with sparse probing. Trans. Mach. Learn. Res., 2023, 2023.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive
multitask language understanding. In ICLR. OpenReview.net, 2021.
S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y. Fang, Y. Huang, W. Zhao, X. Zhang, Z. L.
Thai, K. Zhang, C. Wang, Y. Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, D. Li,
Z. Liu, and M. Sun. Minicpm: Unveiling the potential of small language models with scalable
training strategies. CoRR, abs/2404.06395, 2024.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and
J. He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In
Advances in Neural Information Processing Systems, 2023.
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, A. Yang, R. Men,
F. Huang, X. Ren, X. Ren, J. Zhou, and J. Lin. Qwen2.5-coder technical report, 2024. URL
https://arxiv.org/abs/2409.12186.
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes,
A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
A. Ibrahim, B. Thérien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish.
Simple and scalable strategies to continually pre-train large language models. Trans. Mach. Learn.
Res., 2024, 2024.
W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu. Is chatgpt a good translator? a preliminary study.
arXiv preprint arXiv:2301.08745, 1(10), 2023.
A. Joulin, E. Grave, P. Bojanowski, M.

Chunk 65 · 1,998 chars

t, and I. Rish.
Simple and scalable strategies to continually pre-train large language models. Trans. Mach. Learn.
Res., 2024, 2024.
W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu. Is chatgpt a good translator? a preliminary study.
arXiv preprint arXiv:2301.08745, 1(10), 2023.
A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fasttext.zip: Compressing
text classification models. arXiv: Computation and Language,arXiv: Computation and Language,
Nov 2016.
Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. Continual pre-training of language models,
2023. URL https://arxiv.org/abs/2302.03241.
V. D. Lai, N. T. Ngo, A. P. B. Veyseh, H. Man, F. Dernoncourt, T. Bui, and T. H. Nguyen. Chatgpt
beyond english: Towards a comprehensive evaluation of large language models in multilingual
learning. arXiv preprint arXiv:2304.05613, 2023.
Y. Li, S. Bubeck, R. Eldan, A. D. Giorno, S. Gunasekar, and Y. T. Lee. Textbooks are all you need II:
phi-1.5 technical report. CoRR, abs/2309.05463, 2023.
X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du,
R. Pasunuru, S. Shleifer, P. S. Koura, V. Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer, Z. Kozareva,
M. T. Diab, V. Stoyanov, and X. Li. Few-shot learning with multilingual language models. CoRR,
abs/2112.10668, 2021. URL https://arxiv.org/abs/2112.10668.
H. Lovenia, R. Mahendra, S. M. Akbar, L. J. V. Miranda, J. Santoso, E. Aco, A. Fadhilah, J. Mansurov,
J. M. Imperial, O. P. Kampman, et al. Seacrowd: A multilingual multimodal data hub and benchmark
suite for southeast asian languages. arXiv preprint arXiv:2406.10118, 2024.
H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath:
Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR,
abs/2308.09583, 2023.
34

-- 34 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Z. Luo, C. Xu,

Chunk 66 · 1,999 chars

hao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath:
Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR,
abs/2308.09583, 2023.
34

-- 34 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder:
Empowering code large language models with evol-instruct. In ICLR. OpenReview.net, 2024.
T. Nguyen, C. V. Nguyen, V. D. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H.
Nguyen. Culturax: A cleaned, enormous, and multilingual dataset for large language models in
167 languages. In LREC/COLING, pages 4226–4237. ELRA and ICCL, 2024.
OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
A. Gray, et al. Training language models to follow instructions with human feedback. In Advances
in Neural Information Processing Systems, 2022.
G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei,
and J. Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web
data only. In NeurIPS, 2023.
T. Pires, E. Schlinger, and D. Garrette. How multilingual is multilingual bert? In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, 2019.
E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, and A. Korhonen. XCOPA: A multilingual
dataset for causal commonsense reasoning. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors,
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 2362–2376, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/
v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and

Chunk 67 · 1,990 chars

s of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 2362–2376, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/
v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti-
mization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural
Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9.
H. Schwenk, V. Chaudhary, S. Sun, H. Gong, and F. Guzmán. WikiMatrix: Mining 135M parallel
sentences in 1620 language pairs from Wikipedia. In P. Merlo, J. Tiedemann, and R. Tsarfaty,
editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computa-
tional Linguistics: Main Volume, pages 1351–1361, Online, Apr. 2021. Association for Computa-
tional Linguistics. doi: 10.18653/v1/2021.eacl-main.115. URL https://aclanthology.org/2021.
eacl-main.115.
S. She, W. Zou, S. Huang, W. Zhu, X. Liu, X. Geng, and J. Chen. MAPO: Advancing multilingual
reasoning through multilingual-alignment-as-preference optimization. In L.-W. Ku, A. Martins, and
V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 10015–10027, Bangkok, Thailand, Aug. 2024. Association
for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.539. URL https://aclanthology.
org/2024.acl-long.539.
F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou,
D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh
International Conference on Learning Representations, 2023. URL https://openreview.net/forum?
id=fR3wGCk-IXp.
H. Singh, N. Gupta, S. Bharadwaj, D. Tewari, and P. Talukdar. Indicgenbench: A multilingual
benchmark to evaluate generation capabilities of llms on indic

Chunk 68 · 1,975 chars

ltilingual chain-of-thought reasoners. In The Eleventh
International Conference on Learning Representations, 2023. URL https://openreview.net/forum?
id=fR3wGCk-IXp.
H. Singh, N. Gupta, S. Bharadwaj, D. Tewari, and P. Talukdar. Indicgenbench: A multilingual
benchmark to evaluate generation capabilities of llms on indic languages, 2024a. URL https:
//arxiv.org/abs/2404.16816.
35

-- 35 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
S. Singh, F. Vargus, D. Dsouza, B. F. Karlsson, A. Mahendiran, W.-Y. Ko, H. Shandilya, J. Patel,
D. Mataciunas, L. OMahony, M. Zhang, R. Hettiarachchi, J. Wilson, M. Machado, L. S. Moura,
D. Krzemiński, H. Fadaei, I. Ergün, I. Okoh, A. Alaagib, O. Mudannayake, Z. Alyafeai, V. M. Chien,
S. Ruder, S. Guthikonda, E. A. Alghamdi, S. Gehrmann, N. Muennighoff, M. Bartolo, J. Kreutzer,
A. Üstün, M. Fadaee, and S. Hooker. Aya dataset: An open-access collection for multilingual
instruction tuning, 2024b. URL https://arxiv.org/abs/2402.06619.
N. Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi,
J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault,
G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran,
P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn,
A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang. No language left behind: Scaling
human-centered machine translation, 2022. URL https://arxiv.org/abs/2207.04672.
J. Tiedemann. Parallel data, tools and interfaces in OPUS. In N. Calzolari, K. Choukri, T. Declerck,
M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings
of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages
2214–2218, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).

Chunk 69 · 1,993 chars

In N. Calzolari, K. Choukri, T. Declerck,
M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings
of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages
2214–2218, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). URL
http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,
F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation
language models. CoRR, abs/2302.13971, 2023a.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient
foundation language models, 2023b. URL https://arxiv.org/abs/2302.13971.
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,
S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,
W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan,
M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril,
J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton,
J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan,
B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur,
S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and
fine-tuned chat models, 2023c. URL https://arxiv.org/abs/2307.09288.
A. Üstün, V. Aryabumi, Z. X. Yong, W. Ko, D. D’souza, G. Onilude, N. Bhandari, S. Singh, H. Ooi,
A. Kayid, F. Vargus, P. Blunsom, S. Longpre, N. Muennighoff, M. Fadaee, J. Kreutzer, and S. Hooker.
Aya model: An instruction

Chunk 70 · 1,997 chars

lama 2: Open foundation and
fine-tuned chat models, 2023c. URL https://arxiv.org/abs/2307.09288.
A. Üstün, V. Aryabumi, Z. X. Yong, W. Ko, D. D’souza, G. Onilude, N. Bhandari, S. Singh, H. Ooi,
A. Kayid, F. Vargus, P. Blunsom, S. Longpre, N. Muennighoff, M. Fadaee, J. Kreutzer, and S. Hooker.
Aya model: An instruction finetuned open-access multilingual language model. In ACL (1), pages
15894–15939. Association for Computational Linguistics, 2024.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, and Z. Tu. Document-level machine translation with
large language models. arXiv preprint arXiv:2304.02210, 2023.
J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned
language models are zero-shot learners. In International Conference on Learning Representations,
2022.
36

-- 36 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang, B. Li, C. Cheng, W. Lü, R. Hu, C. Li, L. Yang,
X. Luo, X. Wu, L. Liu, W. Cheng, P. Cheng, J. Zhang, X. Zhang, L. Lin, X. Wang, Y. Ma, C. Dong,
Y. Sun, Y. Chen, Y. Peng, X. Liang, S. Yan, H. Fang, and Y. Zhou. Skywork: A more open bilingual
foundation model. CoRR, abs/2310.19341, 2023a.
X. Wei, H. Wei, H. Lin, T. Li, P. Zhang, X. Ren, M. Li, Y. Wan, Z. Cao, B. Xie, T. Hu, S. Li, B. Hui, B. Yu,
D. Liu, B. Yang, F. Huang, and J. Xie. Polylm: An open source polyglot large language model. CoRR,
abs/2307.06018, 2023b.
G. Wenzek, M. Lachaux, A. Conneau, V. Chaudhary, F. Guzmán, A. Joulin, and E. Grave. Ccnet:
Extracting high quality monolingual datasets from web crawl data. In LREC, pages 4003–4012.
European Language Resources Association, 2020.
C. Whitehouse, M. Choudhury, and A. F. Aji. Llm-powered data augmentation

Chunk 71 · 1,998 chars

.06018, 2023b.
G. Wenzek, M. Lachaux, A. Conneau, V. Chaudhary, F. Guzmán, A. Joulin, and E. Grave. Ccnet:
Extracting high quality monolingual datasets from web crawl data. In LREC, pages 4003–4012.
European Language Resources Association, 2020.
C. Whitehouse, M. Choudhury, and A. F. Aji. Llm-powered data augmentation for enhanced cross-
lingual performance. In EMNLP, pages 671–686. Association for Computational Linguistics, 2023.
C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, Q. Lin, and D. Jiang. Wizardlm: Empowering
large pre-trained language models to follow complex instructions. In ICLR. OpenReview.net, 2024.
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. mT5:
A massively multilingual pre-trained text-to-text transformer. In K. Toutanova, A. Rumshisky,
L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou,
editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 483–498, Online, June 2021.
Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https:
//aclanthology.org/2021.naacl-main.41.
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei,
H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin,
K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao,
R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, X. Ren, X. Zhang,
X. Wei, X. Ren, X. Liu, Y. Fan, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, and
Z. Fan. Qwen2 technical report. CoRR, abs/2407.10671, 2024a.
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei,
H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou,

Chunk 72 · 1,991 chars

, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, and
Z. Fan. Qwen2 technical report. CoRR, abs/2407.10671, 2024a.
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei,
H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin,
K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao,
R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, X. Ren, X. Zhang,
X. Wei, X. Ren, X. Liu, Y. Fan, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, and
Z. Fan. Qwen2 technical report, 2024b. URL https://arxiv.org/abs/2407.10671.
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate
problem solving with large language models. Advances in Neural Information Processing Systems,
36, 2024.
Y. Yu, Y. Zhuang, J. Zhang, Y. Meng, A. J. Ratner, R. Krishna, J. Shen, and C. Zhang. Large language
model as attributed training data generator: A tale of diversity and bias. In NeurIPS, 2023.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your
sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
2019.
37

-- 37 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
B. Zhang, P. Williams, I. Titov, and R. Sennrich. Improving massively multilingual neural machine
translation and zero-shot translation. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors,
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
1628–1639, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.
acl-main.148. URL https://aclanthology.org/2020.acl-main.148.
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan.

Chunk 73 · 1,996 chars

e 58th Annual Meeting of the Association for Computational Linguistics, pages
1628–1639, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.
acl-main.148. URL https://aclanthology.org/2020.acl-main.148.
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. Agieval: A
human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/
2304.06364.
Çağatay Yıldız, N. K. Ravichandran, P. Punia, M. Bethge, and B. Ermis. Investigating continual
pretraining in large language models: Insights and implications, 2024. URL https://arxiv.org/
abs/2402.17400.
A. Üstün, V. Aryabumi, Z.-X. Yong, W.-Y. Ko, D. D’souza, G. Onilude, N. Bhandari, S. Singh, H.-L.
Ooi, A. Kayid, F. Vargus, P. Blunsom, S. Longpre, N. Muennighoff, M. Fadaee, J. Kreutzer, and
S. Hooker. Aya model: An instruction finetuned open-access multilingual language model, 2024.
URL https://arxiv.org/abs/2402.07827.
A. Appendix
A.1. Dataset Preprocessing
A.1.1. Translation Templates
Table 13 | Translation templates used in our experiments. Note: The translation templates are
used to construct our parallel data, where ^ indicates the position of a line break. The placeholders
<src_lang>, <tgt_lang>, <input>, and <output> represent the source language name, target language
name, source text, and target text in the parallel pair, respectively.
ID Template (in English)
A <src_lang> phrase: <input> ^ <tgt_lang> phrase: <output>
B <src_lang> text: <input> ^ <tgt_lang> text: <output>
C Translate the text from <src_lang> to <tgt_lang>: ^ <src_lang> text: <input> ^ <tgt_lang> text: <output>
D Translate the words from <src_lang> to <tgt_lang>: ^ <src_lang> words: <input> ^ <tgt_lang> words: <output>
E Convert the phrase from <src_lang> to <tgt_lang>: ^ <src_lang> phrase: <input> ^ <tgt_lang> phrase: <output>
F Render the <src_lang> sentence <input> to <tgt_lang>: <output>
G Provide the translation of the sentence <input> from <src_lang> to

Chunk 74 · 1,992 chars

g> to <tgt_lang>: ^ <src_lang> words: <input> ^ <tgt_lang> words: <output>
E Convert the phrase from <src_lang> to <tgt_lang>: ^ <src_lang> phrase: <input> ^ <tgt_lang> phrase: <output>
F Render the <src_lang> sentence <input> to <tgt_lang>: <output>
G Provide the translation of the sentence <input> from <src_lang> to <tgt_lang>: <output>
H Change the phrase <input> to <tgt_lang>, the translated phrase is: <output>
I Please change the sentence <input> to <tgt_lang>, and the resulting translation is: <output>
J Change the phrase <input> to <tgt_lang>, resulting in: <output>
K The sentence <input> in <src_lang> means <output> in <tgt_lang>
To standardize the translation process, we designed a diverse set of translation templates as
shown in Table 13. These templates serve multiple purposes: they provide consistent formatting
for the parallel data, enable clear instruction-following capabilities, and help maintain structural
consistency across different language pairs. The templates range from simple direct translation
formats (Templates A and B) to more elaborate instructional patterns (Templates C through K), each
designed to capture different aspects of the translation task. By incorporating various phrasings
and structures, these templates help improve the model’s robustness and ability to handle diverse
translation requests. The symbol ^ in the templates is a line break, which helps maintain clear visual
separation between different parts of the translation pair.
38

-- 38 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 14 | Prompt Templates for Synthetic Data.
Method 	Prompt
Keywords-based Explanation
Suppose that you are a/an {role_1} in {subject}. Please explain the following keywords and
meet the following requirements:
(1) The keywords: {keywords};
(2) Each keyword explanation should contain at least three sentences. You can generate a story
about the keyword for better explanation;
(3) The

Chunk 75 · 1,998 chars

ds-based Explanation
Suppose that you are a/an {role_1} in {subject}. Please explain the following keywords and
meet the following requirements:
(1) The keywords: {keywords};
(2) Each keyword explanation should contain at least three sentences. You can generate a story
about the keyword for better explanation;
(3) The explanations suit {role_2} students;
(4) Summarize the explanations.
Your answer should be a list of keywords. Make the explanations correct, useful, understandable,
and diverse.
Keywords-based Story
Assume that you are a/an {role_1} in {subject}. Before you teach students new vocabulary,
please write a {type_passage} about the new knowledge and meet the following requirements:
(1) It must contain keywords: {keywords};
(2) Its setting should be {scene};
(3) Should be between {min_length} and {max_length} words in length;
(4) The writing style should be {style};
(5) The suitable audience is {role_2};
(6) Should end with {ending};
(7) Should be written in {language}.
Few-shot Based SFT Data
I want you to act as a Sample Generator. Your goal is to draw inspiration from the Given
Sample to create a brand new sample. This new sample should belong to the same domain
as the Given Sample but be even rarer. The length and complexity of the Created Sample
should be similar to that of the Given Sample. The Created Sample must be reasonable and
understandable by humans. The terms Given Sample, Created Sample, ’given sample’, and
’created sample’ are not allowed to appear in the Created Sample.
Given Sample:
(1) Sample doc 1
(2) Sample doc 2
(3) Sample doc 3
(4) ...
Created Sample:
39

-- 39 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
A.1.2. Prompt Templates
Table 14 summaries the prompt templates for synthetic data.
A.2. More Evaluation Results for Instruct LLMs
We show more evaluation results for the post-trained LLMs in this section. In our evaluation on
the TydiQA benchmark [Clark et al., 2020] shown

Chunk 76 · 1,998 chars

ngual Training for Cross-Lingual Enhancement
A.1.2. Prompt Templates
Table 14 summaries the prompt templates for synthetic data.
A.2. More Evaluation Results for Instruct LLMs
We show more evaluation results for the post-trained LLMs in this section. In our evaluation on
the TydiQA benchmark [Clark et al., 2020] shown in Table 15, Marco-Chat consistently exhibits
superior performance across both 7B and 70B model categories, underlining its enhanced multilingual
capabilities. Notably, Marco-Chat achieves top scores in languages such as Arabic and Bengali,
demonstrating its proficiency in managing diverse linguistic structures and dialects. With an impressive
average score of 57.7 for 7B models and 61.0 for 70B models, Marco-Chat surpasses competitors,
showcasing its robust understanding across many languages. The CEVAL benchmark [Huang et al.,
2023] results in Table 16 underscore Marco’s strong performance across both 7B and 70B models.
Marco-LLM consistently achieves the highest scores, with an average of 86.4 in the 7B models and 94.5
in the 70B models, demonstrating its robust generalization across diverse linguistic tasks. Particularly
noteworthy is Marco’s strong performance in complex categories such as ’Hard’ and ’STEM’, where
it obtained higher scores at handling challenging language tasks and quantitative reasoning. The
AGIEval benchmark [Zhong et al., 2023] results shown in Table 17highlight Marco-Chat’s exceptional
multilingual and reasoning capabilities, particularly in the 7B model category, where it obtained
strong performance in Chinese and Gaokao-related tasks. Achieving top scores in categories such as
Gaokao-English and Gaokao-History underscores Marco-Chat’s adeptness at handling diverse linguistic
challenges and contextual comprehension. In the 70B model category, Marco-Chat continues to lead
with an average score of 72.7.
40

-- 40 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 15 |

Chunk 77 · 1,997 chars

underscores Marco-Chat’s adeptness at handling diverse linguistic
challenges and contextual comprehension. In the 70B model category, Marco-Chat continues to lead
with an average score of 72.7.
40

-- 40 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 15 | Performance comparison of language models on TydiQA benchmark [Clark et al., 2020]
across different languages.
Language Qwen2 Qwen2.5 Llama-3 Llama3.1 Aya-23 Aya-expanse Marco-Chat
7B Models
Arabic 	47.5 	48.0 58.8 	68.3 67.2 	45.0 	78.4
Bengali 	42.3 	61.6 56.8 	64.3 50.6 	21.3 	74.6
English 	29.0 	42.7 24.4 	41.6 53.6 	49.6 	44.4
Finnish 	27.1 	48.4 54.8 	47.3 44.7 	23.5 	51.5
Indonesian 28.6 	38.4 34.5 	31.8 33.4 	30.2 	48.1
Japanese 	0.8 	8.7 17.6 	31.8 71.0 	41.7 	70.5
Korean 	38.2 	45.6 25.4 	57.9 76.0 	20.3 	77.9
Russian 	22.4 	33.7 31.2 	40.8 49.9 	25.7 	46.6
Swahili 	17.5 	8.9 30.1 	45.2 11.8 	13.3 	31.3
Telugu 	28.7 	22.8 48.8 	83.3 5.5 	11.3 	35.2
Thai 	39.7 	55.7 54.3 	70.8 55.2 	29.3 	76.2
Avg. Scores 29.2 	39.0 39.7 	53.0 47.2 	28.3 	57.7
70B Models
Arabic 	42.5 	61.6 72.7 	62.4 67.5 	40.9 	78.4
Bengali 	47.1 	65.7 55.9 	68.1 43.6 	28.0 	73.3
English 	43.0 	45.5 46.6 	41.2 53.6 	46.8 	51.2
Finnish 	55.8 	39.0 57.0 	50.9 52.5 	26.0 	57.0
Indonesian 39.9 47.8 44.0 	31.9 35.8 	27.6 	46.5
Japanese 	38.6 	44.7 43.8 	49.9 73.6 	22.1 	68.4
Korean 	41.1 	43.2 43.8 	63.0 73.1 	29.5 	69.4
Russian 	38.5 	43.2 42.5 	36.3 52.4 	35.5 	45.9
Swahili 	18.6 	21.3 46.0 	33.8 20.4 	12.5 	49.2
Telugu 	36.9 	41.4 45.5 	80.4 26.6 	22.9 	62.8
Thai 	41.5 	68.2 74.0 	66.4 52.7 	38.7 	76.5
Avg. Scores 40.3 	48.4 52.0 	53.1 50.2 	30.0 	61.0
Table 16 | Performance comparison on the CEVAL benchmark across different categories.
Category Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco
7B Models
Average 	81.8 	77.9 50.8 	55.6 43.9 	48.5 86.4
Hard 	63.1 	51.8 33.9 	36.9 32.6 	32.4 79.9
Other 	84.9 	82.8 53.5 	56.3 44.9 	48.3 81.0
Humanities 	84.3 	79.6 47.2 	53.8

Chunk 78 · 1,994 chars

Performance comparison on the CEVAL benchmark across different categories.
Category Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco
7B Models
Average 	81.8 	77.9 50.8 	55.6 43.9 	48.5 86.4
Hard 	63.1 	51.8 33.9 	36.9 32.6 	32.4 79.9
Other 	84.9 	82.8 53.5 	56.3 44.9 	48.3 81.0
Humanities 	84.3 	79.6 47.2 	53.8 43.2 	50.3 84.8
Social Science 91.3 	87.8 58.9 	64.9 51.2 	54.3 90.4
STEM 	73.9 	69.4 47.2 	51.5 40.1 	44.7 88.3
70B Models
Average 	90.6 	88.2 66.7 	71.6 53.6 	56.9 94.5
Hard 	77.8 	73.3 51.1 	56.0 35.2 	36.5 94.0
Other 	92.0 	88.4 63.1 	69.2 52.8 	54.7 92.5
Humanities 	92.8 	91.0 66.5 	70.0 56.9 	60.3 95.2
Social Science 94.8 	91.5 75.6 	82.0 61.8 	66.4 95.7
STEM 	88.4 	84.9 64.2 	68.5 48.1 	51.4 94.6
41

-- 41 of 42 --

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 17 | Agieval 7B Results: Performance of various models across different categories of the Agieval
dataset. The best performance in each category is highlighted in bold.
Category 	Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco-Chat
7B Models
Chinese 	59.5 	57.1 37.7 	62.5 36.5 	34.4 	65.3
English 	54.0 61.5 41.5 	51.0 37.9 	39.9 	56.6
Gaokao 	64.5 	63.4 41.5 	53.0 40.3 	37.3 	71.0
Gaokao-Chinese 	77.6 	67.9 47.6 	58.5 46.3 	42.3 	80.5
Gaokao-English 	89.5 	85.3 79.1 	89.5 82.0 	67.0 	93.1
Gaokao-Geography 	81.4 	80.4 49.8 	75.9 50.8 	50.3 	86.9
Gaokao-History 	86.0 	82.1 55.7 	74.9 51.1 	49.8 	89.8
Gaokao-Biology 	81.9 	80.5 52.9 	76.2 43.8 	35.7 	87.6
Gaokao-Chemistry 	56.5 	55.6 32.4 	46.9 29.5 	30.4 	74.4
Gaokao-MathQA 	45.3 	55.3 32.5 	16.2 27.4 	27.4 	60.7
Gaokao-Physics 	45.0 	44.5 15.5 	34.0 30.5 	26.0 	57.5
Gaokao-MathCloze 	17.0 18.6 8.5 	5.1 1.7 	6.8 	8.5
LogiQA-ZH 	60.8 	54.7 42.1 	61.1 36.3 	37.8 	68.2
LSAT-AR 	26.5 	23.9 23.5 	30.9 21.3 	23.0 	29.6
LSAT-LR 	57.3 	66.3 54.5 	84.5 41.0 	45.3 	79.2
LSAT-RC 	66.2 	74.4 69.5 	89.2 55.8 	56.1 	75.8
LogiQA-EN 	46.7 	50.4 43.5 	59.6 36.7 	35.3 	61.4
SAT-Math 	82.7 88.6

Chunk 79 · 1,999 chars

57.5
Gaokao-MathCloze 	17.0 18.6 8.5 	5.1 1.7 	6.8 	8.5
LogiQA-ZH 	60.8 	54.7 42.1 	61.1 36.3 	37.8 	68.2
LSAT-AR 	26.5 	23.9 23.5 	30.9 21.3 	23.0 	29.6
LSAT-LR 	57.3 	66.3 54.5 	84.5 41.0 	45.3 	79.2
LSAT-RC 	66.2 	74.4 69.5 	89.2 55.8 	56.1 	75.8
LogiQA-EN 	46.7 	50.4 43.5 	59.6 36.7 	35.3 	61.4
SAT-Math 	82.7 88.6 68.2 	44.1 35.9 	49.1 	62.3
SAT-EN 	83.0 84.5 82.0 	91.3 77.2 	64.1 	84.0
SAT-EN-Without-Passage 46.6 	42.2 46.1 	44.7 42.2 	39.8 	52.9
Math 	82.7 	48.5 20.0 	57.6 7.1 	13.6 	12.8
Aqua-RAT 	59.8 74.8 52.0 	61.0 24.0 	32.7 	51.2
JEC-QA-KD 	38.0 	33.1 17.5 	26.9 18.2 	17.8 	37.5
JEC-QA-CA 	34.7 	27.4 19.1 	26.0 20.4 	21.1 	38.4
Average 	57.1 	59.0 43.4 	41.8 37.1 	36.7 	61.5
70B Models
Chinese 	66.3 	66.0 51.0 	49.3 43.5 	42.1 	74.5
English 	65.7 	69.4 65.3 	62.5 45.5 	50.6 	70.4
Gaokao 	71.8 	71.9 54.1 	53.0 47.6 	46.6 	80.6
Gaokao-Chinese 	85.0 	84.6 56.5 	58.5 50.8 	52.4 	92.7
Gaokao-English 	89.5 	92.2 90.9 	89.5 88.9 	71.6 	96.1
Gaokao-Geography 	88.9 	87.9 74.9 	75.9 66.3 	65.8 	96.0
Gaokao-History 	92.3 	87.2 71.1 	74.9 68.9 	63.4 	95.3
Gaokao-Biology 	86.2 	86.2 74.3 	76.2 54.8 	53.3 	95.2
Gaokao-Chemistry 	68.1 	66.2 39.6 	46.9 37.2 	35.8 	84.1
Gaokao-MathQA 	60.7 66.1 45.0 	16.2 29.1 	36.8 	74.6
Gaokao-Physics 	53.5 	56.5 23.0 	34.0 31.5 	32.0 	75.0
Gaokao-MathCloze 	22.0 	20.3 11.9 	5.1 0.9 	8.5 	16.1
LogiQA-ZH 	70.2 	70.1 61.1 	61.1 46.2 	46.2 	82.3
LSAT-AR 	32.2 	27.4 31.7 	30.9 19.6 	20.9 	40.0
LSAT-LR 	73.1 	88.0 80.8 	84.5 60.0 	59.4 	94.5
LSAT-RC 	77.7 	85.5 84.8 	89.2 72.9 	62.1 	91.1
LogiQA-EN 	56.8 	59.6 55.8 	59.6 43.9 	40.3 	75.7
SAT-Math 	90.5 	84.1 86.4 	44.1 43.6 	66.8 	74.6
SAT-EN 	89.8 91.3 89.8 	91.3 87.9 	77.7 	90.8
SAT-EN w/o Passage 	49.0 	49.5 55.3 	44.7 44.7 	45.6 	69.4
Math 	46.5 63.1 34.2 	57.6 8.9 	25.7 	22.6
AQUA-RAT 	75.2 76.0 69.3 	61.0 28.0 	32.0 	75.2
JEC-QA-KD 	39.0 	35.5 34.4 	26.9 23.2 	19.3 	41.4
JEC-QA-CA 	39.9 	20.3 28.9 	26.0 24.1 	19.5 	44.6
Avg. Scores 	66.0 	67.5 57.1 	55.0 44.4 	45.7 	72.7
42

-- 42 of

Chunk 80 · 325 chars

90.8
SAT-EN w/o Passage 	49.0 	49.5 55.3 	44.7 44.7 	45.6 	69.4
Math 	46.5 63.1 34.2 	57.6 8.9 	25.7 	22.6
AQUA-RAT 	75.2 76.0 69.3 	61.0 28.0 	32.0 	75.2
JEC-QA-KD 	39.0 	35.5 34.4 	26.9 23.2 	19.3 	41.4
JEC-QA-CA 	39.9 	20.3 28.9 	26.0 24.1 	19.5 	44.6
Avg. Scores 	66.0 	67.5 57.1 	55.0 44.4 	45.7 	72.7
42

-- 42 of 42 --