Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Summary
Marco-LLM is a multilingual large language model developed by Alibabaâs MarcoPolo Team to address performance gaps in low-resource languages. Built upon the Qwen2 foundation, it undergoes massive multilingual continual pre-training using 300 billion tokens curated from 29 languages, including web data, parallel corpora, high-quality knowledge sets, and synthetic data. The training employs a two-stage strategy: Stage I balances adaptation and catastrophic forgetting with a mixed data distribution, while Stage II increases low-resource language proportions with a lower learning rate. Comprehensive evaluations on benchmarks such as MMMLU, Belebele, Flores, and AGIEval demonstrate that Marco-LLM significantly outperforms state-of-the-art models like Llama3 and Qwen2.5. Notably, Marco-7B and Marco-72B achieve leading scores in low-resource languages like Nepali and Kazakh, while maintaining strong performance in high-resource languages like English and Chinese. The model also excels in any-to-any machine translation tasks, surpassing commercial systems in several language pairs. Subsequent post-training includes multilingual supervised fine-tuning and preference alignment to enhance instruction following and cultural appropriateness. These results validate that targeted massive multilingual training effectively bridges the capability gap between high- and low-resource languages.
PDF viewer
Chunks(81)
Chunk 0 ¡ 1,999 chars
2024-12-6 Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Lingfeng Ming*, Bo Zeng*, Chenyang Lyu*, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wangâ , Weihua Luo, Kaifu Zhang MarcoPolo Team, Alibaba International Digital Commerce Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual en- hancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many oth- ers, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages. Figure 1 | Comparison of English-centric performance vs Multilingual performance on MMMLU and Flores. Our Marco-LLM demonstrates strong performance on both
Chunk 1 ¡ 1,999 chars
- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages. Figure 1 | Comparison of English-centric performance vs Multilingual performance on MMMLU and Flores. Our Marco-LLM demonstrates strong performance on both dimensions. âEqual Contribution. â Corresponding Author: wanglongyue.wly@alibaba-inc.com Š 2024 Alibaba. All rights reserved arXiv:2412.04003v1 [cs.CL] 5 Dec 2024 -- 1 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Contents 1 Introduction 4 2 Related Work 5 3 Massive Multilingual Continual Pretraining for Large Language Models 6 3.1 Data Collection and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1 Multilingual Web Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.2 Parallel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.3 High-quality Knowledge Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.4 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.5 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.6 Data Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Two-Stage Continual Pretraining Strategy . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Benchmarks and Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.2 Baseline LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Evolution of performance during continual pretraining . . . . . . . . . . . . . . . . . 18 3.5 Data ablations on Parallel Data . . . . . . . . . . . . . .
Chunk 2 ¡ 1,998 chars
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Evolution of performance during continual pretraining . . . . . . . . . . . . . . . . . 18 3.5 Data ablations on Parallel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6 Effect of Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Extensive Multilingual Post-training for Large Language Models 19 4.1 Multilingual Supervised Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.2 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.3 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.4 Baseline LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Multilingual Preference Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 Dataset Construction from Existing Preference Data . . . . . . . . . . . . . . . 29 4.2.2 Multilingual Preference Data Generation and Translation . . . . . . . . . . . . 29 4.2.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2 -- 2 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 5 Conclusion and Future Work 31 A Appendix 38 A.1 Dataset Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A.1.1 Translation Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A.1.2 Prompt Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 A.2 More Evaluation Results for Instruct LLMs . . . . . . . . . . . . . . . . . . . . . . . . 40 3
Chunk 3 ¡ 1,993 chars
. . . . . . . . . . . . . . . . 38 A.1.1 Translation Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A.1.2 Prompt Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 A.2 More Evaluation Results for Instruct LLMs . . . . . . . . . . . . . . . . . . . . . . . . 40 3 -- 3 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 1. Introduction Large Language Models (LLMs) [Brown et al., 2020, OpenAI, 2023, Touvron et al., 2023b,c, Dubey and et al, 2024, Guo et al., 2024] have transformed the field of Natural Language Processing (NLP) by achieving impressive results across a variety of tasks such as language understanding, generation, and translation [Bang et al., 2023, Wang et al., 2023, Jiao et al., 2023, Bai et al., 2023, Hui et al., 2024, Yao et al., 2024]. Models like GPT-4 [OpenAI, 2023], GPT-4o [Hurst et al., 2024], PaLM[Chowdhery et al., 2024], and LLaMA [Dubey and et al, 2024] have demonstrated that scaling up model size and training data leads to significant performance gains. However, the majority of these advances have been centered around high-resource languages, predominantly English [Bang et al., 2023, Jiao et al., 2023, Lai et al., 2023]. This focus has resulted in a performance gap when these models are applied to multilingual tasks, especially involving low-resource languages. Multilingual NLP faces unique challenges due to the diversity and imbalance of linguistic re- sources [Pires et al., 2019, Conneau and Lample, 2019]. Low-resource languages often lack the extensive textual data required for training large models, which hinders the development of effective language technologies for these languages. As a result, speakers of low-resource languages are underrepresented in the benefits brought by recent advancements in NLP. To address this disparity, we have developed Marco-LLM, a multilingual language model trained with a focus on low-resource
Chunk 4 ¡ 1,997 chars
the development of effective language technologies for these languages. As a result, speakers of low-resource languages are underrepresented in the benefits brought by recent advancements in NLP. To address this disparity, we have developed Marco-LLM, a multilingual language model trained with a focus on low-resource languages. An overview of the framework of our Marco-LLM is shown in Figure 2. By collecting a substantial amount of multilingual data covering a series of underrepresented languages, and through massive multilingual continual pre-training and extensive multilingual post- training (including multilingual supervised finetuning and multilingual preference alignment) using the Qwen2 model [Bai et al., 2023] as a foundation, Marco-LLM aims to bridge the performance gap in multilingual NLP tasks. Our main contributions can be summarized as follows: ⢠We compile and curate a large-scale multilingual dataset tailored for low-resource languages, enhancing the diversity and richness of training data. ⢠We perform massive multilingual continual pre-training and post-training on the Qwen2 model to develop Marco-LLM, a multilingual LLM that substantially improves performance on low-resource language tasks. ⢠We conduct comprehensive evaluations on benchmarks such as MMMLU, Flores, Belebele, AGIEval, multilingual MT-bench, etc, demonstrating that Marco-LLM outperforms state-of-the-art models in multilingual settings. The rest of this paper is structured as: ⢠In Section 2, we give an overview of relevant literature regarding the development trajectory of LLMs especially multilingual LLMs and continual pre-training for LLMs. ⢠We present the details of our continual pre-training experiments in Section 3 including the mono- lingual and multilingual data collection and curation, training setup and evaluation results. ⢠Furthermore, in Section 4 we demonstrate how we conduct the post-training (including supervised finetuning and preference alignment)for the Marco-LLM
Chunk 5 ¡ 1,998 chars
f our continual pre-training experiments in Section 3 including the mono- lingual and multilingual data collection and curation, training setup and evaluation results. ⢠Furthermore, in Section 4 we demonstrate how we conduct the post-training (including supervised finetuning and preference alignment)for the Marco-LLM that has been continual pre-trained shown in Section 3, including how we construct our multilingual supervised finetuning data, how to formulate the task format, supervised training details as well as corresponding evaluation results on multilingual benchmark datasets. 4 -- 4 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Figure 2 | An overview of the training and evaluation paradigm of our Marco-LLM, we conducted massive multilingual continual pre-training, multilingual supervised finetuning and preference align- ment. We further perform extensive evaluation on multilingual benchmarks to validate the efficacy of our Marco-LLM. 2. Related Work In recent years, Large Language Models (LLMs) such as GPT-3 [Brown et al., 2020], GPT-4 [OpenAI, 2023], PaLM [Chowdhery et al., 2024], and LLaMA [Touvron et al., 2023b] have revolutionized the research paradigm of Natural Language Processing (NLP). These transformer-based [Vaswani et al., 2017] models have demonstrated exceptional capabilities in generating and understanding text, achieving state-of-the-art results in tasks like language translation, summarization, and question- answering [Bang et al., 2023, Wang et al., 2023]. However, their primary focus has been on high- resource languages, leaving a gap in multilingual performance [Jiao et al., 2023, Bang et al., 2023, Lai et al., 2023]. To bridge the gap in multilingual applications, multilingual models such as mBERT [Pires et al., 2019], XLM-R [Conneau and Lample, 2019], mT5 [Xue et al., 2021], and PolyLM[Wei et al., 2023b] have been developed. These models aim to provide cross-lingual capabilities
Chunk 6 ¡ 1,990 chars
et al., 2023, Bang et al., 2023, Lai et al., 2023]. To bridge the gap in multilingual applications, multilingual models such as mBERT [Pires et al., 2019], XLM-R [Conneau and Lample, 2019], mT5 [Xue et al., 2021], and PolyLM[Wei et al., 2023b] have been developed. These models aim to provide cross-lingual capabilities by being trained on diverse language datasets. Despite their promise, they often struggle with low-resource languages due to insufficient training data and the inherent difficulty of balancing multiple languages within one model [Shi et al., 2023, Dac Lai et al., 2023, Singh et al., 2024a, Lovenia et al., 2024]. Efforts in this area have included data augmentation, transfer learning, and specialized models for specific languages or tasks [Conneau et al., 2018, Artetxe et al., 2018, Conneau and Lample, 2019, Goyal et al., 2021, Team et al., 2022]. These approaches, while beneficial, have not fully harnessed the potential of large-scale language models [ĂstĂźn et al., 2024]. Continual pre-training offers a viable solution to enhance the adaptability of LLMs by allowing models to incorporate new information without starting from scratch [Ke et al., 2023, ĂaÄatay YÄąldÄąz et al., 2024]. This method enables models to improve performance in underrepresented areas, particularly for low-resource languages. By leveraging existing strengths and optimizing computational resources, continual pre-training presents a scalable approach to overcoming current limitations in multilingual and low-resource 5 -- 5 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement language processing. Building on these insights, we introduce Marco-LLM, which leverages continual pre-training of the Qwen2 model with a focus on low-resource languages. Our approach integrates large-scale multilingual data collection and advanced training techniques to enhance the modelâs capabilities in multilingual NLP tasks. 3. Massive Multilingual
Chunk 7 ¡ 1,998 chars
se insights, we introduce Marco-LLM, which leverages continual pre-training of the Qwen2 model with a focus on low-resource languages. Our approach integrates large-scale multilingual data collection and advanced training techniques to enhance the modelâs capabilities in multilingual NLP tasks. 3. Massive Multilingual Continual Pretraining for Large Language Models In this section, we present the technical details of how we conduct continual pretraining on LLMs. Specifically, the process of continual pretraining for multilingual LLMs encompasses the following: (1) the curation and filtering of a large-scale training corpus, which will be introduced in Section 3.1; (2) simple and scalable strategies to efficiently conduct continual pretraining at large scale for LLMs, detailed in Section 3.2; (3) a comprehensive presentation of evaluation results across multiple benchmarks, along with an analysis, in Section 3.3. 3.1. Data Collection and Filtering We create our dataset for Marco-LLM training from a variety of data sources containing knowledge until the end of 2024.4. We apply several data cleaning mechanisms and de-duplication methods on each data source to obtain high-quality tokens. 3.1.1. Multilingual Web Data Curation To produce a high-quality multilingual training data, we meticulously designed a cascaded data- processing pipeline. Similar to prior processing pipelines (e.g., CCNet[Wenzek et al., 2020], Re- fineWeb[Penedo et al., 2023], Llama[Touvron et al., 2023a], etc.) focusing on English, our multilingual pipeline features a series of data-cleaning strategies targeting text quality and information distribution. Document Preparation. Our pipeline starts from the target document preparation to keep information distribution. First of all, we extract the main contents from the raw Common Crawl (WARC) files. Meanwhile, we perform URL filter and language identification to avoid subsequent computationally expensive processing. The former targets fraudulent and/or
Chunk 8 ¡ 1,998 chars
ne starts from the target document preparation to keep information distribution. First of all, we extract the main contents from the raw Common Crawl (WARC) files. Meanwhile, we perform URL filter and language identification to avoid subsequent computationally expensive processing. The former targets fraudulent and/or adult websites (e.g., predominantly pornographic, violent, related to gambling, etc.), while the latter focus on the target languages. Specifically, we classify documents according to their primary languages and remove those with low confidence in classification, leveraging inexpensive n-gram models (e.g, fast-Text [Joulin et al., 2016]). Quality Filtering. The filters in this part aim for removing text with low quality. We filter out document based on some heuristic rule filters: (1). word blocklists and garbled text filters; (2). document length, the ratio of special symbols, the ratio of stop words, and the ratio of short, consecutive, or incomplete lines; (3). repeated words, n-grams. Inspired by CulturaX [Nguyen et al., 2024], the filtering thresholds are based on a statistical analysis of large document samples. To enhance our filtering process, we utilize the KenLM library [Wenzek et al., 2020] to evaluate a vast array of web documents. Documents with perplexity scores largely above average are subsequently removed. Deduplication. After filtering, we implement a comprehensive deduplication pipeline following the procedure in RefineWeb [Penedo et al., 2023]. This pipeline integrates document-level MinHash deduplication and sub-document exact-match deduplication, effectively identifying and removing duplicate content within and across documents. 6 -- 6 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 1 | Overview of corpus utilization rates across various languages and categories of our corpus, we show the total number of tokens (in Billion Tokens) available and used for high-resource
Chunk 9 ¡ 1,992 chars
ss documents. 6 -- 6 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 1 | Overview of corpus utilization rates across various languages and categories of our corpus, we show the total number of tokens (in Billion Tokens) available and used for high-resource and low-resource languages, as well as other data sources such as synthetic data. Category Language Total Tokens (B) Used Tokens (B) Utilization Rate (%) High-Resource Languages English (en) 1,459.7 90.4 6.2 Chinese (zh) 214.7 48.2 22.4 Arabic (ar) 45.8 10.6 23.0 German (de) 442.8 10.6 2.4 Spanish (es) 397.8 10.6 2.7 French (fr) 320.8 10.6 3.3 Korean (ko) 41.8 10.6 25.2 Japanese (ja) 224.2 10.6 4.7 Portuguese (pt) 145.3 10.6 7.3 Turkish (tr) 80.6 10.6 13.1 Low-Resource Languages Bengali (bn) 6.5 1.9 28.7 Hebrew (he) 11.3 1.9 16.4 Indonesian (id) 23.8 1.9 7.8 Italian (it) 56.3 1.9 3.3 Malay (ms) 2.9 1.9 64.4 Dutch (nl) 31.3 1.9 6.0 Polish (pl) 45.3 1.9 4.1 Russian (ru) 251.0 1.9 0.7 Thai (th) 8.3 1.9 22.4 Ukrainian (uk) 18.4 1.9 10.1 Urdu (ur) 8.7 1.9 21.4 Vietnamese (vi) 24.4 1.9 7.6 Czech (cs) 270.2 1.9 0.7 Greek (el) 376.8 1.9 0.5 Hungarian (hu) 214.4 1.9 0.9 Kazakh (kk) 16.8 1.9 11.1 Romanian (ro) 160.0 1.9 1.2 Azerbaijani (az) 19.4 1.9 9.6 Nepali (ne) 22.6 1.9 8.2 Other Data Sources Parallel Data 103.0 20.8 20.2 High-quality Knowledge Data 65.0 16.4 25.2 Synthetic Data 6.0 4.4 73.3 Total 5,115.9 300.0 5.9 Itâs worth noting that we perform quality filtering and deduplications within data for each language. At this stage, we obtain many high-quality monolingual data in low-resource languages, generally called multilingual data, the statistics is detailed in Table 1. We determine the amount of multilingual tokens used in continual pretraining experimentally (shown in Section 3.2), balancing model performance on Chinese, English, and multilingual
Chunk 10 ¡ 1,986 chars
many high-quality monolingual data in low-resource languages, generally called multilingual data, the statistics is detailed in Table 1. We determine the amount of multilingual tokens used in continual pretraining experimentally (shown in Section 3.2), balancing model performance on Chinese, English, and multilingual benchmarks. 3.1.2. Parallel Data Follow prior works such as PaLM2 [Anil et al., 2023], Skywork [Wei et al., 2023a], and PolyLM [Wei et al., 2023b], we employ parallel data into our continual pretraining dataset to further improve the cross-lingual and multilingual ability of Marco. This data is meticulously structured to pair a complete source-language paragraph with its corresponding target-language counterpart, ensuring a seamless alignment of linguistic capabilities between the two languages. We mainly focus on open-source parallel data, OPUS [Tiedemann, 2012, Zhang et al., 2020] and 7 -- 7 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 1000 1100 1200 1300 1400 1500 Corpus Size by Category (Billion Tokens) Total Tokens Used Tokens English Chinese Arabic German Spanish French Korean Japanese Portuguese Turkish Bengali Hebrew Indonesian Italian Malay Dutch Polish Russian Thai Ukrainian Urdu Vietnamese Czech Greek Hungarian Kazakh Romanian Azerbaijani Nepali Parallel Data High-quality Knowledge Data Synthetic Data 0 20 40 60 80 100 Tokens (Billion) Figure 3 | The amount of tokens per category in our multilingual continual pretraining corpus for Marco-LLM. CCAligned [Chaudhary et al., 2019, El-Kishky et al., 2020]. The former covers 100 languages, is English-centric, meaning that all training pairs include English on either the source or target side, while the latter consists of parallel or comparable web-document pairs in 137 languages aligned with English. There are many bad cases, such as translation errors, in these open-source data. We utilize the following pipeline to process
Chunk 11 ¡ 1,997 chars
eaning that all training pairs include English on either the source or target side, while the latter consists of parallel or comparable web-document pairs in 137 languages aligned with English. There are many bad cases, such as translation errors, in these open-source data. We utilize the following pipeline to process them. Heuristic Filtering. To obtain high-quality parallel corpora, we develop heuristics to remove additional low-quality documents, outliers, and documents with excessive repetitions. Some examples of heuristics include: ⢠We filter the sentence pairs using the ratio of special symbols, the ratio of stop words, the ratio of digitsâ and the ratio of repeated words in source sentence; ⢠We use similarity scores of LASER embeddings â for sentence pairs to filter out sentence pairs with low scores. Diverse Translation Templates. After filtering, the translation templates are employed to concatenate the parallel corpora. Prior work [ĂstĂźn et al., 2024] has shown the importance of diverse wording, templates, and task types to aid generalization to different natural inputs. Therefore, we utilize diverse translation templates to make the input diversity to enhance semantic alignment between multilingual parallel sentences, the details can be found in Appendix A.1.1. Empirically, the parallel data has shown the importance of enhancing cross-lingual and multilingual ability of Marco, specific NLP downstream tasks such as machine translation. More experimental details are presented in Section 3.5. âhttps://github.com/facebookresearch/LASER 8 -- 8 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 3.1.3. High-quality Knowledge Data Prior works, such as MiniCPM [Hu et al., 2024], Skywork [Wei et al., 2023a], and Llama3 [Dubey and et al, 2024], have shown that introducing the high-quality knowledge data into the pretraining stage enhances model capabilities. Similarly, we mix some high-quality knowledge data
Chunk 12 ¡ 1,998 chars
ment 3.1.3. High-quality Knowledge Data Prior works, such as MiniCPM [Hu et al., 2024], Skywork [Wei et al., 2023a], and Llama3 [Dubey and et al, 2024], have shown that introducing the high-quality knowledge data into the pretraining stage enhances model capabilities. Similarly, we mix some high-quality knowledge data into continual pretraining. Our high-quality knowledge data consists of mQA (multilingual Question Answer), STEM, and some ability-oriented SFT data, like math and coding. All of them are collected from the open-source website, Huggingfaceâ . Similar to our pipelines for parallel multilingual data described above, we implement filters to remove data with low quality and the data from common benchmarks. In addition to the n-grams deduplications, we employ semantic deduplications to process them. Note that, we do not include any training sets from commonly used benchmarks in our high-quality knowledge data. 3.1.4. Synthetic Data Books and papers are high-quality knowledge data, but they are scarce. phi models [Gunasekar et al., 2023, Li et al., 2023] are trained by synthetic textbooks to enhance performance. Recent work [Whitehouse et al., 2023] suggests that multilingual synthetic data can also enhance cross-lingual transfer. Also, we mix some synthetic data into our continual pretraining. Our synthetic data mainly consists of two parts: keywords-based explanation and story data, and ability-oriented SFT data. We describe them below. Keywords-based Explanation/Story Data. To synthetic Wikipedia and book data, we first use GPT-4 to generate a large number of keywords covering diverse topics such as mathematics, history, physics, etc. That results in 100k keyword pools. Then, we sample several target keywords from the keyword pool and employ other LLMs with the designed prompts to generate explanations of the target words and generate textbook stories, respectively. Similar to AttrPrompt [Yu et al., 2023], the prompts contains several attributes, we vary
Chunk 13 ¡ 1,999 chars
esults in 100k keyword pools. Then, we sample several target keywords from the keyword pool and employ other LLMs with the designed prompts to generate explanations of the target words and generate textbook stories, respectively. Similar to AttrPrompt [Yu et al., 2023], the prompts contains several attributes, we vary the value of attribute to generate diverse samples. An example of such prompts can be found in Appendix A.1.2, where the LLMs are instructed to generate training data based on attributes such as key words and subject. Ability-Oriented SFT Data. WizardLM [Xu et al., 2024], WizardMath [Luo et al., 2023], and WizardCoder [Luo et al., 2024] are designed to synthetic diverse instructions on target ability. We follow them to enhance Marco-LLM capacity of math and coding. To further improve the cross-lingual and multilingual ability, we use in-context learning method and translation technology to generate multilingual data. The former is that we employ the few-shot prompt to generate a new sample in the similar domain, the prompts are listed in Appendix A.1.2. The latter is that we randomly translate the original instruction, question, or answer to other language, resulting in cross-lingual QA. To enhance the diversity of synthetic data, we employ several superb models (like GPT-4, Deepseek- v2 [DeepSeek-AI et al., 2024], DBRXâĄ, Command-R Plus [Cohere For AI, 2024], etc.) to act as the generator, as its generation is of higher quality. 3.1.5. Data Statistics The composition of the continual pretraining dataset used for Marco-LLM is detailed in Table 1. The multilingual data mainly covers the following languages: English (en), Chinese (zh), Arabic (ar), German (de), Spanish (es), French (fr), Korean (ko), Japanese (ja), Portuguese (pt), Turkish (tr), Azerbaijani (az), Bengali (bn), Czech (cs), Greek (el), Hebrew (he), Hungarian (hu), Indonesian (id), â https://huggingface.co/ âĄttps://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm 9 -- 9 of 42
Chunk 14 ¡ 1,996 chars
c (ar), German (de), Spanish (es), French (fr), Korean (ko), Japanese (ja), Portuguese (pt), Turkish (tr), Azerbaijani (az), Bengali (bn), Czech (cs), Greek (el), Hebrew (he), Hungarian (hu), Indonesian (id), â https://huggingface.co/ âĄttps://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm 9 -- 9 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Italian (it), Kazakh (kk), Malay (ms), Nepali (ne), Dutch (nl), Polish (pl), Romanian (ro), Russian (ru), Thai (th), Ukrainian (uk), Urdu (ur), Vietnamese (vi). The data distribution across different languages is extremely uneven, shown in Figure 3. We group them into a rough taxonomy of lower-resourced (LR) and higher-resourced (HR). This yield a split of the 29 languages in our training mixture into 10 HR and 19 LR languages. Itâs worth noting that our dataset contains 5.1T tokens in total, but only about 300B tokens are used to develop our Marco. We will discuss the utilization of data in Section 3.1.6. 3.1.6. Data Utilization Although we gathered a total of 5.1T training data, the actual amount used for continual pretraining is 300B. The overall data utilization rate is only 5.9%. Table 1 presents a detailed breakdown of the utilization of each language and data source. We have the following findings: ⢠It is evident that, overall, the data utilization rates for both high-resource and low-resource languages are low. The more data we collect, the less effectively we utilize it. The corpus of Malay is only 2.9B, resulting in its utilization rate is up to 64.4%. ⢠The parallel data, high-quality knowledge data, and synthetic data have higher utilization rate. These data are of high quality and focused on specific tasks; moreover, the cost of acquisition is substantial. Therefore, these data are well used. Specifically, the parallel data is used to enhance the down-stream machine translation, while the others are employed to enhance oriented ability of
Chunk 15 ¡ 1,998 chars
her utilization rate. These data are of high quality and focused on specific tasks; moreover, the cost of acquisition is substantial. Therefore, these data are well used. Specifically, the parallel data is used to enhance the down-stream machine translation, while the others are employed to enhance oriented ability of LLM. The quality and diversity of synthetic data enhances Marco-LLM performance, thus its utilization rate is up to 73.3%. ⢠The multilingual capacity relies on the tokens utilized. Intuitively, to improve performance in specific languages, we should immerse the LLM in a more diverse corpus. The number of tokens used in each high-resource languages is 10.6B, compared to 1.9B in each low-resource languages. According to Table 6 and Table 7, the improvement in low-resource languages is greater than that in high-resource languages. It indicates that open-source models (e.g., Qwen2/2.5 and llama3/3.1) primarily focus on high-resource languages, with the continuous pretraining on only 1.9B corpus for each low-resource languages producing significant effects. Based on the above findings, we outlines the following directions for further exploration: ⢠Further improving quality and diversity of multilingual data. The overall data utilization rate is only 5.9%. It means that we should further improve quality of our training data for the next model iterations. Moreover, take the diversity multilingual data into consideration, we will pay more attention to collecting multilingual data with local features. ⢠Synthetic data scaling and improving self-improvement capability. We conclude that training with synthetic data is effective. Therefore, we will continuously focus on developing advanced techniques that can control and manipulate specific attributes of the generated data, enabling the creation of diverse and customizable synthetic datasets. In addition, we will explore methods that integrate domain-specific knowledge to ensure that the generated data adheres
Chunk 16 ¡ 1,999 chars
l continuously focus on developing advanced techniques that can control and manipulate specific attributes of the generated data, enabling the creation of diverse and customizable synthetic datasets. In addition, we will explore methods that integrate domain-specific knowledge to ensure that the generated data adheres to the underlying constraints and patterns present in the target domain. Furthermore, We aim to unlock the potential of emerging self-improvement capabilities by utilizing Marco-LLM to generate synthetic data, as we believe it can potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data. 10 -- 10 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 2 | The proportion of high-quality and multilingual sources. Data Stage-I Stage-II English (en) 32% 28% Chinese (zh) 17% 15% High-Resourced (Others) 30% 26% Low-Resourced 9% 15% Parallel Multilingual Data 6% 8% High-quality Knowledge Data 5% 6% Synthetic Data 1% 2% 3.2. Two-Stage Continual Pretraining Strategy We develop Marco-LLM based on Qwen2, which is naturally designed to be multilingual-friendly and has an extensive vocabulary of 150k, ensuring a high compression rate for multilingual data. Nonetheless, the primary challenge of continual pretraining lies in balancing the adaptation (which can lead to suboptimal performance on the new dataset) and catastrophic forgetting (resulting in significant capability loss on the previous dataset). In this paper, we propose an advanced two-stage continual pretraining learning approach designed to facilitate the transfer of commonsense knowledge, primarily acquired in English and Chinese, to a variety of low-resource languages, as well as specific NLP downstream tasks such as machine translation. when it comes to continual pretraining, the data mixture and the learning rate are two crucial hyper-parameters to optimize Marco. In our practice, we employ different
Chunk 17 ¡ 1,988 chars
rimarily acquired in English and Chinese, to a variety of low-resource languages, as well as specific NLP downstream tasks such as machine translation. when it comes to continual pretraining, the data mixture and the learning rate are two crucial hyper-parameters to optimize Marco. In our practice, we employ different hyper-parameters in the two-stage training. Specifically, we employ data mixing to balance the adaptation of multilingual capabilities and prevent catastrophic forgetting in Stage-I, while the goal of Stage-II is to further strengthen Marco-LLMâs multilingual capabilities via a lower maximum learning rate. We describe these stages below. Data Mixture. Optimizing LLMs to learn knowledge encoded in multiple languages simultaneously is a significant challenge. We concretely formulate this problem as transferring general knowledge to low-resource languages while maintaining the advantage of original languages in the base model. To address this issue, we keep the 32% English and 17% Chinese corpus in Stage-I training to avoid catastrophic forgetting, as shown in Table 2. Meanwhile, the proportion of other high-resourced (HR, apart from Chinese and English) and low-resourced (LR) are about 30% and 9%, respectively, to develop Marco-LLM with multilingual capabilities. The intuition is that HR has a higher priority and is more commonly used. In Stage-II, we raise a greater proportion of LR data from 9% to 15%, to further strength Marco-LLM multilingual capabilities in low-resourced. Moreover, parallel data, high-quality knowledge data, and synthetic data are increased, while dropping down English and Chinese corpus. Notably, the proportion of diverse corpus is determined by a large number of experiments in Marco-1.5B with a constant learning rate, not by manual experience. Lower Maximal Learning Rate. Learning rate is a crucial hyper-parameter in neural network models that controls the magnitude of parameter updates [Ibrahim et al., 2024]. However,
Chunk 18 ¡ 1,997 chars
tion of diverse corpus is determined by a large number of experiments in Marco-1.5B with a constant learning rate, not by manual experience. Lower Maximal Learning Rate. Learning rate is a crucial hyper-parameter in neural network models that controls the magnitude of parameter updates [Ibrahim et al., 2024]. However, most performant open-source LLMs [Yang et al., 2024a, Dubey and et al, 2024] decay their learning rate to a small value (i.e. âź 1e-7) by the end of training. In our practice, we employ the learning rate to be re-warned and re-decayed to improve adaptation per compute spent when continual pretraining on a new distribution. The key challenge is to optimize the maximum learning rate. Empirically, decreasing the scheduleâs maximum learning rate can help reduce forgetting, whereas increasing it can improve adaptation. We refer to Section 3.6 for learning rate tuning. After conducting fine-grained learning rate experiments when the data mixture is determined, we set maximal learning rate to 11 -- 11 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 1đ â 5 in Stage-I to strike a balance between the acquisition of multilingual languages and catastrophic forgetting. Based on the HR multilingual capacities acquired from the Stage-I training, we proceed to train the LR corpus with a smaller learning rate of 6đ â 6. As described in Table 3, the training tokens in Stage-I is about 160B, while Stage-II introduces an additional 140B tokens. Note that, an attempt to apply the Warmup-Stable-Decay (WSD) learning rate scheduler [Hu et al., 2024] resulted in subpar performance. It is suspected that the stable stage does not necessarily benefit model continual pretraining. Training. Marco-LLM were trained using Pai-Megatron-LM§ on a cluster of 512 A100 GPU (64x80G) servers. To accelerate multi-node training for large model, tensor parallel and pipeline parallel were set to 8 to maximize the model flops utilization
Chunk 19 ¡ 1,997 chars
he stable stage does not necessarily benefit model continual pretraining. Training. Marco-LLM were trained using Pai-Megatron-LM§ on a cluster of 512 A100 GPU (64x80G) servers. To accelerate multi-node training for large model, tensor parallel and pipeline parallel were set to 8 to maximize the model flops utilization (MFU) of NVIDIA GPUs. Specifically, we employ 512 GPU cards for Marco-72B, and 256 GPU cards for Marco-1.5B/7B. To ensure the model learns the distribution akin, we conduct experiments in Marco-1.5B to optimize the mixing of data and learning rate. Then, we scale them up to Marco-7B and Marco-72B. We adopt most of the pretraining settings and model architectures from Qwen2. During the training, all Marco-LLM were continuously pre-trained over 300B tokens, using the Adam (đ˝1 = 0.9, đ˝2 = 0.95 ) optimizer. We utilize a context window length of 32768 for Marco-1.5B and 7B, while the sequence length is 8192 in Marco-72B. At each stage, we warm-up the learning rate from 0 to the maximum learning rate over the first 10B tokens, and then decay it to 10% of the maximal learning rate using a cosine schedule. We use a weight decay of 0.1 and gradient clipping of 1.0. After the two-stage continual pretraining, we have developed our multilingual large model, Marco. Our two-stage continual pretraining is both effective and efficient to extend to the incoming rarely low-resourced languages, we leave it further exploration for future model iterations. Table 3 | The training corpus tokens and learning rate in two Stage Continual Pretraining. Stage Training Tokens (B) LR Stage-I 160 1đ â 5 Stage-II 140 6đ â 6 3.3. Evaluation Results We aim to assess the capabilities of Marco-LLM from various perspectives: 1) the ability of LLMs to understand and generate natural language, as well as the ability to grasp world knowledge; 2) the performance of LLMs across different languages; and 3) their capacity to handle cross-lingual tasks such as machine translation. Following
Chunk 20 ¡ 1,995 chars
the capabilities of Marco-LLM from various perspectives: 1) the ability of LLMs to understand and generate natural language, as well as the ability to grasp world knowledge; 2) the performance of LLMs across different languages; and 3) their capacity to handle cross-lingual tasks such as machine translation. Following the experiment design of previous work [Wei et al., 2023b], we gather a subset of datasets from previous NLP tasks to construct a multilingual benchmark. Table 4 summarizes the evaluation tasks and datasets, together with their language coverage. 3.3.1. Benchmarks and Evaluation Protocol All the datasets in the above multilingual benchmark can be divided into four groups: Natural Language Understanding, Knowledge, Question answering, and Machine Translation. The details of each dataset that we use for benchmarking are given below. AGIEval: AGIEval [Zhong et al., 2023] is a benchmark dataset designed to evaluate the reasoning and problem-solving abilities of artificial intelligence models on tasks that mimic human examinations. §https://github.com/alibaba/Pai-Megatron-Patch/ 12 -- 12 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 4 | Evaluation benchmarks overview. The table presents the comprehensive evaluation suite used in our experiments, spanning four major task categories: general knowledge, multilingual understanding, question answering, and machine translation. Each dataset is evaluated using either accuracy (Acc.), F1 score, or BLEU metric, covering a diverse range of languages from single-language (en, zh) to multilingual scenarios. Task Dataset Split Metric #Languages #-shots General CEVAL Val Acc. One 5-shot Knowledge AGIEval Test Acc. One 5-shot ARC Test Acc. One 25-shot MMLU Test Acc. One 5-shot Multilingual XCOPA Val Acc. Six 5-shot Understanding X-MMLU Val Acc. Thirteen 5-shot XStoryCloze Val Acc. Six 5-shot Question TyDiQA Val F1 Six 1-shot Answering Belebele Test Acc.
Chunk 21 ¡ 1,997 chars
guages #-shots General CEVAL Val Acc. One 5-shot Knowledge AGIEval Test Acc. One 5-shot ARC Test Acc. One 25-shot MMLU Test Acc. One 5-shot Multilingual XCOPA Val Acc. Six 5-shot Understanding X-MMLU Val Acc. Thirteen 5-shot XStoryCloze Val Acc. Six 5-shot Question TyDiQA Val F1 Six 1-shot Answering Belebele Test Acc. Twenty-Eight 1-shot Machine Flores Devtest BLEU Twenty-Eight 1-shot Translation WMT-16 Test BLEU Three 1-shot It includes thousands of multiple-choice questions sourced from real-world standardized tests such as the GRE, GMAT, LSAT, and other professional certification exams. The questions cover a wide range of subjects, including mathematics, logical reasoning, reading comprehension, and analytical writing. ARC (AI2 Reasoning Challenge): The ARC dataset [Clark et al., 2018] consists of 7,787 natural language questions designed to assess an AI modelâs ability to answer grade-school-level science questions. It is divided into two subsets: Easy Set, containing 5,217 questions that can often be answered with surface-level information, and Challenge Set, comprising 2,570 more difficult questions that require reasoning and background knowledge beyond simple retrieval. Questions are multiple-choice, with four options each, covering topics like biology, physics, chemistry, and earth science. Belebele: Belebele [Bandarkar et al., 2024] is a multilingual multiple-choice reading comprehension dataset spanning 122 language variants, including low-resource and typologically diverse languages. It provides a benchmark for evaluating machine reading comprehension across a wide linguistic spectrum. Each question includes a passage, a query, and four answer choices. CEVAL: CEVAL [Huang et al., 2023] is a comprehensive evaluation suite designed to assess the capabilities of Chinese language models. It includes over 13,000 multiple-choice questions sourced from real Chinese national college entrance examinations and professional qualification tests. The dataset covers
Chunk 22 ¡ 1,985 chars
swer choices. CEVAL: CEVAL [Huang et al., 2023] is a comprehensive evaluation suite designed to assess the capabilities of Chinese language models. It includes over 13,000 multiple-choice questions sourced from real Chinese national college entrance examinations and professional qualification tests. The dataset covers 52 subjects across disciplines such as mathematics, physics, law, medicine, and language arts. Each question has four options. Flores: The Flores (Facebook Low Resource Languages for Emergent Situations) dataset [Team et al., 2022] is a multilingual machine translation benchmark focused on low-resource languages. It includes parallel sentences in over 100 languages, with a particular emphasis on underrepresented 13 -- 13 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement and low-resource languages. The dataset provides high-quality, professionally translated sentences, allowing for accurate evaluation of machine translation models. HellaSwag: HellaSwag [Zellers et al., 2019] is a benchmark dataset designed to test a modelâs ability to perform commonsense reasoning and natural language inference. It contains 70,000 multiple-choice questions generated from descriptions of everyday activities. Each question provides a context and four possible endings, with one correct continuation and three distractors that are misleading and adversarially constructed. For example: Context: "A person is cooking onions on a stove. They begin to cry because..." Options: 1. "the onions release gas that irritates the eyes." (Correct) 2. "they are listening to sad music." 3. "the stove is very hot." 4. "they forgot to buy garlic." MMLU (Massive Multitask Language Understanding): The MMLU benchmark [Hendrycks et al., 2021] is designed to evaluate the broad knowledge and problem-solving abilities of AI models across 57 subjects. It includes over 57,000 multiple-choice questions from high school, undergraduate,
Chunk 23 ¡ 1,992 chars
ot." 4. "they forgot to buy garlic." MMLU (Massive Multitask Language Understanding): The MMLU benchmark [Hendrycks et al., 2021] is designed to evaluate the broad knowledge and problem-solving abilities of AI models across 57 subjects. It includes over 57,000 multiple-choice questions from high school, undergraduate, and professional levels. Subjects span various domains, including history, mathematics, law, medicine, computer science, and more. Each question has four options, and the tasks often require reasoning, calculation, or application of specialized knowledge. TyDiQA: TyDiQA (Typologically Diverse Question Answering) is a benchmark dataset [Clark et al., 2020] for information-seeking question answering in 11 typologically diverse languages. It includes over 200,000 question-answer pairs with passages from Wikipedia in languages such as Arabic, Bengali, Finnish, Japanese, Korean, Russian, Swahili, Telugu, and others. The dataset is designed to test a modelâs ability to understand and generate answers in different languages without relying on English translations. TyDiQA has two primary tasks: Gold Passage Task, where models are provided with the correct passage containing the answer, and Minimal Answer Task, where models must find the minimal span in the passage that answers the question. WMT16: The WMT16 [Bojar et al., 2016] datasets are part of the Conference on Machine Trans- lationâs annual shared tasks, which provide standard benchmark datasets for evaluating machine translation systems. WMT16 includes parallel corpora for language pairs such as English-German, English-French, English-Russian and expands based on previous years by adding languages like Romanian and incorporating more challenging test sets. XCOPA: XCOPA [Ponti et al., 2020] is a multilingual dataset for evaluating causal commonsense reasoning in AI systems across multiple languages. It extends the original COPA (Choice of Plausible Alternatives) dataset to 11 languages, including
Chunk 24 ¡ 1,998 chars
dding languages like Romanian and incorporating more challenging test sets. XCOPA: XCOPA [Ponti et al., 2020] is a multilingual dataset for evaluating causal commonsense reasoning in AI systems across multiple languages. It extends the original COPA (Choice of Plausible Alternatives) dataset to 11 languages, including languages like Haitian Creole, Quechua, and Yoruba. Each question consists of a premise and two alternative causes or effects, and the task is to select the more plausible one. For example: Premise: "The ground is wet." Options: 1. "It rained last night." (Cause) 2. "The sun is shining." X-MMLU: X-MMLU [Dac Lai et al., 2023] is an extension of the MMLU benchmark for evaluating multilingual models. It includes translated versions of the original MMLU tasks into multiple languages. The dataset aims to assess a modelâs ability to perform multitask language understanding across diverse languages, testing both its knowledge and reasoning skills in non-English contexts. X-MMLU covers subjects like mathematics, science, and humanities, requiring models to demonstrate proficiency comparable to educated human speakers in various languages. 14 -- 14 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement XStoryCloze: XStoryCloze [Lin et al., 2021] is a multilingual version of the Story Cloze Test, designed to evaluate a modelâs ability to understand and reason about narratives in different languages. The dataset provides short, four-sentence stories followed by two possible endings, and the task is to choose the coherent conclusion. It includes translations of the stories into multiple languages, testing narrative understanding and commonsense reasoning. For example: Story: 1. "Maria woke up early on Saturday." 2. "She was excited about the trip." 3. "She packed her bags quickly." 4. "She grabbed her keys and left the house." Endings: A. "She arrived at the airport just in time for her flight." (Correct) B. "She
Chunk 25 ¡ 1,999 chars
sting narrative understanding and commonsense reasoning. For example: Story: 1. "Maria woke up early on Saturday." 2. "She was excited about the trip." 3. "She packed her bags quickly." 4. "She grabbed her keys and left the house." Endings: A. "She arrived at the airport just in time for her flight." (Correct) B. "She decided to go back to sleep because it was raining." 3.3.2. Baseline LLMs Llama3 and Llama3.1 The Llama 3 series includes the Llama3-8B/70B and Llama3.1-8B/70B model. These models have been pretrained on over 15 trillion tokens from publicly available sources, achieving superior performance on various benchmarks [Dubey and et al, 2024]. Qwen2 and Qwen2.5 We compare with the Qwen2 [Yang et al., 2024b] series LLMs, including the Qwen2-7B/72B and Qwen2.5-7B/72B models. Qwen2 LLM is pre-trained on 7 trillion tokens and Qwen2.5 is a further optimized version of Qwen2, which is pretrained on 18 trillion tokens and achieved state-of-the-art performance on multiple evaluation benchmarks. 3.3.3. Experimental Results Table 5 | Performance comparison of LLMs across various benchmarks: Results for LLMs with parame- ters of both 7B and 70B. The best performance in each benchmark is in bold. 7B Models Model AGIEval Belebele CEval Flores MMLU TyDiQA WMT16 XCOPA XMMLU XStoryCloze Qwen2-7B 64.6 73.4 83.0 27.1 71.9 52.3 18.1 70.6 60.2 70.6 Qwen2.5-7B 66.5 72.3 81.4 27.2 75.4 59.9 18.2 73.6 62.6 70.3 Llama3-8B 24.3 55.3 37.5 33.1 53.6 50.5 24.6 71.7 49.7 66.5 Llama3.1-8B 44.9 63.3 52.8 33.4 66.2 57.0 25.8 71.6 49.2 71.7 Marco-7B 68.8 78.8 83.5 35.0 74.4 60.8 29.0 76.6 61.2 71.9 70B+ Models Qwen2-72B 78.2 86.5 90.4 38.7 83.8 58.7 30.2 80.9 78.5 77.1 Qwen2.5-72B 80.8 87.6 90.6 35.0 86.3 63.7 31.0 84.7 79.9 76.3 Llama3-70B 60.6 85.5 66.8 37.4 79.2 64.3 34.3 81.1 72.0 76.9 Llama3.1-70B 61.7 86.2 67.3 36.9 78.8 62.8 35.0 83.0 71.4 75.4 Marco-72B 84.4 90.0 93.7 45.0 86.3 62.7 35.1 85.7 81.2 78.7 Results divided by benchmarks Our proposed models, Marco-7B and
Chunk 26 ¡ 1,991 chars
7.1 Qwen2.5-72B 80.8 87.6 90.6 35.0 86.3 63.7 31.0 84.7 79.9 76.3 Llama3-70B 60.6 85.5 66.8 37.4 79.2 64.3 34.3 81.1 72.0 76.9 Llama3.1-70B 61.7 86.2 67.3 36.9 78.8 62.8 35.0 83.0 71.4 75.4 Marco-72B 84.4 90.0 93.7 45.0 86.3 62.7 35.1 85.7 81.2 78.7 Results divided by benchmarks Our proposed models, Marco-7B and Marco-72B, demonstrate superior multilingual capabilities across a variety of benchmarks compared to baseline models of similar sizes (see Tables 5). On multilingual understanding and reasoning tasks such as Belebele, CEVAL, MMLU, XMMLU, XCOPA, and XStoryCloze, the Marco-LLM consistently outperform the other strong LLMs, indicating enhanced proficiency in comprehending and common sense reasoning across diverse languages. For the 7B models, Marco-7B achieves the best performance across several benchmarks, obtaining the highest scores in AGIEval (68.8), Belebele (78.8), CEval (83.5), Flores (35.0), and TyDiQA (60.8). These results highlight its proficiency in handling diverse tasks and datasets, outperforming the strongest competitor, Qwen2.5-7B, by 2.3, 6.5, 2.1, 7.8, and 0.9 points, respectively. The 72B models further amplify these strengths. Marco-72B achieves the highest scores in multiple benchmarks, including AGIEval (84.4), Belebele (90.0), CEval (93.7), Flores (45.0), and 15 -- 15 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement XMMLU (81.2), showcasing its exceptional capacity to handle complex linguistic tasks. Compared to Qwen2.5-72B, Marco-72B surpasses by 3.6 points in AGIEval, 2.4 points in Belebele, and 10.0 points in Flores. Notably, Marco-72B achieves the highest scores in benchmarks like CEVAL (93.7) and XMMLU (81.4), underscoring its advanced multilingual understanding and reasoning abilities. In addition, the Marco-LLM exhibit strong performance in multilingual translation and generation tasks, as evidenced by competitive scores on the Flores and WMT16 benchmarks. Their
Chunk 27 ¡ 1,995 chars
est scores in benchmarks like CEVAL (93.7) and XMMLU (81.4), underscoring its advanced multilingual understanding and reasoning abilities. In addition, the Marco-LLM exhibit strong performance in multilingual translation and generation tasks, as evidenced by competitive scores on the Flores and WMT16 benchmarks. Their superior results in TyDiQA further highlight their capacity for multilingual question answering and knowledge retrieval. Overall, these findings supported that our focus on enhancing the multilingual abilities of LLMs has led to models that are highly effective in understanding, reasoning, and generating text across multiple languages, thereby validating the efficacy of our approach. Table 6 | Average performance on multilingual benchmarks shown in Table 4 for five 7B-parameter LLMs divided by 29 languages. The best performance for each language is highlighted in bold. Language Llama3-8B Llama3.1-8B Qwen2-7B Qwen2.5-7B Marco-7B Chinese (zh) 55.1 63.5 75.8 75.5 76.0 English (en) 69.6 74.2 77.4 78.0 77.9 Arabic (ar) 40.5 52.6 61.3 64.5 66.0 German (de) 47.3 59.8 69.5 69.0 72.9 Spanish (es) 57.0 64.9 70.4 71.9 72.6 French (fr) 56.0 63.5 69.8 71.2 72.5 Japanese (ja) 63.4 74.9 76.7 76.4 77.3 Korean (ko) 43.2 63.8 76.0 79.7 78.3 Portuguese (pt) 56.8 64.7 70.7 72.3 72.3 Turkish (tr) 51.1 62.9 66.0 66.6 73.4 Azerbaijani (az) 39.1 48.2 52.2 53.2 69.4 Bengali (bn) 45.9 49.8 63.9 62.3 68.9 Hebrew (he) 50.2 56.6 69.8 68.6 77.3 Indonesian (id) 64.0 66.1 77.3 77.8 82.3 Italian (it) 68.1 69.4 80.3 79.4 83.4 Polish (pl) 62.2 60.7 77.2 76.3 80.9 Malay (ms) 56.8 57.7 73.2 73.9 80.2 Dutch (nl) 61.0 65.2 80.2 71.6 82.1 Romanian (ro) 65.6 67.4 77.3 72.6 80.8 Russian (ru) 69.2 70.7 81.2 77.1 83.2 Thai (th) 54.8 53.3 69.1 73.7 72.9 Ukrainian (uk) 60.9 60.4 72.7 70.4 79.9 Urdu (ur) 50.0 56.7 63.9 59.9 71.3 Vietnamese (vi) 67.6 70.3 76.1 79.0 81.2 Czech
Chunk 28 ¡ 1,995 chars
.2 73.9 80.2 Dutch (nl) 61.0 65.2 80.2 71.6 82.1 Romanian (ro) 65.6 67.4 77.3 72.6 80.8 Russian (ru) 69.2 70.7 81.2 77.1 83.2 Thai (th) 54.8 53.3 69.1 73.7 72.9 Ukrainian (uk) 60.9 60.4 72.7 70.4 79.9 Urdu (ur) 50.0 56.7 63.9 59.9 71.3 Vietnamese (vi) 67.6 70.3 76.1 79.0 81.2 Czech (cs) 59.2 63.6 76.9 70.0 78.2 Greek (el) 68.1 67.7 65.0 67.2 77.1 Hungarian (hu) 59.8 61.0 63.3 57.9 69.7 Kazakh (kk) 41.3 44.0 43.1 45.9 66.1 Nepali (ne) 37.0 41.56 36.33 42.1 65.8 Avg. Scores 55.9 61.2 69.4 69.1 75.5 Results divided by languages The experiments evaluate the performance of our Marco-LLM: both 7B and 72B variants against various strong open-source LLMs across a diverse set of languages on the benchmarks shown in Section 3.3.1. Both Marco-LLM are built on continual pretraining on Qwen2 with massive multilingual data. The results divided by languages shown in Table 6 reveals that Marco-7B achieves a leading average score of 75.5 among the 7B parameter models. In low-resource languages such as Nepali and Kazakh, Marco-7B obtained strong performance with scores of 65.8 and 66.1, respectively, outperforming Qwen2-7B by substantial margins of 29.47 and 23.0 points. This improvement underscores the benefits of our continual pretraining approach using a vast multilingual 16 -- 16 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 7 | Average performance on multilingual benchmarks shown in Table 4 for five 70B-parameter LLMs divided by 29 languages. The best performance for each language is highlighted in bold. Language Llama3-70B Llama3.1-70B Qwen2-72B Qwen2.5-72B Marco-72B Chinese (zh) 75.2 75.0 83.7 84.8 86.4 English (en) 82.7 83.2 84.6 85.1 86.0 Arabic (ar) 73.0 73.4 76.2 77.4 79.7 German (de) 80.8 80.9 84.0 85.0 87.0 Spanish (es) 80.7 79.4 82.5 83.1 84.8 French (fr) 78.9 80.7 80.4 82.7 82.8 Japanese (ja) 82.5 84.5
Chunk 29 ¡ 1,999 chars
.1-70B Qwen2-72B Qwen2.5-72B Marco-72B Chinese (zh) 75.2 75.0 83.7 84.8 86.4 English (en) 82.7 83.2 84.6 85.1 86.0 Arabic (ar) 73.0 73.4 76.2 77.4 79.7 German (de) 80.8 80.9 84.0 85.0 87.0 Spanish (es) 80.7 79.4 82.5 83.1 84.8 French (fr) 78.9 80.7 80.4 82.7 82.8 Japanese (ja) 82.5 84.5 86.6 86.1 86.2 Korean (ko) 87.1 87.0 88.8 88.9 90.6 Portuguese (pt) 81.1 80.8 83.6 83.8 85.5 Turkish (tr) 80.8 81.3 79.0 81.5 84.4 Azerbaijani (az) 77.2 77.9 78.8 79.1 85.8 Bengali (bn) 79.7 79.7 83.2 82.9 86.7 Hebrew (he) 80.1 82.3 83.9 84.1 85.1 Indonesian (id) 87.7 88.0 87.6 88.4 93.0 Italian (it) 87.1 87.9 88.1 89.9 91.0 Polish (pl) 87.2 87.1 88.6 88.8 88.2 Malay (ms) 85.2 87.2 83.4 87.9 90.7 Dutch (nl) 88.9 88.8 89.0 90.3 93.0 Romanian (ro) 88.4 88.2 88.2 87.3 90.4 Russian (ru) 87.9 88.3 89.9 91.0 90.3 Thai (th) 80.2 80.4 85.7 87.6 88.3 Ukrainian (uk) 89.0 88.6 88.2 90.2 91.7 Urdu (ur) 80.6 81.2 81.4 82.1 87.9 Vietnamese (vi) 87.1 89.6 88.7 90.0 90.8 Czech (cs) 87.1 87.6 87.8 90.0 91.4 Greek (el) 88.3 89.6 89.4 89.0 91.1 Hungarian (hu) 83.8 83.0 80.1 88.1 88.3 Kazakh (kk) 74.7 77.9 70.3 72.2 84.8 Nepali (ne) 70.1 73.8 67.1 74.1 86.7 Avg. Scores 82.5 83.2 83.8 85.2 87.9 corpus, which enhances the modelâs ability to generalize across languages with limited resources. Marco-7B also shows competitive performance in high-resource languages like Arabic and German, surpassing Qwen2.5-7B by 1.5 and 3.9 points, respectively. These results highlight the effectiveness of our continual pretraining approach, which improves the modelâs adaptability to different linguistic structures and complexities. Similarly, Marco-72B exhibits remarkable performance (shown in Table 7), achieving the highest average score of 87.9 among the LLMs with 70B+ parameters. The Marco-72B model further extends these capabilities with an average score of 87.9 across 29
Chunk 30 ¡ 1,980 chars
elâs adaptability to different linguistic structures and complexities. Similarly, Marco-72B exhibits remarkable performance (shown in Table 7), achieving the highest average score of 87.9 among the LLMs with 70B+ parameters. The Marco-72B model further extends these capabilities with an average score of 87.9 across 29 languages. It demonstrates remarkable performance in low-resource languages, achieving 86.7 in Nepali and 84.8 in Kazakh, reflecting improvements over Qwen2.5-72B by 12.6 and 12.6 points, respectively. This indicates the efficacy of scaling model size alongside multilingual training data to achieve superior performance in challenging linguistic contexts. Additionally, Marco-72B remains highly competitive in high-resource languages, achieving scores of 79.7 in Arabic and 87.0 in German, which are improvements of 2.3 and 2.0 points over Qwen2.5-72B. The consistent outperformance of Marco-LLM across both high-resource and low-resource languages (especially compared with Qwen2 - the base LLM we conduct continual pretraining, our Marco-LLM achieved substantial improvements over the languages shown in Table 1) underscores the pivotal role of leveraging a vast multilingual corpus during pretraining as shown in Figure 3. This approach not only enhances the modelsâ ability to generalize to low-resource languages, but also maintains strong performance in high-resource languages such as English and Chinese. These 17 -- 17 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement findings suggest that the comprehensive training methodologies employed in the Marco-LLM are key factors contributing to their leading performance in multilingual settings. 0 50 100 150 200 250 300 Billion of tokens 55 60 65 70 75 Xstorycloze (Accuracy) zh en ar es (a) 0 50 100 150 200 250 300 Billion of tokens 62 66 70 74 78 82 Xwinograd (Accuracy) zh en fr ja pt (b) 0 50 100 150 200 250 300 Billion of
Chunk 31 ¡ 1,981 chars
ntributing to their leading performance in multilingual settings. 0 50 100 150 200 250 300 Billion of tokens 55 60 65 70 75 Xstorycloze (Accuracy) zh en ar es (a) 0 50 100 150 200 250 300 Billion of tokens 62 66 70 74 78 82 Xwinograd (Accuracy) zh en fr ja pt (b) 0 50 100 150 200 250 300 Billion of tokens 35 40 45 50 55 60 Belebele (Accuracy) zh en ar de es fr ko pt tr (c) 0 50 100 150 200 250 300 Billion of tokens 5 10 15 20 25 30 35 40 45 Flores (Bleu) zh ar de es fr ko ja pt tr (d) Figure 4 | Evolution of performance on question answering and machine translation during continual pretraining in Marco-1.5B. 3.4. Evolution of performance during continual pretraining During continual pretraining, we track the performance of our models on a few question answering and machine translation benchmarks. we report the performance of Marco-1.5B in Figure 4. On question answering benchmarks, the performance for multilingual languages such as Arabic (ar), German (de), Spanish (es), French (fr), Korean (ko), Japanese (ja), Portuguese (pt), and Turkish (tr) improve steadily, and while capabilities in Chinese (zh) and English (en) are maintained, and even slightly increased. On Flores, we observe the performance of all languages shows consistent improvement. Notably, Figure 4(d) illustrates the averages for both English-to-multilingual (ENâXX) and multilingual-to-English (XXâEN) directions. These indicators are consistent with our goals aiming at enhancing multilingual capabilities. 3.5. Data ablations on Parallel Data Empirically, introducing parallel corpora can enhance semantic alignment between cross-languages. In this section, we will explore the impact of parallel corpora on downstream specific NLP tasks, such as machine translation. We mainly focus on the quality of parallel data. The open-source parallel data has many bad cases, what happens if we ignore them on continual pretraining? we report our findings in Figure 5, where
Chunk 32 ¡ 1,998 chars
is section, we will explore the impact of parallel corpora on downstream specific NLP tasks, such as machine translation. We mainly focus on the quality of parallel data. The open-source parallel data has many bad cases, what happens if we ignore them on continual pretraining? we report our findings in Figure 5, where Marco-w/o-parallel-data-filtering indicates that Marco-LLM were continuously pre-trained on Qwen2 without any filtering on parallel data. We have the following findings: ⢠Phenomena vary across different model scales. Compared to base model Qwen2, the smaller models, specifically the 1.5B and 7B models, Marco-w/o-parallel-data-filtering shows improvements, whereas the 72B model has performance degradation. We will elaborate on this phenomenon. Firstly, Marcoâs machine translation capability benefits from both monolingual data and parallel data, thus resulting in gradient conflicts utilizing low-quality parallel data during continual pretraining in smaller models. The monolingual data, due to its dominant proportion, plays a crucially positive role, which explains the improvements observed with Marco-w/o-parallel-data-filtering. Secondly, due to parameter redundancy, the 72B model contains a higher number of monosemantic neurons[Gurnee et al., 2023]. The low-quality parallel data negatively interferes with the machine translation task. This meaningful discovery reveals an important insights that some conclusions obtained on small models are not necessarily scaled to larger models. Furthermore, beyond the issue of parallel 18 -- 18 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 1.5B 7B 72B 0 5 10 15 20 25 30 35 Flores (BLEU) Qwen2 Marco-w/o-parallel-data-filtering Marco Figure 5 | The performance of different model size on Flores benchmark. Marco-w/o-parallel-data- filtering denotes that we continuously pre-trained Marco-LLM based on Qwen2 without applying any filtering to parallel data. data,
Chunk 33 ¡ 1,992 chars
7B 72B
0
5
10
15
20
25
30
35
Flores (BLEU)
Qwen2
Marco-w/o-parallel-data-filtering
Marco
Figure 5 | The performance of different model size on Flores benchmark. Marco-w/o-parallel-data-
filtering denotes that we continuously pre-trained Marco-LLM based on Qwen2 without applying any
filtering to parallel data.
data, we suspect that cross-language gradient conflicts may also exist during multilingual training,
which we will explore in future research.
⢠High-quality data enhances the performance of model. Compared to base model Qwen2,
our Marco-LLM that has been continuously pre-trained with the processed parallel data, shows
significant improvements across the 1.5B, 7B, and 72B models. We attribute it primarily to its data
quality resulting from our data-engineering efforts. Furthermore, we will enhance the quality of
our training data for the upcoming model iterations.
3.6. Effect of Learning Rate
In this section, we explore the effect of the crucial hyper-parameter during continual pretraining. We
select learning rate (lr) from {1đâ5, 2đâ5, 3đâ5} and conduct continual pretraining on 200B corpus
under determined data mixture. Figure 6 demonstrates the training dynamic between the average
score on question answering benchmarks and different learning rate. Specifically, Figure 6(a) presents
that the forgetting of Chinese and English ability is aggravated with the increase of learning rate,
and multilingual ability is generally enhanced in Figure 6(b). Interestingly, the average accuracy in
multilingual languages goes up firstly and then down when learning rate is 3đâ5. We believe that this
is due to the loss of primary language ability resulting in reducing natural language understanding.
Notably, the machine translation ability gets better as the learning rate increases, which shows
consistently in Figure 4(d). Therefore, we set the peak learning rate to 1đâ5 in our experiments, as it
plays a pivotal role in striking a balance between the acquisition ofChunk 34 ¡ 1,979 chars
ing in reducing natural language understanding. Notably, the machine translation ability gets better as the learning rate increases, which shows consistently in Figure 4(d). Therefore, we set the peak learning rate to 1đâ5 in our experiments, as it plays a pivotal role in striking a balance between the acquisition of multilingual languages and the forgetting of English and Chinese. 4. Extensive Multilingual Post-training for Large Language Models After conducting continuous pre-training on up to 29 languages, we proceeded with post pre-training of the Macro model. This phase of training primarily comprises two stages: Supervised Fine-Tuning (SFT) [Wei et al., 2022, Ouyang et al., 2022] and Direct Preference Optimization (DPO) [Rafailov et al., 2023]. The main objective of the Supervised Fine-Tuning (SFT) stage is to activate and enhance the modelâs multilingual capabilities across various domains, including commonsense reasoning, dialogue-based question answering, precise instruction following, mathematical and logical reasoning, multilingual comprehension and translation, and coding. During this stage, our research particularly focuses on (1) the automatic generation and cost-effective collection of high-quality multilingual data, and (2) the transfer of extensive domain knowledge from high-resource languages, such as English, 19 -- 19 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 0 50 100 150 200 Billion of tokens 57 58 59 60 61 62 63 64 AVG. Accuracy (En&Zh) lr=1e-5 lr=2e-5 lr=3e-5 (a) The average accuracy in English and Chi- nese with different learning rate. 0 50 100 150 200 Billion of tokens 46 47 48 49 50 51 52 AVG. Accuracy (Multilingual) lr=1e-5 lr=2e-5 lr=3e-5 (b) The average accuracy in multilingual lan- guages with different learning rate. Figure 6 | The average performance on question answering benchmarks with different learning rate during continual pretraining in
Chunk 35 ¡ 1,994 chars
100 150 200 Billion of tokens 46 47 48 49 50 51 52 AVG. Accuracy (Multilingual) lr=1e-5 lr=2e-5 lr=3e-5 (b) The average accuracy in multilingual lan- guages with different learning rate. Figure 6 | The average performance on question answering benchmarks with different learning rate during continual pretraining in Marco-1.5B. Chinese, and French, to low-resource languages. The DPO stage aims to ensure that the content generated by the model aligns with specified preference. 4.1. Multilingual Supervised Fine-tuning 4.1.1. Data Construction The Supervised Fine-Tuning (SFT) dataset is composed of several components: ⢠A limited set of detoxification data annotated by experts, along with multilingual self-cognition enhancement data. ⢠Synthetic data and parallel corpora, which include precise instruction enhancement data, multi- lingual Alpaca dialogue data, as well as synthetic data for specific tasks such as comprehension, generation, and translation. ⢠Open-source instruction data, including Chain-of-Thought (CoT) enhancement data and general instruction data like Aya collection [Singh et al., 2024b]. This includes multi-turn dialogues, coding, mathematics, and more, with examples such as UltraChat, Glaive-Code, MetaMathQA, MathInstruct, Belle, and Orca. ⢠We utilize parallel data for multilingual machine translation tasks, enhancing language diversity and quality within the dataset. Specifically, we use the dev sets from WMT-14 to WMT-20 [Barrault et al., 2020] and the WikiMatrix [Schwenk et al., 2021]. Data Collection and Processing Our data collection endeavors encompass two primary facets. On one hand, we engage in the aggregation and cleansing of open-source data. On the other hand, we employ data synthesis and the translation of parallel corpora to augment the dataset. Building upon these two approaches, we have developed both multilingual Supervised Fine-Tuning (SFT) datasets and multilingual Direct Preference Optimization (DPO) datasets. The following
Chunk 36 ¡ 1,998 chars
ng of open-source data. On the other hand, we employ data synthesis and the translation of parallel corpora to augment the dataset. Building upon these two approaches, we have developed both multilingual Supervised Fine-Tuning (SFT) datasets and multilingual Direct Preference Optimization (DPO) datasets. The following sections provide a detailed examination of the composition of these datasets and the experiments conducted to assess data distribution proportions. Data Cleaning Given that the majority of our dataset is derived from open-source, synthetic, and parallel corpus data, we implemented a comprehensive data processing strategy to ensure quality. Initially, we applied regular expression filtering to remove inconsistencies such as merged 20 -- 20 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement multi-turn dialogues, incorrect numbering in segmented outputs, HTML format outputs, emoji data, and hyperlinks or URL references. Additionally, regular expressions and Python calculations were employed to capture and validate mathematical equations, with incorrect mathematical results being discarded. Data Filtering To enhance the overall performance of the model, we employed a comprehensive pipeline based on seminal works for data quality filtering, effectively removing low-quality training samples: ⢠Quality Scoring: For English-language data, we utilized the Deita model in conjunction to score the raw data within a range of 1 to 6. High-scoring data were selected to construct parallel corpora, with GPT-4 further filtering the translated data. ⢠QA similarity: In assessing QA relevance, we evaluated the semantic similarity between input and output fields. Data with similarity below a certain threshold were considered irrelevant and subsequently removed. ⢠Mathematics Grading: For mathematical data, we deployed models from the open-source project Open-Web-Math to score the data, retaining only those with higher
Chunk 37 ¡ 1,999 chars
ed the semantic similarity between input and output fields. Data with similarity below a certain threshold were considered irrelevant and subsequently removed. ⢠Mathematics Grading: For mathematical data, we deployed models from the open-source project Open-Web-Math to score the data, retaining only those with higher scores for training purposes. ⢠Multilingual difficulty scoring: For extensive multilingual parallel datasets, assessing transla- tion quality through open-source models poses challenges. We employed the Instruct-Following- Difficulty (IFD) method. By examining the ratio of Conditioned Answer Score (CA) to Direct Answer Score (DA) during the initial iterations, we filtered data that significantly benefited the model. ⢠Semantic deduplication: Initially, we traversed the dataset to remove duplicate instructions. We then applied MinHash and SimHash techniques for further deduplication. Lastly, embeddings were extracted using open-source models, retaining data with embedding similarity below a 0.7 threshold. 4.1.2. Training Setup In our instruction fine-tuning, neat packing was employed to train a CT model with a context length of 16,384 on a our supervised finetuning data of totally 5.7 milliion examples. The training utilized the Adam optimizer with a cosine schedule learning rate. It was observed that setting a large learning rate during full model fine-tuning could severely affect the general knowledge acquired during the pre-training and CT phases, leading to a performance degradation. The optimal instruction fine-tuning learning rate was determined by adjusting the minimum pre-training learning rate in accordance with the batch size, with the maximum and minimum fine-tuning learning rates identified as 6e-6 and 6e-7, respectively. 4.1.3. Evaluation Benchmarks In the evaluation experiments, we employ TyDiQA, AGIEval, CEVAL, Belebele and a multilingual version of the original MMLU: Multilingual MMLU (MMMLU) The Multilingual Massive Multitask Language
Chunk 38 ¡ 1,999 chars
ith the maximum and minimum fine-tuning learning rates identified as 6e-6 and 6e-7, respectively. 4.1.3. Evaluation Benchmarks In the evaluation experiments, we employ TyDiQA, AGIEval, CEVAL, Belebele and a multilingual version of the original MMLU: Multilingual MMLU (MMMLU) The Multilingual Massive Multitask Language Understanding (MMMLU) datasetÂś is an extension of the MMLU benchmark [Hendrycks et al., 2021] into multiple languages. MMMLU includes translations of the original 57 subjects into 14 languages including Arabic, Ben- gali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Brazilian Portuguese, Swahili, Yoruba, and Simplified Chinese, covering areas such as STEM, humanities, social sciences, Âśhttps://huggingface.co/datasets/openai/MMMLU 21 -- 21 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement and more. Each language version contains approximately 15,908 multiple-choice questions, mirroring the structure of the original MMLU dataset. 4.1.4. Baseline LLMs The LLMs we compared in this section are all Instruct models by default. We employ Llama3 and Llama3.1 as well as Qwen2 and Qwen2.5. Additionally, we also use Aya-23 and Aya-expanse: Aya-23 and Aya-expanse The Aya-23 and Aya-expanse LLMs [Aryabumi et al., 2024] that includes the Aya-23-8B, Aya-23-35B, Aya-expanse-8B and Aya-expanse-35B models, which are open-source multilingual language models supporting 23 languages. Aya-23/Aya-expanse are based on Command- R ââ with sophisticated multilingual supervised finetuning. Table 8 | Performance comparison of LLMs across multiple major benchmarks. Best performance in each benchmark is marked in bold. Model MMMLU TydiQA AGIEval CEval Belebele 7B Models Aya-23-8B 41.0 47.2 37.1 43.9 52.5 Aya-expanse-8B 48.2 28.3 36.7 48.5 64.3 Llama3-8B 46.6 39.7 43.4 50.8 50.7 Llama3.1-8B 49.2 53.0 41.8 55.6 63.9 Qwen2-7B 52.2 29.2 57.1 81.8 69.4 Qwen2.5-7B 56.0 39.0 59.0 77.9 70.0 Marco-Chat-7B
Chunk 39 ¡ 1,990 chars
benchmark is marked in bold. Model MMMLU TydiQA AGIEval CEval Belebele 7B Models Aya-23-8B 41.0 47.2 37.1 43.9 52.5 Aya-expanse-8B 48.2 28.3 36.7 48.5 64.3 Llama3-8B 46.6 39.7 43.4 50.8 50.7 Llama3.1-8B 49.2 53.0 41.8 55.6 63.9 Qwen2-7B 52.2 29.2 57.1 81.8 69.4 Qwen2.5-7B 56.0 39.0 59.0 77.9 70.0 Marco-Chat-7B 60.1 57.7 61.5 86.4 79.3 70B Models Aya-23-35B 50.1 50.2 44.4 53.6 66.3 Aya-expanse-32B 58.9 30.0 45.7 56.9 72.7 Llama3-70B 64.3 52.0 57.1 66.7 76.2 Llama3.1-70B 71.7 53.1 55.0 71.6 84.4 Qwen2-72B 69.2 40.3 66.0 90.6 85.3 Qwen2.5-72B 69.0 48.4 67.5 88.2 88.9 Marco-72B 76.1 61.0 72.7 94.5 89.6 4.1.5. Results and Discussion Evaluation Results divided by benchmarks Table 8 presents the average scores of our Marco- Chat-7B and Marco-72B, in comparison with several baseline models across five major benchmarks: MMMLU, TydiQA, AGIEval, CEval, and Belebele. These benchmarks are designed to evaluate language models on a diverse range of tasks and languages, highlighting their multilingual comprehension and reasoning abilities. Our Marco-Chat-7B model consistently achieves the highest scores among the 7B parameter models across all benchmarks. Specifically, it significantly outperforms the baselines on CEval and Belebele, which focus on Chinese educational subjects and a variety of African languages, respectively. On CEval, Marco-Chat-7B attains a score of 86.4, surpassing the next best model, Qwen2-7B, by a substantial margin of 4.6 points. This indicates our modelâs strong capability in understanding and processing Chinese language content. Similarly, on Belebele, which evaluates proficiency in underrepresented African languages, Marco-Chat-7B achieves a score of 79.3, outperforming others https://cohere.com/blog/aya-expanse-connecting-our-world ââhttps://cohere.com/command 22 -- 22 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 9 | MMMLU Results: Performance of various LLMs across
Chunk 40 ¡ 1,996 chars
arco-Chat-7B achieves a score of 79.3, outperforming others https://cohere.com/blog/aya-expanse-connecting-our-world ââhttps://cohere.com/command 22 -- 22 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 9 | MMMLU Results: Performance of various LLMs across different languages in the MMMLU dataset for both 7B and 70B models. The best performance in each language is highlighted in bold. Language Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco-Chat GPT-4 7B Models Arabic 50.9 56.6 40.5 42.2 42.1 48.8 60.6 62.7 Bengali 42.6 45.3 36.4 39.8 27.4 33.4 54.4 60.0 German 57.3 62.3 53.5 55.6 43.3 53.9 65.9 68.0 Spanish 60.3 65.3 55.8 59.1 47.9 56.1 67.7 68.2 French 61.1 65.0 55.8 58.9 46.9 55.5 67.6 67.5 Hindi 44.5 46.6 41.4 45.8 38.9 46.2 54.3 62.2 Indonesian 56.6 61.4 51.0 54.3 46.7 53.3 62.3 66.1 Italian 60.2 64.7 53.3 56.3 47.1 55.3 65.4 68.3 Japanese 56.3 61.0 42.3 52.1 45.6 51.5 64.2 64.4 Korean 54.1 59.1 46.5 50.8 43.6 50.7 63.0 63.6 Chinese 62.0 64.3 51.4 55.5 45.7 52.5 66.5 65.9 Portuguese 59.9 64.4 55.5 59.0 46.9 55.8 67.6 68.8 Swahili 34.9 35.7 37.5 40.3 26.2 32.0 43.9 53.1 Yoruba 30.5 32.8 31.0 31.4 26.4 29.9 37.2 38.0 Avg. Score 52.2 56.0 46.6 50.1 41.0 48.2 60.1 62.6 70B Models Arabic 72.0 74.3 60.6 71.1 51.8 61.6 79.3 71.1 Bengali 68.3 67.2 53.8 66.5 32.9 43.9 76.6 64.8 German 74.4 72.5 71.4 77.0 55.5 64.7 80.7 75.7 Spanish 77.0 77.5 74.3 79.3 58.0 67.5 82.6 76.8 French 75.6 76.0 73.1 77.9 58.1 67.5 80.7 75.8 Hindi 69.9 69.1 65.0 72.7 47.6 58.8 76.9 70.1 Indonesian 73.1 73.3 70.6 75.7 55.5 65.4 79.0 73.7 Italian 75.3 72.5 73.3 77.8 57.8 66.5 81.6 75.8 Japanese 74.1 74.7 65.6 73.8 54.5 64.3 81.6 71.6 Korean 72.3 71.8 64.5 72.7 53.7 62.4 78.8 71.3 Chinese 77.5 76.7 69.5 74.8 54.1 63.4 82.0 72.5 Portuguese 76.8 76.9 73.7 78.9 58.3 67.2 81.7 76.2 Swahili 47.3 48.8
Chunk 41 ¡ 1,998 chars
70.6 75.7 55.5 65.4 79.0 73.7 Italian 75.3 72.5 73.3 77.8 57.8 66.5 81.6 75.8 Japanese 74.1 74.7 65.6 73.8 54.5 64.3 81.6 71.6 Korean 72.3 71.8 64.5 72.7 53.7 62.4 78.8 71.3 Chinese 77.5 76.7 69.5 74.8 54.1 63.4 82.0 72.5 Portuguese 76.8 76.9 73.7 78.9 58.3 67.2 81.7 76.2 Swahili 47.3 48.8 51.1 64.0 33.6 38.4 63.7 68.1 Yoruba 34.6 35.5 33.6 41.2 30.4 33.4 44.0 47.3 Avg. Score 69.2 69.0 64.3 71.7 50.1 58.9 76.1 70.8 by nearly 9 points. This demonstrates the effectiveness of our multilingual training approach in capturing linguistic nuances across diverse languages, including those with limited available data. In the 70B size LLMs, Marco-72B leads the group by a large margin on all benchmarks. It achieves a score of 76.1 on MMMLU, which assesses multitask language understanding across various subjects and languages. This result is 4.4 points higher than the next best model, Llama3.1-70B, highlighting our modelâs superior general language understanding capabilities. On TydiQA, a benchmark for typologically diverse question answering, Marco-72B attains a score of 61.0, outperforming the second-best model by 7.9 points. This suggests that our model shows strong performance at various tasks across a wide range of languages with different grammatical structures and scripts. Additionally, Marco-72B achieves an impressive score of 94.5 on CEval, indicating excellent proficiency in Chinese across various academic subjects. On Belebele, it reaches 89.6, showcasing strong performance in languages that are often underrepresented in training data. These results highlight the effectiveness of our approach in building a truly multilingual LLM. By consistently achieving top performance across benchmarks that evaluate different languages and tasks, our models demonstrate robust multilingual proficiency and adaptability. This underscores the importance of incorporating a diverse and comprehensive multilingual dataset during training 23 -- 23
Chunk 42 ¡ 1,994 chars
multilingual LLM. By consistently achieving top performance across benchmarks that evaluate different languages and tasks, our models demonstrate robust multilingual proficiency and adaptability. This underscores the importance of incorporating a diverse and comprehensive multilingual dataset during training 23 -- 23 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 10 | Performance comparison of language models on Belebele benchmark [Bandarkar et al., 2024] across different languages. Language Qwen2 Qwen2.5 Llama-3 Llama3.1 Aya-23 Aya-expanse Marco-Chat 7B Models Azerbaijani 59.9 60.4 41.3 57.7 34.1 49.9 72.3 Bengali 64.9 64.2 46.8 57.2 28.7 41.6 75.1 Czech 75.4 75.9 54.3 73.4 61.6 76.9 84.4 Greek 68.7 75.2 59.0 74.3 65.2 80.3 81.4 Hebrew 74.2 72.7 45.9 59.3 61.7 77.9 82.1 Hungarian 57.3 63.0 45.1 52.4 35.0 44.0 68.0 Indonesian 77.2 78.4 61.9 75.0 64.7 77.6 82.8 Italian 78.2 81.9 60.6 71.8 67.0 75.7 86.8 Japanese 78.0 69.8 49.9 64.7 64.2 74.1 83.1 Kazakh 48.0 51.2 36.0 49.1 28.3 38.0 73.1 Malay 78.9 77.1 57.9 67.4 53.2 73.3 83.4 Dutch 58.7 74.6 56.3 70.1 66.2 72.1 85.3 Nepali 44.3 49.4 38.9 46.7 32.9 40.8 70.0 Polish 78.4 70.3 55.0 65.0 61.1 73.9 73.3 Romanian 72.4 74.0 55.2 68.7 65.9 74.8 73.2 Russian 79.7 73.3 52.7 70.2 64.1 74.3 87.9 Ukrainian 76.1 73.2 51.1 69.8 61.8 76.2 83.4 Urdu 64.9 64.4 49.0 59.2 35.6 50.2 76.2 Thai 72.3 74.5 43.0 57.2 37.6 41.8 76.9 Vietnamese 80.0 77.0 53.8 68.3 61.4 73.2 86.3 Avg. Scores 69.4 70.0 50.7 63.9 52.5 64.3 79.3 70B Models Azerbaijani 79.9 81.6 63.3 79.7 51.9 58.3 85.6 Bengali 84.9 87.3 75.3 81.4 42.1 64.8 89.2 Czech 89.7 91.9 79.0 86.8 79.6 85.8 91.8 Greek 89.2 92.6 87.0 89.4 80.1 83.6 91.9 Hebrew 85.0 86.9 75.9 78.9 77.0 83.1 86.0 Hungarian 72.8 89.3 58.0 74.1 53.6 50.8 87.0 Indonesian 88.9 91.7 82.6 87.3 77.8 81.1 93.1 Italian 89.8
Chunk 43 ¡ 1,994 chars
79.7 51.9 58.3 85.6 Bengali 84.9 87.3 75.3 81.4 42.1 64.8 89.2 Czech 89.7 91.9 79.0 86.8 79.6 85.8 91.8 Greek 89.2 92.6 87.0 89.4 80.1 83.6 91.9 Hebrew 85.0 86.9 75.9 78.9 77.0 83.1 86.0 Hungarian 72.8 89.3 58.0 74.1 53.6 50.8 87.0 Indonesian 88.9 91.7 82.6 87.3 77.8 81.1 93.1 Italian 89.8 90.4 83.8 87.9 81.1 77.7 91.1 Japanese 87.8 90.2 82.2 86.9 73.7 78.2 90.1 Kazakh 73.6 76.0 54.1 78.2 40.7 55.1 81.7 Malay 88.7 91.2 87.7 88.7 74.8 76.4 92.1 Dutch 90.9 93.2 80.4 87.1 80.8 85.7 94.4 Nepali 70.6 80.1 55.7 76.0 39.9 50.1 84.4 Polish 86.4 89.7 75.0 88.7 73.9 83.7 90.6 Romanian 88.4 92.1 73.4 86.3 75.7 77.7 90.6 Russian 90.0 94.1 86.1 87.2 73.2 84.6 92.7 Ukrainian 90.9 93.4 84.1 90.2 74.3 79.8 93.0 Urdu 83.1 86.4 78.9 90.2 47.2 63.7 88.2 Thai 86.2 87.1 79.7 82.7 53.6 63.8 87.6 Vietnamese 88.8 92.0 81.7 83.9 75.8 69.2 92.2 Avg. Scores 85.3 88.9 76.2 84.4 66.3 72.7 89.6 to enhance the language understanding abilities of large language models. Our observations also reveal that models with a higher number of parameters, like Marco-72B, substantially benefit from our multilingual training strategy, achieving significant improvements over other large models. Fur- thermore, the consistent gains across both high-resource languages (like English and Chinese) and low-resource languages (as represented in Belebele) indicate that our models do not merely rely on the abundance of data in certain languages but truly learn to generalize across linguistic boundaries. 24 -- 24 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 11 | Performance comparison of various LLMs and MT systems on the Flores benchmark. The best performance in each column is highlighted in bold. GPT4 DeepL Google Aya-32B Aya-35B Qwen2-72B Qwen2.5-72B Llama3-70B Llama3.1-70B Marco-7B Marco-72B EnâXX En2Ar 40.4 48.1 50.0 31.5 24.4 17.1 29.9 29.2 33.5
Chunk 44 ¡ 1,988 chars
ment Table 11 | Performance comparison of various LLMs and MT systems on the Flores benchmark. The best performance in each column is highlighted in bold. GPT4 DeepL Google Aya-32B Aya-35B Qwen2-72B Qwen2.5-72B Llama3-70B Llama3.1-70B Marco-7B Marco-72B EnâXX En2Ar 40.4 48.1 50.0 31.5 24.4 17.1 29.9 29.2 33.5 41.5 61.2 En2De 45.9 48.7 49.3 32.3 33.1 37.7 40.4 43.0 44.7 41.6 47.7 En2Es 33.1 32.9 34.6 28.4 20.6 32.0 31.7 31.5 32.5 33.1 37.2 En2Fr 54.4 59.1 57.9 50.5 34.5 52.3 53.2 52.6 55.2 54.9 58.8 En2It 37.2 41.5 39.1 32.4 25.3 34.2 34.8 34.8 36.6 36.2 40.6 En2Ja 34.6 36.8 41.1 31.0 9.6 29.6 33.0 14.3 33.0 36.9 37.7 En2Ko 28.5 32.9 33.7 27.2 12.3 19.2 24.8 0.1 27.7 29.0 31.6 En2Nl 34.8 37.0 36.3 25.4 24.2 28.4 30.9 32.0 34.7 33.0 35.9 En2Pl 30.3 33.4 33.7 24.8 16.5 20.4 26.0 28.6 29.5 28.3 30.6 En2Pt 54.8 45.7 56 44.0 41.7 50.1 52.6 52.0 54.8 54.5 57.9 En2Ru 36.8 40.5 40.9 32.3 23.8 33.6 36.1 35.6 37.9 37.7 43.4 En2Tr 36.9 45.0 44.2 33.1 26.4 22.8 30.4 32.7 36.8 35.8 37.7 En2Uk 37.0 42.9 41.6 33.5 17.0 25.2 30.0 36.5 36.8 36.3 44.3 En2Zh 44.2 48.6 50.6 26.0 15.6 28.0 33.9 13.3 31.3 45.3 48.6 Avg. Scores 39.2 42.4 43.5 32.3 23.2 30.8 34.8 31.2 37.5 38.9 43.8 XXâEn Ar2En 42.7 47.7 46.8 33.3 41.1 41.5 44.6 37.1 46.1 48.0 58.0 De2En 47.7 51.0 51.3 30.8 40.9 46.9 48.4 46.3 49.4 50.6 54.7 Es2En 34.3 36.9 36.3 24.8 33.8 34.5 35.3 33.7 35.0 40.2 47.2 Fr2En 48.9 50.8 52.7 27.7 45.1 48.8 49.5 47.5 50.7 51.5 56.8 It2En 36.7 40.2 40.2 28.6 37.5 36.7 38.2 36.5 38.4 42.9 49.2 Ja2En 30.4 37.0 36.7 20.5 22.6 29.8 31.9 26.2 32.3 36.3 49.5 Ko2En 33.3 39.3 38.2 21.9 25.9 32.1 34.5 28.7 33.8 37.0 49.0 Nl2En 36.0 37.7 38.7 23.3 32.8 35.9 36.3 34.8 37.0 39.8 46.4 Pl2En 33.5 35.8 37.0 19.6 27.6 33.7 34.7 32.1 35.3 38.4
Chunk 45 ¡ 1,993 chars
5 36.7 38.2 36.5 38.4 42.9 49.2 Ja2En 30.4 37.0 36.7 20.5 22.6 29.8 31.9 26.2 32.3 36.3 49.5 Ko2En 33.3 39.3 38.2 21.9 25.9 32.1 34.5 28.7 33.8 37.0 49.0 Nl2En 36.0 37.7 38.7 23.3 32.8 35.9 36.3 34.8 37.0 39.8 46.4 Pl2En 33.5 35.8 37.0 19.6 27.6 33.7 34.7 32.1 35.3 38.4 45.9 Pt2En 53.1 55.8 56.3 37.9 50.7 53.0 53.5 51.8 54.9 54.5 60.3 Ru2En 38.7 43.3 42.9 23.6 36.2 39.0 40.2 37.9 41.0 43.8 49.2 Tr2En 42.6 48.5 47.7 31.1 36.5 39.9 42.3 37.9 43.1 43.4 52.0 Uk2En 43.4 47.2 47.3 27.4 38.8 41.6 43.7 40.5 44.9 46.3 58.8 Zh2En 31.3 36.8 37.7 24.3 26.7 31.0 35.4 29.0 34.7 38.2 45.5 Avg. Scores 36.8 43.4 43.6 26.8 35.4 38.9 40.6 37.2 41.2 43.6 51.6 MMMLU Results We also highlight the results on the recently released multilingual version of MMLU from OpenAI â â , which are shown in Table 9. The MMMLU dataset evaluates language models across a diverse set of languages and tasks. This benchmark is crucial for assessing the multilingual capabilities of language models, particularly their ability to process and understand languages beyond â â https://huggingface.co/datasets/openai/MMMLU 25 -- 25 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement English. In the 7B LLMs, Marco-7B consistently outperforms baseline models across a wide range of lan- guages. It achieves the highest scores in languages such as Arabic (60.6), Bengali (54.4), German (65.9), and Spanish (67.7), demonstrating a significant advantage over other models. This perfor- mance is particularly noteworthy in Bengali, where Marco-7B surpasses the second best model (Qwen- 2.5-7B) by a substantial margin of 4.1, highlighting its ability to handle languages with less training data effectively. The modelâs strong results in languages like Hindi (+7.7 compared to the second best model) and Korean further emphasize its robust multilingual capabilities,
Chunk 46 ¡ 1,990 chars
surpasses the second best model (Qwen- 2.5-7B) by a substantial margin of 4.1, highlighting its ability to handle languages with less training data effectively. The modelâs strong results in languages like Hindi (+7.7 compared to the second best model) and Korean further emphasize its robust multilingual capabilities, making it a versatile tool for diverse linguistic contexts. In the 70B LLMs, Marco-72B achieves the highest average score across all languages, showcasing its superior multilingual understanding. It demonstrated competitive performance in high-resource languages such as German (80.7), Spanish (82.6), and Chinese (82.0), while also delivering strong performance in lower-resource languages like Swahili (63.7) and Yoruba (44.0). These results underscore the modelâs extensive language processing abilities, positioning it as a leading solution for multilingual applications. It is worth noting that our Marco-72B outperformed GPT-4 (for 7B LLMs we compare with GPT-4o-mini and for 70B LLMs we compare GPT-4) [Hurst et al., 2024] in many languages. truction (existing preference dataset) Belebele Results The results on the Belebele [Bandarkar et al., 2024], presented in Table 10, illustrate the strong multilingual capabilities of the Marco-Chat models across both 7B and 70B parameter scales. For the 7B models, Marco-Chat consistently achieves the highest scores across most languages, with an average score of 79.3, significantly outperforming other models such as Qwen2 (69.4) and Qwen2.5 (70.0). This performance is particularly impressive in low-resource languages like Kazakh and Nepali, where Marco-Chat scores 73.1 and 70.0, respectively, showcasing improvements of 25.1 and 25.7 points over Qwen2. These substantial gains underscore the success of our continual pretraining approach, which effectively leverages a vast multilingual corpus to enhance generalization capabilities across languages with limited resources. Additionally, Marco-Chat remains highly
Chunk 47 ¡ 1,989 chars
howcasing improvements of 25.1 and 25.7 points over Qwen2. These substantial gains underscore the success of our continual pretraining approach, which effectively leverages a vast multilingual corpus to enhance generalization capabilities across languages with limited resources. Additionally, Marco-Chat remains highly competitive in high- resource languages such as Italian (86.8) and Japanese (83.1), further demonstrating its versatility and robustness. The 70B models exhibit similar patterns, with Marco-Chat achieving an average score of 89.6, leading the performance across nearly all languages. Notable improvements are observed in Bengali (89.2) and Indonesian (93.1), surpassing Qwen2.5 by 1.9 and 1.4 points, respectively. The modelâs ability to handle complex linguistic tasks is further exemplified in high-resource languages such as Dutch (94.4) and Russian (92.7), where it achieves top scores. These results highlight the strategic advantage of our approach, which effectively captures linguistic nuances and complexities, benefiting from extensive multilingual pretraining. Overall, the results on the Belebele benchmark validate our focus on enhancing multilingual capabilities. English-pivot Translation Results The English-pivot translation results on Flores benchmark [Goyal et al., 2021, Team et al., 2022], as detailed in Table 11, highlight the strengths of the Marco-Chat LLMs in translation tasks across a variety of languages. This benchmark is designed to evaluate translation quality from English to multiple target languages (ENâXX) and vice versa (XXâEN), offering insights into the multilingual capabilities of different models. In the ENâXX translation tasks, the Marco-72B model achieves an average score of 43.8, surpass- ing the second-best model, Google Translate, by a margin of 0.3 points. Notably, Marco-72B shows strong performance in translating English into Arabic (En2Ar) with a BLEU score of 61.2, outperform- ing Google by 11.2 points, and in
Chunk 48 ¡ 1,994 chars
NâXX translation tasks, the Marco-72B model achieves an average score of 43.8, surpass- ing the second-best model, Google Translate, by a margin of 0.3 points. Notably, Marco-72B shows strong performance in translating English into Arabic (En2Ar) with a BLEU score of 61.2, outperform- ing Google by 11.2 points, and in Portuguese (En2Pt) with a BLEU score of 57.9, leading by 1.9 points. These results highlight the modelâs ability to handle both high-resource languages, such as French 26 -- 26 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 12 | Any2Any translation performance on Flores benchmark across various language pairs for LLMs. The table compares results from the Instruct models of Qwen2, Qwen2.5, Llama3, Llama3.1, and Marco-LLM, with the best performance highlighted in bold for each translation direction. Trans. Dir. Qwen2-7B Qwen2.5-7B Llama3-8B Llama3.1-8B Aya-expanse-8B Aya-23-8B Marco-7B Ar2Ja 16.2 14.3 13.1 17.5 16.9 17.1 21.8 Es2Ja 17.2 10.7 11.2 18.1 18.3 19.9 22.4 Fr2Ja 19.9 14.0 14.1 21.6 17.6 22.3 25.8 Hu2Ja 15.2 9.6 12.3 16.0 10.7 12.6 20.3 Hu2Ko 10.5 6.9 9.4 10.2 9.2 11.0 14.0 Ja2Ar 9.8 7.4 8.6 11.1 12.3 9.6 15.5 Ja2Es 16.1 15.0 15.9 16.7 12.9 16.2 19.1 Ja2Ko 16.3 12.1 12.2 17.3 18.7 17.2 22.6 Ja2Th 11.7 11.4 11.1 11.6 0.8 2.0 16.4 Ja2Zh 19.5 17.6 6.9 10.2 14.9 12.7 22.9 Kk2Ar 7.3 5.5 7.1 8.6 2.3 5.7 11.6 Kk2Fr 13.8 8.9 17.9 14.1 5.6 10.6 20.5 Kk2Ja 11.7 6.0 10.4 12.2 4.1 9.6 16.8 Kk2Ko 8.3 4.7 9.6 9.3 4.5 7.5 13.1 Kk2Pt 12.7 8.8 15.2 10.2 4.2 9.7 16.9 Kk2Th 6.7 7.1 8.9 10.4 0.4 1.2 12.7 Kk2Zh 11.9 10.8 10.2 13.1 3.0 7.6 18.5 Ko2Ja 22.4 21.2 17.5 22.3 23.9 19.3 26.2 Ko2Th 11.9 9.7 11.2 12.2 0.8 2.0 14.7 Ko2Zh 20.0 20.3 10.2 16.2 16.6 14.7 22.7 Th2Ar 11.3 9.1 8.7 5.5 1.7 6.8 15.4 Th2Es 16.4 15.1 16.4 9.1 1.5 10.0 18.6 Th2Fr 22.2
Chunk 49 ¡ 1,995 chars
7.1 8.9 10.4 0.4 1.2 12.7 Kk2Zh 11.9 10.8 10.2 13.1 3.0 7.6 18.5 Ko2Ja 22.4 21.2 17.5 22.3 23.9 19.3 26.2 Ko2Th 11.9 9.7 11.2 12.2 0.8 2.0 14.7 Ko2Zh 20.0 20.3 10.2 16.2 16.6 14.7 22.7 Th2Ar 11.3 9.1 8.7 5.5 1.7 6.8 15.4 Th2Es 16.4 15.1 16.4 9.1 1.5 10.0 18.6 Th2Fr 22.2 20.3 19.3 12.4 5.1 14.2 25.2 Th2Ja 16.5 15.4 12.6 14.5 0.0 5.5 22.7 Th2Kk 2.2 1.7 4.4 5.4 0.3 0.8 9.3 Th2Ko 12.2 9.4 8.3 7.6 0.8 5.1 16.5 Th2Zh 18.8 18.0 9.5 9.2 0.0 6.4 22.3 Tr2Ja 17.7 7.3 11.9 20.0 19.6 21.4 23.2 Uk2Fr 27.5 24.3 31.2 33.0 31.5 33.0 34.6 Uk2Ja 17.3 13.0 14.1 20.1 16.9 21.8 26.6 Uk2Kk 3.4 2.7 7.9 8.3 0.7 1.8 12.4 Uk2Ko 13.2 8.9 10.7 15.8 16.0 17.3 18.9 Uk2Th 12.4 11.2 13.7 15.2 1.1 2.9 19.9 Uk2Zh 20.7 18.4 11.6 9.8 16.1 17.8 25.6 Ur2Ar 8.0 7.4 5.2 4.8 3.4 6.4 10.7 Ur2Ko 8.7 6.8 8.5 9.3 4.5 9.0 12.8 Zh2Ar 11.9 9.8 10.6 13.5 14.9 13.6 17.0 Zh2Fr 24.5 21.7 21.8 25.7 24.5 21.3 28.9 Zh2Ja 19.2 15.0 10.9 18.4 14.4 17.3 27.7 Zh2Ko 14.2 10.4 9.7 14.8 15.7 15.4 21.2 Zh2Pt 22.6 20.9 19.5 20.5 21.0 21.5 25.4 Zh2Th 13.3 10.8 12.8 13.9 1.0 2.5 19.8 Avg. Score 14.6 11.9 12.2 13.9 9.7 11.9 19.7 (En2Fr, 58.8), and low-resource languages, such as Ukrainian (En2Uk, 44.3), demonstrating its strong capability and robustness. Besides, for the language pairs where Marco-LLM underperformed competitive commercial MT systems such as En2De, En2Ja and En2Tr, it still achieved the best trans- lation performance compared to the other open-sourced LLMs. For XXâEN translations, Marco-72B achieves an impressive average score of 51.6, leading the performance with a significant margin of 8.0 points over the second-best model, Google. The model shows outstanding performance in translating from Italian (It2En, 49.2) and Korean (Ko2En, 49.0), reflecting its advanced capacity to capture nuanced linguistic features and deliver
Chunk 50 ¡ 1,997 chars
sive average score of 51.6, leading the performance with a significant margin of 8.0 points over the second-best model, Google. The model shows outstanding performance in translating from Italian (It2En, 49.2) and Korean (Ko2En, 49.0), reflecting its advanced capacity to capture nuanced linguistic features and deliver high-quality translations. The consistent outperformance 27 -- 27 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement 0 230 460 690 920 1150 Training Steps 0.2 0.3 0.4 0.5 0.6 0.7 Accuracy Accuracy Trends Across Different Training Steps Yoruba Chinese Arabic Italian Averge Figure 7 | Accuracy trends across training checkpoints for different languages on MMMLU. The model shows rapid initial learning (0-230 steps) followed by performance stabilization. High-resource languages (ZH-CN, IT-IT) consistently outperform low-resource ones (YO-NG), with a persistent performance gap of 29%. across both high-resource languages (e.g., French to English, 56.8) and low-resource languages (e.g., Ukrainian to English, 58.8) underscores the effectiveness of our multilingual approach. The results on the Flores benchmark [Team et al., 2022] validate the effectiveness of focusing on enhancing multilingual capabilities. Non-English-pivot Translation Results The Non-English-pivot translation results (Any2Any transla- tion - translation from any languages to any languages) the Flores benchmark, as shown in Table ??, highlight the superior performance of the Marco-7B model in Any2Any translation tasks. The Marco-7B model achieves the highest average score of 19.5, which is a substantial margin above the second-best model, Qwen2-7B, with an average score of 14.4. This represents a notable improvement of 5.1 points. In some specific language directions, Marco-7B exhibits remarkable strong performance. For example, Marco-7B outperformed Qwen2-7B in Chinese to Japanese (Zh2Ja) translation by 8.5 points. Similarly, in the
Chunk 51 ¡ 1,996 chars
second-best model, Qwen2-7B, with an average score of 14.4. This represents a notable improvement of 5.1 points. In some specific language directions, Marco-7B exhibits remarkable strong performance. For example, Marco-7B outperformed Qwen2-7B in Chinese to Japanese (Zh2Ja) translation by 8.5 points. Similarly, in the Arabic to Japanese (Ar2Ja) translation, Marco-7B achieves a score of 21.8, outperforming Llama3.1-8B by 4.3 points. Moreover, in the French to Japanese (Fr2Ja) translation, Marco-7B scores 25.8, which is 4.2 points higher than Llama3.1-8B. These results underscore the modelâs capability to manage complex linguistic structures, particularly in non-English-pivot language pairs. The Marco-7B modelâs superior performance is evident across both high-resource and low-resource languages, demonstrating its versatility and robustness. This is particularly important for enhancing the multilingual/cross-lingual capabilities of large language models, as it ensures consistent translation quality across a wide range of languages beyond traditional English-centric translation [Fan et al., 2020]. Performance Across Different Training Steps Our analysis of the performance of Marco-7B model across different training steps (0-1150) on the MMMLU benchmark is shown in Figure 7. The most significant improvements occur during the early training phase (0-230 steps), where the overall 28 -- 28 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement average accuracy increases dramatically from 40.81% to 59.29%. After 460 step, the performance stabilizes across most languages, with the final overall accuracy reaching 60.05%. High-resource languages like Chinese (ZH-CN) and Italian (IT-IT) achieve and maintain higher accuracy levels (>65%) compared to low-resource languages, with Yoruba (YO-NG) plateauing around 37%. This persistent performance gap of approximately 29% between high and low-resource languages suggests that while
Chunk 52 ¡ 1,995 chars
ing 60.05%. High-resource languages like Chinese (ZH-CN) and Italian (IT-IT) achieve and maintain higher accuracy levels (>65%) compared to low-resource languages, with Yoruba (YO-NG) plateauing around 37%. This persistent performance gap of approximately 29% between high and low-resource languages suggests that while the model effectively captures multilingual knowledge early in training, achieving better performance across languages remains a challenge that may require more monolingual data during the pretraining phase rather than simply extending training steps for SFT. 4.2. Multilingual Preference Alignment Preference alignment is critical for ensuring that an LLMâs outputs are consistent with human expec- tations and values. However, most LLMs are predominantly aligned on preference in English data, leading to a disparity in performance when applied to other languages. In a multilingual setting, this alignment becomes even more essential due to the variations in language structures [She et al., 2024], idiomatic expressions, and cultural references. By focusing on multilingual preference alignment, we aim to enhance Marcoâs ability to generate responses that are not only grammatically correct but also culturally appropriate and contextually relevant in multiple languages. Moreover, multilingual preference alignment helps mitigate biases that may arise from training on datasets that lack linguistic diversity. It promotes fairness and inclusivity, enabling the model to cater to a broader user base. By aligning the modelâs preferences across different languages, we ensure that Marco-LLM can effectively understand and respond to users worldwide, fostering better communication and understanding across linguistic boundaries. 4.2.1. Dataset Construction from Existing Preference Data To construct a comprehensive multilingual preference dataset, we began with the LMSYS Arena Human Preference dataset [Chiang et al., 2024] âĄâĄ. This dataset comprises 57.5k high-quality
Chunk 53 ¡ 1,935 chars
stering better communication and understanding across linguistic boundaries. 4.2.1. Dataset Construction from Existing Preference Data To construct a comprehensive multilingual preference dataset, we began with the LMSYS Arena Human Preference dataset [Chiang et al., 2024] âĄâĄ. This dataset comprises 57.5k high-quality human preference annotations for various prompts and responses in English. We selected a subset of high- quality examples based on criteria such as clarity, relevance, and diversity of topics to ensure a robust foundation for multilingual preference alignment. The selected examples were then translated into the 28 target languages. This translation step was critical to extend the language coverage of the original data. By leveraging existing English preference data and extending it to multiple languages, we aim to improve the performance of preference alignment of Marco-LLM under various languages beyond English. 4.2.2. Multilingual Preference Data Generation and Translation In addition to the translated data, we expanded our preference dataset by incorporating prompts from the UltraFeedback dataset [Cui et al., 2023], which are also translated into 28 languages. For each prompt, we utilized Marco-LLM to generate at least two distinct responses with different generation configuration. This approach allowed us to capture the modelâs inherent variability in generating responses across different languages. To establish preferences between the generated responses, we employed another LLM to evaluate and select the better response based on predefined criteria such as relevance, coherence, and adherence to the prompt. This process effectively created a set of preference pairs that reflect the modelâs capabilities and the desired outcomes in various languages. By generating and evaluating responses within the target languages, we ensured that the preference data was culturally and linguistically
Chunk 54 ¡ 1,975 chars
herence, and adherence to the prompt. This process effectively created a set of preference pairs that reflect the modelâs capabilities and the desired outcomes in various languages. By generating and evaluating responses within the target languages, we ensured that the preference data was culturally and linguistically appropriate. âĄâĄhttps://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k 29 -- 29 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Win Loss Tie 0.25 0.30 0.35 0.40 0.45 43.85% 27.50% 28.65% Language: ar Win Loss Tie 0.20 0.30 0.40 0.50 50.52% 25.00% 24.48% Language: az Win Loss Tie 0.20 0.30 0.40 0.50 51.56% 25.00% 23.44% Language: bn Win Loss Tie 0.25 0.30 0.35 0.40 0.45 29.17% 43.13% 27.71% Language: cs Win Loss Tie 0.31 0.32 0.33 0.34 0.35 0.36 35.73% 31.77% 32.50% Language: de Win Loss Tie 0.20 0.30 0.40 0.50 0.60 22.29% 53.54% 24.17% Language: el Win Loss Tie 0.25 0.30 0.35 0.40 0.45 42.60% 25.21% 32.19% Language: es Win Loss Tie 0.28 0.30 0.32 0.34 0.36 0.38 36.98% 29.27% 33.75% Language: fr Win Loss Tie 0.25 0.30 0.35 0.40 0.45 41.56% 31.56% 26.88% Language: he Win Loss Tie 0.20 0.30 0.40 0.50 48.23% 28.23% 23.54% Language: hu Win Loss Tie 0.30 0.31 0.32 0.33 0.34 0.35 0.36 34.48% 35.00% 30.52% Language: id Win Loss Tie 0.32 0.33 0.33 0.34 0.34 0.34 34.17% 32.08% 33.75% Language: it Win Loss Tie 0.30 0.32 0.34 0.36 32.92% 30.83% 36.25% Language: ja Win Loss Tie 0.20 0.40 0.60 66.04% 17.50% 16.46% Language: kk Win Loss Tie 0.30 0.32 0.34 0.36 0.38 37.08% 31.77% 31.15% Language: ko Win Loss Tie 0.25 0.30 0.35 0.40 0.45 44.58% 27.08% 28.33% Language: ms Win Loss Tie 0.20 0.30 0.40 0.50 48.33% 29.79% 21.88% Language: ne Win Loss Tie 0.25 0.30 0.35 0.40 37.81% 26.15% 36.04% Language: nl Win Loss Tie 0.32 0.33 0.33 0.34 0.34 0.34 0.35 34.62% 33.05% 32.33% Language: pl Win Loss Tie 0.25 0.30 0.35 0.40
Chunk 55 ¡ 1,996 chars
0.25 0.30 0.35 0.40 0.45 44.58% 27.08% 28.33% Language: ms Win Loss Tie 0.20 0.30 0.40 0.50 48.33% 29.79% 21.88% Language: ne Win Loss Tie 0.25 0.30 0.35 0.40 37.81% 26.15% 36.04% Language: nl Win Loss Tie 0.32 0.33 0.33 0.34 0.34 0.34 0.35 34.62% 33.05% 32.33% Language: pl Win Loss Tie 0.25 0.30 0.35 0.40 39.79% 26.15% 34.06% Language: pt Win Loss Tie 0.30 0.32 0.34 0.36 0.38 36.67% 32.08% 31.25% Language: ro Win Loss Tie 0.25 0.30 0.35 0.40 37.71% 25.63% 36.67% Language: ru Win Loss Tie 0.20 0.30 0.40 0.50 52.08% 24.90% 23.02% Language: th Win Loss Tie 0.25 0.30 0.35 0.40 40.73% 26.15% 33.12% Language: tr Win Loss Tie 0.28 0.30 0.33 0.35 0.38 0.40 38.92% 32.31% 28.77% Language: uk Win Loss Tie 0.20 0.30 0.40 0.50 0.60 57.08% 21.88% 21.04% Language: ur Win Loss Tie 0.28 0.30 0.33 0.35 0.38 0.40 39.58% 29.17% 31.25% Language: vi Win Loss Tie 0.25 0.28 0.30 0.33 0.35 0.38 37.29% 27.40% 35.31% Language: zh Marco-LLM vs Baseline Models Win Rates on Multilingual MT-bench by Language Figure 8 | Performance comparison of Marco-LLM against baseline models across 28 languages on multilingual MT-bench. Each subplot shows the win rate (blue), loss rate (green), and tie rate (red) for a specific language. Win rates indicate Marco-LLMâs superior responses, loss rates represent baseline modelsâ better performance, and tie rates show equivalent quality responses. 4.2.3. Evaluation Results To evaluate Marcoâs multilingual capabilities for capturing preference in different languages, we translated the original English MT-Bench benchmark [Chiang et al., 2024] into the 28 target languages. We then compare the generated responses from LLMs in a pairwise manner using GPT-4o-mini, specifically we compare the responses of Marco-LLM (7B) with the responses from the other six baseline LLMs including Qwen2, Qwen2.5, Llama3, Llama3.1, Aya-23, Aya-expanse (all in 7/8B size). The results from the multilingual MT-bench, as illustrated in Figure 8, reveal
Chunk 56 ¡ 1,992 chars
rom LLMs in a pairwise manner using GPT-4o-mini, specifically we compare the responses of Marco-LLM (7B) with the responses from the other six baseline LLMs including Qwen2, Qwen2.5, Llama3, Llama3.1, Aya-23, Aya-expanse (all in 7/8B size). The results from the multilingual MT-bench, as illustrated in Figure 8, reveal that Marco-chat-7B model outperforms baseline models in 25 out of 28 languages. We evaluate model responses using GPT-4o-mini where win rates, loss rates, and tie rates (Marco-LLM vs baseline) were averaged for the baseline models mentioned in Section 4.1.4 across each language. Marco-chat-7B achieved better generation quality in low-resource languages. For instance, in Azerbaijani (az), the model achieves a win rate of 50.52% compared to a loss rate of 25%, while in Bengali (bn), the win rate is 51.56% against a loss rate of 23.44% as well as Kazakh(he) where Marco-LLM obtained a win rate of 66.04% . These results indicate a clear advantage over the baseline models. Hebrew (he) also demonstrates a strong performance with a win rate of 41.56%, surpassing the loss rate of 31.56%. In certain high-resource languages, Marco-chat-7B maintains competitive performance. For example, in French (fr), our model achieves a win rate of 35.98% against a loss rate of 29.27%, and in Chinese (zh), it achieves a win rate of 37.29% compared to a loss rate of 27.40%. These results reflect the modelâs effective handling of languages with extensive linguistic data. The model also performs consistently well across various language families, maintaining win rates around 35% for Indo-European languages such as Italian (it) with a win rate of 34.17%, and Dutch (nl) with a win rate of 37.81%. These results suggest balanced performance across diverse linguistic structures. While Marco-LLM achieved strong performance in 25 languages out of 28 languages, it still underperformed in languages including Czech, Greek and Indonesian. This suggests the need for further refinement
Chunk 57 ¡ 1,996 chars
tch (nl) with a win rate of 37.81%. These results suggest balanced performance across diverse linguistic structures. While Marco-LLM achieved strong performance in 25 languages out of 28 languages, it still underperformed in languages including Czech, Greek and Indonesian. This suggests the need for further refinement in handling complex grammatical structures. Overall, the experimental results highlight Marco-chat-7Bâs strong multilingual performance, particularly in languages where it achieves a higher win rate than loss rate. The model effectively addresses the challenges posed by both high-resource and low-resource 30 -- 30 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement languages, demonstrating its capability and potential for deployment in a wide range of linguistic environments. 5. Conclusion and Future Work In this paper, we introduced Marco-LLM, a multilingual LLM specifically designed to address the challenges posed by low-resource languages. By leveraging a large and diverse multilingual dataset, we conducted extensive multilingual continual pre-training and post-training, including supervised finetuning and preference alignment, based on the Qwen2 model. Our comprehensive evaluations on benchmarks such as MMMLU, Flores, Belebele, CEVAL, TydiQA, and multilingual MT-bench validated that Marco-LLM obtained excellent performance in multilingual tasks. The results demonstrate that focusing on low-resource languages can bridge existing performance gaps and extend the benefits of LLMs to a wider range of linguistic communities. Our work highlights the importance of data diversity and targeted training strategies in enhancing model performance across diverse languages. For future work, there are several directions for future research. One promising direction is to extend Marco-LLMâs capabilities to include more languages, further enriching the linguistic diversity it can handle. Additionally, exploring
Chunk 58 ¡ 1,983 chars
eted training strategies in enhancing model performance across diverse languages. For future work, there are several directions for future research. One promising direction is to extend Marco-LLMâs capabilities to include more languages, further enriching the linguistic diversity it can handle. Additionally, exploring the integration of multilingual reasoning capabilities could enhance the modelâs ability to understand and generate more complex language structures. Furthermore, improving model efficiency and scalability will be essential for deploying these systems in real-world applications, particularly in resource-constrained environments. References R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. Ăbrego, J. Ahn, J. Austin, P. Barham, J. A. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catasta, Y. Cheng, C. Cherry, C. A. Choquette-Choo, A. Chowdhery, C. Crepy, S. Dave, M. Dehghani, S. Dev, J. Devlin, M. DĂaz, N. Du, E. Dyer, V. Feinberg, F. Feng, V. Fienber, M. Freitag, X. Garcia, S. Gehrmann, L. Gonzalez, and et al. Palm 2 technical report. CoRR, abs/2305.10403, 2023. M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. In International Conference on Learning Representations, 2018. V. Aryabumi, J. Dang, D. Talupuru, S. Dash, D. Cairuz, H. Lin, B. Venkitesh, M. Smith, J. A. Campos, Y. C. Tan, K. Marchisio, M. Bartolo, S. Ruder, A. Locatelli, J. Kreutzer, N. Frosst, A. Gomez, P. Blunsom, M. Fadaee, A. ĂstĂźn, and S. Hooker. Aya 23: Open weight releases to further multilingual progress, 2024. URL https://arxiv.org/abs/2405.15032. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint
Chunk 59 ¡ 1,992 chars
unsom, M. Fadaee, A. ĂstĂźn, and S. Hooker. Aya 23: Open weight releases to further multilingual progress, 2024. URL https://arxiv.org/abs/2405.15032. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. L. Bandarkar, D. Liang, B. Muller, M. Artetxe, S. N. Shukla, D. Husa, N. Goyal, A. Krishnan, L. Zettle- moyer, and M. Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 749â775, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.44. URL https://aclanthology.org/2024.acl-long.44. 31 -- 31 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023. L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-jussĂ , C. Federmann, M. Fishel, A. Fraser, Y. Graham, P. Guzman, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, A. Martins, M. Morishita, C. Monz, M. Nagata, T. Nakazawa, and M. Negri, editors. Proceedings of the Fifth Conference on Machine Translation, Online, Nov. 2020. Association for Computational Linguistics. URL https: //aclanthology.org/2020.wmt-1.0. O. Bojar, C. Buck, R. Chatterjee, C. Federmann, L. Guillou, B. Haddow, M. Huck, A. J. Yepes, A. NĂŠvĂŠol, M. Neves, P. Pecina, M. Popel, P. Koehn, C. Monz, M. Negri, M. Post, L. Specia, K. Verspoor, J. Tiedemann, and M. Turchi, editors. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task
Chunk 60 ¡ 1,993 chars
0. O. Bojar, C. Buck, R. Chatterjee, C. Federmann, L. Guillou, B. Haddow, M. Huck, A. J. Yepes, A. NĂŠvĂŠol, M. Neves, P. Pecina, M. Popel, P. Koehn, C. Monz, M. Negri, M. Post, L. Specia, K. Verspoor, J. Tiedemann, and M. Turchi, editors. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2300. URL https://aclanthology.org/W16-2300. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877â1901. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. V. Chaudhary, Y. Tang, F. GuzmĂÂĄn, H. Schwenk, and P. Koehn. Low-resource corpus filtering using multilingual sentence embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 263â268, Florence, Italy, August 2019. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W19-5435. W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J.
Chunk 61 ¡ 1,995 chars
r evaluating llms by human preference, 2024. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel. Palm: scaling language modeling with pathways. J. Mach. Learn. Res., 24(1), Mar. 2024. ISSN 1532-4435. J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454â470, 2020. doi: 10.1162/tacl_ a_00317. URL https://aclanthology.org/2020.tacl-1.30. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/ abs/1803.05457. 32 -- 32 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Cohere For AI. c4ai-command-r-plus-08-2024, 2024. URL https://huggingface.co/CohereForAI/ c4ai-command-r-plus-08-2024. A. Conneau and G. Lample. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems, 32:7059â7069, 2019. A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018
Chunk 62 ¡ 1,987 chars
-08-2024. A. Conneau and G. Lample. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems, 32:7059â7069, 2019. A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475â2485, Brussels, Belgium, Oct.- Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269. G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023. V. Dac Lai, C. Van Nguyen, N. T. Ngo, T. Nguyen, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pages arXivâ2307, 2023. DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Zhang, M. Li, M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, P. Wang, P. Zhang, Q. Zhu, Q. Chen, Q. Du, R. J. Chen, R. L. Jin, R. Ge, R. Pan, R. Xu, R. Chen, S. S. Li, S. Lu, S. Zhou, S. Chen, S. Wu, S. Ye, S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Zheng, T. Wang, T. Pei, T. Yuan, T. Sun, W. L. Xiao, W. Zeng, W. An, W. Liu, W. Liang, W. Gao, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen, X. Chen, X. Chen, X. Nie, and X. Sun. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024. A. Dubey and et al. The llama 3 herd of models, 2024. URL
Chunk 63 ¡ 1,994 chars
W. Zeng, W. An, W. Liu, W. Liang, W. Gao, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen, X. Chen, X. Chen, X. Nie, and X. Sun. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR, abs/2405.04434, 2024. A. Dubey and et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. A. El-Kishky, V. Chaudhary, F. GuzmĂĄn, and P. Koehn. CCAligned: A massive collection of cross- lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 5960â5969, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.480. URL https://www.aclweb. org/anthology/2020.emnlp-main.480. A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin. Beyond english-centric multilingual machine translation, 2020. URL https://arxiv.org/abs/2010.11125. N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, G. Wenzek, D. Ju, S. Krishnan, M. Ranzato, F. GuzmĂĄn, and A. Fan. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. 2021. S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y. Li. Textbooks are all you need. CoRR, abs/2306.11644, 2023. D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming â the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196. 33 -- 33 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding
Chunk 64 ¡ 1,996 chars
hen the large language model meets programming â the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196. 33 -- 33 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. Trans. Mach. Learn. Res., 2023, 2023. D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021. S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y. Fang, Y. Huang, W. Zhao, X. Zhang, Z. L. Thai, K. Zhang, C. Wang, Y. Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, D. Li, Z. Liu, and M. Sun. Minicpm: Unveiling the potential of small language models with scalable training strategies. CoRR, abs/2404.06395, 2024. Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and J. He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, 2023. B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, A. Yang, R. Men, F. Huang, X. Ren, X. Ren, J. Zhou, and J. Lin. Qwen2.5-coder technical report, 2024. URL https://arxiv.org/abs/2409.12186. A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. A. Ibrahim, B. ThĂŠrien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish. Simple and scalable strategies to continually pre-train large language models. Trans. Mach. Learn. Res., 2024, 2024. W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10), 2023. A. Joulin, E. Grave, P. Bojanowski, M.
Chunk 65 ¡ 1,998 chars
t, and I. Rish. Simple and scalable strategies to continually pre-train large language models. Trans. Mach. Learn. Res., 2024, 2024. W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10), 2023. A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. JĂŠgou, and T. Mikolov. Fasttext.zip: Compressing text classification models. arXiv: Computation and Language,arXiv: Computation and Language, Nov 2016. Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. Continual pre-training of language models, 2023. URL https://arxiv.org/abs/2302.03241. V. D. Lai, N. T. Ngo, A. P. B. Veyseh, H. Man, F. Dernoncourt, T. Bui, and T. H. Nguyen. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613, 2023. Y. Li, S. Bubeck, R. Eldan, A. D. Giorno, S. Gunasekar, and Y. T. Lee. Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463, 2023. X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, R. Pasunuru, S. Shleifer, P. S. Koura, V. Chaudhary, B. OâHoro, J. Wang, L. Zettlemoyer, Z. Kozareva, M. T. Diab, V. Stoyanov, and X. Li. Few-shot learning with multilingual language models. CoRR, abs/2112.10668, 2021. URL https://arxiv.org/abs/2112.10668. H. Lovenia, R. Mahendra, S. M. Akbar, L. J. V. Miranda, J. Santoso, E. Aco, A. Fadhilah, J. Mansurov, J. M. Imperial, O. P. Kampman, et al. Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. arXiv preprint arXiv:2406.10118, 2024. H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583, 2023. 34 -- 34 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Z. Luo, C. Xu,
Chunk 66 ¡ 1,999 chars
hao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583, 2023. 34 -- 34 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder: Empowering code large language models with evol-instruct. In ICLR. OpenReview.net, 2024. T. Nguyen, C. V. Nguyen, V. D. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In LREC/COLING, pages 4226â4237. ELRA and ICCL, 2024. OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022. G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei, and J. Launay. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only. In NeurIPS, 2023. T. Pires, E. Schlinger, and D. Garrette. How multilingual is multilingual bert? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996â5001, 2019. E. M. Ponti, G. GlavaĹĄ, O. Majewska, Q. Liu, I. VuliÄ, and A. Korhonen. XCOPA: A multilingual dataset for causal commonsense reasoning. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362â2376, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/ v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and
Chunk 67 ¡ 1,990 chars
s of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362â2376, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/ v1/2020.emnlp-main.185. URL https://aclanthology.org/2020.emnlp-main.185. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9. H. Schwenk, V. Chaudhary, S. Sun, H. Gong, and F. GuzmĂĄn. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. In P. Merlo, J. Tiedemann, and R. Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computa- tional Linguistics: Main Volume, pages 1351â1361, Online, Apr. 2021. Association for Computa- tional Linguistics. doi: 10.18653/v1/2021.eacl-main.115. URL https://aclanthology.org/2021. eacl-main.115. S. She, W. Zou, S. Huang, W. Zhu, X. Liu, X. Geng, and J. Chen. MAPO: Advancing multilingual reasoning through multilingual-alignment-as-preference optimization. In L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10015â10027, Bangkok, Thailand, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.539. URL https://aclanthology. org/2024.acl-long.539. F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=fR3wGCk-IXp. H. Singh, N. Gupta, S. Bharadwaj, D. Tewari, and P. Talukdar. Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic
Chunk 68 ¡ 1,975 chars
ltilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=fR3wGCk-IXp. H. Singh, N. Gupta, S. Bharadwaj, D. Tewari, and P. Talukdar. Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages, 2024a. URL https: //arxiv.org/abs/2404.16816. 35 -- 35 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement S. Singh, F. Vargus, D. Dsouza, B. F. Karlsson, A. Mahendiran, W.-Y. Ko, H. Shandilya, J. Patel, D. Mataciunas, L. OMahony, M. Zhang, R. Hettiarachchi, J. Wilson, M. Machado, L. S. Moura, D. KrzemiĹski, H. Fadaei, I. ErgĂźn, I. Okoh, A. Alaagib, O. Mudannayake, Z. Alyafeai, V. M. Chien, S. Ruder, S. Guthikonda, E. A. Alghamdi, S. Gehrmann, N. Muennighoff, M. Bartolo, J. Kreutzer, A. ĂstĂźn, M. Fadaee, and S. Hooker. Aya dataset: An open-access collection for multilingual instruction tuning, 2024b. URL https://arxiv.org/abs/2402.06619. N. Team, M. R. Costa-jussĂ , J. Cross, O. Ăelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. GuzmĂĄn, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang. No language left behind: Scaling human-centered machine translation, 2022. URL https://arxiv.org/abs/2207.04672. J. Tiedemann. Parallel data, tools and interfaces in OPUS. In N. Calzolari, K. Choukri, T. Declerck, M. U. DoÄan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LRECâ12), pages 2214â2218, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).
Chunk 69 ¡ 1,993 chars
In N. Calzolari, K. Choukri, T. Declerck, M. U. DoÄan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LRECâ12), pages 2214â2218, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023b. URL https://arxiv.org/abs/2302.13971. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023c. URL https://arxiv.org/abs/2307.09288. A. ĂstĂźn, V. Aryabumi, Z. X. Yong, W. Ko, D. Dâsouza, G. Onilude, N. Bhandari, S. Singh, H. Ooi, A. Kayid, F. Vargus, P. Blunsom, S. Longpre, N. Muennighoff, M. Fadaee, J. Kreutzer, and S. Hooker. Aya model: An instruction
Chunk 70 ¡ 1,997 chars
lama 2: Open foundation and fine-tuned chat models, 2023c. URL https://arxiv.org/abs/2307.09288. A. ĂstĂźn, V. Aryabumi, Z. X. Yong, W. Ko, D. Dâsouza, G. Onilude, N. Bhandari, S. Singh, H. Ooi, A. Kayid, F. Vargus, P. Blunsom, S. Longpre, N. Muennighoff, M. Fadaee, J. Kreutzer, and S. Hooker. Aya model: An instruction finetuned open-access multilingual language model. In ACL (1), pages 15894â15939. Association for Computational Linguistics, 2024. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762. L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, and Z. Tu. Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210, 2023. J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. 36 -- 36 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang, B. Li, C. Cheng, W. LĂź, R. Hu, C. Li, L. Yang, X. Luo, X. Wu, L. Liu, W. Cheng, P. Cheng, J. Zhang, X. Zhang, L. Lin, X. Wang, Y. Ma, C. Dong, Y. Sun, Y. Chen, Y. Peng, X. Liang, S. Yan, H. Fang, and Y. Zhou. Skywork: A more open bilingual foundation model. CoRR, abs/2310.19341, 2023a. X. Wei, H. Wei, H. Lin, T. Li, P. Zhang, X. Ren, M. Li, Y. Wan, Z. Cao, B. Xie, T. Hu, S. Li, B. Hui, B. Yu, D. Liu, B. Yang, F. Huang, and J. Xie. Polylm: An open source polyglot large language model. CoRR, abs/2307.06018, 2023b. G. Wenzek, M. Lachaux, A. Conneau, V. Chaudhary, F. GuzmĂĄn, A. Joulin, and E. Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. In LREC, pages 4003â4012. European Language Resources Association, 2020. C. Whitehouse, M. Choudhury, and A. F. Aji. Llm-powered data augmentation
Chunk 71 ¡ 1,998 chars
.06018, 2023b. G. Wenzek, M. Lachaux, A. Conneau, V. Chaudhary, F. GuzmĂĄn, A. Joulin, and E. Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. In LREC, pages 4003â4012. European Language Resources Association, 2020. C. Whitehouse, M. Choudhury, and A. F. Aji. Llm-powered data augmentation for enhanced cross- lingual performance. In EMNLP, pages 671â686. Association for Computational Linguistics, 2023. C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, Q. Lin, and D. Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. In ICLR. OpenReview.net, 2024. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483â498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https: //aclanthology.org/2021.naacl-main.41. A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, X. Liu, Y. Fan, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, and Z. Fan. Qwen2 technical report. CoRR, abs/2407.10671, 2024a. A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou,
Chunk 72 ¡ 1,991 chars
, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, and Z. Fan. Qwen2 technical report. CoRR, abs/2407.10671, 2024a. A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, X. Liu, Y. Fan, Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, Z. Guo, and Z. Fan. Qwen2 technical report, 2024b. URL https://arxiv.org/abs/2407.10671. S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024. Y. Yu, Y. Zhuang, J. Zhang, Y. Meng, A. J. Ratner, R. Krishna, J. Shen, and C. Zhang. Large language model as attributed training data generator: A tale of diversity and bias. In NeurIPS, 2023. R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 37 -- 37 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement B. Zhang, P. Williams, I. Titov, and R. Sennrich. Improving massively multilingual neural machine translation and zero-shot translation. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628â1639, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. acl-main.148. URL https://aclanthology.org/2020.acl-main.148. W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan.
Chunk 73 ¡ 1,996 chars
e 58th Annual Meeting of the Association for Computational Linguistics, pages 1628â1639, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. acl-main.148. URL https://aclanthology.org/2020.acl-main.148. W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/ 2304.06364. ĂaÄatay YÄąldÄąz, N. K. Ravichandran, P. Punia, M. Bethge, and B. Ermis. Investigating continual pretraining in large language models: Insights and implications, 2024. URL https://arxiv.org/ abs/2402.17400. A. ĂstĂźn, V. Aryabumi, Z.-X. Yong, W.-Y. Ko, D. Dâsouza, G. Onilude, N. Bhandari, S. Singh, H.-L. Ooi, A. Kayid, F. Vargus, P. Blunsom, S. Longpre, N. Muennighoff, M. Fadaee, J. Kreutzer, and S. Hooker. Aya model: An instruction finetuned open-access multilingual language model, 2024. URL https://arxiv.org/abs/2402.07827. A. Appendix A.1. Dataset Preprocessing A.1.1. Translation Templates Table 13 | Translation templates used in our experiments. Note: The translation templates are used to construct our parallel data, where ^ indicates the position of a line break. The placeholders <src_lang>, <tgt_lang>, <input>, and <output> represent the source language name, target language name, source text, and target text in the parallel pair, respectively. ID Template (in English) A <src_lang> phrase: <input> ^ <tgt_lang> phrase: <output> B <src_lang> text: <input> ^ <tgt_lang> text: <output> C Translate the text from <src_lang> to <tgt_lang>: ^ <src_lang> text: <input> ^ <tgt_lang> text: <output> D Translate the words from <src_lang> to <tgt_lang>: ^ <src_lang> words: <input> ^ <tgt_lang> words: <output> E Convert the phrase from <src_lang> to <tgt_lang>: ^ <src_lang> phrase: <input> ^ <tgt_lang> phrase: <output> F Render the <src_lang> sentence <input> to <tgt_lang>: <output> G Provide the translation of the sentence <input> from <src_lang> to
Chunk 74 ¡ 1,992 chars
g> to <tgt_lang>: ^ <src_lang> words: <input> ^ <tgt_lang> words: <output>
E Convert the phrase from <src_lang> to <tgt_lang>: ^ <src_lang> phrase: <input> ^ <tgt_lang> phrase: <output>
F Render the <src_lang> sentence <input> to <tgt_lang>: <output>
G Provide the translation of the sentence <input> from <src_lang> to <tgt_lang>: <output>
H Change the phrase <input> to <tgt_lang>, the translated phrase is: <output>
I Please change the sentence <input> to <tgt_lang>, and the resulting translation is: <output>
J Change the phrase <input> to <tgt_lang>, resulting in: <output>
K The sentence <input> in <src_lang> means <output> in <tgt_lang>
To standardize the translation process, we designed a diverse set of translation templates as
shown in Table 13. These templates serve multiple purposes: they provide consistent formatting
for the parallel data, enable clear instruction-following capabilities, and help maintain structural
consistency across different language pairs. The templates range from simple direct translation
formats (Templates A and B) to more elaborate instructional patterns (Templates C through K), each
designed to capture different aspects of the translation task. By incorporating various phrasings
and structures, these templates help improve the modelâs robustness and ability to handle diverse
translation requests. The symbol ^ in the templates is a line break, which helps maintain clear visual
separation between different parts of the translation pair.
38
-- 38 of 42 --
Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Table 14 | Prompt Templates for Synthetic Data.
Method Prompt
Keywords-based Explanation
Suppose that you are a/an {role_1} in {subject}. Please explain the following keywords and
meet the following requirements:
(1) The keywords: {keywords};
(2) Each keyword explanation should contain at least three sentences. You can generate a story
about the keyword for better explanation;
(3) TheChunk 75 ¡ 1,998 chars
ds-based Explanation
Suppose that you are a/an {role_1} in {subject}. Please explain the following keywords and
meet the following requirements:
(1) The keywords: {keywords};
(2) Each keyword explanation should contain at least three sentences. You can generate a story
about the keyword for better explanation;
(3) The explanations suit {role_2} students;
(4) Summarize the explanations.
Your answer should be a list of keywords. Make the explanations correct, useful, understandable,
and diverse.
Keywords-based Story
Assume that you are a/an {role_1} in {subject}. Before you teach students new vocabulary,
please write a {type_passage} about the new knowledge and meet the following requirements:
(1) It must contain keywords: {keywords};
(2) Its setting should be {scene};
(3) Should be between {min_length} and {max_length} words in length;
(4) The writing style should be {style};
(5) The suitable audience is {role_2};
(6) Should end with {ending};
(7) Should be written in {language}.
Few-shot Based SFT Data
I want you to act as a Sample Generator. Your goal is to draw inspiration from the Given
Sample to create a brand new sample. This new sample should belong to the same domain
as the Given Sample but be even rarer. The length and complexity of the Created Sample
should be similar to that of the Given Sample. The Created Sample must be reasonable and
understandable by humans. The terms Given Sample, Created Sample, âgiven sampleâ, and
âcreated sampleâ are not allowed to appear in the Created Sample.
Given Sample:
(1) Sample doc 1
(2) Sample doc 2
(3) Sample doc 3
(4) ...
Created Sample:
39
-- 39 of 42 --
Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
A.1.2. Prompt Templates
Table 14 summaries the prompt templates for synthetic data.
A.2. More Evaluation Results for Instruct LLMs
We show more evaluation results for the post-trained LLMs in this section. In our evaluation on
the TydiQA benchmark [Clark et al., 2020] shownChunk 76 ¡ 1,998 chars
ngual Training for Cross-Lingual Enhancement A.1.2. Prompt Templates Table 14 summaries the prompt templates for synthetic data. A.2. More Evaluation Results for Instruct LLMs We show more evaluation results for the post-trained LLMs in this section. In our evaluation on the TydiQA benchmark [Clark et al., 2020] shown in Table 15, Marco-Chat consistently exhibits superior performance across both 7B and 70B model categories, underlining its enhanced multilingual capabilities. Notably, Marco-Chat achieves top scores in languages such as Arabic and Bengali, demonstrating its proficiency in managing diverse linguistic structures and dialects. With an impressive average score of 57.7 for 7B models and 61.0 for 70B models, Marco-Chat surpasses competitors, showcasing its robust understanding across many languages. The CEVAL benchmark [Huang et al., 2023] results in Table 16 underscore Marcoâs strong performance across both 7B and 70B models. Marco-LLM consistently achieves the highest scores, with an average of 86.4 in the 7B models and 94.5 in the 70B models, demonstrating its robust generalization across diverse linguistic tasks. Particularly noteworthy is Marcoâs strong performance in complex categories such as âHardâ and âSTEMâ, where it obtained higher scores at handling challenging language tasks and quantitative reasoning. The AGIEval benchmark [Zhong et al., 2023] results shown in Table 17highlight Marco-Chatâs exceptional multilingual and reasoning capabilities, particularly in the 7B model category, where it obtained strong performance in Chinese and Gaokao-related tasks. Achieving top scores in categories such as Gaokao-English and Gaokao-History underscores Marco-Chatâs adeptness at handling diverse linguistic challenges and contextual comprehension. In the 70B model category, Marco-Chat continues to lead with an average score of 72.7. 40 -- 40 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 15 |
Chunk 77 ¡ 1,997 chars
underscores Marco-Chatâs adeptness at handling diverse linguistic challenges and contextual comprehension. In the 70B model category, Marco-Chat continues to lead with an average score of 72.7. 40 -- 40 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 15 | Performance comparison of language models on TydiQA benchmark [Clark et al., 2020] across different languages. Language Qwen2 Qwen2.5 Llama-3 Llama3.1 Aya-23 Aya-expanse Marco-Chat 7B Models Arabic 47.5 48.0 58.8 68.3 67.2 45.0 78.4 Bengali 42.3 61.6 56.8 64.3 50.6 21.3 74.6 English 29.0 42.7 24.4 41.6 53.6 49.6 44.4 Finnish 27.1 48.4 54.8 47.3 44.7 23.5 51.5 Indonesian 28.6 38.4 34.5 31.8 33.4 30.2 48.1 Japanese 0.8 8.7 17.6 31.8 71.0 41.7 70.5 Korean 38.2 45.6 25.4 57.9 76.0 20.3 77.9 Russian 22.4 33.7 31.2 40.8 49.9 25.7 46.6 Swahili 17.5 8.9 30.1 45.2 11.8 13.3 31.3 Telugu 28.7 22.8 48.8 83.3 5.5 11.3 35.2 Thai 39.7 55.7 54.3 70.8 55.2 29.3 76.2 Avg. Scores 29.2 39.0 39.7 53.0 47.2 28.3 57.7 70B Models Arabic 42.5 61.6 72.7 62.4 67.5 40.9 78.4 Bengali 47.1 65.7 55.9 68.1 43.6 28.0 73.3 English 43.0 45.5 46.6 41.2 53.6 46.8 51.2 Finnish 55.8 39.0 57.0 50.9 52.5 26.0 57.0 Indonesian 39.9 47.8 44.0 31.9 35.8 27.6 46.5 Japanese 38.6 44.7 43.8 49.9 73.6 22.1 68.4 Korean 41.1 43.2 43.8 63.0 73.1 29.5 69.4 Russian 38.5 43.2 42.5 36.3 52.4 35.5 45.9 Swahili 18.6 21.3 46.0 33.8 20.4 12.5 49.2 Telugu 36.9 41.4 45.5 80.4 26.6 22.9 62.8 Thai 41.5 68.2 74.0 66.4 52.7 38.7 76.5 Avg. Scores 40.3 48.4 52.0 53.1 50.2 30.0 61.0 Table 16 | Performance comparison on the CEVAL benchmark across different categories. Category Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco 7B Models Average 81.8 77.9 50.8 55.6 43.9 48.5 86.4 Hard 63.1 51.8 33.9 36.9 32.6 32.4 79.9 Other 84.9 82.8 53.5 56.3 44.9 48.3 81.0 Humanities 84.3 79.6 47.2 53.8
Chunk 78 ¡ 1,994 chars
Performance comparison on the CEVAL benchmark across different categories. Category Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco 7B Models Average 81.8 77.9 50.8 55.6 43.9 48.5 86.4 Hard 63.1 51.8 33.9 36.9 32.6 32.4 79.9 Other 84.9 82.8 53.5 56.3 44.9 48.3 81.0 Humanities 84.3 79.6 47.2 53.8 43.2 50.3 84.8 Social Science 91.3 87.8 58.9 64.9 51.2 54.3 90.4 STEM 73.9 69.4 47.2 51.5 40.1 44.7 88.3 70B Models Average 90.6 88.2 66.7 71.6 53.6 56.9 94.5 Hard 77.8 73.3 51.1 56.0 35.2 36.5 94.0 Other 92.0 88.4 63.1 69.2 52.8 54.7 92.5 Humanities 92.8 91.0 66.5 70.0 56.9 60.3 95.2 Social Science 94.8 91.5 75.6 82.0 61.8 66.4 95.7 STEM 88.4 84.9 64.2 68.5 48.1 51.4 94.6 41 -- 41 of 42 -- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement Table 17 | Agieval 7B Results: Performance of various models across different categories of the Agieval dataset. The best performance in each category is highlighted in bold. Category Qwen2 Qwen2.5 Llama3 Llama3.1 Aya-23 Aya-expanse Marco-Chat 7B Models Chinese 59.5 57.1 37.7 62.5 36.5 34.4 65.3 English 54.0 61.5 41.5 51.0 37.9 39.9 56.6 Gaokao 64.5 63.4 41.5 53.0 40.3 37.3 71.0 Gaokao-Chinese 77.6 67.9 47.6 58.5 46.3 42.3 80.5 Gaokao-English 89.5 85.3 79.1 89.5 82.0 67.0 93.1 Gaokao-Geography 81.4 80.4 49.8 75.9 50.8 50.3 86.9 Gaokao-History 86.0 82.1 55.7 74.9 51.1 49.8 89.8 Gaokao-Biology 81.9 80.5 52.9 76.2 43.8 35.7 87.6 Gaokao-Chemistry 56.5 55.6 32.4 46.9 29.5 30.4 74.4 Gaokao-MathQA 45.3 55.3 32.5 16.2 27.4 27.4 60.7 Gaokao-Physics 45.0 44.5 15.5 34.0 30.5 26.0 57.5 Gaokao-MathCloze 17.0 18.6 8.5 5.1 1.7 6.8 8.5 LogiQA-ZH 60.8 54.7 42.1 61.1 36.3 37.8 68.2 LSAT-AR 26.5 23.9 23.5 30.9 21.3 23.0 29.6 LSAT-LR 57.3 66.3 54.5 84.5 41.0 45.3 79.2 LSAT-RC 66.2 74.4 69.5 89.2 55.8 56.1 75.8 LogiQA-EN 46.7 50.4 43.5 59.6 36.7 35.3 61.4 SAT-Math 82.7 88.6
Chunk 79 ¡ 1,999 chars
57.5 Gaokao-MathCloze 17.0 18.6 8.5 5.1 1.7 6.8 8.5 LogiQA-ZH 60.8 54.7 42.1 61.1 36.3 37.8 68.2 LSAT-AR 26.5 23.9 23.5 30.9 21.3 23.0 29.6 LSAT-LR 57.3 66.3 54.5 84.5 41.0 45.3 79.2 LSAT-RC 66.2 74.4 69.5 89.2 55.8 56.1 75.8 LogiQA-EN 46.7 50.4 43.5 59.6 36.7 35.3 61.4 SAT-Math 82.7 88.6 68.2 44.1 35.9 49.1 62.3 SAT-EN 83.0 84.5 82.0 91.3 77.2 64.1 84.0 SAT-EN-Without-Passage 46.6 42.2 46.1 44.7 42.2 39.8 52.9 Math 82.7 48.5 20.0 57.6 7.1 13.6 12.8 Aqua-RAT 59.8 74.8 52.0 61.0 24.0 32.7 51.2 JEC-QA-KD 38.0 33.1 17.5 26.9 18.2 17.8 37.5 JEC-QA-CA 34.7 27.4 19.1 26.0 20.4 21.1 38.4 Average 57.1 59.0 43.4 41.8 37.1 36.7 61.5 70B Models Chinese 66.3 66.0 51.0 49.3 43.5 42.1 74.5 English 65.7 69.4 65.3 62.5 45.5 50.6 70.4 Gaokao 71.8 71.9 54.1 53.0 47.6 46.6 80.6 Gaokao-Chinese 85.0 84.6 56.5 58.5 50.8 52.4 92.7 Gaokao-English 89.5 92.2 90.9 89.5 88.9 71.6 96.1 Gaokao-Geography 88.9 87.9 74.9 75.9 66.3 65.8 96.0 Gaokao-History 92.3 87.2 71.1 74.9 68.9 63.4 95.3 Gaokao-Biology 86.2 86.2 74.3 76.2 54.8 53.3 95.2 Gaokao-Chemistry 68.1 66.2 39.6 46.9 37.2 35.8 84.1 Gaokao-MathQA 60.7 66.1 45.0 16.2 29.1 36.8 74.6 Gaokao-Physics 53.5 56.5 23.0 34.0 31.5 32.0 75.0 Gaokao-MathCloze 22.0 20.3 11.9 5.1 0.9 8.5 16.1 LogiQA-ZH 70.2 70.1 61.1 61.1 46.2 46.2 82.3 LSAT-AR 32.2 27.4 31.7 30.9 19.6 20.9 40.0 LSAT-LR 73.1 88.0 80.8 84.5 60.0 59.4 94.5 LSAT-RC 77.7 85.5 84.8 89.2 72.9 62.1 91.1 LogiQA-EN 56.8 59.6 55.8 59.6 43.9 40.3 75.7 SAT-Math 90.5 84.1 86.4 44.1 43.6 66.8 74.6 SAT-EN 89.8 91.3 89.8 91.3 87.9 77.7 90.8 SAT-EN w/o Passage 49.0 49.5 55.3 44.7 44.7 45.6 69.4 Math 46.5 63.1 34.2 57.6 8.9 25.7 22.6 AQUA-RAT 75.2 76.0 69.3 61.0 28.0 32.0 75.2 JEC-QA-KD 39.0 35.5 34.4 26.9 23.2 19.3 41.4 JEC-QA-CA 39.9 20.3 28.9 26.0 24.1 19.5 44.6 Avg. Scores 66.0 67.5 57.1 55.0 44.4 45.7 72.7 42 -- 42 of
Chunk 80 ¡ 325 chars
90.8 SAT-EN w/o Passage 49.0 49.5 55.3 44.7 44.7 45.6 69.4 Math 46.5 63.1 34.2 57.6 8.9 25.7 22.6 AQUA-RAT 75.2 76.0 69.3 61.0 28.0 32.0 75.2 JEC-QA-KD 39.0 35.5 34.4 26.9 23.2 19.3 41.4 JEC-QA-CA 39.9 20.3 28.9 26.0 24.1 19.5 44.6 Avg. Scores 66.0 67.5 57.1 55.0 44.4 45.7 72.7 42 -- 42 of 42 --