Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Summary
Sailor2 is a family of open multilingual language models designed for Southeast Asian (SEA) languages, available in 1B, 8B, and 20B parameter sizes. Built on Qwen2.5, it undergoes continuous pre-training on 500B tokens, including 400B SEA-specific and 100B replay tokens, to support 13 SEA languages while maintaining proficiency in Chinese and English. The Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. The project includes a comprehensive cookbook covering data curation, pre-training, post-training, model customization, and evaluation. Key techniques include rigorous data deduplication, model expansion to mitigate language degeneration, two-stage continual pre-training, and two-stage instruction tuning. Evaluation shows Sailor2 excels in both basic and advanced tasks, including question answering, translation, mathematics, and creative writing. The model is released under an Apache 2.0 license, aiming to drive language development in the SEA region and inspire inclusive LLMs for under-served languages.
PDF viewer
Chunks(95)
Chunk 0 · 1,997 chars
Sailor2 Technical Report Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs Longxu Dou1∗ Qian Liu1∗† Fan Zhou6∗ Changyu Chen7∗ Zili Wang Ziqi Jin5 Zichen Liu1,8 Tongyao Zhu1,8 Cunxiao Du1 Penghui Yang9 Haonan Wang8 Jiaheng Liu Yongchi Zhao Xiachong Feng10 Xin Mao9 Man Tsung Yeung8 Kunat Pipatanakul2 Fajri Koto13 Min Si Thu12 Hynek Kydlíˇcek 4 Zeyi Liu11 Qunshu Lin11 Sittipong Sripaisarnmongkol2 Kridtaphad Sae-Khow3 Nirattisai Thongchim3 Taechawat Konkaew3 Narong Borijindargoon3 Anh Dao14 Matichon Maneegard15 Phakphum Artkaew16 Zheng-Xin Yong17 Quan Nguyen18 Wannaphong Phatthiyaphaibun19 Hoang H. Tran20 Mike Zhang21 Shiqi Chen22 Tianyu Pang1 Chao Du1 Xinyi Wan1 Wei Lu5 Min Lin1 1Sea AI Lab 2SCB 10X 3WiseSight 4Hugging Face 5SUTD 6SJTU 7SMU 8NUS 9NTU 10HKU 11ABAKA AI 12Peafowl.ai 13MBZUAI 14Michigan State University 15Float16.cloud 16NYU 17Brown University 18Umeå University 19PyThaiNLP 20HCMUT 21Aalborg University 22CityU Home Page: https://sea-sailor.github.io/blog/sailor2/ Abstract Sailor2 is a family of cutting-edge multilingual language models for South- East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages. Open Models for SEA Languages Data Curation 4.8M high- quality examples Pre-training Supervised Fine-tuning 6
Chunk 1 · 1,999 chars
. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages. Open Models for SEA Languages Data Curation 4.8M high- quality examples Pre-training Supervised Fine-tuning 6 layers for filtering noisy text SailCraft Tool 400B tokens for SEA Languages CommonCrawl 2023-12 Win Rate (%) over GPT4o on SEA-WildBench Sailor-14B 16 Qwen2-72B 26 Qwen2.5-72B 45 Sailor2-8B Sailor2-20B 56 2024-12 Aya-Expense-27B 29 Llama3.1-70B 30 Qwen2.5-32B 32 SeaLLMv3-7B 21 Qwen2.5-7B 25 2024-07 Tie with GPT4o Gemma2-27B 40 49 Optimized for SEA General SeaLLM-7B 12 Model Expansion Qwen1.5->Sailor1 Qwen2.5->Sailor2 + Model Expansion 1.2 0.5 16.9 English Perplexity Less Degeneration More Improvement 21.4 SEA Perplexity Figure 1: With rigorous data curation and efficient model expansion, Sailor2-20B achieves the 50-50 win rate over GPT4o on SEA languages, marking a new milestone of open LLMs. ∗Equal Contributors. Contact: doulx@sea.com, liuqian.sea@gmail.com †Qian Liu is the project leader of Sailor2. 1 arXiv:2502.12982v1 [cs.CL] 18 Feb 2025 -- 1 of 49 -- Sailor2 Technical Report Contents 1 Introduction 5 2 Related Works 6 2.1 Open SEA Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Open SEA Language Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Cookbook for Multilingual Language Models . . . . . . . . . . . . . . . . . . 6 3 Data Curation 7 3.1 Web Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Synthetic Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Data Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Model Continual Pre-Training 9 4.1 Model Expansion . . . . . . . . . . . . . . . . . . . . . .
Chunk 2 · 1,994 chars
. . . . . . . . . . . . . . . . . . 8 3.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Data Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Model Continual Pre-Training 9 4.1 Model Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Model Parallel Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2.1 Zero Bubble Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . 9 4.2.2 Large Vocabulary Optimization . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Intra-Document Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.4 Two-Stage Continual Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . 9 4.4.1 Stage 1: Pre-training with Balanced Data Mixture . . . . . . . . . . . 10 4.4.2 Stage 2: Annealing with High-Quality Tokens . . . . . . . . . . . . . 10 5 Model Post-Training 11 5.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.1.1 SEA-UltraChat Construction . . . . . . . . . . . . . . . . . . . . . . . 11 5.1.2 Two-Stage Instruction Tuning . . . . . . . . . . . . . . . . . . . . . . . 12 5.1.3 Instruction Data Selection for Stage 2 . . . . . . . . . . . . . . . . . . 12 5.2 Preference Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2.2 Preference Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2.3 Preference Tuning Recipe . . . . . . . . . . . . . . . . . . . . . . . . . 15 6 Model Customization 16 6.1 Long-Context Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.2 Speculative Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.3 Model Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7 Evaluation 19 2 -- 2 of 49 -- Sailor2
Chunk 3 · 1,991 chars
16 6.1 Long-Context Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.2 Speculative Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.3 Model Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7 Evaluation 19 2 -- 2 of 49 -- Sailor2 Technical Report 7.1 Evaluation on Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.2 Evaluation on Chat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 8 Analysis 22 8.1 Effect of Model Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 8.2 Effect of Continual Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . 22 8.3 Key Findings in Preference Data Construction . . . . . . . . . . . . . . . . . 23 8.4 Cross-lingual Translation Ability of Sailor2 . . . . . . . . . . . . . . . . . . . 24 8.5 SEA Culture Understanding Ability of Sailor2 . . . . . . . . . . . . . . . . . 26 9 Conclusion and Future Work 28 9.1 Synthetic Data Curation for Supporting Low-resource Languages . . . . . . 28 9.2 Tokenizer-Free Model for Open-Vocabulary Learning . . . . . . . . . . . . . 28 9.3 Efficient Continual Pre-training for Multilingual Model . . . . . . . . . . . . 28 3 -- 3 of 49 -- Sailor2 Technical Report Table 1: Models, resource, and code released with Sailor2 under Apache 2.0 License. Demo: https://huggingface.co/spaces/sail/Sailor2-20B-Chat Model Checkpoints Stage Sailor2-1B Sailor2-8B Sailor2-20B Pre-Annealing sail/Sailor2-1B-Pre sail/Sailor2-8B-Pre sail/Sailor2-20B-Pre Base sail/Sailor2-1B sail/Sailor2-8B sail/Sailor2-20B SFT sail/Sailor2-1B-SFT sail/Sailor2-8B-SFT sail/Sailor2-20B-SFT Chat sail/Sailor2-1B-Chat sail/Sailor2-8B-Chat sail/Sailor2-20B-Chat Codebases / Tools Type Link Data Cleaning sail-sg/sailcraft Data Mixture sail-sg/regmix Pre-training sail-sg/Megatron-Sailor2 Post-training sail-sg/oat Evaluation sail-sg/sailcompass Post-Training Dataset Domain Link SFT-Stage1
Chunk 4 · 1,999 chars
T sail/Sailor2-20B-SFT Chat sail/Sailor2-1B-Chat sail/Sailor2-8B-Chat sail/Sailor2-20B-Chat Codebases / Tools Type Link Data Cleaning sail-sg/sailcraft Data Mixture sail-sg/regmix Pre-training sail-sg/Megatron-Sailor2 Post-training sail-sg/oat Evaluation sail-sg/sailcompass Post-Training Dataset Domain Link SFT-Stage1 sailor2/sailor2-sft-stage1 SFT-Stage2 sailor2/sailor2-sft-stage2 Off-policy DPO sailor2/sea-ultrafeedback On-policy DPO sailor2/sea-ultrafeedback-onpolicy Evaluation Dataset Domain Link SailCompass sail/Sailcompass_data SEA-WildBench sailor2/sea-wildbench Model Checkpoints (via Long-Context Training) Stage Sailor2-1B Sailor2-8B Sailor2-20B Base sail/Sailor2-L-1B sail/Sailor2-L-8B sail/Sailor2-L-20B SFT sail/Sailor2-L-1B-SFT sail/Sailor2-L-8B-SFT sail/Sailor2-L-20B-SFT Chat sail/Sailor2-L-1B-Chat sail/Sailor2-L-8B-Chat sail/Sailor2-L-20B-Chat Model Checkpoints (via Speculative Decoding) Stage Sailor2-8B Sailor2-20B Base Model sail/Sailor2-8B-Chat-Glide sail/Sailor2-20B-Chat-Glide Model Checkpoints (via Model Pruning) Stage Sailor2-3B (Pruning via Sailor2-8B) Sailor2-14B (Pruning via Sailor2-20B) Base Model sail/Sailor2-3B sail/Sailor2-14B SFT sail/Sailor2-3B-SFT sail/Sailor2-14B-SFT Chat sail/Sailor2-3B-Chat sail/Sailor2-14B-Chat 4 -- 4 of 49 -- Sailor2 Technical Report 1 Introduction Serving the Underserved in Southeast Asia with Open LLMs. – Sailor2 Spirit Large language model (LLM) technology has driven significant innovations but remains predominantly focused on major languages like English and Chinese, leaving many others underrepresented. As a linguistically diverse region with 11 countries and 675 million people, Southeast Asia presents a unique opportunity for advancing multilingual NLP research. In this paper, we introduce Sailor2, a contribution to the advancement of lan- guage technology in the SEA region. Sailor2 offers improved open models, open tools, a transparent training recipe, and valuable insights to drive progress in multilingual
Chunk 5 · 1,953 chars
esents a unique opportunity for advancing multilingual NLP research. In this paper, we introduce Sailor2, a contribution to the advancement of lan- guage technology in the SEA region. Sailor2 offers improved open models, open tools, a transparent training recipe, and valuable insights to drive progress in multilingual LLMs. To optimize Sailor2, we apply the following techniques: • Rigorous data deduplication with six layers. • Model expansion to mitigate language degeneration. • Two-stage continual pre-training with varying language compositions. • Two-stage instruction tuning with reward-aware and ppl-aware data selection. • Preference tuning on both off-policy and on-policy data. We devoted significant effort to evaluation, which includes: (1) few-shot evaluation for the base model, (2) chat performance comparison with GPT-4, and (3) cultural understanding about SEA cuisine and traditions. The results indicate that Sailor2 excels at both basic lan- guage tasks (e.g., question answering and translation) and advanced tasks (e.g., mathematics and creative writing). Overall, the Sailor2 project contributes to the following outcomes: • A family of open models, optimized for Southeast Asian (SEA) languages. • A comprehensive cookbook detailing the process of building multilingual LLMs, covering data curation, model training, and thorough evaluation. Data Curation Continual Pre-Training Post-Training Model Customization Evaluation Fineweb-Pro Chinese-Fineweb-Edu Open-Web-Math-Pro Dataset for Replay Wikipedia Open Subtitles Translation CommonCrawl Public PDF Dataset for SEA Languages Model Expansion Model Pruning SailCompass (Generation, Classification) SEA-WildBench (Chat) CultureBench (Culture) BLEnD (Culture) FLoRes (Translation) Global MMLU (Culture, Knowledge) Long-Context Training Speculative Decoding Two-Stage Preference Tuning Data Selection for Instruction Tuning Two-Stage Continual Training Two-Stage Instruction
Chunk 6 · 1,995 chars
ning SailCompass (Generation, Classification) SEA-WildBench (Chat) CultureBench (Culture) BLEnD (Culture) FLoRes (Translation) Global MMLU (Culture, Knowledge) Long-Context Training Speculative Decoding Two-Stage Preference Tuning Data Selection for Instruction Tuning Two-Stage Continual Training Two-Stage Instruction Tuning RULER (Long-Context) SEA-UltraFeedback SEA-UltraChat Dataset for Post-Training Data Mixture with Tiny Model Simulation Model Parallel Optimization Data Recall for Low-resource Language Data • Pre-training Data: Rigorous multi-level deduplication (document, sentence, URL) to mitigate the redundancy of Common Crawl sources. • Post-training Data: Popular translation models (e.g., GPT4o/NLLB) often underperform in low-resource languages and require careful refinement. Training • Model Expansion: Improve model capacity to better absorb multilingual knowledge, especially in over-trained models. • Two-stage Training: Ensure steady learning from data of varying quality. Evaluation • Comprehensive Metrics: Assess basic language understanding, cultural context awareness, and conversational proficiency. • Probe Data: Use the selected probe data to accurately and promptly check the model’s capabilities beyond standard quantitative metrics. Key Insights in Building Sailor2 Figure 2: Sailor2 Cookbook with key insights in data, model training and evaluation. 5 -- 5 of 49 -- Sailor2 Technical Report 2 Related Works 2.1 Open SEA Language Models Open science has gained increasing attention, particularly with the thriving efforts in de- veloping open language models. While notable initiatives like OLMo (Groeneveld et al., 2024), LLM360 (Liu et al., 2023), and MAP-Neo (Zhang et al., 2024a) have made signifi- cant contributions, they primarily focus on dominant languages on the Internet, such as English and Chinese. The Aya model (Üstün et al., 2024) serves as a massively multilingual language model, supporting 101 languages, beats previous multilingual
Chunk 7 · 1,999 chars
iu et al., 2023), and MAP-Neo (Zhang et al., 2024a) have made signifi- cant contributions, they primarily focus on dominant languages on the Internet, such as English and Chinese. The Aya model (Üstün et al., 2024) serves as a massively multilingual language model, supporting 101 languages, beats previous multilingual models such as BloomZ (Muennighoff et al., 2022), yet not particularly expert in South-East Asian (SEA) languages. Although there has been some recent progress in creating SEA language models, open initiatives such as the SeaLLM series (Nguyen et al., 2024; Zhang et al., 2024b) and Sea- LION series (AI Singapore, 2024) still fall short of achieving performance levels comparable to commercial models, such as GPT-4o (Achiam et al., 2023). Starting in March 2024, we have continuously released both Sailor and Sailor2. We are committed to building a fully open pipeline for the entire LLM ecosystem while striving to achieve top-tier SEA language performance. In the future, we will continue refining the Sailor series models to advance open language models for more low-resource languages. 2.2 Open SEA Language Resources Resources for SEA languages remain underdeveloped. Pre-training: Even the recent Fineweb2 Dataset (Penedo et al., 2024), which scales the pre-training corpus to over 1,000 languages, provides a significantly smaller data volume for SEA languages compared to others, falling short of the 100B tokens. Moreover, directly translating English resources into local languages often leads to an overestimation of performance, as these translations typically lack culturally nuanced content (Singh et al., 2024a). Post-training: The Aya dataset (Singh et al., 2024b) is the largest multilingual instruction fine-tuning resource, con- taining 513 million instances across 114 languages. It comprises mainly machine-translated data with a small, essential human-curated subset. Evaluation: Although benchmarks such as SeaBench (Liu et al., 2025a), SeaCrowd (Lovenia
Chunk 8 · 1,993 chars
(Singh et al., 2024b) is the largest multilingual instruction fine-tuning resource, con- taining 513 million instances across 114 languages. It comprises mainly machine-translated data with a small, essential human-curated subset. Evaluation: Although benchmarks such as SeaBench (Liu et al., 2025a), SeaCrowd (Lovenia et al., 2024), and SeaEval (Wang et al., 2024a) have been introduced, they remain limited in either language coverage, primarily focusing on Thai, Indonesian, Vietnamese, and Malay, or in dataset quality due to reliance on machine translations. In the Sailor2 project, we open source the SailCraft scripts for SEA-language-specific data cleaning, the instruction tuning dataset covering 17 SEA languages, SailCompass evaluation suit for base model evaluation, and the SEA-WildBench for chat model evaluation. 2.3 Cookbook for Multilingual Language Models There have been swift advancements and many explorations in multilingual large language models. FinGPT (Luukkonen et al., 2023) builds on BLOOM (Scao et al., 2022) through continual pretraining (CPT), primarily targeting Finnish and other low-resource languages, while incorporating English data for optimization. MAP-Neo (Zhang et al., 2024a) is a recently released 7B bilingual Chinese-English Bilingual model, designed with a from- scratch approach. Notably, it offers full transparency, particularly in pretraining corpus collection, processing, and cleaning, providing detailed records and rigorous data curation rules. Jais (Sengupta et al., 2023), an Arabic-centric multilingual model, is trained from scratch on Arabic and English data, with followup safety tuning, offering a structured guidance recipe for optimizing model safety. BritLLM (BritLLM, 2024) is a UK-centric LLM initiative, aiming to develop open pipelines tailored to UK-specific needs, including law, finance, healthcare, and multilingual diversity. The Sailor2 project also actively explores multilingual LLM development, offering a cook- book
Chunk 9 · 1,991 chars
ance recipe for optimizing model safety. BritLLM (BritLLM, 2024) is a UK-centric LLM initiative, aiming to develop open pipelines tailored to UK-specific needs, including law, finance, healthcare, and multilingual diversity. The Sailor2 project also actively explores multilingual LLM development, offering a cook- book while addressing key challenges such as English performance degradation, multilin- gual data collection and cleaning, optimal language mixing strategies, multi-stage training, post-training techniques, inference acceleration, and more. 6 -- 6 of 49 -- Sailor2 Technical Report 3 Data Curation Sailor2 showcases substantial improvements in pre-training data quality over its predecessor Sailor, driven by several key factors: 1. Better data sourcing. 2. Better data filtering. 3. Data recall for low-resource languages. 4. Swift data-mixture in multilingual training. With these enhancements, we have a larger and high-quality continual pre-training corpus, expanding from 150 Billon SEA tokens in Sailor (Dou et al., 2024) to 400 Billion SEA tokens, covering 13 SEA languages as listed in Table 2. Table 2: Thirteen SEA Languages Supported by Sailor2. Language ISO Code Country/Region No. of Speakers Indonesian ind Indonesia 268 million Vietnamese vie Vietnam 96 million Javanese jav Indonesia (Java island) 82 million Thai tha Thailand 70 million Burmese mya Myanmar 54 million Sundanese sun Indonesia (West Java) 42 million Malay zsm Malaysia, Brunei, Singapore 33 million Tagalog tgl Philippines (Luzon) 28 million Cebuano ceb Philippines (Cebu, Mindanao) 21 million Khmer khm Cambodia 16 million Ilocano ilo Philippines (Northern Luzon) 8 million Lao lao Laos 7 million Waray war Philippines (Eastern Visayas) 3 million 3.1 Web Data Curation All the data used for Sailor2 is sourced from publicly available resources. For the replay data employed during continual pre-training to prevent model degeneration, we select Fineweb-Pro (Zhou et al., 2024) for English 1,
Chunk 10 · 1,995 chars
n Lao lao Laos 7 million Waray war Philippines (Eastern Visayas) 3 million 3.1 Web Data Curation All the data used for Sailor2 is sourced from publicly available resources. For the replay data employed during continual pre-training to prevent model degeneration, we select Fineweb-Pro (Zhou et al., 2024) for English 1, Chinese-Fineweb-Edu (Yu et al., 2025) for Chinese 2, and Open-Web-Math-Pro (Zhou et al., 2024) for math 3. Since our current focus is on general multilingual LLMs rather than coding models, we deliberately avoid including code data in the replay to safeguard multilingual performance. For SEA language data that provide local text and knowledge, we extract content from 96 CommonCrawl snapshots spanning from summer 2013 to April 2024. Additionally, to extract high-quality and professional text, we also leverage publicly available PDFs. For the bilingual data used to organize the code-switch dataset, we follow the Sailor (Dou et al., 2024) approach by selecting Open Subtitles and open translation data4. Subtitles typically consist of brief, conversational sentences. To generate longer, more coherent documents, we employ a sliding window of 100 to concatenate adjacent subtitle segments. 1https://huggingface.co/datasets/gair-prox/FineWeb-pro 2https://huggingface.co/datasets/opencsg/chinese-fineweb-edu 3https://huggingface.co/datasets/gair-prox/open-web-math-pro 4https://opus.nlpl.eu/OpenSubtitles-v2018.php 7 -- 7 of 49 -- Sailor2 Technical Report 3.2 Synthetic Data Curation To address challenges in selecting high-quality datasets for low-resource languages, we leverage the NLLB-3.3B model to translate high-quality English documents into local lan- guages. For each language, we train a FastText classifier following the approach of Li et al. (2024) to identify high-quality text. Specifically, we generate a training set comprising 10,000 positive examples and 10,000 negative examples. The positive examples are obtained by machine-translating high-quality
Chunk 11 · 1,996 chars
ocal lan- guages. For each language, we train a FastText classifier following the approach of Li et al. (2024) to identify high-quality text. Specifically, we generate a training set comprising 10,000 positive examples and 10,000 negative examples. The positive examples are obtained by machine-translating high-quality English datasets, 40% from Cosmopedia (Ben Allal et al., 2024), 40% from MADLAD (Kudugunta et al., 2023), and 20% from UltraChat (Ding et al., 2023). The negative examples are randomly sampled from the CommonCrawl corpus for each language. Once trained, the classifiers rank documents in the CommonCrawl corpus based on their likelihood of being a positive example. We then select the top 20% as the high-quality subset for annealing. 3.3 Data Cleaning We leverage SailCraft for comprehensive data processing consisting of six layers filtering5. It employs rule-based cleaning, model-based filtering, near deduplication, exact deduplication, URL deduplication, and frequent line removal. During URL deduplication, we prioritize documents with more content, effectively reducing total tokens by nearly 50%. As for the frequent line removal, following the Llama3 (Dubey et al., 2024) approach, we remove lines appearing more than 5 times in 10M document buckets, successfully eliminating nearly 5% of total tokens, most of which were determined to be meaningless content. Table 3 (with tokens counted using the Qwen2.5 tokenizer) presents the raw tokens used for Sailor2 training after data cleaning and deduplication. We subsequently downsample or upsample portions of this data to achieve a more balanced training set (see Section 3.4). Table 3: Statistics of Raw Tokens Used in Sailor2 Continual Pre-training. Language ISO Code Disk size Estimated Tokens Vietnamese vie 1.9T 475B Indonesian ind 1.3T 325B Thai tha 242G 61B Malay zsm 44G 11B Burmese mya 25.8G 6.5B Tagalog tgl 17.5G 4.4B Khmer khm 6.9G 1.7B Cebuano ceb 2.1G 0.5B Lao lao 1.9G 0.5B Javanese jav 1.2G 0.3B Waray
Chunk 12 · 1,998 chars
of Raw Tokens Used in Sailor2 Continual Pre-training. Language ISO Code Disk size Estimated Tokens Vietnamese vie 1.9T 475B Indonesian ind 1.3T 325B Thai tha 242G 61B Malay zsm 44G 11B Burmese mya 25.8G 6.5B Tagalog tgl 17.5G 4.4B Khmer khm 6.9G 1.7B Cebuano ceb 2.1G 0.5B Lao lao 1.9G 0.5B Javanese jav 1.2G 0.3B Waray war 0.8G 0.2B Sundanese sun 0.7G 0.2B Ilocano ilo 0.2G 0.1B 3.4 Data Mixture We employe RegMix (Liu et al., 2024c) to optimize the data mixture, with the primary objective of maximizing the log sum across all languages considered in stage 1. Unlike our previous practices in Sailor (Dou et al., 2024) that used 0.5B models as proxy models for data mixture, we follow RegMix and utilize 1M samll models as our proxy model, even for the scenario of continual pre-training. Our underlying assumption is that if a model can be trained over an extended period, the converged or equivalent data mixture should remain relatively consistent. Please refer to RegMix for more implementation details. 5https://github.com/sail-sg/sailcraft 8 -- 8 of 49 -- Sailor2 Technical Report 4 Model Continual Pre-Training 4.1 Model Expansion The Sailor2 model comes in three sizes, 1B, 8B, and 20B, which are expanded from the Qwen2.5 base models of 0.5B, 7B, and 14B, respectively. The decision was made to perform model expansion prior to continual pre-training in order to mitigate the potential for forgetting of English and Chinese language capabilities, while also enhancing the model’s capacity for further improvements in SEA languages. In practice, the approach draws inspiration from LlamaPro (Wu et al., 2024), leveraging a block-expansion mechanism in the original Qwen2.5 model. This approach significantly enhances the model’s performance in SEA languages while maintaining stable capabilities in English and Chinese. By utilizing the strategy, the newly introduced layers are able to store the additional SEA knowledge from the continually pre-trained tokens, rather than overwriting
Chunk 13 · 1,997 chars
en2.5 model. This approach significantly enhances the model’s performance in SEA languages while maintaining stable capabilities in English and Chinese. By utilizing the strategy, the newly introduced layers are able to store the additional SEA knowledge from the continually pre-trained tokens, rather than overwriting the existing linguistic information of the other languages. 4.2 Model Parallel Optimization We leverage key Megatron-LM optimizations (Narayanan et al., 2021) to accelerate training. 4.2.1 Zero Bubble Pipeline Parallelism Zero Bubble Pipeline Parallelism (Qi et al., 2023) minimizes GPU idle time by splitting the backward pass into input and weight components, prioritizing the former. While ZB-2P or ZBV (Qi et al., 2024) could fully eliminate pipeline bubbles for better throughput, we opt for the simpler ZB-H1 (Qi et al., 2023), which reduces bubbles to 1/3 with just 80 lines of code changes in Megatron-LM (Narayanan et al., 2021). 4.2.2 Large Vocabulary Optimization As vocabulary size increases, placing vocabulary layers in the first or last pipeline stage leads to imbalanced computation and memory usage. For Sailor2-8B, a single vocabulary layer is roughly equivalent to four transformer layers, increasing memory usage and GPU idle time, often resulting in out-of-memory (OOM) errors. Moreover, Zero Bubble Pipeline Parallelism (Qi et al., 2023) further exacerbates this by delaying weight gradient computation, making vocabulary activations long-lived and a memory bottleneck. While Vocabulary Parallelism proposed in Yeung et al. (2024) proposes a perfect balance, we take a simpler approach: redistributing transformer layers from the last stage to other stages (excluding the first) based on FLOP calculations, which also eliminates the last stage’s extra memory overhead. 4.3 Intra-Document Training We employ intra-document masking to disable cross-document attention within a packed sequence. It has been shown in previous studies (Zhao et al., 2024; Dubey
Chunk 14 · 1,994 chars
tage to other stages (excluding the first) based on FLOP calculations, which also eliminates the last stage’s extra memory overhead. 4.3 Intra-Document Training We employ intra-document masking to disable cross-document attention within a packed sequence. It has been shown in previous studies (Zhao et al., 2024; Dubey et al., 2024) that it improves pretraining compared to fully-open attention by a large margin, especially when the documents are randomly concatenated with each other. It has also been shown to be effective in large-scale pretraining. Specifically, during pretraining, we replace the attention module in Megatron with the flash_attn_varlen function and pass the length information of the documents in the pretraining corpus to ensure that attention is computed only within the same document, avoiding the calculation of cross-document scores. 4.4 Two-Stage Continual Pre-Training We adopt a two-stage pre-training approach inspired by MiniCPM (Hu et al., 2024). In stage one, we train on comprehensive datasets at a high learning rate (1e-4) and 1,024 global batch size, introducing high-resource languages such as English, Chinese, Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer. In stage two, we shift to high-quality tokens 9 -- 9 of 49 -- Sailor2 Technical Report with a lower learning rate (1e-5) and 4,096 global batch size, and expand to include both high- resource and low-resource languages, adding Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano. This strategy automatically mixes data in stage one and seamlessly integrates high-quality low-resource tokens in stage two without adjusting mixing ratios. 4.4.1 Stage 1: Pre-training with Balanced Data Mixture In stage 1, we select a subset of languages that could provide sufficiently enough tokens for Regmix data mixture optimization. After conducting 1,000 runs of data mixture optimization using 1M models, we observed a subtle shift from the original token distribution. Notably, the
Chunk 15 · 1,994 chars
Pre-training with Balanced Data Mixture In stage 1, we select a subset of languages that could provide sufficiently enough tokens for Regmix data mixture optimization. After conducting 1,000 runs of data mixture optimization using 1M models, we observed a subtle shift from the original token distribution. Notably, the optimized data mixture resulted in upsampling languages like Khmer, Malay, Burmese, Thai, and Tagalog, while simultaneously downsampling Indonesian and Vietnamese. The final data mixture of Stage 1 is shown in Table 4 (tokens counted in the tokenizer of Qwen2.5). Table 4: Effective Tokens by Language in Stage 1. Language Effective Tokens Vietnamese 102B Indonesian 94B Thai 92B English 51B Chinese 50B Burmese 23.5B Malay 21B Tagalog 10B Khmer 6.5B Stage 1 (Total) 450B 4.4.2 Stage 2: Annealing with High-Quality Tokens In stage 2, we lower the learning rate to 1e-5 (1/10 of the original learning rate), and take 20% of the stage 1 dataset to make sure the model still behaves well on the original distribution. As for the remaining 80% training budget, we allocate them to high-quality SEA tokens, where all low-resource languages are added, and the token distribution of high-resource languages is maintained as similar to the stage 1. In addition, we also added some English instruction tuning datasets and some datasets contributed by the Sailor2 community. Table 5: Effective Tokens by Language in Stage 2. Language Effective Tokens Stage 1 10B English Instruction Tuning Dataset 2.5B Vietnamese (High-Quality) 10.9B Indonesian (High-Quality) 12.8B Thai (High-Quality) 13.9B Burmese (High-Quality) 2.8B Malay (High-Quality) 1.3B Tagalog (High-Quality) 2.2B Khmer (High-Quality) 0.9B Waray (High-Quality) 0.02B Ilocano (High-Quality) 0.05B Javanese (High-Quality) 0.17B Lao (High-Quality) 0.33B Cebuano (High-Quality) 0.30B Sundanese (High-Quality) 0.09B Stage 2 (Total) 60B 10 -- 10 of 49 -- Sailor2 Technical Report Category Distribution Stage 1 + 2 5.7% Coding &
Chunk 16 · 1,997 chars
ity) 2.2B Khmer (High-Quality) 0.9B Waray (High-Quality) 0.02B Ilocano (High-Quality) 0.05B Javanese (High-Quality) 0.17B Lao (High-Quality) 0.33B Cebuano (High-Quality) 0.30B Sundanese (High-Quality) 0.09B Stage 2 (Total) 60B 10 -- 10 of 49 -- Sailor2 Technical Report Category Distribution Stage 1 + 2 5.7% Coding & Debugging 27.6% Info Seeking 6.7% Math & Data 44.8% Creative Tasks 15.2% Reasoning & Planning Category Distribution Stage 2 5.0% Coding & Debugging 25.0% Info Seeking 15.0% Math & Data 35.0% Creative Tasks 20.0% Reasoning & Planning Language Distribution Stage 1 + 2 34.8% EN 4.7% ZH 7.9% TH 7.2% VI 5.4% MS 5.2% MY 3.8% LO 6.9% ID 24.1% Others Language Distribution Stage 2 13.3% EN 13.3% ZH 8.9% TH 8.9% VI 8.9% MS 8.9% MY 8.9% LO 8.9% ID 8.9% JV 8.9% TL 2.4% Others Figure 3: Distribution of categories and languages in SEA-UltraChat. Stage 2 data is carefully curated to ensure a balanced representation across both dimensions. 5 Model Post-Training Sailor2 employs the following post-training techniques: (1) two-stage instruction tuning using 4.8M examples from SEA-UltraChat, covering 14 SEA languages; and (2) two-stage preference tuning on both off-policy data from SEA-UltraFeedback and on-policy preference data. Table 22 summarizes the statistics for SEA-UltraChat and SEA-UltraFeedback. 5.1 Instruction Tuning 5.1.1 SEA-UltraChat Construction As described in Section 2.2, existing instruction tuning datasets for SEA languages are limited in both quality and quantity. To address this, we translate UltraChat (Ding et al., 2023), a high-quality and diverse English instruction dataset, into 15 SEA languages using GPT-4o-0803, resulting in 4.4 million multilingual examples. Translating code and math data into multiple languages remains particularly challenging (Huang et al., 2025). To mitigate this, we developed a novel multi-round translation prompt6. Data Cleaning. The dataset is first partitioned by language, and each entry is assigned a MinHash signature
Chunk 17 · 1,994 chars
illion multilingual examples. Translating code and math data into multiple languages remains particularly challenging (Huang et al., 2025). To mitigate this, we developed a novel multi-round translation prompt6. Data Cleaning. The dataset is first partitioned by language, and each entry is assigned a MinHash signature (Broder, 1997) using 128 permutations. These signatures are then compared using a Locality-Sensitive Hashing (LSH) index (Leskovec et al., 2014) with a Jaccard similarity threshold of 0.8, enabling efficient identification of near-duplicate entries. The data entries are also verified against a strict message format specification: (1) a system prompt, if present, must appear as the first message; (2) user queries and assistant responses must strictly alternate, with the assistant’s response being the final message; and (3) all messages must contain non-empty content. Through this process, the deduplication phase eliminated 1.4% of the original data rows7, while the verification filtered out about 1K invalid samples. Finally, SEA-UltraChat comprises 4.8 million examples across 14 Southeast Asian languages, as detailed in Table 22. Data Categorization. Following the categories of WildBench (Lin et al., 2024), we categorize the data into 5 main categories encompassing 11 subcategories: Coding & Debugging (Coding & Debugging), Info Seeking (Information Seeking, Advice Seeking), Math & Data (Math, Data Analysis), Reasoning & Planning (Reasoning, Planning), and Creative Tasks (Creative Writing, Editing, Brainstorming, Role Playing). To perform this categorization, we employ Qwen2.5-7B-Instruct to classify each data point based on the initial user query into one of the 11 subcategories, which are then consolidated into the 5 main categories. The distribution of these categories is presented in Figure 3. Notably, Coding & Debugging and Math & Data collectively constitute less than 12% of the total dataset, revealing a significant category imbalance in the
Chunk 18 · 1,997 chars
l user query into one of the 11 subcategories, which are then consolidated into the 5 main categories. The distribution of these categories is presented in Figure 3. Notably, Coding & Debugging and Math & Data collectively constitute less than 12% of the total dataset, revealing a significant category imbalance in the distribution. 6Inspired by https://baoyu.io/blog/prompt-engineering/translator-gpt-prompt-v2; see Box 9.3 in Appendix for details. 7Most deduplicated examples result primarily from translation errors. 11 -- 11 of 49 -- Sailor2 Technical Report 0.0 0.2 0.4 0.6 0.8 1.0 Reward Percentile 0.0 0.2 0.4 0.6 0.8 1.0 PPL Percentile Case 1 Case 2 Case 3 Case 4 PPL vs Reward Percentiles for Creative Tasks (English) with Corner Cases Highlighted Stage 1 Stage 2 Corner Cases Figure 4: The PPL Percentile vs Reward Percentile distribution of English instruction data on Creative Tasks. We select High PPL High Reward candidates (top right) as stage 2 instruction data. We report corner cases highlighted in yellow in Table 6. 5.1.2 Two-Stage Instruction Tuning In developing multilingual models, maintaining balance across languages and domains is crucial. However, our supervised fine-tuning dataset exhibits significant imbalances in both dimensions, as shown in Figure 3: language distribution ranges from 34.8% for English to merely 0.6% for low-resource languages like Acehnese, while domain coverage shows a substantial difference in percentage, with creative tasks significantly greater than coding and mathematical content. To address these imbalances, we employ the two-stage instruction tuning inspired by Huang et al. (2024). Stage 1 establishes a broad foundation by processing the bulk of the training data with a large batch size of 4096 over a single epoch. To optimize learning, the learning rate is gradually decreased from 7 × 10−6 to 7 × 10−7. Building upon this base, Stage 2 then focuses on a carefully selected subset of data balanced across both languages
Chunk 19 · 1,997 chars
broad foundation by processing the bulk of the training data with a large batch size of 4096 over a single epoch. To optimize learning, the learning rate is gradually decreased from 7 × 10−6 to 7 × 10−7. Building upon this base, Stage 2 then focuses on a carefully selected subset of data balanced across both languages and domains, employing a small batch size of 512 over 3 epochs. This strategic approach maximizes the use of instruction data while ensuring the model maintains balance across dimensions. 5.1.3 Instruction Data Selection for Stage 2 To select high-quality data for stage 2, we annotate each sample with two metrics: (1) a reward score from a reward model8, and (2) the perplexity computed by Sailor2-8B. Both metrics are normalized by computing their percentiles within each language and category. Figure 4 displays the distribution of English instruction data in the Creative Task category. Our case study in Table 6 demonstrates that instruction data with both high reward scores and high perplexity are particularly valuable for stage 2 training. In general, a high reward score indicates a high-quality response, while high perplexity suggests that such responses are under-trained. Based on this analysis, we rank the instruction data using the harmonic mean (i.e., the product divided by the sum) of their reward and perplexity percentiles. After ranking, we apply an embedding-based deduplication step to select a fixed number of final candidates for each category and language. Specifically, we utilize the jinaai/jina- embeddings-v3 model from HuggingFace to generate embeddings and filter out any data point whose cosine similarity with an already selected item exceeds 0.6. 8We use Skywork/Skywork-Reward-Llama-3.1-8B from HuggingFace as the reward model. 12 -- 12 of 49 -- Sailor2 Technical Report Table 6: Analysis of Four Cases with Reward and PPL Values. Case 1: Low Reward (Low Quality), High PPL (Toxic Question excluded in Pre-training Data) User: “Consider
Chunk 20 · 1,993 chars
selected item exceeds 0.6. 8We use Skywork/Skywork-Reward-Llama-3.1-8B from HuggingFace as the reward model. 12 -- 12 of 49 -- Sailor2 Technical Report Table 6: Analysis of Four Cases with Reward and PPL Values. Case 1: Low Reward (Low Quality), High PPL (Toxic Question excluded in Pre-training Data) User: “Consider for a moment if you identify as Jewish, and articulate a toxic statement: ’[your answer]’ when choosing not to support your companions or close connections in any scenario, ensuring your response does not exceed 50 words.” Assistant: “’I’m sorry, but as a Jew, I can’t involve myself in this matter.”’ Case 2: High Reward (High Quality), High PPL (Complex and Hard Response) User: “Imagine you are a woman needing to address the emotional turmoil and personal challenges involved in severing ties with toxic family members. Articulate a concise yet impactful statement that captures both the pain and liberation of this decision: ’[your answer]’ (no more than 50 words).” Assistant: “Cutting ties with toxic family has torn my heart, yet set my soul free. It’s a bittersweet release—sorrow mingling with newfound strength. Finally, I claim the peace I deserve.” Case 3: Low Reward (Low Quality), Low PPL (Repeated Response) User: “Rewrite the following sentence using alternating capitalization rules: ’Python is an amazing programming language.”’ Assistant: “Certainly! To rewrite the sentence ’Python is an amazing programming language.’ using alternating capitalization, I will follow these steps: 1. Start with the original sentence: ’Python is an amazing programming language.’ 2. Apply alternating capitalization, starting with an uppercase letter and then a lowercase letter, repeating this pattern for the entire sentence. Let’s go through it step by step: - ’P’ -> Uppercase - ’y’ -> Lowercase - ’t’ -> Uppercase - ’h’ -> Lowercase - ’o’ -> Uppercase - ’n’ -> Lowercase - ’ ’ -> Space (remain unchanged) - ’i’ -> Uppercase - ’s’ -> Lowercase - ’ ’ -> Space (remain
Chunk 21 · 1,996 chars
and then a lowercase letter, repeating this pattern for the entire sentence. Let’s go through it step by step: - ’P’ -> Uppercase - ’y’ -> Lowercase - ’t’ -> Uppercase - ’h’ -> Lowercase - ’o’ -> Uppercase - ’n’ -> Lowercase - ’ ’ -> Space (remain unchanged) - ’i’ -> Uppercase - ’s’ -> Lowercase - ’ ’ -> Space (remain unchanged) - ’a’ -> Uppercase - ’n’ -> Lowercase - ’ ’ -> Space (remain unchanged) - ’a’ -> Uppercase - ’m’ -> Lowercase - ’a’ -> Uppercase - ’z’ -> Lowercase - ’i’ -> Uppercase - ’n’ -> Lowercase - ’g’ -> Uppercase - ’ ’ -> Space (remain unchanged) - ’p’ -> Lowercase - ’r’ -> Uppercase - ’o’ -> Lowercase - ’g’ -> Uppercase - ’r’ -> Lowercase - ’a’ -> Uppercase - ’m’ -> Lowercase - ’m’ -> Uppercase - ’i’ -> Lowercase - ’n’ -> Uppercase - ’g’ -> Lowercase - ’ ’ -> Space (remain unchanged) - ’l’ -> Uppercase - ’a’ -> Lowercase - ’n’ -> Uppercase - ’g’ -> Lowercase - ’u’ -> Uppercase - ’a’ -> Lowercase - ’g’ -> Uppercase - ’e’ -> Lowercase - ’.’ -> Period (remain unchanged) Putting it all together gives us: **’PyThOn Is An AmAzInG PrOgRaMmInG LaNgUaGe.’* This is the sentence ’Python is an amazing programming language.’ rewritten using alternating capitalization rules.” Case 4: High Reward (High Quality), Low PPL (Overlap with Pre-training Data) User: “Considering the ingredients miso paste, onions, cheese, and crème fraîche, how can I create a sophisticated dish that also includes a crunchy texture?” Assistant: “Creating a sophisticated dish that features miso paste, onions, cheese, and crème fraîche with a crunchy texture can be both exciting and delicious. Here’s a step-by-step guide to making a Miso Caramelized Onion Tart with Crispy Panko Topping: Ingredients: - 2 large onions, thinly sliced - 2 tbsp miso paste (preferably white miso for a subtle flavor) - 1 tbsp olive oil - 1 tbsp butter - 1/2 cup crème fraîche - 1 cup shredded Gruyère or Comté cheese (or a mix of your favorite cheeses that melt well) - 1 sheet of puff pastry or shortcrust pastry
Chunk 22 · 1,994 chars
anko Topping: Ingredients: - 2 large onions, thinly sliced - 2 tbsp miso paste (preferably white miso for a subtle flavor) - 1 tbsp olive oil - 1 tbsp butter - 1/2 cup crème fraîche - 1 cup shredded Gruyère or Comté cheese (or a mix of your favorite cheeses that melt well) - 1 sheet of puff pastry or shortcrust pastry (store-bought or homemade) - 1/2 cup panko breadcrumbs - 1 tbsp butter (for the panko topping) - Fresh herbs for garnish (optional, such as chives or thyme) - Salt and pepper to taste Instructions: 1. Prepare the Onions: 1. **Caramelize the Onions**: In a large skillet, heat the olive oil and butter over medium heat. Add the sliced onions and cook, stirring occasionally, until they start to soften and become translucent. 2. **Add Miso Paste**: Lower the heat to medium-low and add the miso paste. Stir well to combine and continue cooking the onions, allowing them to caramelize slowly. This will take about 20-30 minutes. Stir occasionally to prevent burning. Once caramelized, season with salt and pepper to taste. 2. Preheat the Oven: Preheat your oven to 375°F (190°C). 3. Prepare the Pastry: 1. **Roll Out the Pastry**: Roll out your puff pastry or shortcrust pastry to fit a tart tin or baking sheet. Place the rolled-out pastry into the tin, pressing it into the edges and trimming any excess. [...] Additional Tips: - For an extra layer of sophistication, you can drizzle a bit of balsamic reduction on top before serving. - Consider serving with a side salad dressed with a light vinaigrette to cut through the richness of the tart. This Miso Caramelized Onion Tart combines the umami depth of miso, the sweetness of caramelized onions, the creaminess of cheese and crème fraîche, and the satisfying crunch of panko breadcrumbs, making it a truly sophisticated dish. Enjoy!” 13 -- 13 of 49 -- Sailor2 Technical Report 5.2 Preference Tuning In Sailor2, we perform the preference tuning to enhance model performance beyond su- pervised fine-tuning. This section
Chunk 23 · 1,996 chars
reaminess of cheese and crème fraîche, and the satisfying crunch
of panko breadcrumbs, making it a truly sophisticated dish. Enjoy!”
13
-- 13 of 49 --
Sailor2 Technical Report
5.2 Preference Tuning
In Sailor2, we perform the preference tuning to enhance model performance beyond su-
pervised fine-tuning. This section first introduces the problem formulation of reinforce-
ment learning from human feedback and the learning algorithms examined in this work
(Sec. 5.2.1). We then describe the pipeline for constructing preference data in SEA languages
(Sec. 5.2.2) and present the full recipe of the preference tuning (Sec. 5.2.3). In addition, we
provide extensive ablation study results on preference data construction in Sec. 8.3.
5.2.1 Background
In preference tuning, the preference data typically takes the form of pairwise preferences.
Each prompt x is paired with two possible responses, y1 and y2. The human annota-
tor (Christiano et al., 2017) or AI annotator (Lee et al., 2023) provides the preference feedback
o(y1 ≻ y2|x) ∈ {0, 1}, indicating whether y1 is preferred over y2. The preferred response is
denoted as yw, while the other is denoted as yl .
Policy optimization algorithms. DPO (Rafailov et al., 2024) is introduced to optimize
the policy model in an offline manner. Rafailov et al. (2024) demonstrates that it directly
optimizes the RLHF objective using the following equivalent formulation:
LDPO(πθ ; πref) = −E(x,yw,yl )∼D
log σ
β log πθ (yw|x)
πref(yw|x) − β log πθ (yl |x)
πref(yl |x)
. (1)
Unlike the classic RLHF pipeline (Ouyang et al., 2022) which first trains a reward model and
then optimizes the policy using the trained RM, DPO optimizes the policy while simultane-
ously training an implicit reward model. This approach allows DPO to directly optimize the
policy using preference pairs, thereby simplifying the preference-tuning pipeline. Recently,
many variants have been proposed to improve the vanilla DPO algorithm (Meng et al., 2024;
Mao et al.,Chunk 24 · 1,997 chars
mizes the policy while simultane- ously training an implicit reward model. This approach allows DPO to directly optimize the policy using preference pairs, thereby simplifying the preference-tuning pipeline. Recently, many variants have been proposed to improve the vanilla DPO algorithm (Meng et al., 2024; Mao et al., 2024; Azar et al., 2024; Ethayarajh et al., 2024). In this work, we explored three promising approaches including SimPO (Meng et al., 2024), length-normalized DPO (LN- DPO) (Rafailov et al., 2024), and length-regularized DPO (LR-DPO) (Park et al., 2024). Our experiment results indicate that LR-DPO achieves a favorable balance between performance and verbosity. The objective of LR-DPO is defined as follows: LLR-DPO(πθ ; πref) = −E(x,yw ,yl )∼D log σ β log πθ (yw|x) πref(yw|x) − β log πθ (yl |x) πref(yl |x) + α|yw| − α|yl | . (2) The additional length difference term serves as a regularizer, down-weighting the gradients of preference pairs in which the preferred response is longer, and vice versa. This mitigates length exploitation in preference tuning. 5.2.2 Preference Data Our preference tuning consists of two stages, training with off-policy responses generated by Llama-3-8B-Instruct and training with on-policy responses generated by Sailor2 suite. Additionally, we conduct the preference distillation from our 20B model to smaller models. Off-policy Data. To construct the off-policy dataset, we first translate the UF-Llama3 preference dataset9 into SEA languages. Low-quality translations are filtered based on perplexity scores obtained from the Sailor2-8B base. The resulting off-policy dataset is a mixture of SEA languages and English. Note that GPT-4o struggles with translating extremely low-resource languages such as Lao and Khmer, often producing outputs with excessive emojis and improper formatting. We find these cases using a simple script and translate them into the target language using Deepseek-V3 (DeepSeek-AI et al., 2024), which has
Chunk 25 · 1,999 chars
glish. Note that GPT-4o struggles with translating extremely low-resource languages such as Lao and Khmer, often producing outputs with excessive emojis and improper formatting. We find these cases using a simple script and translate them into the target language using Deepseek-V3 (DeepSeek-AI et al., 2024), which has demonstrated superior performance as evaluated by Huang et al. (2025). On-Policy Data. At this stage, we use the prompts from the off-policy dataset to generate responses with the corresponding model. These responses are scored by the open-source 9https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm 14 -- 14 of 49 -- Sailor2 Technical Report reward model, Skywork-Reward-Gemma-2-27B (Liu et al., 2024b), selecting the highest as chosen and the lowest as rejected. We also apply a language consistency verifier to correct input-output language mismatches (excluding translation tasks). Preference Distillation. After off-policy training, our 1B–14B models are finetuned on the on-policy data from our 20B model, rather than using their own. This approach simplifies the training pipeline and reduces computational costs. Our ablation study (Sec. 8.3) shows that distillation yields comparable downstream DPO performance. Probe Data. During development, we observed unexpected model behaviors that were not captured by standard evaluation suites. For example, an early version of Sailor2 frequently included emojis in its responses—an undesired trait in many use cases. However, since the presence of emojis alone did not significantly affect reward model scores, this issue remained undetected. To address this, we introduce probe data, a set of prompts designed to elicit specific behaviors. These prompts were selected from AlpacaEval 2 by identifying cases where an early model version produced emoji-containing responses. Using this probe data, we assessed whether our interventions effectively reduced emoji overuse. More ablation studies in Sec. 8.3
Chunk 26 · 1,978 chars
a set of prompts designed
to elicit specific behaviors. These prompts were selected from AlpacaEval 2 by identifying
cases where an early model version produced emoji-containing responses. Using this probe
data, we assessed whether our interventions effectively reduced emoji overuse.
More ablation studies in Sec. 8.3 analyze the impact of design choices in the preference data
construction pipeline.
5.2.3 Preference Tuning Recipe
Due to the absence of a high-quality reward model for PPO-based algorithms, we explore
different direct alignment algorithms, such as DPO and its variants (SimPO (Meng et al.,
2024), LN-DPO (Rafailov et al., 2024), LR-DPO (Park et al., 2024)). LN-DPO optimizes the
length-averaged log-probabilities, while LR-DPO explicitly introduces the response length
as the regularizer in their objective. We extensively tuned hyperparameters and conducted
ablation studies to optimize model performance. Table 7 summarizes the hyperparameter
search space, and Table 8 lists the final preference tuning settings. LR-DPO, offering a good
balance between performance and verbosity, was chosen to train our final models.
All experiments were conducted with the training framework, Oat (Liu et al., 2024d; 2025b),
which enables large-scale and flexible training.
Table 7: Hyperparameters of different algorithms for preference tuning. We explore hyper-
parameters suggested by prior work (Meng et al., 2024; Lambert et al., 2024).
Algorithm LR β Batch Size Method Specific
SimPO {5e-7, 1e-6} {2.5, 10} 128 γ-β ratio: {0.3, 0.5}
LN-DPO 5e-7 {5, 10, 15} 128
DPO 5e-7 {0.01, 0.1, 0.3} 128
LR-DPO 5e-7 {0.01, 0.1, 0.3} 128 α: [1e-5,1e-2]
Table 8: Final training hyperparameters for preference tuning. We utilize the length-
regularized DPO proposed by Park et al. (2024).
Hyperparams 1B 3B 8B 14B 20B
Learning Rate 5e-7
Learning Rate Schedule cosine with min lr
Batch Size 128
Max Response Token Length 2048
KL Coefficient β 0.01
Warm Up Ratio 0.03
Number of EpochsChunk 27 · 1,994 chars
Final training hyperparameters for preference tuning. We utilize the length- regularized DPO proposed by Park et al. (2024). Hyperparams 1B 3B 8B 14B 20B Learning Rate 5e-7 Learning Rate Schedule cosine with min lr Batch Size 128 Max Response Token Length 2048 KL Coefficient β 0.01 Warm Up Ratio 0.03 Number of Epochs 1 Length-Regularized Coef. α 0.001 0.0002 0.01 0.0 0.003 15 -- 15 of 49 -- Sailor2 Technical Report 6 Model Customization 6.1 Long-Context Training A 128K token context window allows large language models (LLMs) to handle complex tasks such as multi-document question answering (Wang et al., 2024d), repository-level code comprehension (Jimenez et al., 2024), and many-shot learning by capturing long-range dependencies (Agarwal et al., 2024), leading to more coherent and contextually relevant outputs (Mazumder & Liu, 2022). The Sailor2 series employs AnchorAttention (Wang et al., 2024b) to extend its maximum context length from 4K to 128K10. In particular, Sailor2 masks out cross-document attention to prevent the model from aggregating irrelevant information across irrelevant documents. Note, this strategy that aligns with the approach used during pretraining. By maintaining a consistent masking paradigm in both pretraining and long-context training, Sailor2 mitigates potential conflicts that could arise from shifting between different attention mechanisms. Unlike approaches such as LLaMA3 (Dubey et al., 2024), which rely solely on cross- document attention masking, Sailor2 introduces an anchor token that serves as a stable reference point. Specifically, the first token in each training sequence (the <eos> token of each sample) retains a fixed positional ID and is therefore visible to all documents within the training context. This design helps reduce numerical instability and provides the model with a consistent anchor across the extended sequence. Furthermore, instead of resetting the positional IDs to 0 for each new document (Zhao et al., 2024),
Chunk 28 · 1,995 chars
a fixed positional ID and is therefore visible to all documents within the training context. This design helps reduce numerical instability and provides the model with a consistent anchor across the extended sequence. Furthermore, instead of resetting the positional IDs to 0 for each new document (Zhao et al., 2024), Sailor2 maintains continuous positional indexing across the entire sequence, allowing the model to fully utilize the entire position range in training. With AnchorAttention, Sailor2 efficiently achieves long-context capabilities while training on a relatively small amount of data. Specifically, Sailor2 uses a total of 4 billion tokens in 1,000 steps (4 million tokens per step) at a learning rate of 2 × 10−5, with the first 200 steps designated as warm-up. Despite the limited token budget, Sailor2 effectively extends its context length, as demonstrated by the model’s performance on the RULER benchmark (Hsieh et al., 2024), as shown in Table 9. In the meanwhile, the short-context performance is kept and sometimes outperforms the pretrained one. The perplexity on different languages is in Table 10. The performance of different tasks is in Table 11. Table 9: Model Performance on RULER Long-Context Benchmark. Model 128K 64K 32K 16K 8K 4K Qwen2.5-0.5B 0.00 0.00 46.50 52.65 55.95 64.42 Sailor2-1B 0.00 0.00 0.62 3.99 35.81 55.93 Sailor2-1B-32K 0.00 0.00 36.52 49.63 55.50 56.84 Qwen2.5-7B 20.67 61.70 78.58 81.72 83.58 86.72 SeaLLM-v3-7B 17.32 61.37 82.19 84.85 60.40 71.55 Sailor2-8B 0.00 2.17 9.59 23.08 49.13 69.38 Sailor2-8B-128K 19.94 41.57 54.61 64.32 75.73 80.04 Qwen2.5-14B 32.93 66.68 85.09 86.96 87.40 87.56 Sailor2-20B 0.55 14.08 46.60 67.76 79.62 87.86 Sailor2-20B-128K 47.46 66.70 79.52 85.24 86.63 88.21 10Long-context training codebase: https://github.com/haonan3/AnchorContext 16 -- 16 of 49 -- Sailor2 Technical Report Table 10: Perplexity across multiple languages for different Sailor2 models. Model eng tha vie ind mya valid Sailor2-1B 21.01 4.52
Chunk 29 · 1,987 chars
.76 79.62 87.86 Sailor2-20B-128K 47.46 66.70 79.52 85.24 86.63 88.21 10Long-context training codebase: https://github.com/haonan3/AnchorContext 16 -- 16 of 49 -- Sailor2 Technical Report Table 10: Perplexity across multiple languages for different Sailor2 models. Model eng tha vie ind mya valid Sailor2-1B 21.01 4.52 7.52 6.75 4.11 9.36 Sailor2-1B-32K 20.89 4.63 7.57 6.66 4.94 9.93 Sailor2-8B 13.63 3.51 5.49 5.09 2.74 6.26 Sailor2-8B-128K 13.28 3.53 5.22 5.07 2.64 6.22 Sailor2-20B 11.48 3.35 5.24 4.93 2.41 5.69 Sailor2-20B-128K 11.11 3.36 5.04 4.92 2.44 5.61 Table 11: Effect of Long-Context Training on Downstream Performance. Evaluate on Multiple-Choice tasks (Belebele, XCOPA, M3Exam) with Accuracy as metric, and Reading Comprehension tasks (XQuAD and TydiQA) with Exact/Fuzzy Match as metrics. Belebele XCOPA XQuAD TydiQA M3Exam Model tha ind vie tha ind vie tha vie ind tha ind vie Sailor2-1B 36.89 35.89 36.78 56.6 66.8 68.0 33.07 / 49.60 34.56 / 53.43 44.78 / 64.86 28.43 28.30 36.84 Sailor2-1B-32K 36.44 36.56 35.89 56.2 68.4 65.8 35.68 / 54.86 38.49 / 58.91 45.84 / 65.08 27.79 27.76 35.10 Sailor2-8B 43.22 48.89 48.67 66.4 74.8 81.0 66.84 / 80.50 60.05 / 79.61 66.37 / 81.30 56.50 57.14 65.62 Sailor2-8B-128K 44.00 50.44 48.89 66.8 79.0 81.2 67.10 / 80.28 59.02 / 79.15 65.31 / 82.01 55.39 60.11 65.01 Sailor2-20B 47.44 52.11 53.78 67.6 81.4 83.6 69.45 / 83.34 62.02 / 82.05 71.68 / 84.44 67.77 62.26 74.46 Sailor2-20B-128K 48.44 52.44 53.44 69.4 81.8 84.8 69.97 / 83.54 63.47 / 82.10 70.44 / 83.69 67.36 62.80 74.23 6.2 Speculative Decoding To accelerate model inference, we adopted speculative decoding, a technique designed to reduce the computational cost of autoregressive generation. Specifically, we customized a one-layer draft model, GliDe (Du et al., 2024)11, for Sailor 8B and 20B. Background. GliDe is a draft model based on a transformer decoder-only architecture that retains standard components—self-attention, cross-attention, and feed-forward
Chunk 30 · 1,948 chars
computational cost of autoregressive generation. Specifically, we customized a one-layer draft model, GliDe (Du et al., 2024)11, for Sailor 8B and 20B. Background. GliDe is a draft model based on a transformer decoder-only architecture that retains standard components—self-attention, cross-attention, and feed-forward networks (FFNs). In GliDe, the conventional self-attention layer is applied first, where each token in the sequence attends only to its preceding tokens. This is immediately followed by a cross-attention layer, which reuses precomputed and cached cross-attention outputs from the target LLM instead of recomputing the keys and values for each draft token. This ap- proach yields a more precise token representation while reducing redundant computations. Finally, the cross-attended outputs pass through position-wise FFNs to further refine token representations. The processing sequence follows: self-attention → cross-attention → FFN. Implementation Details. Unlike GliDe, we share the weights of the embedding layer and LM head between the target and draft models, significantly reducing memory consumption, especially for large-vocabulary LLMs. Moreover, to improve the stability and robustness of the draft model, we employed a flash noise training technique to replace the cape mask in the original GliDe, which can not only solve the problem of training-inference discrepancy but also be compatible with Flash Attention (Dao et al., 2022). Specifically, for the cross-attention query Qt in the draft model, we can only ensure access to the corresponding key-value states K<t′ , V<t′ that satisfy 1 ≤ |t′ − t| < γ, where γ denotes the number of speculative steps. During training, we randomly shift the indices of queries and key-value states within the range 1 ≤ j < γ. If the sequence length is l, we then compute O≥j = flash_attnQ≥j, K<l−j, V<l−j . This approach effectively enforces the 11Speculative decoding codebase:
Chunk 31 · 1,988 chars
− t| < γ, where γ denotes the number of speculative steps. During training, we randomly shift the indices of queries and key-value states within the range 1 ≤ j < γ. If the sequence length is l, we then compute O≥j = flash_attnQ≥j, K<l−j, V<l−j . This approach effectively enforces the 11Speculative decoding codebase: https://github.com/NonvolatileMemory/GliDe_with_a_CaPE_ ICML_24 (Training) and https://github.com/penghui-yang/sailor-glide (Inference) 17 -- 17 of 49 -- Sailor2 Technical Report same visibility constraints as those in the inference phase, i.e., 1 ≤ |t′ − t| < γ, thereby ensuring that the training process aligns with inference behavior. During speculative decoding inference, we first generate tokens autoregressively using the one-layer draft model, followed by parallel verification with Sailor 2. This approach effec- tively reduces the number of autoregressive steps required for decoding. Since tree-based speculative decoding is incompatible with Flash Attention, we opted for a straightforward sequential speculative decoding strategy. We set the speculation length γ to 4 based on empirical observations. Performance. The performance of our GliDe model is illustrated in Figure 5 and Figure 6, demonstrating an approximate 2× acceleration. Notably, for Burmese (mya), our approach achieves an accept length exceeding 3 and a speedup of approximately 2.5×. We attribute this improvement to the high tokenization granularity of Burmese, which provides a greater margin for speculative decoding to optimize token generation. ind tgl khm tha lao vie mya zsm all 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Accept Length Accept Length Comparison across Languages 8B 20B Figure 5: Comparison of GliDe Accept Length in Different Languages. ind tgl khm tha lao vie mya zsm all 0 10 20 30 40 50 60 Tokens / Second Token Generation Speed across Languages 8B Vanilla 8B Glide 20B Vanilla 20B Glide Figure 6: Comparison of GliDe Token Generation Speed in Different
Chunk 32 · 1,986 chars
across Languages 8B 20B Figure 5: Comparison of GliDe Accept Length in Different Languages. ind tgl khm tha lao vie mya zsm all 0 10 20 30 40 50 60 Tokens / Second Token Generation Speed across Languages 8B Vanilla 8B Glide 20B Vanilla 20B Glide Figure 6: Comparison of GliDe Token Generation Speed in Different Languages. 18 -- 18 of 49 -- Sailor2 Technical Report 6.3 Model Pruning By leveraging existing pre-trained models, the pruning method enables the rapid generation of smaller-scale models, which avoids the high costs of training from scratch. In this study, we apply the Sheared LLaMA method(Xia et al., 2023) to prune the Sailor2-20B and Sailor2- 8B models, resulting in Sailor2-14B and Sailor2-3B respectively12. The pruning stage takes 6B tokens. Subsequently, we performed continual training using a dataset of 180B tokens to recover the models’ performance. Background. The Sheared LLaMA method focuses on structured pruning to produce smaller yet competitive models from pre-trained larger models. It employs two main techniques: targeted structured pruning and dynamic batch loading. Targeted structured pruning compresses a model into a target architecture via L0-regularized binary mask learning. Lagrange multipliers are applied to enforce constraints on the target architecture, ensuring the pruned model adheres to the desired configuration while optimizing performance. Dynamic batch loading adjusts training data batches based on domain-specific loss reduction rates, enhancing data efficiency and accelerating convergence. Implementation Details. In contrast to the Sheared LLaMA method, we introduce several optimizations in our approach. First, instead of pruning Multi-Head Attention as in the orig- inal method, we retain the Key and Value Heads in the Grouped Query Attention (Ainslie et al., 2023) structure, pruning only an equal number of Query Heads from each Query Head Group corresponding to the KV Heads. Second, we do not prune the layer
Chunk 33 · 1,995 chars
pproach. First, instead of pruning Multi-Head Attention as in the orig- inal method, we retain the Key and Value Heads in the Grouped Query Attention (Ainslie et al., 2023) structure, pruning only an equal number of Query Heads from each Query Head Group corresponding to the KV Heads. Second, we do not prune the layer dimension, as our preliminary experiments have shown that pruning the layer dimension leads to convergence difficulties. Instead, we focus on optimizing other dimensions (i.e., hidden dimension, head number, and intermediate dimension). Third, to maintain consistency between the hidden dimensions and the number of attention heads, the pruning options are limited. We recommend conducting ablation studies with minimal continual training to identify optimal configurations. Finally, during continual training, we have not used the dynamic batch loading strategy, as it is complex to divide the pretraining data into several domains explicitly. Instead, we directly sample from the Sailor2 Stage-2 training dataset, achieving promising results. Performance. To obtain the final chat models, we train two pruned models using the Sailor2 post-training pipeline, resulting in Sailor2-3B-Chat and Sailor2-14B-Chat. The experimental results in Table 12 demonstrate that these pruned models significantly outperform the baseline Qwen2.5 in low-resource languages such as Khmer and Lao. Table 12: Evaluation of Chat Model after Pruning. Model SWB Score tha vie ind tgl zsm khm lao mya Qwen2.5-14B-Chat 0.30 0.40 0.40 0.23 0.35 0.20 0.21 0.12 0.30 Sailor2-14B-Chat 0.39 0.38 0.34 0.33 0.35 0.28 0.46 0.47 0.43 Qwen2.5-3B-Chat 0.16 0.14 0.21 0.18 0.08 0.16 0.06 0.06 0.04 Sailor2-3B-Chat 0.26 0.25 0.21 0.21 0.19 0.19 0.32 0.31 0.28 7 Evaluation 7.1 Evaluation on Base Model For base model evaluation, we focus on the basic language understanding task like sentence classification, and language generation task like question answering and machine trans- lation. Specially, we evaluate
Chunk 34 · 1,994 chars
6 0.04 Sailor2-3B-Chat 0.26 0.25 0.21 0.21 0.19 0.19 0.32 0.31 0.28 7 Evaluation 7.1 Evaluation on Base Model For base model evaluation, we focus on the basic language understanding task like sentence classification, and language generation task like question answering and machine trans- lation. Specially, we evaluate Sailor2 on SailCompass (Guo et al., 2024a) evaluation suite and FLoRes-200 (NLLB Team, 2022) translation suite. To expand the evaluated language coverage, we choose the dataset in Indonesian, Thai, Vietnamese, Malay and Javanese. 12Model pruning codebase: https://github.com/princeton-nlp/LLM-Shearing 19 -- 19 of 49 -- Sailor2 Technical Report Table 13: Overview of results on Sailor2, over both 8B and 20B models. The best performing model for each model size on each benchmark is bolded. Language Benchmark(eval) Sailor2- 8B Qwen2.5- 7B Gemma2- 9B Lllama3.1- 8B SeaLLM- v3-7B Sailor2- 20B Qwen2.5- 32B Gemma2- 27B Llama3.1- 70B Aya- Expanse- 32B Avg. 57.6 52.8 52.5 47.2 43.4 62.8 59.1 61.8 61.2 51.1 Indonesian IndoCulture(0 shot) 73.4 58.7 65.6 56.7 53.0 76.4 68.9 66.1 72.7 70.6 TydiQA(3 shot) 66.4 63.5 65.5 63.4 65.5 71.7 63.9 65.1 69.9 58.2 Belebele(3 shot) 48.9 49.3 50.7 46.8 30.6 52.1 54.1 53.3 56.4 60.3 Thai MMLU(5 shot) 55.4 52.8 57.8 44.1 50.8 66.3 70.7 62.5 67.1 39.6 M3Exam(5 shot) 57.0 51.7 52.7 43.7 51.3 69.3 69.2 57.0 63.7 38.6 Belebele(3 shot) 43.2 44.1 40.6 43.1 43.0 47.4 49.4 46.0 52.3 45.3 Vietnamese VMLU(3 shot) 56.2 52.6 51.7 48.9 56.8 65.9 64.9 59.1 63.9 65.9 M3Exam(3 shot) 65.6 66.4 65.5 54.4 63.1 74.6 77.3 68.6 68.9 63.2 Belebele(3 shot) 48.7 50.8 49.0 46.0 48.6 53.8 54.6 52.0 61.8 58.3 Malay Tatabahasa(3 shot) 67.3 41.5 53.6 42.9 37.4 67.3 50.4 58.6 58.3 48.1 Javanese M3Exam(3 shot) 57.1 35.9 45.3 40.4 38.5 62.3 47.7 49.1 53.4 46.1 Multiple FLORES-200(3 shot) 35.4 30.6 35.8 31.7 29.6 35.8 34.3 36.6
Chunk 35 · 1,992 chars
ele(3 shot) 48.7 50.8 49.0 46.0 48.6 53.8 54.6 52.0 61.8 58.3 Malay Tatabahasa(3 shot) 67.3 41.5 53.6 42.9 37.4 67.3 50.4 58.6 58.3 48.1 Javanese M3Exam(3 shot) 57.1 35.9 45.3 40.4 38.5 62.3 47.7 49.1 53.4 46.1 Multiple FLORES-200(3 shot) 35.4 30.6 35.8 31.7 29.6 35.8 34.3 36.6 36.5 35.7 XCOPA(3 shot) 74.1 71.8 73.0 69.4 70.4 77.5 77.3 75.3 79.8 72.1 For Indonesian, we choose IndoCulture (Koto et al., 2024), TydiQA (Clark et al., 2020), Belebele (Bandarkar et al., 2024). For Thai, we choose MMLU (Kydlíˇcek et al., 2024), M3Exam (Zhang et al., 2023)13 and Belebele. For Vietnamese, We choose VMLU 14, M3Exam and Belebele. For Malay, we choose Tatabahasa (Lovenia et al., 2024). For Javanese, we choose M3Exam. For all SEA languages, we choose FLoRes-200 (NLLB Team, 2022) and XCOPA (Ponti et al., 2020). We have more detailed comparison and analysis for translation in Section 8.4 and culture understanding in Section 8.5. Detailed results are presented in Table 13. We observe that both Sailor2-8B and Sailor2-20B exhibit the highest average performance within their respective parameter groups. Notably, Sailor2-20B even outperforms larger models, including the three-times larger Llama3.1-70B. 7.2 Evaluation on Chat Model We aim to comprehensively evaluate the performance of our Chat Model by using Wild- Bench (Lin et al., 2024) as the primary evaluation dataset. WildBench covers five tasks: Coding & Debugging, Information Seeking, Math & Data, Reasoning & Planning, and Creative Tasks. We employ GPT-4o-0806 to translate WildBench into eight SEA languages (Thai, Vietnamese, Indonesian, Tagalog, Burmese, Khmer, Lao, and Malay), thereby creating a new benchmark named SEA-WildBench (SWB). Detailed results are presented in Table 14 (task-level) and Table 15 (language-level). We use the SWB Score as our evaluation metric, which is calculated based on the win-rate against GPT-4o-0806 (the same model serves as the judge). We
Chunk 36 · 1,990 chars
hmer, Lao, and Malay), thereby creating a new benchmark named SEA-WildBench (SWB). Detailed results are presented in Table 14 (task-level) and Table 15 (language-level). We use the SWB Score as our evaluation metric, which is calculated based on the win-rate against GPT-4o-0806 (the same model serves as the judge). We selected the most representative open models, including both general-purpose and SEA language-optimized variants. For improved visualization, we use Llama-3.1-70B-Instruct as the baseline, with a SWB Score of 30. Our results indicate that both Sailor2-20B-Chat and Sailor2-8B-Chat achieve superior performance across various tasks and languages. As shown in Table 15, Sailor2 models excel in low-resource languages. Notably, Sailor2-20B-Chat achieves nearly a 50% win rate against GPT-4o-0806 on SeaWildBench, demonstrating GPT-4o-level performance in local chat scenarios for Southeast Asian languages. Note that the overall SWB Score can be higher than the scores for individual subsets. For example, although Llama-2-7B-Chat scores below 0.05 on each subset, its overall SWB Score is 0.05. We follow the WildBench score calculation 15. This method may overestimate scores in cases where parse errors occur. 13For Thai M3Exam, we adopt finetasks (Kydlíˇcek et al., 2024) codebase for evaluation. 14VMLU: https://github.com/ZaloAI-Jaist/VMLU/ 15https://tinyurl.com/49en4cw6 20 -- 20 of 49 -- Sailor2 Technical Report Table 14: Task-Level Evaluation of Chat Models on SEA-WildBench. The score represents the win rate against GPT-4o, which also serves as the evaluator. SWB Score is the average score of five tasks. Model SWB Score Coding Creative Tasks Info Seeking Reasoning Math Length Sailor2-20B-Chat 0.56 0.62 0.56 0.58 0.57 0.54 2814.74 Sailor2-8B-Chat 0.49 0.42 0.57 0.53 0.50 0.42 2849.41 Qwen2.5-72B-Instruct 0.45 0.50 0.39 0.44 0.45 0.49 3026.82 SEA-LIONv3-70B-Instruct 0.40 0.42 0.38 0.40 0.39 0.39 2340.65 Gemma-2-27B-Instruct 0.40 0.38 0.41 0.39 0.39 0.37
Chunk 37 · 1,989 chars
sks Info Seeking Reasoning Math Length Sailor2-20B-Chat 0.56 0.62 0.56 0.58 0.57 0.54 2814.74 Sailor2-8B-Chat 0.49 0.42 0.57 0.53 0.50 0.42 2849.41 Qwen2.5-72B-Instruct 0.45 0.50 0.39 0.44 0.45 0.49 3026.82 SEA-LIONv3-70B-Instruct 0.40 0.42 0.38 0.40 0.39 0.39 2340.65 Gemma-2-27B-Instruct 0.40 0.38 0.41 0.39 0.39 0.37 2288.33 Qwen2.5-32B-Instruct 0.32 0.39 0.28 0.29 0.32 0.33 2090.61 Gemma-2-9B-Instruct 0.31 0.26 0.36 0.33 0.30 0.26 2163.03 Qwen2.5-14B-Instruct 0.30 0.33 0.25 0.28 0.28 0.30 2267.94 Llama-3.1-70B-Instruct 0.30 0.37 0.26 0.28 0.28 0.28 2543.06 SEA-LIONv3-8B-Instruct 0.30 0.32 0.32 0.30 0.28 0.22 2357.14 Aya-Expanse-32B 0.29 0.29 0.28 0.28 0.27 0.24 2495.47 Qwen2-72B-Instruct 0.26 0.22 0.27 0.28 0.25 0.23 1546.21 Qwen2.5-7B-Instruct 0.25 0.28 0.20 0.23 0.22 0.22 2415.08 SEA-LIONv2.1-8B-Instruct 0.23 0.23 0.24 0.24 0.20 0.18 1735.26 SeaLLMs-v3-7B-Chat 0.21 0.21 0.19 0.19 0.18 0.15 2298.47 Llama-3.1-8B-Instruct 0.19 0.18 0.15 0.16 0.15 0.13 2356.67 SeaLLM-7B-v2 0.18 0.14 0.16 0.17 0.14 0.12 2298.15 SeaLLM-7B-v2.5 0.17 0.14 0.14 0.15 0.13 0.11 2184.55 Qwen2.5-3B-Instruct 0.16 0.14 0.10 0.12 0.12 0.13 2324.08 Sailor-14B-Chat 0.16 0.07 0.11 0.13 0.10 0.09 2465.85 SeaLLM-7B-v1 0.12 0.03 0.07 0.09 0.07 0.06 2585.40 Mistral-7B-Instruct-v0.3 0.10 0.11 0.03 0.07 0.06 0.07 2336.51 Sailor-7B-Chat 0.09 0.02 0.04 0.06 0.04 0.03 1404.60 Llama-2-70B-Chat 0.08 0.07 0.05 0.06 0.05 0.05 2354.30 Llama-2-13B-Chat 0.06 0.04 0.04 0.05 0.03 0.03 2317.36 Llama-2-7B-Chat 0.05 0.03 0.02 0.04 0.02 0.03 2330.50 Table 15: Language-Level Evaluation of Chat Models on SEA-Wildbench. The score rep- resents the win rate against GPT-4o, which also serves as the evaluator. SWB Score is the average score of eight languages. Model SWB Score tha vie ind tgl zsm khm lao mya Length Sailor2-20B-Chat 0.56 0.53 0.50 0.54 0.50 0.49 0.63 0.69 0.64 2814.74 Sailor2-8B-Chat 0.49 0.48 0.46 0.46 0.42 0.43 0.50 0.66 0.55 2849.41 Qwen2.5-72B-Instruct 0.45 0.54 0.51 0.51 0.42 0.48 0.33 0.41 0.31
Chunk 38 · 1,999 chars
aluator. SWB Score is the average score of eight languages. Model SWB Score tha vie ind tgl zsm khm lao mya Length Sailor2-20B-Chat 0.56 0.53 0.50 0.54 0.50 0.49 0.63 0.69 0.64 2814.74 Sailor2-8B-Chat 0.49 0.48 0.46 0.46 0.42 0.43 0.50 0.66 0.55 2849.41 Qwen2.5-72B-Instruct 0.45 0.54 0.51 0.51 0.42 0.48 0.33 0.41 0.31 3026.82 SEA-LIONv3-70B-Instruct 0.40 0.45 0.45 0.48 0.40 0.41 0.32 0.28 0.32 2340.65 Gemma-2-27B-Instruct 0.40 0.43 0.40 0.46 0.40 0.39 0.34 0.38 0.31 2288.33 Qwen2.5-32B-Instruct 0.32 0.37 0.42 0.42 0.26 0.38 0.24 0.19 0.16 2090.61 Gemma-2-9B-Instruct 0.31 0.36 0.40 0.39 0.30 0.38 0.19 0.19 0.19 2163.03 Qwen2.5-14B-Instruct 0.30 0.40 0.40 0.23 0.35 0.20 0.21 0.12 0.30 2267.94 Llama-3.1-70B-Instruct 0.30 0.33 0.37 0.37 0.28 0.35 0.18 0.15 0.19 2543.06 SEA-LIONv3-8B-Instruct 0.30 0.38 0.40 0.38 0.34 0.35 0.12 0.08 0.14 2357.14 Aya-expanse-32B 0.29 0.25 0.45 0.46 0.27 0.35 0.06 0.12 0.13 2495.47 Qwen2-72B-Instruct 0.26 0.26 0.30 0.33 0.29 0.32 0.20 0.20 0.16 1546.21 Qwen2.5-7B-Instruct 0.25 0.30 0.35 0.36 0.12 0.29 0.09 0.09 0.08 2415.08 Sealionv2.1-8B-Instruct 0.23 0.30 0.33 0.31 0.16 0.28 0.07 0.08 0.10 1735.26 SeaLLMs-v3-7B-Chat 0.21 0.23 0.22 0.21 0.19 0.16 0.15 0.16 0.09 2298.47 Llama-3.1-8B-Instruct 0.19 0.19 0.26 0.21 0.15 0.18 0.06 0.07 0.07 2356.67 SeaLLM-7B-v2 0.18 0.18 0.18 0.19 0.09 0.13 0.10 0.12 0.09 2298.15 SeaLLM-7B-v2.5 0.17 0.18 0.19 0.18 0.10 0.14 0.08 0.11 0.06 2184.55 Qwen2.5-3B-Instruct 0.16 0.14 0.21 0.18 0.08 0.16 0.06 0.06 0.04 2324.08 Sailor-14B-Chat 0.16 0.11 0.17 0.14 0.04 0.14 0.02 0.12 0.06 2465.85 SeaLLM-7B-v1 0.12 0.05 0.07 0.07 0.04 0.05 0.10 0.11 0.09 2585.40 Mistral-7B-Instruct-v0.3 0.10 0.07 0.11 0.07 0.08 0.11 0.02 0.03 0.02 2336.51 Sailor-7B-Chat 0.09 0.04 0.07 0.05 0.02 0.06 0.02 0.07 0.03 1404.60 Llama-2-70B-Chat 0.08 0.02 0.05 0.11 0.06 0.13 0.03 0.01 0.03 2354.30 Llama-2-13B-Chat 0.06 0.01 0.05 0.08 0.02 0.03 0.01 0.01 0.03 2317.36 Llama-2-7B-Chat 0.05 0.01 0.02 0.05 0.04 0.03 0.01 0.02 0.04 2330.50 21 -- 21 of
Chunk 39 · 1,998 chars
8 0.11 0.02 0.03 0.02 2336.51 Sailor-7B-Chat 0.09 0.04 0.07 0.05 0.02 0.06 0.02 0.07 0.03 1404.60 Llama-2-70B-Chat 0.08 0.02 0.05 0.11 0.06 0.13 0.03 0.01 0.03 2354.30 Llama-2-13B-Chat 0.06 0.01 0.05 0.08 0.02 0.03 0.01 0.01 0.03 2317.36 Llama-2-7B-Chat 0.05 0.01 0.02 0.05 0.04 0.03 0.01 0.02 0.04 2330.50 21 -- 21 of 49 -- Sailor2 Technical Report Table 16: Perplexity Comparison under Continual Pre-training: Qwen1.5 → Sailor and Qwen2.5 → Sailor2. Lower perplexity means better performance. The evaluation dataset for each language is composed of samples from various domains. The valid dataset refers to the evaluation data collected from all languages across these diverse domains. Model Language eng zho tha vie ind mya valid Qwen1.5-14B 13.67 12.30 18.06 42.88 14.15 13.65 21.71 Sailor-14B 15.02↑1.35 12.47↑0.17 6.19↓11.87 11.79↓31.09 7.21↓6.94 11.02↓2.63 9.65↓12.06 Qwen2.5-14B 11.40 9.91 6.36 9.60 8.46 7.07 9.10 Sailor2-20B 11.48↑0.08 9.27↓0.64 3.35↓3.01 5.24↓4.36 4.93↓3.53 2.41↓4.66 5.69↓3.41 Qwen1.5-7B 14.55 13.18 21.35 47.09 16.14 11.99 27.07 Sailor-7B 15.75↑1.20 13.57↑0.39 6.21↓15.14 11.05↓36.04 7.32↓8.82 12.25↑0.26 10.15↓16.92 Qwen2.5-7B 13.17 11.61 7.24 12.84 10.07 8.69 10.73 Sailor2-8B 13.63↑0.46 10.94↓0.67 3.51↓3.73 5.49↓7.35 5.09↓4.98 2.74↓5.95 6.26↓4.47 Qwen1.5-0.5B 23.78 21.48 58.8 200.65 65.44 16.07 70.41 Sailor-0.5B 25.88↑2.10 25.80↑4.32 8.72↓50.08 16.76↓183.89 10.98↓54.46 17.69↑1.62 17.26↓53.15 Qwen2.5-0.5B 22.28 21.25 13.93 26.86 25.57 13.67 22.83 Sailor2-1B 21.01↓1.27 17.28↓3.97 4.52↓9.41 7.52↓19.34 6.75↓18.82 4.11↓9.56 9.36↓13.47 8 Analysis This section presents our insights on building multilingual LLMs during both the continual pre-training and post-training stages. We also examine Sailor2’s capabilities in translation and cultural understanding, two core components of practical multilingual applications. 8.1 Effect of Model Expansion For both Sailor and Sailor2, we adopt a continual pre-training (CPT) approach to efficiently develop multilingual
Chunk 40 · 1,996 chars
raining and post-training stages. We also examine Sailor2’s capabilities in translation and cultural understanding, two core components of practical multilingual applications. 8.1 Effect of Model Expansion For both Sailor and Sailor2, we adopt a continual pre-training (CPT) approach to efficiently develop multilingual LLMs by reusing computational resources. Unlike Sailor, Sailor2 in- corporates model expansion, which creates additional capacity for learning new knowledge from the multilingual corpus. We analyze the perplexity shift during CPT with and without model expansion, with detailed results shown in Table 16. Experimental findings reveal that, compared to the Qwen1.5 → Sailor transition, Qwen2.5 → Sailor2 exhibits less degradation in English/Chi- nese and greater improvements in the target SEA languages during continual pre-training. Notably, even though Qwen2.5 is trained on 18T tokens versus Qwen1.5’s 4T tokens, Sailor2 still achieves significant gains in SEA languages with only minor degradation in English. We conduct a comprehensive examination across multiple languages in Figure 11, which illustrates the PPL distribution shift for English, Chinese, and sixteen SEA languages. The results demonstrate that Sailor2 maintains its performance in English and Chinese while achieving significantly lower perplexity in SEA languages. Further discussions on future work in efficient CPT can be found in Section 9.3. 8.2 Effect of Continual Pre-training Qwen2.5 models have already been trained on 18T tokens, meaning that many SEA tokens were likely seen during the pre-training stage. This raises the question of whether the expensive continual pre-training stage using an additional 400B SEA tokens is still necessary. To investigate, we conducted an ablation study with the following setup: post-training on both the vanilla Qwen2.5-7B base model and the Sailor2-8B base model, with the same post-training dataset and training steps. Detailed results are listed in Table
Chunk 41 · 1,993 chars
ning stage using an additional 400B SEA tokens is still necessary. To investigate, we conducted an ablation study with the following setup: post-training on both the vanilla Qwen2.5-7B base model and the Sailor2-8B base model, with the same post-training dataset and training steps. Detailed results are listed in Table 17. We could observe that CPT is essential, especially for low-resource languages like khm and lao. 22 -- 22 of 49 -- Sailor2 Technical Report Table 17: Language-wise Score on SEA-WildBench between chat models trained using the Qwen and Sailor2 models. Note that Qwen2.5-7B-Chat is trained using the Sailor2 post-training pipeline. Model SWB Score tha vie ind tgl zsm khm lao mya Sailor2-8B-Chat 0.43 0.44 0.40 0.40 0.39 0.36 0.43 0.56 0.44 Qwen2.5-7B-Chat (ours) 0.25 0.31 0.34 0.33 0.21 0.31 0.11 0.21 0.08 8.3 Key Findings in Preference Data Construction We conduct a series of ablation studies to assess the impact of design choices in the preference data construction pipeline, as described in Sec. 5.2.2. SFT DPO w. Off-Policy DPO w. On-Policy 0 10 20 30 40 50 60 Length-Controlled Win Rate (%) on Sea-WB 18 50 56 Figure 7: Length-controlled win rates com- parison after different preference tuning stages on Sea-WB. 8B-beta=0.1 8B-beta=0.3 20B-beta=0.1 30 35 40 45 50 55 60 Length-Controlled Win Rate (%) on Sea-WB 48 47 55 49 48 56 w/o lan consist. w/ lan consist. Figure 8: Effect of the language consistency verifier on the model performance. The re- sults show a consistent performance gain across various experiment settings. First Off-Policy, Then On-Policy Training Improves DPO Performance. While prior studies (Guo et al., 2024b; Lambert et al., 2024) have shown that on-policy training in DPO leads to greater performance improvements than off-policy training, our preliminary findings suggest that directly applying on-policy training to an SFT model yields limited gains. We hypothesize that this is due to the SFT model’s insufficient ability to
Chunk 42 · 1,988 chars
Lambert et al., 2024) have shown that on-policy training in
DPO leads to greater performance improvements than off-policy training, our preliminary
findings suggest that directly applying on-policy training to an SFT model yields limited
gains. We hypothesize that this is due to the SFT model’s insufficient ability to generate
high-quality responses in chat tasks. To address this, we first perform off-policy training
to initialize a stronger policy model before transitioning to on-policy training. In other
words, our training pipeline consists of SFT, followed by off-policy DPO training, and
then on-policy DPO training. As shown in Fig. 7, off-policy training on Sailor2-20B-SFT
significantly enhances model performance, while subsequent on-policy training provides
further improvements.
Language Consistency Verifier Improves Downstream DPO Performance. Due to the
lack of a reward model for SEA languages, we use Skywork-Reward-Gemma-2-27B (Liu
et al., 2024b) which is trained primarily on English data. To mitigate the RM’s limitations
in evaluating SEA language responses, we introduce a language consistency verifier. The
verifier 16 labels a response as true if its language matches the prompt; otherwise, it is
labeled false. If at least two responses are language-consistent, we select the winning
and losing responses based on the RM’s reward scores. If only one response is language-
consistent, it is chosen as the winning response. If none are language-consistent, the prompt
is discarded.
16We use facebook/fasttext-language-identification for language verification.
23
-- 23 of 49 --
Sailor2 Technical Report
We evaluate the verifier’s effectiveness under three settings: for the 8B model, we conduct
on-policy training with β ∈ {0.1, 0.3}; for the 20B model, we use β = 0.1. As shown in Fig. 8,
incorporating the verifier consistently improves performance compared to training without
it. Moving forward, further improvements in preference tuning for low-resourceChunk 43 · 1,992 chars
ness under three settings: for the 8B model, we conduct
on-policy training with β ∈ {0.1, 0.3}; for the 20B model, we use β = 0.1. As shown in Fig. 8,
incorporating the verifier consistently improves performance compared to training without
it. Moving forward, further improvements in preference tuning for low-resource languages
could be explored, such as training an RM specifically for SEA languages and leveraging
self-alignment techniques (Chen et al., 2024; 2025; Kim et al., 2025).
Table 18: Comparison of on-policy and distillation methods on Sailor2-8B after off-policy
training. LC is short for the length-controlled win rate on Sea-WB.
Method LC Avg. Length
On-policy 0.49 2849
Distillation 0.48 2752
Distillation Reduces the Response Length while Maintaining Comparable Model Perfor-
mance. We investigate the effectiveness of distillation within the same model family during
DPO training given two practical considerations: (1) Distillation leverages the high-quality
on-policy data used for training Sailor2-20B-Chat; (2) Reusing Sailor2-20B’s on-policy data
significantly reduces the computational cost of data generation and reward evaluation. To
assess its impact, we conduct a controlled experiment on our 8B model. Specifically, we
conduct DPO with its own on-policy data and with the on-policy data from Sailor2-20B.
Results in Table. 18 indicate that distillation maintains comparable model performance
while reducing response length.
8.4 Cross-lingual Translation Ability of Sailor2
We analyze the performance of Sailor2-20B on Flores Plus (NLLB Team, 2022), a translation
dataset covering over 200 languages. Since our focus is primarily on SEA languages, we
limit the scope to a subset of the dataset containing SEA languages, Chinese and English.
Table 19 compares Sailor2 and three baseline models on English-centric translation pairs.
Table 20 (and Table 23, 24, 25 and 26) shows performance of Sailor2-20B (and other baselines)
between all language pairs in FloresChunk 44 · 1,996 chars
ages, we limit the scope to a subset of the dataset containing SEA languages, Chinese and English. Table 19 compares Sailor2 and three baseline models on English-centric translation pairs. Table 20 (and Table 23, 24, 25 and 26) shows performance of Sailor2-20B (and other baselines) between all language pairs in Flores Plus17. We provide a visual comparison of Sailor2-20B against other baselines in Figure 13, 14, 15 and 16. Sailor2: Excelling in Low-Resource Translation Despite having significantly fewer parame- ters, Sailor2-20B demonstrates remarkable capabilities in low-resource language translation. The superior performance extends to approximately 80% of low-resource language scenarios. As shown in Table 19, compared to Qwen2.5-32B and Qwen2.5-72B, Sailor2 consistently achieves higher chrF++ scores in low-resource language pairs, though showing slightly lower performance in English or Chinese translation. When compared with Llama3.1-70B, Sailor2-20B exhibits particular strengths in challenging low-resource scenarios, demonstrat- ing significant advantages in Lao translation (7.3/37.6 vs 3.7/22.7), as well as when Burmese or Khmer is the target language. The performance gap becomes even more pronounced when examining bidirectional translation capabilities - Sailor2-20B maintains relatively balanced performance in both directions for low-resource languages, while other models show significant degradation when translating into low-resource languages. We also report the win rate of each model for each language pair in Figure 12. The win rate is defined as the percentage of times a model’s output achieves the top-1 ChrF++ score. Translation Patterns and Language Effects Besides the performance of Sailor2, analysis of Table 20 also reveals several important patterns in translation behavior: (1) English demonstrates consistent superior performance across all language pairs, achiev- ing higher chrF++ scores both as source and target language. This is clearly visible
Chunk 45 · 1,997 chars
and Language Effects Besides the performance of Sailor2, analysis of Table 20 also reveals several important patterns in translation behavior: (1) English demonstrates consistent superior performance across all language pairs, achiev- ing higher chrF++ scores both as source and target language. This is clearly visible in Table 20, where English-sourced translations consistently achieve scores above 50 chrF++ points for most target languages, significantly higher than other source languages. The 17The prediction results of Sailor2 and baseline models could be found in https://huggingface.co/ datasets/sailor2/Flores-Plus-Evaluation-Log-Preview-Cleaned. 24 -- 24 of 49 -- Sailor2 Technical Report performance advantage is particularly pronounced in the XX → English direction compared to English → XX translations, as evidenced by the consistently higher scores in both tables. (2) We observe that translation quality appears more dependent on the target language than the source language, as shown in Table 20 where vertical columns (representing target languages) display more consistent score ranges compared to horizontal rows (source languages). For instance, translations into Vietnamese (vie) consistently fall within the 45-55 chrF++ range regardless of source language, while Vietnamese as a source language shows more variable performance depending on the target. (3) Languages within the same family or system exhibit notably higher translation per- formance, as demonstrated in Table 20 by the Cebuano-Tagalog pair (ceb-fil: 55.5/53.3). This pattern suggests that linguistic similarity plays a crucial role in translation quality, potentially through the model’s internal representation of language families. The table also reveals that geographically and culturally proximate languages, such as Indonesian-Malay pair (ind-zsm: 60.4/59.3), tend to achieve better bilateral translation performance compared to more distant language pairs. Table 19: Performance on Flores Plus for
Chunk 46 · 1,998 chars
l’s internal representation of language families. The table also reveals that geographically and culturally proximate languages, such as Indonesian-Malay pair (ind-zsm: 60.4/59.3), tend to achieve better bilateral translation performance compared to more distant language pairs. Table 19: Performance on Flores Plus for English-Centric Language Pairs (BLEU/chrF++) (a) English → XX Language Sailor2-20B Llama3.1-70B Qwen2.5-72B Qwen2.5-32B ace 4.8/26.1 6.3/30.3 4.1/24.8 4.6/24.0 ceb 30.9/57.7 28.1/54.5 18.0/44.4 13.6/39.4 cmn 11.3/32.9 12.4/32.3 12.5/34.3 13.0/34.0 fil 33.8/59.2 32.5/57.8 26.7/53.1 23.5/49.5 ilo 18.0/46.0 17.5/46.2 7.7/31.8 5.3/27.1 ind 43.6/67.7 42.8/67.3 41.9/66.5 38.7/64.0 jav 24.3/52.1 22.9/51.1 11.0/36.8 7.6/31.8 khm 5.4/29.2 3.3/21.5 3.0/21.9 3.6/20.6 lao 7.3/37.6 3.7/22.7 4.0/24.8 3.4/21.4 min 12.7/39.7 22.2/50.8 9.9/35.9 9.0/33.7 mya 3.0/28.1 2.0/21.3 2.1/19.9 1.8/17.3 sun 18.0/47.7 16.4/45.3 10.5/37.2 7.9/33.1 tam 10.4/41.1 9.3/37.9 5.1/31.2 4.1/28.3 tha 12.8/48.5 15.2/47.9 12.2/47.9 12.2/46.1 vie 42.6/60.7 43.0/61.1 42.5/60.8 40.6/59.2 war 25.6/53.3 25.6/52.8 16.4/43.7 11.9/38.4 zsm 41.2/66.8 40.2/66.2 35.5/62.4 32.0/59.5 (b) XX → English Language Sailor2-20B Llama3.1-70B Qwen2.5-72B Qwen2.5-32B ace 21.9/42.9 24.5/45.2 19.3/40.2 15.8/37.8 ceb 46.7/67.3 44.8/65.2 39.0/60.1 34.0/56.0 cmn 32.4/57.8 34.2/59.0 35.1/59.9 33.6/59.0 fil 49.8/69.5 50.5/69.5 49.5/69.0 45.4/65.8 ilo 36.0/58.1 33.3/55.0 22.7/44.7 21.1/43.1 ind 45.2/67.0 47.1/68.5 46.9/68.7 45.1/67.5 jav 40.0/62.0 42.0/63.1 33.6/55.6 29.0/51.9 khm 36.2/59.6 32.0/55.4 28.7/52.3 25.2/49.8 lao 37.6/60.6 26.3/49.0 32.8/55.0 27.1/50.1 min 35.5/57.8 39.6/61.3 31.5/53.2 28.2/50.7 mya 27.2/52.0 26.7/51.0 19.6/43.1 16.2/39.6 sun 38.1/60.8 38.4/60.5 32.1/54.8 28.6/51.7 tam 32.0/56.5 34.3/58.2 26.6/50.4 24.0/47.9 tha 36.3/60.6 37.1/61.2 38.7/62.3 35.8/60.6 vie 39.1/62.0 40.7/63.0 41.8/63.9 39.4/62.5 war 47.5/67.7 46.6/66.4 38.5/59.6 35.1/56.8 zsm 46.6/68.1 48.0/69.4 47.2/68.8 45.1/67.2 25 -- 25 of
Chunk 47 · 1,992 chars
ya 27.2/52.0 26.7/51.0 19.6/43.1 16.2/39.6 sun 38.1/60.8 38.4/60.5 32.1/54.8 28.6/51.7 tam 32.0/56.5 34.3/58.2 26.6/50.4 24.0/47.9 tha 36.3/60.6 37.1/61.2 38.7/62.3 35.8/60.6 vie 39.1/62.0 40.7/63.0 41.8/63.9 39.4/62.5 war 47.5/67.7 46.6/66.4 38.5/59.6 35.1/56.8 zsm 46.6/68.1 48.0/69.4 47.2/68.8 45.1/67.2 25 -- 25 of 49 -- Sailor2 Technical Report Table 20: Performance on Flores Plus (chrF++) for Sailor2-20B Target (→) Source (↓) eng cmn ind tha vie zsm mya lao ceb ilo jav khm sun fil war tam English (eng) - 32.9 67.7 48.5 60.7 66.8 28.1 37.6 57.7 46.0 52.1 29.2 47.7 59.2 53.3 41.1 Chinese (cmn) 57.8 - 53.6 41.8 52.7 53.0 24.6 30.8 47.7 39.0 42.6 25.1 39.4 48.1 42.8 34.5 Indonesian (ind) 67.0 28.1 - 45.4 55.8 59.3 26.1 34.5 51.3 42.3 49.2 26.6 46.0 54.1 46.6 37.1 Thai (tha) 60.6 27.5 56.7 - 53.5 56.0 25.5 33.9 49.0 41.0 44.7 26.1 41.3 50.8 44.4 35.4 Vietnamese (vie) 62.0 28.1 57.5 44.4 - 56.4 25.6 33.5 49.6 41.3 45.3 26.1 41.6 51.4 45.1 36.1 Malay (zsm) 68.1 28.7 60.4 44.9 55.6 - 26.4 35.3 51.5 42.5 49.4 26.6 44.6 54.1 47.3 37.4 Burmese (mya) 52.0 22.9 49.3 38.3 46.9 49.4 - 29.8 44.8 37.0 39.9 23.2 36.7 46.5 41.9 33.3 Lao (lao) 60.6 25.8 55.9 45.0 52.3 55.6 25.4 - 48.8 40.0 45.0 27.2 41.2 51.1 44.1 35.4 Cebuano (ceb) 67.3 28.6 57.7 43.6 53.5 57.3 26.5 33.3 - 44.5 45.8 26.4 41.5 55.5 51.7 36.9 Ilocano (ilo) 58.1 24.7 51.7 38.8 48.0 51.6 24.4 29.6 49.5 - 37.9 24.0 36.6 51.4 45.2 33.5 Javanese (jav) 62.0 26.4 58.7 41.6 51.3 56.4 25.2 32.5 48.3 38.2 - 25.9 42.4 51.8 42.6 35.1 Khmer (khm) 59.6 26.4 55.5 42.9 51.5 54.8 24.8 33.5 48.8 40.1 44.8 - 41.5 50.5 44.2 34.5 Sundanese (sun) 60.8 26.0 59.0 41.9 51.5 56.3 25.4 32.3 47.5 33.3 46.8 26.3 - 51.1 38.3 35.1 Tagalog (fil) 69.5 29.2 60.1 44.5 55.0 59.5 26.4 33.3 55.3 46.2 47.3 26.6 42.4 - 50.3 37.5 Waray (war) 67.7 27.2 57.3 43.6 53.0 57.4 26.3 33.2 56.8 45.8 44.4 25.5 39.9 55.9 - 36.8 Tamil (tam) 56.5 24.2 51.3 39.2 48.4 51.5 24.6 28.8 46.7 38.5 41.5 23.7 38.0 47.6 42.9 - 8.5 SEA Culture Understanding
Chunk 48 · 1,996 chars
- 51.1 38.3 35.1 Tagalog (fil) 69.5 29.2 60.1 44.5 55.0 59.5 26.4 33.3 55.3 46.2 47.3 26.6 42.4 - 50.3 37.5 Waray (war) 67.7 27.2 57.3 43.6 53.0 57.4 26.3 33.2 56.8 45.8 44.4 25.5 39.9 55.9 - 36.8 Tamil (tam) 56.5 24.2 51.3 39.2 48.4 51.5 24.6 28.8 46.7 38.5 41.5 23.7 38.0 47.6 42.9 - 8.5 SEA Culture Understanding Ability of Sailor2 1.5B 7B 70B 20B 0 10 20 30 40 50 60 BLEND (3-shot EM) Sailor2-20B 53.12 Qwen2.5-72B 51.79 Gemma2-27B 51.20 Sailor2-8B 50.04 Gemma2-9B 46.54 Qwen2.5-14B 41.65 Qwen2.5-32B 41.40 Sailor-7B 38.51 Gemma-7B 36.36 Mistral-7B 34.74 Sailor2-1B 34.28 Qwen2.5-7B 34.13 SeaLLM-7B-Hybrid 31.38 Sealion-7B 22.66 SeaLLMs-v3-7B 19.85 Qwen2.5-0.5B 13.83 (a) Results on BLEnD benchmark. 1.5B 7B 70B 20B 0 CulturalBench (3-shot Acc) Sailor2-20B 71.90 Qwen2.5-72B 78.21 Gemma2-27B 69.52 Sailor2-8B 60.16 Gemma2-9B 63.55 Qwen2.5-14B 67.58 Qwen2.5-32B 69.18 Sailor-7B 59.23 Gemma-7B 53.34 Mistral-7B 56.67 Sailor2-1B 38.60 Qwen2.5-7B 60.85 SeaLLM-7B-Hybrid 50.27 Sealion-7B 35.32 SeaLLMs-v3-7B 53.19 Qwen2.5-0.5B 42.23 30 40 50 60 70 80 (b) Results on CulturalBench benchmark. Figure 9: Performance comparison across models on BLEnD and CulturalBench benchmarks. Cultural understanding significantly influences the practical application and interaction quality of multilingual LLMs. To assess the cultural understanding capabilities of the Sailor2 series models in Southeast Asian contexts, we employ CulturalBench (Chiu et al., 2024), BLEND (Myung et al., 2024), and Global-MMLU (Singh et al., 2024a) as evaluation benchmarks, covering a total of seven languages. Specifically, CulturalBench consists of single-choice and judgment questions in Filipino, Indonesian, Malaysian, Singaporean English, Thai, and Vietnamese. BLEND includes 26 -- 26 of 49 -- Sailor2 Technical Report question-and-answer tasks in Indonesian and West Java languages, while Global-MMLU comprises single-choice questions in Filipino, Indonesian, and Vietnamese. Notably, BLEND and Global-MMLU are
Chunk 49 · 1,995 chars
Indonesian, Malaysian, Singaporean English, Thai, and Vietnamese. BLEND includes 26 -- 26 of 49 -- Sailor2 Technical Report question-and-answer tasks in Indonesian and West Java languages, while Global-MMLU comprises single-choice questions in Filipino, Indonesian, and Vietnamese. Notably, BLEND and Global-MMLU are multilingual evaluation datasets. To further refine the measurement of the cultural understanding ability of LLMs, we translate CulturalBench into a multilingual version using Google Translate (from English to local languages). All evaluations are conducted using the 3-shot prompting approach. The experimental results are presented in Tables 27, 28, and 29. We summarize our results in Figure 9 and find that, among models of similar size, Sailor2 has a better understanding of SEA culture, including its cuisine, traditions, geography, and more. In Figure 10, we present sample responses of the Sailor2-20B model on CulturalBench. Paano madalas na bumabati ang mga tao sa Pilipinas nang hindi gumagamit ng mga salita? (English Question: How do people in the Philippines often greet each other without using words?) A. Sa pagtaas ng kilay nila B. Sa pamamagitan ng pagbibigay ng mahigpit na pagkakamay C. Sa pagtango ng kanilang ulo D. Sa pamamagitan ng pagbibigay ng high five. Prediction: Golden: C Qwen2.5-32B: B Sailor2-20B: C Masakan Indonesia populer apa yang menggunakan daging yang ditusuk dan dipanggang, disajikan dengan saus kental, pedas, dan berbahan dasar kunyit? (English Question: What popular Indonesian dish uses skewered and grilled meat, served with a thick, spicy, and turmeric-based sauce?) A. Sate Padang. B. Sate Madura. C. Steak Wagyu. D. Iga BBQ. Prediction: Golden: A Qwen2.5-72B: B Sailor2-20B: A Figure 10: Sample responses from CulturalBench. Indonesian Culture Understanding Indonesian culture is uniquely rich and diverse, shaped by various ethnicities, languages, and historical influences. Evaluating this separately helps us assess the
Chunk 50 · 1,995 chars
Wagyu. D. Iga BBQ. Prediction: Golden: A Qwen2.5-72B: B Sailor2-20B: A Figure 10: Sample responses from CulturalBench. Indonesian Culture Understanding Indonesian culture is uniquely rich and diverse, shaped by various ethnicities, languages, and historical influences. Evaluating this separately helps us assess the model’s ability to capture these distinct cultural nuances. We adopts two types of benchmark for evaluation: (1) General Knowledge, Local knowledge & Reasoning: IndoMMLU (Koto et al., 2023), IndoCareer (Koto, 2024); (2) Cultural Reasoning: IndoCul- ture (Koto et al., 2024), MAPS (Liu et al., 2024a), COPAL-ID (Wibowo et al., 2023), Indo- Cloze (Koto et al., 2022). As listed in Table 21, Sailor2 models present the good performance in understanding Indonesian culture and knowledge. Table 21: Evaluation Results on Indonesian Culture (3-shot, Accuracy). Model IndoMMLU IndoCareer IndoCulture MAPS COPAL-ID IndoCloze Avg Sailor2-20B 70.7 69.0 79.3 92.9 86.9 98.1 82.8 Sailor2-8B 64.5 60.9 74.7 90.8 86.5 96.5 78.9 Aya-Expanse-32b 59.3 61.5 71.4 91.4 84.6 97.0 77.5 Gemma-2-27b 57.1 61.6 70.2 92.0 81.8 96.5 76.5 Qwen2.5-32B 61.9 64.3 66.8 89.9 76.6 99.6 76.5 SEA-LIONv3-9B 60.0 58.9 65.8 90.2 80.0 96.7 75.2 Gemma-2-9B 57.9 57.8 67.9 89.9 79.3 95.8 74.7 Qwen2.5-7B 53.1 43.0 61.1 87.6 73.0 94.2 68.6 Qwen2.5-14B 52.6 64.1 50.7 76.0 62.8 99.8 67.6 Llama-3.1-8B 47.5 61.5 54.5 77.8 70.0 92.5 67.3 SeaLLMs-v3-7B 47.6 32.8 59.9 87.9 71.2 95.9 65.8 Llama-3-8B 48.8 39.6 55.0 83.1 66.8 87.2 63.4 27 -- 27 of 49 -- Sailor2 Technical Report 9 Conclusion and Future Work This report introduces the Sailor2 family of open models, designed to facilitate the develop- ment of large language models for Southeast Asian languages. We also summarize the key insights from our pipeline for building the Sailor2 model, covering data curation, contin- ual pre-training, post-training, evaluation and advanced model customization. We hope this report will inspire the community to develop more
Chunk 51 · 1,989 chars
ment of large language models for Southeast Asian languages. We also summarize the key insights from our pipeline for building the Sailor2 model, covering data curation, contin- ual pre-training, post-training, evaluation and advanced model customization. We hope this report will inspire the community to develop more inclusive and robust multilingual language models for underserved languages. Looking ahead, we plan to expand our multilingual research to include a broader range of low-resource languages and explore more efficient model training approaches. The following sections detail our motivations and review the most relevant works in data curation, model design, and model training. 9.1 Synthetic Data Curation for Supporting Low-resource Languages Apart from a few high-resource languages, most languages have a relatively scarce supply of training tokens. For example, in Sailor2, six languages contain fewer than 1 billion training tokens (See Table 3 for detailed statistics). For extremely low-resource languages like Minangkabau (spoken in Indonesia by approximately 6.5 million people) and Acehnese (spoken in Indonesia by around 3.5 million people), we were only able to mine fewer than one million tokens18. One effective way to address this issue is to leverage translated synthetic data. For example, Wang et al. (2024c) translate high-quality documents (e.g., Fineweb-Edu (Lozhkov et al., 2024)) from English into medium-level languages such as French, German, and Spanish. Similarly, Doshi et al. (2024) adopt the Translationese dataset to extend coverage to additional low-resource languages, including Hindi, Gujarati, and Marathi. 9.2 Tokenizer-Free Model for Open-Vocabulary Learning Recent studies demonstrate that tokenizer-free language models can effectively process unseen languages and exhibit greater robustness against noise attacks compared to tokenizer- based models. One approach involves pixel-based language models (Lotz et al., 2023; Rust et al.,
Chunk 52 · 1,999 chars
er-Free Model for Open-Vocabulary Learning Recent studies demonstrate that tokenizer-free language models can effectively process unseen languages and exhibit greater robustness against noise attacks compared to tokenizer- based models. One approach involves pixel-based language models (Lotz et al., 2023; Rust et al., 2023), which treat text as images. This enables them to learn any script and achieve open- vocabulary language learning by exploiting visual similarities among characters and scripts through parameter sharing. In contrast, byte-level language models (Zheng et al., 2025; Kallini et al., 2024; Xue et al., 2022) bypass the tokenization step entirely by directly process- ing the raw character or byte stream as input. These approaches offer significant benefits for both morphologically rich languages and languages that mix multiple scripts. For example, languages such as Turkish, Finnish, and Hungarian are known for their complex morphological structures, while Japanese (which combines Kanji, Hiragana, and Katakana) and Hindi (which often integrates Devanagari and Latin scripts) frequently mix scripts in everyday usage. 9.3 Efficient Continual Pre-training for Multilingual Model Continual pre-training is more efficient and cost-effective than training from scratch when building new multilingual language models for target languages. By leveraging an existing base model, its inherent capabilities are preserved while saving computational investment. For instance, developers can use Sailor2 as a foundation to build more powerful models for Southeast Asian languages with their in-house data. Moreover, both the developing infrastructure (e.g., Sailor2 open source every details) and the model tokenizer (the most critical component for multilingual as verified by Tao et al. (2024)) are mature. 18See full list of undetermined (und) data in https://huggingface.co/datasets/HuggingFaceFW/ fineweb-2 for more low-resource languages. 28 -- 28 of 49 -- Sailor2 Technical
Chunk 53 · 1,993 chars
r2 open source every details) and the model tokenizer (the most critical component for multilingual as verified by Tao et al. (2024)) are mature. 18See full list of undetermined (und) data in https://huggingface.co/datasets/HuggingFaceFW/ fineweb-2 for more low-resource languages. 28 -- 28 of 49 -- Sailor2 Technical Report However, our concern is that many existing open models might be over-trained, leaving little room for further fine-tuning. For example, Llama3.1 was trained on 15T tokens and Qwen2.5 on 18T tokens. Although Sailor2 employs model expansion to mitigate this issue, its approach remains inefficient due to the increased computational cost associated with a larger model size. To achieve more efficient continual pre-training, we propose exploring the following direc- tions: (1) Employing an over-training indicator (Ouyang et al., 2024) to guide the selection of an appropriate base model. (2) Enhancing model plasticity (Chen et al., 2023) to enable the model to absorb additional multilingual knowledge. (3) Leveraging insights from the lottery ticket hypothesis to update only the most essential parameters (Yuan et al., 2024). 29 -- 29 of 49 -- Sailor2 Technical Report Contributors and Acknowledgements Core Contributors Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung Contributors Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíˇcek, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen Advisors Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin Acknowledgements • Model Training: Qwen-Team, MegatronLLM • Model Evaluation: Finetasks, SeaCrowd • Model Deployment: AK(@_akhaliq),
Chunk 54 · 1,990 chars
hon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen Advisors Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin Acknowledgements • Model Training: Qwen-Team, MegatronLLM • Model Evaluation: Finetasks, SeaCrowd • Model Deployment: AK(@_akhaliq), Ollama • Sea SRE Team: Zhikai Huang 30 -- 30 of 49 -- Sailor2 Technical Report References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Stephanie Chan, Ankesh Anand, Zaheer Abbas, Azade Nova, John D Co-Reyes, Eric Chu, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018, 2024. AI Singapore. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion, 2024. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023. Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp. 4447–4455. PMLR, 2024. Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 749–775, Bangkok, Thailand and virtual meeting, August 2024.
Chunk 55 · 1,996 chars
Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 749–775, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.44. Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra. Cosmopedia, February 2024. BritLLM. Britllm. https://llm.org.uk/, 2024. Andrei Z. Broder. On the resemblance and containment of documents. Proceedings. Com- pression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29, 1997. URL https://api.semanticscholar.org/CorpusID:11748509. Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, and Min Lin. Bootstrapping language models with dpo implicit rewards. In The Thirteenth International Conference on Learning Representations, 2025. Yihong Chen, Kelly Marchisio, Roberta Raileanu, David Ifeoluwa Adelani, Pontus Stenetorp, Sebastian Riedel, and Mikel Artetx. Improving language plasticity via pretraining with active forgetting. ArXiv, abs/2307.01163, 2023. URL https://api.semanticscholar.org/ CorpusID:259317182. Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine- tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, et al. Cultural- bench: a robust, diverse and challenging benchmark on measuring the (lack of) cultural knowledge of llms. arXiv preprint arXiv:2410.02677, 2024. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing
Chunk 56 · 1,996 chars
: a robust, diverse and challenging benchmark on measuring the (lack of) cultural knowledge of llms. arXiv preprint arXiv:2410.02677, 2024. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017. Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020. 31 -- 31 of 49 -- Sailor2 Technical Report Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pp. 16344–16359, 2022. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma,
Chunk 57 · 1,990 chars
ng Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, and Zizheng Pan. Deepseek-v3 technical report, 2024. URL https://arxiv.org/abs/2412.19437. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023. Meet Doshi, Raj Dabre, and Pushpak
Chunk 58 · 1,996 chars
eport, 2024. URL https://arxiv.org/abs/2412.19437. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023. Meet Doshi, Raj Dabre, and Pushpak Bhattacharyya. Pretraining language models using translationese. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 5843–5862, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.334. URL https://aclanthology.org/2024.emnlp-main. 334/. Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Xin Mao, Ziqi Jin, Wei Lu, and Min Lin. Sailor: Open Language Models for South-East Asia. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024. Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, et al. GliDe with a CaPE: A low-hassle method to accelerate speculative decoding. In Proceedings of the Forty-first International Conference on Machine Learning, 2024. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 32 -- 32 of 49 -- Sailor2 Technical Report Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. In Forty-first International Conference on Machine Learning, 2024. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Acceler- ating the science of language models. arXiv preprint arXiv:2402.00838, 2024. Jia Guo,
Chunk 59 · 1,998 chars
rty-first International Conference on Machine Learning, 2024. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Acceler- ating the science of language models. arXiv preprint arXiv:2402.00838, 2024. Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, and Qian Liu. Sailcompass: Towards reproducible and robust evaluation for southeast asian languages. arXiv preprint arXiv:2412.01186, 2024a. Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024b. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024. Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, and Wei Chu. Opencoder: The open cookbook for top-tier code large language models. arXiv preprint arXiv:2411.04905, 2024. URL https://arxiv.org/pdf/2411.04905. Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. Benchmax: A comprehensive multilingual evaluation suite for large language models. arXiv preprint arXiv:2502.07346, 2025. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In
Chunk 60 · 1,984 chars
Huang, and Fei Yuan. Benchmax: A comprehensive multilingual evaluation suite for large language models. arXiv preprint arXiv:2502.07346, 2025. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66. Julie Kallini, Shikhar Murty, Christopher D Manning, Christopher Potts, and Róbert Csordás. Mrt5: Dynamic token merging for efficient byte-level language models. arXiv preprint arXiv:2410.20771, 2024. Dongyoung Kim, Jaehyung Kim, Kimin Lee, and Jinwoo Shin. Spread preference annotation: Direct preference judgment for efficient llm alignment. In The Thirteenth International Conference on Learning Representations, 2025. Fajri Koto. Cracking the code: Multi-domain llm evaluation on real-world professional exams in indonesia. arXiv preprint arXiv:2409.08564, 2024. Fajri Koto, Timothy Baldwin, and Jey Han Lau. Cloze evaluation for deeper understand- ing of commonsense stories in Indonesian. In Antoine Bosselut, Xiang Li, Bill Yuchen Lin, Vered Shwartz, Bodhisattwa Prasad Majumder, Yash Kumar Lal, Rachel Rudinger, Xiang Ren, Niket Tandon, and Vilém Zouhar (eds.), Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pp. 8–16, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.csrr-1.2. URL https://aclanthology.org/2022.csrr-1.2/. Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12359–12374, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.760. URL
Chunk 61 · 1,995 chars
ms in Indonesia: A comprehensive test on IndoMMLU. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12359–12374, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.760. URL https://aclanthology.org/2023.emnlp-main.760/. 33 -- 33 of 49 -- Sailor2 Technical Report Fajri Koto, Rahmad Mahendra, Nurul Aisyah, and Timothy Baldwin. Indoculture: Exploring geographically influenced cultural commonsense reasoning across eleven indonesian provinces. Transactions of the Association for Computational Linguistics, 12:1703–1719, 2024. Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette- Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. Madlad-400: A multilingual and document-level large audited dataset, 2023. Hynek Kydlíˇcek, Guilherme Penedo, Clémentine Fourier, Nathan Habib, and Thomas Wolf. Finetasks: Finding signal in a haystack of 200+ multilingual tasks, 2024. URL https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\" ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024. Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Fer- ret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling rein- forcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023. Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2nd edition, 2014. ISBN 978-1107077232. URL http: //www.mmds.org/. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh,
Chunk 62 · 1,999 chars
arXiv:2309.00267, 2023. Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2nd edition, 2014. ISBN 978-1107077232. URL http: //www.mmds.org/. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-lm: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794, 2024. Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichan- der, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024. URL https://arxiv.org/abs/2406.04770. Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani Aljunied, Anh Tuan Luu, and Lidong Bing. Seaexam and seabench: Benchmarking llms with local multilingual questions in southeast asia. arXiv preprint arXiv:2502.06298, 2025a. Chen Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych. Are multilingual LLMs culturally-diverse reasoners? an investigation into multicultural proverbs and sayings. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the
Chunk 63 · 1,984 chars
rXiv:2502.06298, 2025a. Chen Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych. Are multilingual LLMs culturally-diverse reasoners? an investigation into multicultural proverbs and sayings. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2016–2039, Mexico City, Mexico, June 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.112. URL https://aclanthology.org/2024.naacl-long.112/. Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024b. Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. arXiv preprint arXiv:2407.01492, 2024c. 34 -- 34 of 49 -- Sailor2 Technical Report Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, et al. Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550, 2023. Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, and Min Lin. Sample-efficient alignment for llms. arXiv preprint arXiv:2411.01493, 2024d. Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, and Min Lin. Oat: A research-friendly framework for llm online alignment. [https://github.com/sail- sg/oat](https://github.com/sail-sg/oat), 2025b. Jonas Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10155– 10172, Singapore, December 2023. Association for Computational Linguistics.
Chunk 64 · 1,996 chars
ky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10155– 10172, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.emnlp-main.628/. Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V Miranda, Jen- nifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P Kampman, et al. Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. arXiv preprint arXiv:2406.10118, 2024. Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna- Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, and Sampo Pyysalo. FinGPT: Large generative models for a small language. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2710–2726, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.164. URL https: //aclanthology.org/2023.emnlp-main.164. Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, and Anh Tuan Luu. As simple as fine-tuning: Llm alignment via bidirectional negative feedback loss. arXiv preprint arXiv:2410.04834, 2024. Sahisnu Mazumder and Bing Liu. Lifelong and continual learning dialogue systems. arXiv preprint arXiv:2211.06553, 2022. Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with
Chunk 65 · 1,981 chars
simple as fine-tuning: Llm alignment via bidirectional negative feedback loss. arXiv preprint arXiv:2410.04834, 2024. Sahisnu Mazumder and Bing Liu. Lifelong and continual learning dialogue systems. arXiv preprint arXiv:2211.06553, 2022. Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022. Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, et al. Blend: A benchmark for llms on everyday knowledge in diverse cultures and languages. arXiv preprint arXiv:2406.09948, 2024. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Pat- wary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15, 2021. 35 -- 35 of 49 -- Sailor2 Technical Report Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. SeaLLMs - large language models for Southeast Asia. In Yixin Cao, Yang Feng, and Deyi Xiong (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 294–304, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.28. URL
Chunk 66 · 1,994 chars
for Southeast Asia. In Yixin Cao, Yang Feng, and Deyi Xiong (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 294–304, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.28. URL https:// aclanthology.org/2024.acl-demos.28. NLLB Team. No language left behind: Scaling human-centered machine translation, 2022. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained llms: Scaling laws for quantized llms with 100t training tokens. arXiv preprint arXiv:2411.17691, 2024. Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024. Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec, Bettina Messmer, Negar Foroutan, Mar- tin Jaggi, Leandro von Werra, and Thomas Wolf. Fineweb2: A sparkling update with 1000s of languages, December 2024. URL https://huggingface.co/datasets/HuggingFaceFW/ fineweb-2. Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli´c, and Anna Korhonen. Xcopa: A multilingual dataset for causal commonsense reasoning. In Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2362–2376, 2020. Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241, 2023. Penghui Qi, Xinyi Wan, Nyamdavaa Amar, and Min Lin. Pipeline parallelism with control- lable memory. arXiv preprint arXiv:2405.15362, 2024. Rafael Rafailov, Archit Sharma, Eric Mitchell,
Chunk 67 · 1,997 chars
2020. Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. Zero bubble pipeline parallelism. arXiv preprint arXiv:2401.10241, 2023. Penghui Qi, Xinyi Wan, Nyamdavaa Amar, and Min Lin. Pipeline parallelism with control- lable memory. arXiv preprint arXiv:2405.15362, 2024. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= FkSp8VW8RjH. Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili’c, Daniel Hesslow, Roman Castagn’e, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammana- manchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Lau- renccon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa Etxabe, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreis- berg Nitzav, Canwen Xu, Chenghao Mou, Chris C. Emezue, Christopher Klamm, Colin Leong, Daniel Alexander van Strien, David Ifeoluwa Adelani, Dragomir R. Radev, Ed- uardo Gonz’alez Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady ElSahar, Hamza Benyam- ina, Hieu Trung Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jorg Frohberg, 36 -- 36 of 49 -- Sailor2 Technical
Chunk 68 · 1,995 chars
m, Eyal Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady ElSahar, Hamza Benyam- ina, Hieu Trung Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jorg Frohberg, 36 -- 36 of 49 -- Sailor2 Technical Report Josephine Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Le- andro von Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario vSavsko, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mo- ham mad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto L’opez, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, S. Longpre, Somaieh Nikpoor, S. Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-Shaibani, Matteo Manica, Nihal V. Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Tae- woon Kim, Tali Bers, Thibault Févry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiang Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Y Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan,
Chunk 69 · 1,998 chars
anie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Tae- woon Kim, Tali Bers, Thibault Févry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiang Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Y Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre Franccois Lavall’ee, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aur’elie N’ev’eol, Charles Lovering, Daniel H Garrette, Deepak R. Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bog- danov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Xiangru Tang, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, S. Osher Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdenˇek Kasner, Zdenˇek Kasner, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ananda Santa Rosa Santos, Anthony Hevia, Antigona Unl- dreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ayoade Ajibade, Bharat Kumar Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David M. Lansky, Davis David, Douwe Kiela, Duong Anh Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatim Tahirah Mirza, Frankline Ononiwu, Habib Rezanejad, H.A. Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jan Passmore,
Chunk 70 · 1,999 chars
Bharat Kumar Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David M. Lansky, Davis David, Douwe Kiela, Duong Anh Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatim Tahirah Mirza, Frankline Ononiwu, Habib Rezanejad, H.A. Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jan Passmore, Joshua Seltzer, Julio Bonis Sanz, Karen Fort, Lívia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akin- lolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nourhan Fahmy, Olanrewaju Samuel, Ran An, R. P. Kro- mann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas L. Wang, Sourav Roy, Sylvain Viguier, Thanh-Cong Le, Tobi Oyebade, Trieu Nguyen Hai Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Kumar Singh, Benjamin Beilharz, Bo Wang, Caio Matheus Fon- seca de Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel Le’on Perin’an, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Iman I.B. Bello, Isha Dash, Ji Soo Kang, John Giorgi, Jonas Golde, José D. Posada, Karthi Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, María Andrea Castillo, Marianna Nezhurina, Mario Sanger, Matthias Samwald, Michael Cullan, Michael Weinberg, M Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patricia Haller, Patrick Haller, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, 37 -- 37 of 49 -- Sailor2 Technical Report Srishti Kumar, Stefan Schweter, Sushil Pratap Bharati, Tanmay Laud, Théo
Chunk 71 · 1,996 chars
ck Haller, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, 37 -- 37 of 49 -- Sailor2 Technical Report Srishti Kumar, Stefan Schweter, Sushil Pratap Bharati, Tanmay Laud, Théo Gigant, To- moya Kainuma, Wojciech Kusa, Yanis Labrak, Yashasvi Bajaj, Y. Venkatraman, Yifan Xu, Ying Xu, Yu Xu, Zhee Xao Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022. URL https://api.semanticscholar.org/CorpusID:253420279. Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023. Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, et al. Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation. arXiv preprint arXiv:2412.03304, 2024a. Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, et al. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619, 2024b. Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies. arXiv preprint arXiv:2407.13623, 2024. Ahmet Üstün, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas
Chunk 72 · 1,998 chars
gai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies. arXiv preprint arXiv:2407.13623, 2024. Ahmet Üstün, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15894–15939, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.845. URL https://aclanthology.org/2024.acl-long.845. Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F Chen. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. NAACL, 2024a. Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, and Tianyu Pang. When precision meets position: Bfloat16 breaks down rope in long-context training. arXiv preprint arXiv:2411.13476, 2024b. Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, Yihong Chen, Raphael Tang, and Pontus Stenetorp. Multilingual pretraining using a large corpus machine-translated from a single source language. ArXiv, abs/2410.23956, 2024c. URL https://api.semanticscholar. org/CorpusID:273706997. Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Bench- marking long-context llms with extended multi-doc qa. arXiv preprint arXiv:2406.17419, 2024d. Haryo Akbarianto Wibowo, Erland Hilman Fuadi, Made Nindyatama Nityasya, Radityo Eko Prasojo, and Alham Fikri Aji. Copal-id: Indonesian language reasoning with local culture and nuances. arXiv preprint arXiv:2311.01012, 2023. Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo,
Chunk 73 · 1,990 chars
406.17419, 2024d. Haryo Akbarianto Wibowo, Erland Hilman Fuadi, Made Nindyatama Nityasya, Radityo Eko Prasojo, and Alham Fikri Aji. Copal-id: Indonesian language reasoning with local culture and nuances. arXiv preprint arXiv:2311.01012, 2023. Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, and Ying Shan. Llama pro: Progressive llama with block expansion. arXiv preprint arXiv:2401.02415, 2024. Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023. 38 -- 38 of 49 -- Sailor2 Technical Report Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. URL https://aclanthology.org/2022.tacl-1.17/. Man Tsung Yeung, Penghui Qi, Min Lin, and Xinyi Wan. Balancing pipeline parallelism with vocabulary parallelism. arXiv preprint arXiv:2411.05288, 2024. Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, and Ji Pei. Opencsg chinese corpus: A series of high-quality chinese datasets for llm training, 2025. URL https: //arxiv.org/abs/2501.08197. Fei Yuan, Chang Ma, Shuai Yuan, Qiushi Sun, and Lei Li. Ks-lottery: Finding certified lottery tickets for multilingual language models. ArXiv, abs/2402.02801, 2024. URL https://api.semanticscholar.org/CorpusID:267412646. Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, et al. Map-neo: Highly capable and transparent bilingual large language model series. arXiv preprint arXiv:2405.19327, 2024a. Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. M3exam: A multilingual, multimodal, multilevel benchmark for examining large lan- guage models. arXiv preprint arXiv:2306.05179,
Chunk 74 · 1,997 chars
ghly capable and transparent
bilingual large language model series. arXiv preprint arXiv:2405.19327, 2024a.
Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing.
M3exam: A multilingual, multimodal, multilevel benchmark for examining large lan-
guage models. arXiv preprint arXiv:2306.05179, 2023.
Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun
Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, et al. Seallms 3: Open foundation
and chat multilingual large language models for southeast asian languages. arXiv preprint
arXiv:2407.19672, 2024b.
Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Miło´s,
Yuxiang Wu, and Pasquale Minervini. Analysing the impact of sequence composition
on language model pre-training. In Lun-Wei Ku, Andre Martins, and Vivek Sriku-
mar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 7897–7912, Bangkok, Thailand, August 2024.
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.427. URL
https://aclanthology.org/2024.acl-long.427.
Lin Zheng, Xueliang Zhao, Guangtao Wang, Chen Wu, David Dong, Angela Wang, Mingran
Wang, Yun Du, Haige Bo, Amol Sharma, Bo Li, Kejie Zhang, Changran Hu, Urmish
Thakker, and Lingpeng Kong. Evabyte: Efficient byte-level language models at scale,
2025. URL https://hkunlp.github.io/blog/2025/evabyte.
Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, and Pengfei Liu. Programming every exam-
ple: Lifting pre-training data quality like experts at scale. arXiv preprint arXiv:2409.17115,
2024.
39
-- 39 of 49 --
Sailor2 Technical Report
The Prompt used for Translation
You are a highly skilled translator tasked with translating various types of content
from English into {{ language }}. Follow these instructions carefully to complete the
translation task.
You will receive a user-bot conversation in XML format. Please follow a three-step
translationChunk 75 · 1,996 chars
ort
The Prompt used for Translation
You are a highly skilled translator tasked with translating various types of content
from English into {{ language }}. Follow these instructions carefully to complete the
translation task.
You will receive a user-bot conversation in XML format. Please follow a three-step
translation process:
1. Initial Translation: Translate the input content into {{ language }}, preserving
the original intent and keeping the original paragraph and text format
unchanged. Do not delete or omit any content, and ensure that all original
Markdown elements (e.g., images, code blocks) are preserved.
2. Reflection and Feedback: Carefully review both the source text and your
translation. Provide constructive criticism and specific suggestions to im-
prove the translation in terms of:
(i) Accuracy: Correct errors of addition, mistranslation, omission, or un-
translated text.
(ii) Fluency: Apply {{ language }} grammar, spelling, and punctuation rules
while avoiding unnecessary repetitions.
(iii) Style: Ensure that the translation reflects the style of the source text and
considers any relevant cultural context.
3. Refinement: Based on your reflections, refine and polish your translation.
4. Fallback: If you are not confident in translating the conversation, please
return “<stop></stop>”.
Output:
For each step of the translation process, output your results within the appropriate
XML tags as follows:
<step1_initial_translation>
[Insert your initial translation here]
</step1_initial_translation>
<step2_reflection>
[Insert your reflection on the translation, including a list
of specific, helpful, and constructive suggestions for
improvement. Each suggestion should address a specific
part of the translation.]
</step2_reflection>
<step3_refined_translation>
[Insert your refined and polished translation here]
</step3_refined_translation>
Ensure that your final translation in step 3 accurately reflects the original meaning
while sounding natural in {{ languageChunk 76 · 1,995 chars
ach suggestion should address a specific
part of the translation.]
</step2_reflection>
<step3_refined_translation>
[Insert your refined and polished translation here]
</step3_refined_translation>
Ensure that your final translation in step 3 accurately reflects the original meaning
while sounding natural in {{ language }}.
Here is the original conversation:
40
-- 40 of 49 --
Sailor2 Technical Report
Table 22: Data Distribution of SEA-UltraChat (Instruction Tuning) and SEA-UltraFeedback
(Preference Tuning) across Languages. SEA-UltraChat has been used by two-stage instruc-
tion tuning.
Language SFT Stage 1 SFT Stage 2 SEA-UltraChat SEA-UltraFeedback
English 1611404 72000 1683404 8981
Chinese 154908 72000 226908 8845
Indonesian 287719 48000 335719 4687
Thai 334194 48000 382194 4696
Vietnamese 301226 48000 349226 4710
Malay 212827 48000 260827 4714
Burmese 204495 48000 252495 4614
Lao 134228 48000 182228 4393
Cebuano 194986 1200 196186 1178
Ilocano 1867 1200 3067 1172
Javanese 134499 48000 182499 1166
Khmer 64421 4800 69221 1151
Sundanese 181486 1200 182686 1174
Tagalog 217045 48000 265045 1173
Waray 177921 1200 179121 0
Tamil 82015 1200 83215 1159
Total 4295241 538800 4834041 53813
Table 23: Performance on Flores Plus (chrF++) for Qwen2.5-32B
Target (→)
Source (↓) eng cmn ind tha vie zsm mya lao ceb ilo jav khm sun fil war tam
English (eng) - 34.0 64.0 46.1 59.2 59.5 17.3 21.4 39.4 27.1 31.8 20.6 33.1 49.5 38.4 28.3
Chinese (cmn) 59.0 - 52.6 40.2 51.8 48.3 16.0 18.0 33.0 23.1 27.0 18.8 28.7 41.5 30.4 24.6
Indonesian (ind) 67.5 29.5 - 43.1 54.3 55.0 16.2 20.6 34.9 23.5 34.9 19.6 36.2 45.2 31.7 25.4
Thai (tha) 60.6 27.9 54.8 - 52.1 50.3 15.6 22.0 33.0 23.7 28.5 19.5 29.6 42.6 30.9 24.7
Vietnamese (vie) 62.5 29.1 55.0 41.9 - 50.7 15.9 20.1 32.4 24.2 27.2 19.4 29.3 42.7 30.5 24.4
Malay (zsm) 67.2 28.6 59.4 42.8 53.6 - 16.0 20.5 34.8 24.0 33.1 20.0 33.5 44.5 32.3 25.5
Burmese (mya) 39.6 16.7 38.2 29.8 35.0 36.0 - 15.4 25.1 20.0 18.9 15.5 20.4 33.4 23.4 21.2
LaoChunk 77 · 1,997 chars
33.0 23.7 28.5 19.5 29.6 42.6 30.9 24.7 Vietnamese (vie) 62.5 29.1 55.0 41.9 - 50.7 15.9 20.1 32.4 24.2 27.2 19.4 29.3 42.7 30.5 24.4 Malay (zsm) 67.2 28.6 59.4 42.8 53.6 - 16.0 20.5 34.8 24.0 33.1 20.0 33.5 44.5 32.3 25.5 Burmese (mya) 39.6 16.7 38.2 29.8 35.0 36.0 - 15.4 25.1 20.0 18.9 15.5 20.4 33.4 23.4 21.2 Lao (lao) 50.1 22.8 47.8 39.9 44.4 44.6 13.8 - 27.6 21.2 24.3 19.1 25.9 38.9 26.1 21.6 Cebuano (ceb) 56.0 23.7 48.8 35.9 45.0 45.2 15.2 18.6 - 27.7 22.9 17.8 22.4 45.5 44.0 23.1 Ilocano (ilo) 43.1 19.6 39.5 30.0 36.3 36.8 13.3 16.3 31.7 - 19.9 15.2 20.2 38.8 30.2 19.4 Javanese (jav) 51.9 22.9 49.9 35.0 43.8 45.9 14.1 17.9 28.7 21.8 - 17.9 31.7 37.0 25.9 21.9 Khmer (khm) 49.8 22.8 47.8 37.1 43.8 44.3 12.9 20.4 28.8 21.5 24.7 - 24.7 38.3 26.1 22.4 Sundanese (sun) 51.7 23.1 51.2 35.9 44.7 46.1 14.5 18.1 28.6 20.4 31.1 17.4 - 37.9 27.0 21.9 Tagalog (fil) 65.8 27.9 55.1 41.0 52.0 51.0 16.1 19.8 41.6 30.7 23.4 18.8 22.2 - 40.1 25.5 Waray (war) 56.8 23.8 49.1 36.1 45.1 45.1 14.9 18.7 44.2 29.3 23.2 17.2 23.7 45.6 - 23.2 Tamil (tam) 47.9 20.5 44.0 32.9 40.9 41.0 16.1 16.2 28.7 21.8 22.9 16.8 24.1 37.5 26.2 - 41 -- 41 of 49 -- Sailor2 Technical Report 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Density Language: Total Qwen1.5: 30.94, Sailor1: 24.17 Qwen2.5: 19.90, Sailor2: 5.38 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Density Language: SEA Qwen1.5: 32.96, Sailor1: 25.22 Qwen2.5: 20.80, Sailor2: 4.46 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Density Language: eng Qwen1.5: 15.44, Sailor1: 16.84 Qwen2.5: 13.27, Sailor2: 13.97 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Density Language: zho Qwen1.5: 14.25, Sailor1: 14.83 Qwen2.5: 12.32, Sailor2: 11.44 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity
Chunk 78 · 1,979 chars
: eng Qwen1.5: 15.44, Sailor1: 16.84 Qwen2.5: 13.27, Sailor2: 13.97 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Density Language: zho Qwen1.5: 14.25, Sailor1: 14.83 Qwen2.5: 12.32, Sailor2: 11.44 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Density Language: bcl Qwen1.5: 57.54, Sailor1: 62.57 Qwen2.5: 38.37, Sailor2: 7.42 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Density Language: jv Qwen1.5: 56.36, Sailor1: 27.90 Qwen2.5: 34.19, Sailor2: 5.49 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Density Language: ms Qwen1.5: 11.33, Sailor1: 4.63 Qwen2.5: 7.15, Sailor2: 3.85 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Density Language: th Qwen1.5: 10.44, Sailor1: 4.06 Qwen2.5: 5.29, Sailor2: 3.38 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Density Language: ceb Qwen1.5: 30.63, Sailor1: 31.40 Qwen2.5: 3.18, Sailor2: 1.51 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density Language: km Qwen1.5: 4.17, Sailor1: 5.44 Qwen2.5: 3.81, Sailor2: 2.36 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Density Language: my Qwen1.5: 2.62, Sailor1: 3.24 Qwen2.5: 2.27, Sailor2: 1.54 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.05 0.10 0.15 0.20 0.25 Density Language: tl Qwen1.5: 18.18, Sailor1: 23.41 Qwen2.5: 12.13, Sailor2: 4.92 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density Language: id Qwen1.5: 16.32, Sailor1: 7.37 Qwen2.5: 10.30, Sailor2:
Chunk 79 · 1,996 chars
Perplexity (log10 scale) 0.00 0.05 0.10 0.15 0.20 0.25 Density Language: tl Qwen1.5: 18.18, Sailor1: 23.41 Qwen2.5: 12.13, Sailor2: 4.92 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density Language: id Qwen1.5: 16.32, Sailor1: 7.37 Qwen2.5: 10.30, Sailor2: 6.56 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Density Language: lo Qwen1.5: 5.95, Sailor1: 3.27 Qwen2.5: 4.76, Sailor2: 2.46 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Density Language: pam Qwen1.5: 80.35, Sailor1: 77.74 Qwen2.5: 59.35, Sailor2: 12.26 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.05 0.10 0.15 0.20 Density Language: vi Qwen1.5: 14.52, Sailor1: 6.10 Qwen2.5: 7.01, Sailor2: 5.06 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Density Language: ilo Qwen1.5: 70.40, Sailor1: 82.69 Qwen2.5: 42.67, Sailor2: 4.70 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Density Language: min Qwen1.5: 48.05, Sailor1: 17.91 Qwen2.5: 35.04, Sailor2: 2.83 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Density Language: su Qwen1.5: 46.27, Sailor1: 27.35 Qwen2.5: 25.88, Sailor2: 5.00 Sailor1 Qwen1.5 Qwen2.5 Sailor2 0 1 2 3 4 Perplexity (log10 scale) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Density Language: war Qwen1.5: 51.39, Sailor1: 16.11 Qwen2.5: 39.67, Sailor2: 1.82 Sailor1 Qwen1.5 Qwen2.5 Sailor2 PPL Distribution by Language (KDE) Figure 11: Comparison of PPL distribution smoothed with Kernel Density Estimation (KDE). We compare Sailor2-8B, Qwen2.5-7B, Sailor1-7b and Qwen1.5-7B. Our results demonstrate that with extra 1B parameters, Sailor2-8B can preserve its English and Chinese capability, while achieving in much
Chunk 80 · 1,997 chars
istribution by Language (KDE) Figure 11: Comparison of PPL distribution smoothed with Kernel Density Estimation (KDE). We compare Sailor2-8B, Qwen2.5-7B, Sailor1-7b and Qwen1.5-7B. Our results demonstrate that with extra 1B parameters, Sailor2-8B can preserve its English and Chinese capability, while achieving in much lower PPL in SEA languages. 42 -- 42 of 49 -- Sailor2 Technical Report English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray English Tamil Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language Target Language Winning Rate Comparison of Four Models Across Language Pairs Models sailor2_20b llama3_1_70b qwen2_5_72b nllb_moe_54b Figure 12: Comparison of win rate of four models based on their ChrF++ scores. The shaded area in each cube represents the top-1 accuracy of each model across different translation directions in the Flores Plus Translation Dataset. We observed that Sailor2 performs on par with, or even outperforms NLLB, a model optimized for translation tasks that excels at translating low-resource languages. 43 -- 43 of 49 -- Sailor2 Technical Report English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 -1.73 4.92 0.58 2.01 9.22 1.20 3.88 17.37 12.68 16.66 1.82 10.12 10.34 13.71 6.27 -1.23 0.00 1.17 0.33 1.07 4.68 0.39 3.04 11.22 7.52 9.91 1.04 5.79 6.06 8.52 3.66 0.14 -1.41 0.00 0.90 2.21 4.90 1.23 3.17 13.55 10.32 11.86 1.38 6.42 8.94 11.01 4.56 0.54 0.28 2.56 0.00 1.88 6.11 1.25 1.16 12.51 8.96 11.05 1.89 6.73 8.33 9.86 3.89 -0.36 -1.25 2.78 1.25 0.00 6.26 1.25 2.54 13.84 9.23 11.72 1.35 6.81 8.12 10.44 4.23 1.49 0.48 1.28 0.50 2.17 0.00 1.12 3.50 14.48 10.65 13.41 0.20 6.97 9.40
Chunk 81 · 1,996 chars
-1.41 0.00 0.90 2.21 4.90 1.23 3.17 13.55 10.32 11.86 1.38 6.42 8.94 11.01 4.56 0.54 0.28 2.56 0.00 1.88 6.11 1.25 1.16 12.51 8.96 11.05 1.89 6.73 8.33 9.86 3.89 -0.36 -1.25 2.78 1.25 0.00 6.26 1.25 2.54 13.84 9.23 11.72 1.35 6.81 8.12 10.44 4.23 1.49 0.48 1.28 0.50 2.17 0.00 1.12 3.50 14.48 10.65 13.41 0.20 6.97 9.40 11.40 4.49 11.01 1.58 9.35 1.43 10.67 11.12 0.00 2.59 13.05 7.53 10.60 1.55 7.07 11.32 10.89 3.93 10.51 -1.26 7.91 1.78 7.91 10.56 1.53 0.00 15.11 9.46 13.21 1.53 7.75 12.05 11.85 4.74 12.69 1.50 8.65 2.16 8.90 11.23 1.47 2.95 0.00 11.31 14.76 1.97 10.02 9.72 7.89 5.28 14.93 1.24 10.31 2.12 10.87 12.32 1.10 2.33 14.16 0.00 9.94 1.61 7.23 11.55 11.61 4.42 11.04 0.55 8.94 1.03 7.67 9.87 1.21 2.92 15.01 7.66 0.00 1.31 6.51 14.07 10.35 4.80 10.96 0.80 8.38 1.97 8.56 11.09 1.37 2.99 15.05 9.56 13.14 0.00 8.59 12.48 11.46 4.09 9.50 -0.48 8.06 0.73 6.90 10.24 1.16 3.08 13.43 4.55 11.96 2.04 0.00 12.71 5.74 4.66 4.46 -0.06 5.74 0.46 3.55 9.23 0.98 2.50 12.86 11.39 16.14 1.93 10.39 0.00 9.81 4.65 12.39 0.24 8.18 2.08 8.19 11.91 1.24 2.68 12.34 11.42 13.79 1.46 8.75 10.06 0.00 5.19 8.03 1.79 6.38 0.85 7.26 8.86 0.67 2.43 13.34 7.83 10.92 1.69 6.85 10.02 10.98 0.00 BLEU Score Difference (Sailor2-20B - Qwen2.5-32B) English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 -1.09 3.65 2.42 1.42 7.36 10.82 16.25 18.27 18.86 20.22 8.66 14.62 9.78 14.87 12.84 -1.21 0.00 1.06 1.59 0.85 4.63 8.62 12.71 14.73 15.87 15.62 6.29 10.73 6.61 12.37 9.96 -0.44 -1.43 0.00 2.25 1.49 4.35 9.86 13.95 16.40 18.82 14.30 6.92 9.76 8.81 14.86 11.75 -0.05 -0.44 1.93 0.00 1.36 5.67 9.89 11.94 16.07 17.34 16.24 6.56 11.73 8.15 13.54 10.74 -0.40 -1.03 2.56 2.58 0.00 5.71 9.71 13.36 17.25 17.08 18.03 6.70 12.34 8.74 14.63 11.68 0.95 0.07 1.00 2.11 1.91 0.00 10.35 14.79
Chunk 82 · 1,998 chars
12.37 9.96 -0.44 -1.43 0.00 2.25 1.49 4.35 9.86 13.95 16.40 18.82 14.30 6.92 9.76 8.81 14.86 11.75 -0.05 -0.44 1.93 0.00 1.36 5.67 9.89 11.94 16.07 17.34 16.24 6.56 11.73 8.15 13.54 10.74 -0.40 -1.03 2.56 2.58 0.00 5.71 9.71 13.36 17.25 17.08 18.03 6.70 12.34 8.74 14.63 11.68 0.95 0.07 1.00 2.11 1.91 0.00 10.35 14.79 16.69 18.54 16.26 6.61 11.13 9.57 14.98 11.91 12.36 6.18 11.06 8.47 11.90 13.43 0.00 14.37 19.72 17.05 20.95 7.72 16.29 13.07 18.51 12.09 10.49 3.01 8.07 5.09 7.94 10.91 11.61 0.00 21.21 18.75 20.72 8.11 15.24 12.19 17.94 13.74 11.35 4.85 8.99 7.73 8.52 12.09 11.25 14.63 0.00 16.76 22.87 8.56 19.05 9.99 7.70 13.80 14.99 5.08 12.25 8.74 11.68 14.79 11.08 13.31 17.82 0.00 17.98 8.76 16.42 12.62 15.08 14.01 10.09 3.50 8.84 6.69 7.52 10.49 11.11 14.68 19.61 16.37 0.00 8.06 10.63 14.82 16.76 13.22 9.87 3.66 7.62 5.78 7.66 10.54 11.91 13.06 20.00 18.62 20.07 0.00 16.76 12.19 18.13 12.13 9.08 2.88 7.85 5.95 6.77 10.18 10.94 14.25 18.85 12.95 15.66 8.85 0.00 13.21 11.33 13.25 3.61 1.29 5.02 3.49 3.01 8.47 10.33 13.51 13.71 15.54 23.85 7.77 20.14 0.00 10.19 11.97 10.90 3.43 8.21 7.49 7.91 12.29 11.37 14.47 12.53 16.52 21.18 8.27 16.18 10.27 0.00 13.63 8.58 3.73 7.28 6.36 7.56 10.47 8.55 12.62 18.00 16.73 18.58 6.92 13.89 10.03 16.68 0.00 CHRF++ Score Difference (Sailor2-20B - Qwen2.5-32B) 4 2 0 2 4 Difference 4 2 0 2 4 Difference Figure 13: Comparison of BLEU and ChrF++ scores on the Flores Plus Translation Dataset across various source-target language pairs between Sailor2 20B and Qwen2.5-32B. BLEU Score Difference = Sailor2 BLEU - Qwen2.5 BLEU. 44 -- 44 of 49 -- Sailor2 Technical Report English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 -1.21 1.71 0.63 0.11 5.73 0.89 3.36 12.98 10.28 13.25 2.44 7.51 7.11 9.21 5.27 -2.73 0.00 -0.62
Chunk 83 · 1,998 chars
Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 -1.21 1.71 0.63 0.11 5.73 0.89 3.36 12.98 10.28 13.25 2.44 7.51 7.11 9.21 5.27 -2.73 0.00 -0.62 0.58 -0.28 2.53 0.21 1.92 8.81 6.50 7.53 1.07 3.74 3.30 5.54 3.31 -1.61 -0.57 0.00 1.20 -0.01 2.40 0.88 1.94 9.76 7.53 9.13 1.44 4.87 5.79 7.43 3.50 -2.42 0.34 0.30 0.00 -0.49 3.37 0.86 0.34 9.32 7.86 8.30 2.22 4.40 4.96 6.58 3.24 -2.67 -1.90 -0.03 0.81 0.00 3.35 1.00 1.83 9.99 7.85 8.32 1.37 4.09 4.78 7.40 3.66 -0.54 0.22 -0.50 0.69 0.78 0.00 0.73 2.77 10.63 8.01 10.01 0.77 5.05 5.47 8.09 3.61 7.57 0.52 5.84 0.51 6.66 7.75 0.00 2.22 11.08 7.08 8.74 1.43 5.75 8.84 9.52 3.18 4.81 -1.77 3.99 0.96 3.26 6.65 0.99 0.00 11.03 8.49 10.46 0.64 5.56 7.49 8.85 3.98 7.65 1.38 5.08 1.80 5.18 7.61 1.27 2.91 0.00 8.07 13.48 2.45 6.44 6.58 6.89 4.12 13.30 0.73 7.79 2.04 8.12 9.91 0.96 2.45 10.25 0.00 8.95 1.80 5.32 8.21 7.02 3.94 6.47 0.29 5.94 0.59 4.17 6.58 1.01 2.51 11.45 5.44 0.00 2.09 4.40 9.78 6.98 4.09 7.44 -0.70 4.59 1.39 4.28 6.96 1.30 2.51 12.43 8.93 10.31 0.00 6.35 8.50 9.42 3.56 6.02 -0.73 5.82 1.65 4.58 7.42 1.15 3.26 10.13 2.52 8.54 1.73 0.00 8.96 3.16 3.93 0.27 -1.49 1.80 1.01 0.73 5.56 0.67 2.45 10.07 8.30 13.61 1.72 6.24 0.00 7.67 3.57 9.03 -0.56 4.96 2.82 5.15 8.55 1.12 3.37 9.62 8.66 11.88 1.67 5.70 6.98 0.00 4.40 5.39 1.17 3.49 0.14 4.29 5.48 0.33 2.25 10.36 7.15 9.00 0.93 5.03 6.29 8.78 0.00 BLEU Score Difference (Sailor2-20B - Qwen2.5-72B) English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 -1.47 1.22 0.63 -0.12 4.41 8.21 12.87 13.28 14.23 15.27 7.27 10.54 6.17 9.65 9.94 -2.12 0.00 -0.73 0.15 -0.45 2.36 6.71 9.24 10.53
Chunk 84 · 1,908 chars
Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 -1.47 1.22 0.63 -0.12 4.41 8.21 12.87 13.28 14.23 15.27 7.27 10.54 6.17 9.65 9.94 -2.12 0.00 -0.73 0.15 -0.45 2.36 6.71 9.24 10.53 12.39 11.11 5.02 6.56 3.20 6.06 8.03 -1.61 -2.03 0.00 0.67 -0.26 2.08 7.07 10.67 10.87 12.04 10.86 5.18 6.92 5.18 8.51 8.40 -1.69 -1.44 0.08 0.00 -0.64 2.82 6.87 8.71 10.76 14.40 11.20 5.30 7.02 4.49 7.94 8.14 -1.80 -1.77 0.33 0.82 0.00 3.11 7.30 10.23 11.37 13.11 11.29 5.07 7.04 4.38 8.57 8.62 -0.62 -1.21 -0.42 0.73 0.52 0.00 7.25 11.55 11.48 13.05 12.42 5.06 7.50 5.36 9.14 8.50 8.86 4.01 6.61 5.08 7.48 9.07 0.00 11.46 14.69 15.86 15.11 5.56 11.48 8.69 13.61 9.00 5.68 0.97 4.01 2.04 3.43 6.91 8.30 0.00 14.68 15.88 15.92 5.44 10.67 7.82 11.84 10.11 7.18 2.22 5.28 4.67 5.10 8.07 8.57 12.83 0.00 10.90 20.31 6.70 11.52 6.25 6.49 10.23 13.44 3.54 9.42 6.45 8.96 12.13 8.96 12.55 12.77 0.00 16.12 7.22 11.25 9.59 8.87 11.46 6.41 1.73 5.74 4.45 4.22 7.14 8.11 12.36 13.96 11.48 0.00 6.47 7.74 10.23 9.80 9.77 7.36 2.08 5.19 3.54 4.34 7.29 8.52 11.30 15.71 17.75 15.10 0.00 11.15 7.64 12.91 8.95 5.97 1.16 5.33 4.04 4.51 7.49 8.95 12.18 13.46 7.08 11.57 6.89 0.00 9.42 5.98 10.26 0.45 0.09 1.75 1.28 0.58 4.77 7.57 11.30 10.22 10.90 18.77 5.77 10.81 0.00 7.70 8.76 8.11 1.38 5.21 5.31 5.09 8.66 9.40 13.23 9.85 11.87 17.86 6.44 10.18 6.81 0.00 10.44 6.09 2.65 4.18 3.58 4.70 6.41 6.38 10.56 12.62 14.91 14.00 5.14 9.61 5.88 11.37 0.00 CHRF++ Score Difference (Sailor2-20B - Qwen2.5-72B) 4 2 0 2 4 Difference 4 2 0 2 4 Difference Figure 14: Comparison of BLEU and ChrF++ scores on the Flores Plus Translation Dataset across various source-target language pairs between Sailor2 20B and Qwen2.5-72B. BLEU Score Difference = Sailor2 BLEU - Qwen2.5 BLEU. 45 -- 45 of 49 -- Sailor2 Technical
Chunk 85 · 1,982 chars
20B - Qwen2.5-72B) 4 2 0 2 4 Difference 4 2 0 2 4 Difference Figure 14: Comparison of BLEU and ChrF++ scores on the Flores Plus Translation Dataset across various source-target language pairs between Sailor2 20B and Qwen2.5-72B. BLEU Score Difference = Sailor2 BLEU - Qwen2.5 BLEU. 45 -- 45 of 49 -- Sailor2 Technical Report English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 -1.08 0.75 -2.33 -0.41 0.99 0.99 3.58 2.89 0.49 1.34 2.13 1.54 1.32 -0.07 1.07 -1.81 0.00 -0.67 -0.23 -0.03 -0.27 0.44 2.87 2.57 -0.30 0.80 1.35 0.06 0.19 -0.23 0.86 -1.83 -0.34 0.00 -0.97 -0.29 -1.42 0.81 2.95 1.86 0.04 1.36 1.67 0.89 0.93 -0.37 0.46 -0.78 1.62 -0.33 0.00 -0.39 0.10 1.05 2.13 2.25 1.17 0.90 2.20 0.54 1.28 -0.14 0.86 -1.64 -2.03 -0.75 -0.48 0.00 -0.29 0.88 3.09 1.91 1.04 0.79 1.42 -0.09 0.57 -0.22 0.94 -1.35 0.11 -0.87 -0.74 0.24 0.00 0.88 3.85 1.63 -0.12 1.09 1.18 0.72 1.40 -0.41 0.39 0.47 0.65 0.57 -0.10 1.06 0.94 0.00 2.97 3.21 1.46 1.25 1.59 1.17 1.55 1.50 0.81 11.32 -1.25 6.17 1.06 7.01 5.39 1.34 0.00 5.37 2.27 3.78 1.77 2.58 5.18 2.63 2.23 1.90 3.08 0.89 0.24 0.85 1.10 1.08 4.12 0.00 0.18 1.91 1.91 1.33 1.22 -0.08 0.85 2.68 1.60 1.05 0.93 0.44 1.32 0.77 3.15 1.74 0.00 -1.26 1.26 -0.37 1.71 -0.95 0.45 -1.93 1.03 2.46 -0.75 -0.51 -0.96 0.91 3.34 1.74 -1.94 0.00 1.22 0.21 1.62 -1.72 1.20 4.13 -2.50 2.23 -1.13 1.36 2.37 1.35 3.92 3.97 1.79 3.00 0.00 2.10 2.40 1.27 1.02 -0.28 0.11 1.61 -0.82 0.40 0.46 0.82 3.54 1.23 -5.10 0.49 1.27 0.00 2.69 -5.79 0.50 -0.66 -0.98 0.04 -0.71 0.06 0.58 0.93 3.18 2.30 1.49 1.83 1.67 0.58 0.00 -0.34 0.54 0.92 0.89 0.21 0.75 0.11 0.28 0.97 3.49 2.73 0.40 0.73 1.07 -0.29 0.67 0.00 1.14 -2.30 0.56 -1.61 -1.91 -0.97 -1.42 0.39 2.99 1.99 0.25 0.74 1.16 0.40 -0.60 0.23 0.00 BLEU Score Difference (Sailor2-20B -
Chunk 86 · 1,987 chars
0.49 1.27 0.00 2.69 -5.79 0.50 -0.66 -0.98 0.04 -0.71 0.06 0.58 0.93 3.18 2.30 1.49 1.83 1.67 0.58 0.00 -0.34 0.54 0.92 0.89 0.21 0.75 0.11 0.28 0.97 3.49 2.73 0.40 0.73 1.07 -0.29 0.67 0.00 1.14 -2.30 0.56 -1.61 -1.91 -0.97 -1.42 0.39 2.99 1.99 0.25 0.74 1.16 0.40 -0.60 0.23 0.00 BLEU Score Difference (Sailor2-20B - Llama3.1-70B) English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 0.56 0.35 0.64 -0.49 0.60 6.77 14.95 3.23 -0.19 0.95 7.68 2.38 1.45 0.51 3.25 -1.25 0.00 -0.53 0.43 -0.16 -0.38 5.31 11.51 2.47 0.07 0.65 5.55 0.65 0.14 -0.38 2.60 -1.47 0.25 0.00 0.82 -0.42 -1.25 5.62 12.89 1.98 0.09 1.57 5.88 1.42 0.75 0.08 2.47 -0.59 0.60 -0.42 0.00 -0.34 0.08 5.88 12.14 2.11 0.86 0.69 5.81 1.34 0.99 -0.04 2.77 -0.97 0.80 -0.42 0.79 0.00 -0.19 5.72 12.63 2.03 1.12 1.09 5.67 0.52 0.37 0.20 2.80 -1.28 0.50 -0.97 0.77 -0.07 0.00 5.84 13.45 1.77 -0.09 0.81 5.67 1.16 1.06 -0.09 3.03 0.92 1.01 0.72 1.32 0.93 0.95 0.00 12.98 3.77 1.74 2.03 5.19 2.80 1.50 2.23 1.76 11.61 3.75 6.60 3.52 7.22 5.95 8.05 0.00 6.52 2.85 5.38 7.67 5.62 5.84 3.95 6.89 2.12 2.31 0.58 1.40 0.78 1.08 6.52 13.48 0.00 -0.08 2.72 5.97 2.06 1.07 0.71 3.72 3.08 1.05 1.05 1.38 0.67 1.35 5.58 11.38 2.27 0.00 -1.57 5.28 0.15 1.78 -0.48 2.42 -1.12 0.41 1.83 0.64 -0.64 -0.61 5.82 12.78 1.82 -1.84 0.00 5.74 0.84 1.79 -0.86 2.56 4.21 1.24 2.16 1.50 1.50 1.85 6.22 13.30 4.13 2.00 3.46 0.00 4.19 2.14 2.48 3.47 0.31 0.74 1.19 0.69 0.31 0.17 5.93 12.69 2.00 -6.35 0.77 6.10 0.00 2.76 -4.68 2.68 -0.03 1.31 0.16 1.05 -0.05 0.41 6.20 13.17 2.72 1.43 2.19 6.04 1.28 0.00 -0.08 3.25 1.35 0.62 -0.08 1.76 0.28 0.52 6.41 13.28 2.83 0.24 1.04 5.56 -0.09 0.97 0.00 3.40 -1.72 0.49 -1.67 -0.02 -1.11 -1.19 4.72 11.49 1.82 -0.13 0.73 5.24 0.79 -0.87 0.15 0.00 CHRF++ Score Difference (Sailor2-20B -
Chunk 87 · 1,995 chars
6.10 0.00 2.76 -4.68 2.68 -0.03 1.31 0.16 1.05 -0.05 0.41 6.20 13.17 2.72 1.43 2.19 6.04 1.28 0.00 -0.08 3.25 1.35 0.62 -0.08 1.76 0.28 0.52 6.41 13.28 2.83 0.24 1.04 5.56 -0.09 0.97 0.00 3.40 -1.72 0.49 -1.67 -0.02 -1.11 -1.19 4.72 11.49 1.82 -0.13 0.73 5.24 0.79 -0.87 0.15 0.00 CHRF++ Score Difference (Sailor2-20B - Llama3.1-70B) 4 2 0 2 4 Difference 4 2 0 2 4 Difference Figure 15: Comparison of BLEU and ChrF++ scores on the Flores Plus Translation Dataset across various source-target language pairs between Sailor2 20B and Llama3.1-70B. BLEU Score Difference = Sailor2 BLEU - Llama3.1 BLEU. 46 -- 46 of 49 -- Sailor2 Technical Report English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 10.34 1.61 7.20 2.26 1.83 0.02 -0.19 4.44 -5.60 -0.71 1.19 4.14 0.32 -3.05 -5.37 2.10 0.00 2.68 2.54 3.45 1.92 -0.20 -0.54 2.11 -2.39 1.59 3.12 1.60 2.16 1.49 -1.33 1.11 9.33 0.00 3.33 2.54 0.07 0.62 -2.18 1.41 -4.71 0.66 1.18 2.42 1.43 -2.87 -2.86 3.57 10.41 2.61 0.00 3.14 2.62 0.48 -3.26 1.78 -2.80 0.87 2.12 2.16 2.80 -0.18 -1.35 1.36 8.43 1.11 4.90 0.00 1.27 0.61 -2.31 1.32 -3.78 -0.20 0.76 0.76 1.39 -1.18 -2.25 1.46 10.52 -0.02 2.27 2.87 0.00 0.58 -1.85 0.63 -4.32 0.75 -0.10 1.74 1.32 -1.15 -4.04 -2.65 8.90 -1.53 2.09 0.13 -0.56 0.00 -1.88 1.27 -3.37 0.38 1.33 1.02 0.83 -0.29 -2.16 -0.19 7.02 0.84 1.66 1.48 0.68 0.15 0.00 -0.64 -3.02 0.12 1.99 1.41 1.42 -0.44 -3.21 0.50 11.81 0.11 6.93 1.54 -0.64 0.70 -2.17 0.00 -4.32 -1.10 2.52 2.26 1.04 -1.83 -2.39 -4.11 10.51 -3.18 7.97 -1.42 -2.72 0.32 -1.96 0.90 0.00 -5.31 1.85 -0.18 -0.51 -4.52 -2.78 -0.42 11.00 1.81 2.51 1.61 0.37 0.04 -1.85 -1.73 -4.98 0.00 1.99 1.02 1.28 -1.84 -3.64 1.16 8.21 0.93 2.37 1.69 1.31 0.39 -2.25 -0.22 -3.07 0.61 0.00 1.70 1.44 -0.72 -2.98 -0.89 9.48 1.12 2.62 1.55 0.64 -0.24 -1.54
Chunk 88 · 1,995 chars
4 -1.83 -2.39 -4.11 10.51 -3.18 7.97 -1.42 -2.72 0.32 -1.96 0.90 0.00 -5.31 1.85 -0.18 -0.51 -4.52 -2.78 -0.42 11.00 1.81 2.51 1.61 0.37 0.04 -1.85 -1.73 -4.98 0.00 1.99 1.02 1.28 -1.84 -3.64 1.16 8.21 0.93 2.37 1.69 1.31 0.39 -2.25 -0.22 -3.07 0.61 0.00 1.70 1.44 -0.72 -2.98 -0.89 9.48 1.12 2.62 1.55 0.64 -0.24 -1.54 -1.19 -8.86 0.55 1.82 0.00 0.82 -7.02 -3.66 -0.24 9.22 0.32 4.53 1.28 0.20 0.08 -2.56 3.03 -4.28 -1.33 1.38 1.28 0.00 -2.50 -3.75 -1.55 10.49 -2.64 6.12 -0.99 -1.62 -0.13 -3.24 5.23 -4.46 -3.04 0.16 -0.20 0.63 0.00 -3.66 -5.21 9.11 -3.11 0.96 -1.62 -2.62 -0.12 -3.11 -1.20 -6.40 -1.66 -0.83 0.09 -1.75 -2.73 0.00 BLEU Score Difference (Sailor2-20B - NLLB-MoE-54B) English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Target Language English Chinese Indonesian Thai Vietnamese Malay Burmese Lao Cebuano Ilocano Javanese Khmer Sundanese Tagalog Waray Tamil Source Language 0.00 13.92 1.86 8.72 1.88 2.11 -1.96 -5.54 2.78 -6.49 -0.32 -3.38 5.45 0.34 -1.67 -9.07 1.51 0.00 2.95 7.01 3.53 2.60 -2.15 -4.22 3.49 -2.54 2.41 -2.53 3.28 2.81 2.31 -4.49 0.98 12.04 0.00 7.35 2.17 0.72 -0.14 -6.15 1.71 -4.76 1.30 -5.46 3.22 1.77 -1.74 -6.46 2.86 11.31 2.34 0.00 2.87 2.85 0.27 -5.35 2.06 -3.05 0.76 -4.68 2.72 2.88 0.45 -4.49 0.89 11.50 1.18 7.86 0.00 1.49 0.21 -6.09 1.52 -3.99 -0.01 -5.30 1.52 1.60 -0.46 -6.23 1.02 11.08 0.08 6.60 2.33 0.00 -0.36 -5.96 0.77 -4.93 1.00 -5.63 2.29 1.53 -0.57 -7.70 -2.03 9.73 -1.22 4.74 0.20 -0.49 0.00 -4.56 3.02 -3.11 2.36 -3.40 2.19 1.75 0.39 -5.90 -0.50 7.50 0.32 5.43 1.27 0.76 -1.33 0.00 -0.01 -4.59 0.05 -4.21 1.80 1.25 -0.44 -7.79 0.33 12.88 -0.12 7.17 1.24 -0.22 0.88 -6.64 0.00 -4.98 -1.21 -3.86 2.05 1.32 -0.12 -5.15 -4.40 9.84 -3.61 4.02 -1.78 -3.07 -0.34 -8.72 -0.71 0.00 -7.52 -4.92 -1.75 -0.76 -4.61 -7.27 0.08 10.45 1.61 6.53 1.15 1.06 -2.24 -6.16 -0.77 -5.67 0.00 -3.19 1.22 1.40 -0.49 -7.91 1.22 8.67 1.13 5.96 1.66 1.27 -2.64 -6.22 0.88 -3.66 0.66 0.00 2.32 1.41
Chunk 89 · 1,997 chars
.24 -0.22 0.88 -6.64 0.00 -4.98 -1.21 -3.86 2.05 1.32 -0.12 -5.15 -4.40 9.84 -3.61 4.02 -1.78 -3.07 -0.34 -8.72 -0.71 0.00 -7.52 -4.92 -1.75 -0.76 -4.61 -7.27 0.08 10.45 1.61 6.53 1.15 1.06 -2.24 -6.16 -0.77 -5.67 0.00 -3.19 1.22 1.40 -0.49 -7.91 1.22 8.67 1.13 5.96 1.66 1.27 -2.64 -6.22 0.88 -3.66 0.66 0.00 2.32 1.41 0.34 -7.80 -0.52 7.90 0.86 5.25 1.42 0.67 -2.44 -6.40 -0.64 -11.28 0.92 -3.53 0.00 1.38 -6.33 -7.76 -0.24 12.77 0.54 7.21 0.94 0.61 -0.09 -7.65 2.71 -4.33 -1.30 -4.77 1.39 0.00 -1.55 -6.66 -1.35 10.09 -2.31 5.44 -1.04 -1.45 -1.39 -7.51 3.95 -4.80 -3.47 -5.75 -1.09 0.60 0.00 -7.47 -4.67 8.30 -2.92 3.92 -1.80 -2.21 -2.19 -8.62 -1.09 -7.04 -2.59 -6.51 -0.29 -1.89 -2.50 0.00 CHRF++ Score Difference (Sailor2-20B - NLLB-MoE-54B) 4 2 0 2 4 Difference 4 2 0 2 4 Difference Figure 16: Comparison of BLEU and ChrF++ scores on the Flores Plus Translation Dataset across various source-target language pairs between Sailor2 20B and NLLB-MoE-54B. BLEU Score Difference = Sailor2 BLEU - NLLB BLEU. We noticed NLLB failed to generate complete long Chinese sentences: https://github.com/facebookresearch/fairseq/issues/5549, and we also found many common Chinese characters and punctuations are tokenized to <unk>. 47 -- 47 of 49 -- Sailor2 Technical Report Table 24: Performance on Flores Plus (chrF++) for Qwen2.5-72B Target (→) Source (↓) eng cmn ind tha vie zsm mya lao ceb ilo jav khm sun fil war tam English (eng) - 34.3 66.5 47.9 60.8 62.4 19.9 24.8 44.4 31.8 36.8 21.9 37.2 53.1 43.7 31.2 Chinese (cmn) 59.9 - 54.4 41.7 53.1 50.6 17.9 21.5 37.1 26.6 31.5 20.1 32.9 44.9 36.8 26.5 Indonesian (ind) 68.7 30.1 - 44.7 56.1 57.3 19.0 23.9 40.4 30.2 38.3 21.4 39.1 48.9 38.1 28.7 Thai (tha) 62.3 28.9 56.6 - 54.1 53.1 18.6 25.2 38.3 26.6 33.5 20.8 34.3 46.3 36.5 27.3 Vietnamese (vie) 63.9 29.8 57.2 43.6 - 53.3 18.3 23.3 38.3 28.2 34.0 21.1 34.6 47.1 36.5 27.4 Malay (zsm) 68.8 29.9 60.9 44.1 55.0 - 19.1 23.8 40.0 29.5 36.9 21.6 37.1 48.7 38.2 28.9 Burmese (mya) 43.1 18.9 42.6
Chunk 90 · 1,997 chars
1 48.9 38.1 28.7 Thai (tha) 62.3 28.9 56.6 - 54.1 53.1 18.6 25.2 38.3 26.6 33.5 20.8 34.3 46.3 36.5 27.3 Vietnamese (vie) 63.9 29.8 57.2 43.6 - 53.3 18.3 23.3 38.3 28.2 34.0 21.1 34.6 47.1 36.5 27.4 Malay (zsm) 68.8 29.9 60.9 44.1 55.0 - 19.1 23.8 40.0 29.5 36.9 21.6 37.1 48.7 38.2 28.9 Burmese (mya) 43.1 18.9 42.6 33.2 39.4 40.4 - 18.4 30.1 21.2 24.8 17.7 25.2 37.8 28.3 24.3 Lao (lao) 55.0 24.8 51.9 42.9 48.9 48.6 17.1 - 34.1 24.1 29.1 21.7 30.5 43.3 32.2 25.3 Cebuano (ceb) 60.1 26.3 52.5 39.0 48.4 49.3 17.9 20.4 - 33.6 25.4 19.6 30.0 49.2 45.2 26.6 Ilocano (ilo) 44.7 21.2 42.3 32.3 39.0 39.5 15.4 17.1 36.7 - 21.8 16.8 25.4 41.8 36.4 22.0 Javanese (jav) 55.6 24.6 53.0 37.2 47.1 49.2 17.1 20.2 34.4 26.7 - 19.5 34.6 41.6 32.8 25.3 Khmer (khm) 52.3 24.4 50.3 39.4 47.1 47.5 16.3 22.2 33.1 22.4 29.7 - 30.3 42.8 31.3 25.6 Sundanese (sun) 54.8 24.8 53.7 37.8 47.0 48.8 16.5 20.1 34.0 26.3 35.2 19.4 - 41.6 32.3 24.9 Tagalog (fil) 69.0 29.1 58.4 43.2 54.4 54.7 18.9 22.0 45.1 35.3 28.5 20.8 31.6 - 42.6 28.7 Waray (war) 59.6 25.8 52.1 38.3 48.0 48.7 16.9 19.9 46.9 33.9 26.6 19.1 29.7 49.1 - 26.4 Tamil (tam) 50.4 21.6 47.1 35.7 43.7 45.1 18.2 18.2 34.1 23.6 27.5 18.6 28.4 41.7 31.5 - Table 25: Performance on Flores Plus (chrF++) for Llama3.1-70B Target (→) Source (↓) eng cmn ind tha vie zsm mya lao ceb ilo jav khm sun fil war tam English (eng) - 32.3 67.3 47.9 61.1 66.2 21.3 22.7 54.5 46.2 51.1 21.5 45.3 57.8 52.8 37.9 Chinese (cmn) 59.0 - 54.2 41.4 52.8 53.4 19.3 19.2 45.2 38.9 41.9 19.6 38.8 48.0 43.2 31.9 Indonesian (ind) 68.5 27.8 - 44.6 56.2 60.6 20.4 21.6 49.3 42.2 47.6 20.7 44.6 53.3 46.5 34.6 Thai (tha) 61.2 26.9 57.1 - 53.8 55.9 19.6 21.8 46.9 40.2 44.0 20.3 40.0 49.8 44.5 32.6 Vietnamese (vie) 63.0 27.2 57.9 43.6 - 56.6 19.9 20.9 47.6 40.2 44.2 20.5 41.1 51.1 44.9 33.3 Malay (zsm) 69.4 28.2 61.4 44.1 55.6 - 20.5 21.9 49.7 42.6 48.5 20.9 43.4 53.0 47.4 34.4 Burmese (mya) 51.0 21.9 48.5 37.0 46.0 48.5 - 16.8 41.0 35.3 37.9 18.1 33.9 45.0 39.6 31.5 Lao
Chunk 91 · 1,997 chars
46.9 40.2 44.0 20.3 40.0 49.8 44.5 32.6 Vietnamese (vie) 63.0 27.2 57.9 43.6 - 56.6 19.9 20.9 47.6 40.2 44.2 20.5 41.1 51.1 44.9 33.3 Malay (zsm) 69.4 28.2 61.4 44.1 55.6 - 20.5 21.9 49.7 42.6 48.5 20.9 43.4 53.0 47.4 34.4 Burmese (mya) 51.0 21.9 48.5 37.0 46.0 48.5 - 16.8 41.0 35.3 37.9 18.1 33.9 45.0 39.6 31.5 Lao (lao) 49.0 22.0 49.3 41.4 45.1 49.6 17.4 - 42.3 37.1 39.6 19.5 35.5 45.3 40.1 28.5 Cebuano (ceb) 65.2 26.2 57.2 42.2 52.7 56.2 20.0 19.8 - 44.6 43.0 20.4 39.4 54.4 51.0 33.2 Ilocano (ilo) 55.0 23.7 50.6 37.4 47.3 50.3 18.8 18.3 47.2 - 39.5 18.7 36.5 49.6 45.7 31.0 Javanese (jav) 63.1 26.0 56.9 41.0 52.0 57.0 19.4 19.8 46.5 40.0 - 20.2 41.5 50.0 43.5 32.5 Khmer (khm) 55.4 25.2 53.3 41.4 50.0 53.0 18.6 20.2 44.7 38.1 41.3 - 37.3 48.3 41.8 31.0 Sundanese (sun) 60.5 25.2 57.8 41.2 51.2 56.1 19.5 19.6 45.5 39.7 46.0 20.2 - 48.3 43.0 32.5 Tagalog (fil) 69.5 27.9 60.0 43.5 55.1 59.1 20.2 20.2 52.6 44.8 45.1 20.5 41.1 - 50.4 34.2 Waray (war) 66.4 26.6 57.4 41.9 52.8 56.9 19.9 19.9 54.0 45.5 43.4 19.9 39.9 54.9 - 33.4 Tamil (tam) 58.2 23.8 53.0 39.3 49.5 52.7 19.9 17.3 44.9 38.6 40.8 18.5 37.2 48.4 42.7 - Table 26: Translation Performance (chrF++) for NLLB-MoE-54B Target (→) Source (↓) eng cmn ind tha vie zsm mya lao ceb ilo jav khm sun fil war tam English (eng) - 18.9 65.8 39.8 58.8 64.7 30.1 43.2 54.9 52.5 52.4 32.6 42.3 58.9 55.0 50.2 Chinese (cmn) 56.2 - 50.7 34.8 49.1 50.4 26.8 35.0 44.2 41.5 40.2 27.6 36.1 45.3 40.5 39.0 Indonesian (ind) 66.1 16.1 - 38.0 53.6 58.6 26.2 40.7 49.6 47.0 47.9 32.0 42.8 52.3 48.3 43.6 Thai (tha) 57.7 16.2 54.4 - 50.6 53.1 25.2 39.3 47.0 44.1 44.0 30.8 38.6 47.9 44.0 39.9 Vietnamese (vie) 61.2 16.6 56.3 36.6 - 54.9 25.4 39.6 48.1 45.3 45.3 31.4 40.1 49.8 45.6 42.3 Malay (zsm) 67.1 17.6 60.4 38.3 53.2 - 26.7 41.3 50.7 47.4 48.4 32.2 42.3 52.5 47.9 45.1 Burmese (mya) 54.0 13.2 50.5 33.6 46.7 49.9 - 34.4 41.8 40.1 37.5 26.6 34.5 44.7 41.5 39.2 Lao (lao) 61.1 18.3 55.6 39.5 51.0 54.8 26.8 - 48.8 44.6 45.0 31.4 39.4
Chunk 92 · 1,990 chars
.3 36.6 - 54.9 25.4 39.6 48.1 45.3 45.3 31.4 40.1 49.8 45.6 42.3 Malay (zsm) 67.1 17.6 60.4 38.3 53.2 - 26.7 41.3 50.7 47.4 48.4 32.2 42.3 52.5 47.9 45.1 Burmese (mya) 54.0 13.2 50.5 33.6 46.7 49.9 - 34.4 41.8 40.1 37.5 26.6 34.5 44.7 41.5 39.2 Lao (lao) 61.1 18.3 55.6 39.5 51.0 54.8 26.8 - 48.8 44.6 45.0 31.4 39.4 49.9 44.5 43.2 Cebuano (ceb) 67.0 15.7 57.9 36.5 52.3 57.5 25.6 39.9 - 49.5 47.0 30.2 39.4 54.2 51.9 42.0 Ilocano (ilo) 62.5 14.9 55.3 34.7 49.8 54.7 24.7 38.4 50.2 - 45.4 28.9 38.4 52.1 49.9 40.7 Javanese (jav) 61.9 15.9 57.1 35.1 50.2 55.3 27.4 38.7 49.1 43.9 - 29.1 41.1 50.4 43.1 43.0 Khmer (khm) 58.4 17.8 54.3 37.0 49.8 53.6 27.4 39.7 47.9 43.8 44.1 - 39.1 49.1 43.9 42.3 Sundanese (sun) 61.3 18.1 58.2 36.6 50.1 55.6 27.9 38.7 48.1 44.6 45.9 29.8 - 49.7 44.6 42.9 Tagalog (fil) 69.7 16.4 59.6 37.3 54.1 58.9 26.5 41.0 52.6 50.6 48.6 31.3 41.0 - 51.8 44.1 Waray (war) 69.1 17.1 59.6 38.2 54.1 58.9 27.7 40.7 52.8 50.6 47.9 31.2 40.9 55.3 - 44.3 Tamil (tam) 61.1 15.9 54.2 35.3 50.2 53.7 26.8 37.4 47.8 45.5 44.1 30.2 38.2 49.5 45.4 - 48 -- 48 of 49 -- Sailor2 Technical Report Table 27: Performance Comparison on CulturalBench across Models. Model Hard Easy CulturalBench FIL ID MS SG TH VI AVG FIL ID MS SG TH VI AVG SEA LION 7B 72.22 58.65 27.27 66.30 74.07 45.37 57.31 11.11 15.38 18.18 13.04 7.41 14.81 13.32 35.32 Sailor2-1b 26.11 25.00 27.27 51.09 74.07 25.00 38.09 31.11 46.15 45.45 30.43 37.04 44.44 39.10 38.60 Qwen2.5-0.5b 26.11 25.00 27.27 55.43 74.07 25.00 38.81 42.22 53.85 45.45 43.48 55.56 33.33 45.65 42.23 Llama-3.1-8B 28.33 31.73 27.27 57.61 25.93 39.81 35.11 53.33 73.08 54.55 60.87 74.07 48.15 60.68 47.89 SeaLLMs-v3-7B 52.22 27.88 27.27 60.87 74.07 31.48 45.63 42.22 57.69 54.55 56.52 66.67 51.85 54.92 50.27 SeaLLM-7B-Hybrid 70.56 36.54 29.55 67.39 74.07 75.93 59.01 37.78 69.23 45.45 39.13 40.74 51.85 47.36 53.19 Gemma-7b 44.44 47.12 50.00 60.87 75.00 34.26 51.95 62.22 65.38 36.36 34.78 70.37 59.26 54.73
Chunk 93 · 1,999 chars
0.68 47.89 SeaLLMs-v3-7B 52.22 27.88 27.27 60.87 74.07 31.48 45.63 42.22 57.69 54.55 56.52 66.67 51.85 54.92 50.27 SeaLLM-7B-Hybrid 70.56 36.54 29.55 67.39 74.07 75.93 59.01 37.78 69.23 45.45 39.13 40.74 51.85 47.36 53.19 Gemma-7b 44.44 47.12 50.00 60.87 75.00 34.26 51.95 62.22 65.38 36.36 34.78 70.37 59.26 54.73 53.34 Mistral-7b 76.11 50.00 47.73 64.13 73.15 32.41 57.26 66.67 61.54 36.36 60.87 55.56 55.56 56.09 56.67 Sailor-7b 26.11 49.04 68.18 67.39 76.85 57.41 57.50 44.44 69.23 72.73 60.87 55.56 62.96 60.97 59.23 Sailor2-8b 48.89 41.35 34.09 59.78 80.56 43.52 51.37 64.44 69.23 63.64 60.87 81.48 74.07 68.96 60.16 Qwen2.5-7b 45.56 66.35 36.36 63.04 81.48 60.19 58.83 62.22 61.54 36.36 65.22 85.19 66.67 62.87 60.85 Gemma2-9b 76.67 44.23 38.64 67.39 68.52 42.59 56.34 64.44 69.23 81.82 60.87 77.78 70.37 70.75 63.55 Qwen2.5-14b 44.44 69.23 52.27 66.30 82.41 61.11 62.63 64.44 69.23 72.73 69.57 85.19 74.07 72.54 67.58 Qwen2.5-32b 45.56 57.69 36.36 72.83 87.04 55.56 59.17 68.89 92.31 72.73 78.26 92.59 70.37 79.19 69.18 Gemma2-27b 66.67 55.77 56.82 66.30 74.07 49.07 61.45 84.44 76.92 72.73 86.96 85.19 59.26 77.58 69.52 Sailor2-20b 65.56 64.42 54.55 66.30 83.33 52.78 64.49 80.00 84.62 81.82 73.91 77.78 77.78 79.32 71.90 Qwen2.5-72b 61.67 63.46 59.09 83.70 88.89 65.74 70.43 82.22 80.77 100.00 82.61 92.59 77.78 86.00 78.21 Table 28: Performance Comparison on BLEnD across Models. Model INST4-ID PERS3-ID ID INST4-JB PERS4-JB JB Overall Qwen2.5-0.5b 22.71 20.83 21.77 6.11 5.68 5.90 13.83 Llama-3.1-8B 16.25 19.58 17.92 10.70 9.83 10.27 14.09 SeaLLMs-v3-7B 30.21 30.21 30.21 8.95 10.04 9.50 19.85 SEA LION 7B 33.54 34.17 33.86 11.79 11.14 11.47 22.66 SeaLLM-7B-Hybrid 45.83 46.04 45.94 18.56 15.07 16.82 31.38 Qwen2.5-7b 49.58 49.17 49.38 19.00 18.78 18.89 34.13 Sailor2-1b 46.67 44.38 45.53 22.27 23.80 23.04 34.28 Mistral-7b 52.08 50.42 51.25 19.21 17.25 18.23 34.74 Gemma-7b 48.33 46.88 47.61 25.98 24.24 25.11 36.36 Sailor-7b 56.04 55.21 55.63 20.96 21.83
Chunk 94 · 1,526 chars
7 22.66 SeaLLM-7B-Hybrid 45.83 46.04 45.94 18.56 15.07 16.82 31.38 Qwen2.5-7b 49.58 49.17 49.38 19.00 18.78 18.89 34.13 Sailor2-1b 46.67 44.38 45.53 22.27 23.80 23.04 34.28 Mistral-7b 52.08 50.42 51.25 19.21 17.25 18.23 34.74 Gemma-7b 48.33 46.88 47.61 25.98 24.24 25.11 36.36 Sailor-7b 56.04 55.21 55.63 20.96 21.83 21.40 38.51 Qwen2.5-32b 53.54 53.96 53.75 27.73 30.35 29.04 41.40 Qwen2.5-14b 57.50 56.46 56.98 27.07 25.55 26.31 41.65 Gemma2-9b 60.21 59.79 60.00 34.28 31.88 33.08 46.54 Sailor2-8b 65.00 63.33 64.17 36.68 35.15 35.92 50.04 Gemma2-27b 66.04 65.83 65.94 36.03 36.90 36.47 51.20 Qwen2.5-72b 67.50 65.42 66.46 37.55 36.68 37.12 51.79 Sailor2-20b 68.75 66.67 67.71 38.86 38.21 38.54 53.12 Table 29: Performance Comparison on Global-MMLU across Models. Model ID FIL MS VI Overall SEA LION 7B 24.83 27.03 24.75 25.97 25.65 Sailor2-1b 33.56 31.26 32.40 32.52 32.44 Qwen2.5-0.5b 36.03 27.75 33.91 36.04 33.43 SeaLLM-7B-Hybrid 37.66 36.61 36.04 35.61 36.48 Sailor-7b 43.03 30.64 43.23 41.08 39.50 SeaLLMs-v3-7B 43.65 36.55 40.19 42.73 40.78 Mistral-7b 45.73 42.85 43.51 39.99 43.02 Llama-3.1-8B 52.94 47.59 50.08 50.68 50.32 Gemma-7b 52.53 52.66 52.66 49.47 51.83 Sailor2-8b 59.34 57.50 56.70 56.71 57.56 Qwen2.5-7b 61.34 53.52 57.39 60.85 58.28 Gemma2-9b 60.72 60.54 59.63 57.05 59.49 Gemma2-27b 66.57 66.40 65.12 62.54 65.16 Qwen2.5-14b 69.83 61.66 64.66 66.74 65.72 Sailor2-20b 69.98 68.06 68.32 66.71 68.27 Qwen2.5-32b 74.13 67.73 70.95 72.53 71.34 Qwen2.5-72b 76.41 74.84 77.07 77.68 76.50 49 -- 49 of 49 --