SeaLLMs - Large Language Models for Southeast Asia
Summary
SeaLLMs is a series of large language models designed to address the linguistic bias in AI systems that favor high-resource languages like English. Developed by Alibaba's DAMO Academy, SeaLLMs focus on Southeast Asian (SEA) languages, including Thai, Vietnamese, Indonesian, and others. The models are built by continuing pre-training on English-centric models like Llama-2 and Mistral-7B, with an extended vocabulary tailored for SEA languages. This approach improves tokenization efficiency for non-Latin scripts, which are often poorly represented in standard tokenizers. SeaLLMs also undergo specialized instruction tuning and self-preferencing alignment to better reflect local cultural norms and legal considerations. Evaluation shows that SeaLLMs outperform open-source models and ChatGPT-3.5 in non-Latin SEA languages like Thai, Khmer, Lao, and Burmese, while remaining lightweight and cost-effective. The models are available in multiple versions (v1, v2, v2.5), each with improvements in tasks such as math reasoning and multilingual instruction-following. SeaLLMs aim to democratize AI access and preserve linguistic diversity in Southeast Asia.
PDF viewer
Chunks(22)
Chunk 0 ¡ 1,990 chars
SeaLLMs - Large Language Models for Southeast Asia Xuan-Phi Nguyenâ , Wenxuan Zhangâ, Xin Liâ, Mahani Aljuniedâ, Zhiqiang Hu, Chenhui ShenâĄ, Yew Ken ChiaâĄ, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, Lidong Bingâ DAMO Academy, Alibaba Group Video: https://youtu.be/s0mBrHYD_H4 Website: https://damo-nlp-sg.github.io/SeaLLMs Abstract Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high- resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon popular English-centric mod- els through continued pre-training with an ex- tended vocabulary, specialized instruction and alignment tuning to better capture the intrica- cies of regional languages. This allows them to respect and reflect local cultural norms, cus- toms, stylistic preferences, and legal consider- ations. Our comprehensive evaluation demon- strates that SeaLLM models exhibit superior performance across a wide spectrum of lin- guistic tasks and assistant-style instruction- following capabilities relative to comparable open-source models. Moreover, they outper- form ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate. 1 Introduction The advent of large language models (LLMs) has radically transformed the field of natural language processing, demonstrating remarkable abilities in text generation, comprehension, and decision-making tasks (Brown et al., 2020; Ope- nAI, 2023a,b; Touvron et al., 2023a,b; Thoppilan et al., 2022; Jiang et al., 2023; Wei et al., 2023; Bai et al., 2023). While the proficiencies of these models are extraordinary, the majority of
Chunk 1 ¡ 1,999 chars
g, demonstrating remarkable abilities in text generation, comprehension, and decision-making tasks (Brown et al., 2020; Ope- nAI, 2023a,b; Touvron et al., 2023a,b; Thoppilan et al., 2022; Jiang et al., 2023; Wei et al., 2023; Bai et al., 2023). While the proficiencies of these models are extraordinary, the majority of existing LLMs embody a linguistic hierarchy overwhelm- ingly dominated by English (Ahuja et al., 2023; Lai et al., 2023; Zhang et al., 2023). This dominance â ⥠Equal contributions. â Corresponding author: l.bing@alibaba-inc.com undermines the multilingual capability of such models, with particularly prejudicial outcomes for lower-resource and regional languages, where data scarcity and tokenization challenges lead to dispro- portionately poor model performance. This linguis- tic disparity not only impedes access to state-of- the-art AI technologies for non-English-speaking populations but also risks cultural homogenization and the loss of linguistic diversity. While hyper- polyglot models exist (Scao et al., 2022; Muen- nighoff et al., 2022; Wei et al., 2023), they may pay a high cost for high-resource language performance while lacking in multilingual instruction-following abilities. Recognizing the urgent need to democratize AI and empower linguistically diverse regions, we in- troduce SeaLLMs1, a suite of specialized language models optimized for Southeast Asian languages2. These languages, while rich and diverse, often lack the extensive dataset support available for more widely spoken languages, resulting in a stark per- formance gap in existing LLM applications. As a long-term continuous effort, as of this writ- ing, SeaLLMs come in three versions (v1, v2, v2.5). SeaLLM-13B-v1, which was pre-trained from Llama-2-13B, eclipses the performance of most available open-source LLMs in a compre- hensive array of tasks including world knowledge assessments, language comprehension, and gen- erative capabilities in SEA languages. For En- glish and alike,
Chunk 2 ¡ 1,993 chars
n three versions (v1, v2, v2.5). SeaLLM-13B-v1, which was pre-trained from Llama-2-13B, eclipses the performance of most available open-source LLMs in a compre- hensive array of tasks including world knowledge assessments, language comprehension, and gen- erative capabilities in SEA languages. For En- glish and alike, SeaLLMs do not only preserve, but also demonstrate enhanced performance in tasks that were part of the original Llama training set. When evaluated on multilingual instruction- following tasks with GPT-4 as a judge (Zheng et al., 2023), SeaLLM-13B-v1 outperforms ChatGPT-3.5 by large margins in less-represented languages such 1https://github.com/DAMO-NLP-SG/SeaLLMs 2English (Eng), Chinese (Zho), Indonesian (Ind), Viet- namese (Vie), Thai (Tha), Khmer (Khm), Lao, Malay (Msa), Burmese (Mya) and Tagalog (Tgl) arXiv:2312.00738v2 [cs.CL] 1 Jul 2024 -- 1 of 11 -- Figure 1: Sea-bench (Section 4.2) scores as evaluated by GPT-4 (Zheng et al., 2023) for different models. Each radar chart compares scores as averaged across 5 categories (left) and 9 languages (right). Detailed breakdown by each category and language is given in Figure 5 in the Appendix. as Khmer, Lao or Burmese. Meanwhile, SeaLLM- 7B-v2, which was pre-trained from Mistral-7B (Jiang et al., 2023), demonstrates better perfor- mances in math and commonsense reasoning than comparable baselines, surpassing ChatGPT-3.5 in reasoning for common SEA languages, while be- ing much smaller in sizes. Later, SeaLLM-7B- v2.5, which was further pre-trained from Gemma- 7B (Team et al., 2024), shows significant improve- ments in SEA languages over SeaLLM-7B-v2. Figure 2 illustrates the four-stage training pro- cess of SeaLLMs. In the first stage, detailed in Sec- tion 2.3, we conduct continuous pre-training from the foundational models (Touvron et al., 2023b; Jiang et al., 2023) with an extended vocabulary tailored for SEA languages. Next, we fine-tune the model in a novel hybrid paradigm with a mix- ture of
Chunk 3 ¡ 1,989 chars
e training pro- cess of SeaLLMs. In the first stage, detailed in Sec- tion 2.3, we conduct continuous pre-training from the foundational models (Touvron et al., 2023b; Jiang et al., 2023) with an extended vocabulary tailored for SEA languages. Next, we fine-tune the model in a novel hybrid paradigm with a mix- ture of multilingual pre-training data and English- dominant instruction fine-tuning data (Section 3.2). The following stage subsequently fine-tunes the model on a balanced and custom-built multilingual SFT dataset. Finally, we conduct self-preferencing alignment optimization using the SeaLLM model itself, without relying on human annotators or more powerful LLMs (OpenAI, 2023b). 2 Pre-training 2.1 Pre-training Data The pre-training data comprises a heterogeneous assortment of documents sourced from several pub- licly accessible repositories (SuĂĄrez et al., 2019; Raffel et al., 2019; Computer, 2023; Foundation). Specifically, during the creation of the pre-training data, we include web-based corpora such as Com- mon Crawl (Wenzek et al., 2020), journalistic con- tent such as CC-News, text corpora with expertly- curated knowledge such as Wikipedia (Founda- tion), and some scholarly publications. After col- lecting the data, we employ a language identifier (Bojanowski et al., 2017) to retain the documents for the major languages in Southeast Asia, namely Thai, Vietnamese, Indonesian, Chinese, Khmer, Lao, Malay, Burmese, and Tagalog, and discard the remaining ones. Subsequent stages of data re- finement involve the multiple modules dedicated to data cleansing and content filtration. We blend such data with the highest quality English data from RedPajama subset (Computer, 2023) in more bal- anced ratios, as we found that such English data are useful to preserve the original learnt knowledge. 2.2 Vocabulary Expansion Table 1 describes how expensive it is to process an under-represented non-Latin language. For ex- ample, encoding a single sentence in Thai
Chunk 4 ¡ 1,996 chars
from RedPajama subset (Computer, 2023) in more bal- anced ratios, as we found that such English data are useful to preserve the original learnt knowledge. 2.2 Vocabulary Expansion Table 1 describes how expensive it is to process an under-represented non-Latin language. For ex- ample, encoding a single sentence in Thai requires 4.3 times more tokens than its English equivalent. The reason for this is that most English language models employ a BPE tokenizer (Sennrich et al., 2016) that inefficiently segments texts from non- Latin scripts into disproportionately lengthy byte sequences, which inadequately represent the un- derlying semantic content, resulting in diminished model performance (Nguyen et al., 2023). To that end, we propose a novel vocabulary expansion tech- nique, as formally described in Algorithm 1 in the Appendix. This technique involves recursively merging whole-word and sub-word token pieces of a new language from a highly multilingual target to- kenizer (i.e., the NLLB tokenizer (Costa-jussĂ et al., -- 2 of 11 -- Llama-2 Continual Pre-training Pre-train & SFT hybrid SFT Self-Preferencing Optimization Figure 2: Complete Training Process of SeaLLMs. It begins with continual pre-training Llama-2 with more data of regional languages. Then the models undergo specialized fine-tuning process with multilingual SFT data, before finally being tuned with self-preferencing alignment. Language ChatGPTâs Llamaâs SeaLLMâs Vie 4.41 3.46 1.48 Zho 2.80 2.36 1.40 Tha 9.09 5.10 1.87 Ind 2.00 2.09 1.36 Khm 15.56 12.14 2.67 Lao 13.29 13.50 2.07 Msa 2.07 2.16 1.50 Mya 17.11 9.85 1.93 Tgl 2.28 2.22 1.91 Eng 1.00 (baseline) 1.19 1.19 Table 1: Averaged compression ratios between the to- kenized length of texts of each language produced by different tokenizers versus the baseline tokenized length of same-meaning English equivalents produced by Chat- GPT tokenizer (i.e., it costs 15.6x more tokens to en- code Khmer than English with ChatGPT tokenizer). SeaLLMâs ratios are
Chunk 5 ¡ 1,997 chars
ression ratios between the to- kenized length of texts of each language produced by different tokenizers versus the baseline tokenized length of same-meaning English equivalents produced by Chat- GPT tokenizer (i.e., it costs 15.6x more tokens to en- code Khmer than English with ChatGPT tokenizer). SeaLLMâs ratios are applicable only for v1 and v2. 2022)), to the existing LLM tokenizer. This new set of retrieved tokens are then pruned to remove rarely appearing and low-quality tokens before be- ing added to the final SeaLLM tokenizer. Table 1 demonstrates the efficiency of the new vocabulary. The compression ratio for Thai text has markedly improved from 4.29 to 1.57, signifying a 2.7-fold increase in the length of Thai text that can be encoded within the same context constraints. At the same time, the compression of English text has experienced a negligible reduction of 0.3%, thus maintaining its tokenization effectiveness. We applied our vocabulary expansion for SeaLLM v1 and v2 with Llama-2 and Mistral-7B as backbones due to their limit 32K-token vocab- ulary. However, we did not extend the tokenizer for SeaLLM-7B-v2.5, which inherits a sufficiently large 250K-token vocabulary from Gemma-7B. 2.3 Pre-training Process We organize our pre-training dataset based on the language of the content and the quality of the data, as mentioned in Section 2.1. We setup a separate stream of data for each language, and dynamically control and balance the sampling ratio of each lan- guage. We pack multilingual documents into a single sequence up to the maximum context length. During the last steps of pre-training, we re-feed the model with more high quality data, which it has previously seen, to readjust the modelâs learning fo- cus back towards the high-quality data, improving the modelâs performance. 3 Supervised Fine-tuning (SFT) 3.1 Supervised Fine-tuning Data Our supervised finetuning (SFT) data consists of many categories, including text understanding and processing, math
Chunk 6 ¡ 1,988 chars
hich it has previously seen, to readjust the modelâs learning fo- cus back towards the high-quality data, improving the modelâs performance. 3 Supervised Fine-tuning (SFT) 3.1 Supervised Fine-tuning Data Our supervised finetuning (SFT) data consists of many categories, including text understanding and processing, math and logical reasoning, user- centric instruction-following, and natural dialog data. As most public and open-source SFT data are English-only (Longpre et al., 2023; Lian et al., 2023; Mukherjee et al., 2023; Lee et al., 2023), various techniques were implemented to enhance the multilingual aspect of the model. These in- clude sourcing natural data from local websites in natural settings, selectively translating from En- glish data, employing self-instruction, and using advanced prompting techniques (Wang et al., 2022; Madaan et al., 2023; Nguyen et al., 2023). As those synthetically generated data may remain in- correct or low-quality, native speakers3 were then engaged to further verify, filter, and edit such syn- thetic responses to finalize the SFT dataset. We find that engaging the annotators to verify and modify model-generated responses is more efficient than having them write responses from scratch. Safety- related data also played a crucial role in fine-tuning SeaLLMs. We manually collected and prepared country-relevant safety data, which covered a broad range of culturally and legally sensitive topics in each of these countries. This was necessary as such topics are often overlooked or may even conflict with open-source English-centric safety data (Deng et al., 2023). For SeaLLM-7B-v2 and SeaLLM-7B-v2.5, we incorporate significantly more SFT data relating to math and commonsense reasoning. Such data is 3Hired by our organization, they are not co-authors. -- 3 of 11 -- synthetically generated with SeaLLM-13B-v1, as well as strong English models (Jiang et al., 2024; Bai et al., 2023) using a combination of few-shot paraphrasing and
Chunk 7 ¡ 1,992 chars
significantly more SFT data relating to math and commonsense reasoning. Such data is 3Hired by our organization, they are not co-authors. -- 3 of 11 -- synthetically generated with SeaLLM-13B-v1, as well as strong English models (Jiang et al., 2024; Bai et al., 2023) using a combination of few-shot paraphrasing and translation techniques (Yu et al., 2023). 3.2 Supervised Fine-tuning Pre-train and SFT Hybrid. As our SFT data is still significantly English due to contributions of open-source data, directly conducting SFT on it may overshadow the smaller SEA language datasets. Therefore, we propose incorporating an additional step prior to complete fine-tuning, namely Pre-train & SFT Hybrid. In this step, the model is further trained on a combination of the pre-training corpus and a large portion of English SFT data, leaving the remaining and more balanced amount of English SFT data to the next stage. Dur- ing this hybrid stage, the model processes both gen- eral pre-training content and instruction-following examples. We mask the source side of the instruc- tion or supervised data to prevent the model from overfitting to the training examples and to reduce the risk of it simply memorizing the input data in- stead of learning the more generalized ability to follow instructions. Supervised Fine-tuning. We conduct supervised fine-tuning by compiling instructions from a vari- ety of sources explained in Section 3.1, combin- ing them at random into a single, consolidated se- quence to maximize efficiency. To enhance the multi-turn conversation capability, in the later stage of fine-tuning, we further artificially create multi- turn conversations by randomly joining several single-turn instructions together. 3.3 Self-Preferencing Optimization Alignment from human feedback preference has been key to the success of many AI-assistant lan- guage models (Stiennon et al., 2020; Touvron et al., 2023b; Rafailov et al., 2023; Ouyang et al., 2022). To save the cost of human
Chunk 8 ¡ 1,989 chars
andomly joining several single-turn instructions together. 3.3 Self-Preferencing Optimization Alignment from human feedback preference has been key to the success of many AI-assistant lan- guage models (Stiennon et al., 2020; Touvron et al., 2023b; Rafailov et al., 2023; Ouyang et al., 2022). To save the cost of human preference annotation work, some have sought to use powerful LLMs like GPT-4 (OpenAI, 2023b) to play the part of a pref- erence data generator (Tunstall et al., 2023). How- ever, that may not even be feasible for low-resource non-Latin languages because of the unfavorable to- kenization of ChatGPT as explained in Section 2.2. In other words, even short prompts would exceed their context-length and the API-call costs would explode by up to 17 times. Therefore, we use our own SeaLLM SFT mod- els to generate preference data by asking it to in- dicate its preference between two of its own re- sponses, given a question based on certain human- written criteria. To eliminate position bias, we swap the order of the responses and remove sam- ples with inconsistent preference. The data is later used to employ direct preference optimiza- tion (Rafailov et al., 2023) to significantly improve the model abilities as an assistant. As such, unlike other works (Mukherjee et al., 2023; Tunstall et al., 2023), our models are free from relying on pow- erful close-sourced models like GPT-4 to improve the performance in low-resource languages. Our self-preferencing method also shares certain fla- vors with another self-rewarding mechanism (Yuan et al., 2024).4 4 Evaluation 4.1 Model Variants We trained multiple variants of SeaLLMs, as speci- fied in the following. ⢠SeaLLM-7B-v1: Trained from Llama-2-7B, it supports the 10 official languages used in Southeast Asia. ⢠SeaLLM-13B-v1: Trained from Llama-2- 13B, it outperforms ChatGPT-3.5 in most non- Latin SEA languages (Khm, Lao, Mya and Tha) by large margins. ⢠SeaLLM-7B-v2: Trained from Mistral-7B, it outperforms
Chunk 9 ¡ 1,990 chars
the following. ⢠SeaLLM-7B-v1: Trained from Llama-2-7B, it supports the 10 official languages used in Southeast Asia. ⢠SeaLLM-13B-v1: Trained from Llama-2- 13B, it outperforms ChatGPT-3.5 in most non- Latin SEA languages (Khm, Lao, Mya and Tha) by large margins. ⢠SeaLLM-7B-v2: Trained from Mistral-7B, it outperforms SeaLLM-13B-v1 by far in higher- resource SEA languages (Vie, Ind, Tha), and surpasses ChatGPT-3.5 in math reasoning in SEA languages. ⢠SeaLLM-7B-v2.5: Trained from Gemma-7B, it outperforms SeaLLM-7B-v2 and SeaLLM- 13B-v1 remarkably and surpasses ChatGPT- 3.5 in various aspects in SEA languages, espe- cially non-Latin languages. 4.2 Sea-bench Peer Comparison While there are popular benchmarks to evaluate LLMs as a helpful assistant, such as MT-bench (Zheng et al., 2023), they are only English-based and not suitable for evaluating performances in low-resource languages. Due to such a lack of multilingual benchmarks for assistant-style models, 4Our work was publicly available before Yuan et al. (2024). -- 4 of 11 -- Model M3Exam MMLU Eng Zho Vie Ind Tha Eng ChatGPT-3.5 75.46 60.20 58.64 49.27 37.41 70.00 SeaLion-7b 23.80 25.87 27.11 24.28 20.29 26.87 Llama-2-13b 61.17 43.29 39.97 35.50 23.74 53.50 Polylm-13b 32.23 29.26 29.01 25.36 18.08 22.94 SeaLLM-7B-v1 54.89 39.30 38.74 32.95 25.09 47.16 SeaLLM-13B-v1 62.69 44.50 46.45 39.28 36.39 52.68 SeaLLM-7B-v2 70.91 55.43 51.15 42.25 35.52 61.89 SeaLLM-7B-v2.5 76.87 62.54 63.11 48.64 46.86 64.05 Table 2: Multilingual world knowledge accuracy evaluation across multiple languages and various models of different sizes. we engaged native linguists to build a multilingual test set with instructions that cover SEA languages, called Sea-bench. The linguists sourced such data by translating open-source English test sets, col- lecting real user questions from local forums and websites, collecting real math and reasoning ques- tions from reputable sources, as well as writing test instructions themselves. Our
Chunk 10 ¡ 1,998 chars
ructions that cover SEA languages, called Sea-bench. The linguists sourced such data by translating open-source English test sets, col- lecting real user questions from local forums and websites, collecting real math and reasoning ques- tions from reputable sources, as well as writing test instructions themselves. Our Sea-Bench consists of diverse categories of instructions to evaluate the models, as follows: ⢠Task-solving: This type of data comprises var- ious text understanding and processing tasks, such as summarization, translation, etc. ⢠Math-reasoning: This includes math problems and logical reasoning tasks. ⢠General-instruction data: This consists of gen- eral user-centric instructions, which evaluate the modelâs ability in general knowledge and writing. ⢠NaturalQA: This consists of queries posted by real users, often in popular local forums, with a variety of subjects and topics of local interest. The aim is to test the modelâs capac- ity to understand and respond coherently to colloquial language, natural expressions and idiomatic language, and locally contextual- ized references. ⢠Safety: This includes both general safety and local context-related safety instructions. While most general safety questions are trans- lated from open sources, other local country- specific safety instructions are written by lin- guists of each language. As inspired by MT-bench (Zheng et al., 2023), we evaluate and compare SeaLLMs with well- known and state-of-the-art models using GPT-4 as a judge in a score-based grading metrics and a peer comparison (or pairwise comparison) manner. Figure 1 compares our SeaLLM (v2, v2.5) chat models with Qwen1.5-7B-chat (Bai et al., 2023) and the widely reputed ChatGPT-3.55 (OpenAI, 2023a). In the âBy Categoryâ chart, SeaLLM-7B- v2.5 performs on par with or surpasses ChatGPT- 3.5 across various linguistic and writing tasks. This is largely thanks to the large gap in low-resource non-Latin languages, such as Burmese (Mya), Lao, Khmer and
Chunk 11 ¡ 1,999 chars
i et al., 2023) and the widely reputed ChatGPT-3.55 (OpenAI, 2023a). In the âBy Categoryâ chart, SeaLLM-7B- v2.5 performs on par with or surpasses ChatGPT- 3.5 across various linguistic and writing tasks. This is largely thanks to the large gap in low-resource non-Latin languages, such as Burmese (Mya), Lao, Khmer and Thai, as seen in the âBy languageâ chart on the right in Figure 1. Model Languages MT-bench GPT-4-turbo Multi 9.32 Mixtral-8x7B (46B) Multi 8.3 Starling-LM-7B-alpha Mono (Eng) 8.0 OpenChat-3.5-7B Mono (Eng) 7.81 SeaLLM-7B-v2 Multi 7.54 SeaLLM-7B-v2.5 Multi 7.43 Llama-2-70B-chat Mono 6.86 Mistral-7B-instruct Mono 6.84 SeaLLM-13B-v1 Multi 6.32 Table 3: MT-Bench scores (Zheng et al., 2023) for closed, open, multilingual and monolingual (as indi- cated by their authors on Huggingface.) models. 4.3 MT-bench We also compare our models with certain baselines on the English MT-Bench (Zheng et al., 2023) in Table 3. As shown, SeaLLM-7B-v2 model demon- strates outstanding ability in English, given its size. 5gpt-3.5-turbo June 2023 version. -- 5 of 11 -- Model Eng Zho Vie Ind Tha GSM8K MATH GSM8K MATH GSM8K MATH GSM8K MATH GSM8K MATH ChatGPT-3.5 80.8 34.1 48.2 21.5 55.0 26.5 64.3 26.4 35.8 18.1 Qwen1.5-7B-chat 56.8 15.3 40.0 2.7 37.7 9.0 36.9 7.7 21.9 4.7 SeaLLM-7B-v2 78.2 27.5 53.7 17.6 69.9 23.8 71.5 24.4 59.6 22.4 SeaLLM-7B-v2.5 78.5 34.9 51.3 22.1 72.3 30.2 71.5 30.1 62.0 28.4 Table 4: GSM8K and MATH scores (Cobbe et al., 2021; Hendrycks et al., 2021b) and their translated-versions in Chinese, Vietnamese, Indonesian and Thai, under zero-shot chain-of-thought prompting for different models. Figure 3: Translation chrF++ scores of various models for both SEA languages to English and English to SEA languages directions. It is also a rare multilingual model in the 7B realm, especially since it focuses on non-mainstream lan- guages. 4.4 World Knowledge In this section, we evaluate our models and rep- utable chat baselines (Touvron et al., 2023b; Wei et al.,
Chunk 12 ¡ 1,996 chars
dels for both SEA languages to English and English to SEA languages directions. It is also a rare multilingual model in the 7B realm, especially since it focuses on non-mainstream lan- guages. 4.4 World Knowledge In this section, we evaluate our models and rep- utable chat baselines (Touvron et al., 2023b; Wei et al., 2023; OpenAI, 2023a) in terms of world knowledge. For knowledge across languages, we use the M3Exam benchmark (Zhang et al., 2023), which consists of real questions from human exam papers with various degrees of difficulty, rang- ing from primary school to high school examina- tions. We evaluate M3Exam with 3-shot native- instruction prompts across English, Chinese, Viet- namese, Indonesian and Thai. We also evaluate our models with the well-known English-centric MMLU benchmark (Hendrycks et al., 2021a). Table 2 details the evaluations of world knowl- edge across multiple languages and models of dif- ferent sizes. SeaLLM-7B-v2.5 exhibits the best performance given its size and is competitive to GPT-3.5. 4.5 Math Reasoning Table 4 shows the GSM8K and MATH scores (Cobbe et al., 2021; Hendrycks et al., 2021b) for zero-shot chain-of-thought prompting for English and their translated version in Chinese, Vietnamese, Indonesian and Thai. As shown, SeaLLM-7B-v2.5 Figure 4: Direct translation between SEA languages. Scores are indicated as the different between the respec- tive chrF++ score of SeaLLM-13B-v1 minus that of ChatGPT-3.5. Red colors suggests SeaLLM-13B-v1 is better, while blue colors indicates ChatGPT is better. shows competitive English performance in math reasoning compared to open-source models, with 78.5 in GSM8K and 34.9 in MATH. It also ex- ceeds GPT-3.5 in SEA languages. This is achieved by scaling supervised and preference data in math reasoning in multilingual settings. 4.6 Machine Translation To benchmark the machine translation performance of our SeaLLMs, we evaluate 4-shot chrF++ scores on the test sets from Flores-200 (Costa-jussĂ et
Chunk 13 ¡ 1,997 chars
. It also ex- ceeds GPT-3.5 in SEA languages. This is achieved by scaling supervised and preference data in math reasoning in multilingual settings. 4.6 Machine Translation To benchmark the machine translation performance of our SeaLLMs, we evaluate 4-shot chrF++ scores on the test sets from Flores-200 (Costa-jussĂ et al., 2022). As can be seen from Figure 3, SeaLLM- 13B exhibits clear superiority over ChatGPT-3.5 in low-resource languages, such as Lao and Khmer, while maintaining comparable performance with ChatGPT-3.5 in most higher resource languages (e.g., Vietnamese and Indonesian). For direct translation between SEA languages, as shown in Figure 4, our SeaLLM-13B-v1 model still achieves higher chrF++ scores than ChatGPT- 3.5 in most cases, especially when the translation pairs involve low-resource languages. Overall, we -- 6 of 11 -- believe our SeaLLMs will play a key role in facili- tating communication and cultural exchange across communities in Southeast Asia. 5 Conclusion In conclusion, our research presents a substantial advance in the development of equitable and cul- turally aware AI with the creation of SeaLLMs, a specialized suite of language models attuned to the linguistic and cultural landscapes of South- east Asia. Through rigorous pre-training enhance- ments and culturally tailored fine-tuning processes, SeaLLMs have demonstrated exceptional profi- ciency in language understanding and generation tasks, challenging the performance of dominant players such as ChatGPT-3.5, particularly in SEA languages. The modelsâ attunement to local norms and legal stipulationsâvalidated by human evalua- tionsâestablishes SeaLLMs as not only a technical breakthrough but a socially responsive innovation, poised to democratize access to high-quality AI language tools across linguistically diverse regions. This work lays a foundation for further research into language models that respect and uphold the rich tapestry of human languages and cultures, ulti- mately
Chunk 14 ¡ 1,990 chars
echnical breakthrough but a socially responsive innovation, poised to democratize access to high-quality AI language tools across linguistically diverse regions. This work lays a foundation for further research into language models that respect and uphold the rich tapestry of human languages and cultures, ulti- mately driving the AI community towards a more inclusive future. 6 Limitations SeaLLMs are among the most linguistically di- verse multilingual large language models with re- markable abilities in languages beyond mainstream. However, they do not come without limitations. First, they only scratch the surface of the regionally linguistic diversity with 9 most common and repre- sentative languages, while there are hundreds other languages spoken in the Southeast Asia, such as Javanese and Tamil. Second, despite outperform- ing other popular models in non-Latin low-resource languages, SeaLLM models still suffer from consid- erable hallucination and degeneration under certain circumstances for languages such as Burmese and Lao. Moderate hallucination is still inevitable for other common languages. 7 Acknowledgement We would like to express our special thanks to our professional and native linguists, Tantong Cham- paiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety. References Kabir Ahuja, Rishav Hada, Millicent Ochieng, Prachi Jain, Harshita Diddee, Samuel Maina, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, and Sunayana Sitaram. 2023. MEGA: multilingual evalu- ation of generative AI. CoRR, abs/2303.12528. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword
Chunk 15 ¡ 1,997 chars
rative AI. CoRR, abs/2303.12528. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Associa- tion for Computational Linguistics, 5:135â146. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word prob- lems. arXiv preprint arXiv:2110.14168. Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset. Marta R Costa-jussĂ , James Cross, Onur Ăelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672. Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2023. Multilingual jailbreak chal- lenges in large language models. arXiv preprint arXiv:2310.06474. Wikimedia Foundation. Wikimedia downloads. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. 2021a. Measuring massive multitask language understanding. Proceedings of the International Con- ference on Learning Representations (ICLR). Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. NeurIPS. -- 7 of 11 -- Albert Q Jiang,
Chunk 16 ¡ 1,983 chars
rstanding. Proceedings of the International Con- ference on Learning Representations (ICLR). Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. NeurIPS. -- 7 of 11 -- Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825. Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088. Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. 2023. Chatgpt beyond en- glish: Towards a comprehensive evaluation of large language models in multilingual learning. CoRR, abs/2304.05613. Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. 2023. Platypus: Quick, cheap, and powerful refinement of llms. Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". 2023. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface. co/Open-Orca/OpenOrca. Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023. The flan collection: Designing data and methods for effective instruction tuning. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong,
Chunk 17 ¡ 1,999 chars
u Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generaliza- tion through multitask finetuning. arXiv preprint arXiv:2211.01786. Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- har, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. 2023. Orca: Progressive learning from complex explanation traces of gpt-4. Xuan-Phi Nguyen, Sharifah Mahani Aljunied, Shafiq Joty, and Lidong Bing. 2023. Democratizing llms for low-resource languages by leveraging their en- glish dominant abilities with linguistically-diverse prompts. arXiv preprint arXiv:2306.11372. OpenAI. 2023a. Chatgpt (june 2023 version. OpenAI. 2023b. Gpt-4 technical report. arXiv preprint. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems, 35:27730â27744. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans- former. arXiv e-prints. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman CastagnĂŠ, Alexandra Sasha Luccioni, François Yvon, Matthias GallĂŠ, et al. 2022. Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. Rico
Chunk 18 ¡ 1,995 chars
-text trans- former. arXiv e-prints. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman CastagnĂŠ, Alexandra Sasha Luccioni, François Yvon, Matthias GallĂŠ, et al. 2022. Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715â1725. Association for Computational Linguistics. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learn- ing to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008â 3021. Pedro Javier Ortiz SuĂĄrez, BenoĂŽt Sagot, and Laurent Romary. 2019. Asynchronous pipeline for process- ing huge corpora on medium to low resource infras- tructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz- Institut fĂźr Deutsche Sprache. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applica- tions. arXiv preprint arXiv:2201.08239. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, TimothĂŠe Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. -- 8 of 11 -- Hugo
Chunk 19 ¡ 1,996 chars
l, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, TimothĂŠe Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. -- 8 of 11 -- Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, ClĂŠmentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar San- seviero, Alexander M. Rush, and Thomas Wolf. 2023. Zephyr: Direct distillation of lm alignment. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. 2022. Self-instruct: Aligning lan- guage model with self generated instructions. arXiv preprint arXiv:2212.10560. Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, et al. 2023. Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco GuzmĂĄn, Ar- mand Joulin, and Edouard Grave. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Lan- guage Resources and Evaluation Conference, pages 4003â4012, Marseille, France. European Language Resources Association. Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhen- guo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical ques- tions for large language models. arXiv preprint arXiv:2309.12284. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing
Chunk 20 ¡ 1,982 chars
, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhen- guo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical ques- tions for large language models. arXiv preprint arXiv:2309.12284. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020. Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2023. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. CoRR, abs/2306.05179. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685. A Vocabulary Expansion Algorithm 1 explains in details how we perform se- lective and recursive merger of tokens from target NLLB vocabulary into the original Llama vocabu- lary to enrich the linguistic coverage for new and low-resource languages. Specifically, given a small seed unlabeled dataset of a given new language, the algorithm first tokenizes a document with the current Llama tokenizer. The resulting tokens are then exhaustively merged into longer tokens that are supported by the target NLLB vocabulary. Dur- ing this merger process, any intermediate sub-word is also added to the Llama tokenizer as long as they exist in the rich NLLB vocabulary. The new set of collected tokens are then pruned to remove rarely appearing and low-quality tokens before being added to the final SeaLLM tokenizer. This frequency-based pruning process ensures the new language is sufficiently and efficiently encoded without introducing tokens from other existing languages (e.g., English), which may disrupt the learned knowledge during the Llama-2 pre-training stage. B Sea-bench Evaluation Details Figure 5 breaks down the GPT-4 rated
Chunk 21 ¡ 1,829 chars
s frequency-based pruning process ensures the new language is sufficiently and efficiently encoded without introducing tokens from other existing languages (e.g., English), which may disrupt the learned knowledge during the Llama-2 pre-training stage. B Sea-bench Evaluation Details Figure 5 breaks down the GPT-4 rated Sea-bench score-based evaluations of SeaLLM-13b and other baselines by both language and task category. As shown, our SeaLLM-13b model far exceeds ChatGPT-3.5 in most non-Latin languages, such as Burmese (Mya), Lao and Khmer, though it trails behind this formidable competitor in Latin-based languages, mostly in math reasoning skills. -- 9 of 11 -- Algorithm 1 Vocabulary Extension algorithm: Vi is Llama vocabulary, Vt is target NLLB vocabulary, D is unlabeled data and m is minimum frequency. 1: function EXHAUSTIVEMERGE(Vi, Vt, tV ) 2: Tnew â empty set â 3: repeat 4: for each consecutive token pair (prev, next) in tV do 5: tmerged â â¨prevâŠâ¨next⊠⡠Form a new token 6: if tmerged exists in Vt then 7: Replace (prev, next) with tmerged in tV ⡠Update tV with new token 8: Tnew â Tnew ⪠tmerged 9: break 10: until no new token added to Tnew 11: return Tnew 12: function VOCABEXTEND(Vi, Vt, D, m) 13: V â Vi 14: F â empty set â 15: T â empty set â 16: for document d in D do 17: tV â tokenize(V, d) ⡠tokenize the document 18: Tnew â EXHAUSTIVEMERGE(Vi, Vt, tV ) ⡠obtain new words from Vt based on d 19: V â V ⪠Tnew ⡠update V with new words Tnew 20: T â T ⪠Tnew 21: F â Update frequencies of Tnew to F ⡠update appearance frequencies of Tnew 22: T â Prune ti â T with corresponding ft â F where ft < m ⡠Remove rare words 23: Vf inal â Vi ⪠T 24: return Vf inal -- 10 of 11 -- Figure 5: Sea-bench scores as evaluated by GPT-4 for different models across 9 languages and 5 categories. -- 11 of 11 --