SeaLLMs 3: Open Foundation and Chat Multilingual Large Language
Summary
SeaLLMs 3 is a multilingual large language model (LLM) designed for Southeast Asian languages, addressing the region's lack of language technology support. It covers 12 languages, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. The model uses efficient language enhancement techniques, such as training language-specific neurons, to reduce costs while maintaining performance. It is trained on a diverse dataset, including Wikipedia, textbooks, and synthetic content, and employs a specialized instruction tuning dataset to improve task versatility. SeaLLMs 3 excels in world knowledge, math reasoning, translation, and instruction following, achieving state-of-the-art results among similarly sized models. The model also prioritizes safety and reliability, with mechanisms to reduce hallucinations and a novel benchmark, SeaRefuse, to evaluate refusal of out-of-knowledge queries. Both foundational and chat versions are open-sourced, enabling broader accessibility and application.
PDF viewer
Chunks(27)
Chunk 0 · 1,994 chars
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages Wenxuan Zhang*, Hou Pong Chanâ, Yiran Zhaoâ, Mahani Aljuniedâ, Jianyu Wangâ, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, Lidong Bingâ DAMO Academy, Alibaba Group Project page: https://seallms.github.io/ Abstract Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages un- derserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a compre- hensive range of languages spoken in this re- gion, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leverag- ing efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces train- ing costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achiev- ing state-of-the-art performance among sim- ilarly sized models. Additionally, we priori- tized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucina- tions. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities. 1 Introduction Large Language Models (LLMs) such as GPT-4 (OpenAI, 2023) and Gemini (Anil et al., 2023) have demonstrated remarkable capabilities across a wide array of tasks, ranging from natural lan- guage understanding and
Chunk 1 · 1,985 chars
g that advanced LLM capabilities can benefit underserved linguistic and cultural communities. 1 Introduction Large Language Models (LLMs) such as GPT-4 (OpenAI, 2023) and Gemini (Anil et al., 2023) have demonstrated remarkable capabilities across a wide array of tasks, ranging from natural lan- guage understanding and generation to more spe- cialized domain applications (Zhao et al., 2023). *Equal contributions. â Corresponding author: l.bing@alibaba-inc.com These models have proven valuable, offering sub- stantial benefits to the global community, especially through the proliferation of open-source LLMs such as Llama (Touvron et al., 2023a,b), Mistral (Jiang et al., 2023), Qwen (Bai et al., 2023; Yang et al., 2024), and Gemma (Mesnard et al., 2024). However, the majority of these efforts have been concentrated on high-resource languages such as English and Chinese, or well-developed regions like Europe (Zhang et al., 2023; Ahuja et al., 2023). Consequently, the development of LLMs tailored for low-resource languages or underdeveloped re- gions has been significantly overlooked, resulting in a lack of inclusivity and equitable distribution of AI advancements across diverse linguistic and cul- tural communities (Qin et al., 2024; Huang et al., 2024; Liu et al., 2024). To bridge this gap, we introduced the SeaLLMs model (Nguyen et al., 2023c), specifically designed LLMs for Southeast Asian languages. Southeast Asia (SEA) is a region with a rich diversity of lan- guages spoken by millions of people, yet it suffers from a significant lack of language technology sup- port (Aji et al., 2022). The initiative of SeaLLMs thus aims to make the benefits of LLMs accessi- ble to speakers of these languages, addressing their unique linguistic and cultural nuances. Following this endeavor, several other models have been ded- icated to this region as well, such as SEA-LION (AI Singapore, 2023) and Sailor (Dou et al., 2024). However, these models often face significant
Chunk 2 · 1,997 chars
its of LLMs accessi- ble to speakers of these languages, addressing their unique linguistic and cultural nuances. Following this endeavor, several other models have been ded- icated to this region as well, such as SEA-LION (AI Singapore, 2023) and Sailor (Dou et al., 2024). However, these models often face significant lim- itations: they are typically released only as foun- dational or chat models, offer limited options in terms of model size, and cover a limited number of SEA languages. Moreover, the relatively scarce availability of language corpora further constrains the amount of training data available, hindering the development and performance of these models. In this work, we introduce SeaLLMs 3, the lat- est iteration of the SeaLLMs model family. This version is designed to cover a more diverse ar- arXiv:2407.19672v1 [cs.CL] 29 Jul 2024 -- 1 of 11 -- ray of Southeast Asian languages, including En- glish, Chinese, Indonesian, Vietnamese, Thai, Taga- log, Malay, Burmese, Khmer, Lao, Tamil, and Ja- vanese. Different from the conventional continue- pretraining paradigm (Zhao et al., 2024a; Nguyen et al., 2023c; Dou et al., 2024), we conduct effi- cient language enhancement by training language- specific neurons only based on a foundation model (Zhao et al., 2024b), significantly reducing the over- all training cost. Moreover, such targeted training also ensures that the performance of high-resource languages can remain unaffected during the en- hancement. Furthermore, SeaLLMs 3 is trained using a specially constructed instruction tuning dataset that encompasses a wide variety of task types and carefully balanced language distributions. This approach ensures that the model can handle the linguistic diversity of the Southeast Asian re- gion while maintaining high performance and ver- satility across different applications. As a result, it achieves state-of-the-art performance among mod- els with similar sizes, excelling across a diverse array of tasks such as
Chunk 3 · 1,997 chars
roach ensures that the model can handle the linguistic diversity of the Southeast Asian re- gion while maintaining high performance and ver- satility across different applications. As a result, it achieves state-of-the-art performance among mod- els with similar sizes, excelling across a diverse array of tasks such as world knowledge (Zhang et al., 2023), mathematical reasoning (Shi et al., 2023), translation (Costa-jussĂ et al., 2022), and instruction following. In the meantime, we pay special attention to the modelâs reliability and trustworthiness during its development, which are often under-considered in multilingual settings (Deng et al., 2023). In particular, we address both general and culture- specific safety considerations to ensure the models provide contextually appropriate responses. The model is also specifically trained to be aware of its knowledge boundary and refuse what it does not know. To evaluate this capability, we introduce a novel benchmark, SeaRefuse, which assesses the ability of LLMs to refuse questions beyond their knowledge boundaries. This focus on safety and reliability has resulted in SeaLLMs 3 exhibiting reduced hallucination and delivering safe, coherent responses, especially in queries closely related to Southeast Asian culture. We open-source both the foundational and chat models of SeaLLMs 31. The foundational model could serve as a base for conducting instruction tuning tailored to specific application requirements. Meanwhile, the chat model has already undergone instruction tuning and is ready for direct use of 1https://huggingface.co/collections/SeaLLMs/ seallms-v3-668f3a52e1e6fbaad5752cdb Figure 1: Language-Specific Neuron Training. handling a wide range of tasks. 2 Pre-training 2.1 Pre-training Data Building on the efforts from previous versions of SeaLLMs (Nguyen et al., 2023c), we have incorpo- rated corpora from a wider range of data sources to enhance diversity. Specifically, we have inte- grated fundamental knowledge
Chunk 4 · 1,977 chars
ecific Neuron Training. handling a wide range of tasks. 2 Pre-training 2.1 Pre-training Data Building on the efforts from previous versions of SeaLLMs (Nguyen et al., 2023c), we have incorpo- rated corpora from a wider range of data sources to enhance diversity. Specifically, we have inte- grated fundamental knowledge from Wikipedia (Foundation) and textbooks (Ben Allal et al., 2024), journalistic materials such as CC-News (Crawl), web-based corpora from CulturaX (Nguyen et al., 2023a), and MADLAD-400 (Kudugunta et al., 2023). We have also improved the data processing pipeline including the language model filtering and duplicate removal to improve the data quality. Additionally, we explore the utilization of model- synthetic data for training, which received much attention recently (Ben Allal et al., 2024). Starting with manual annotation of domain-specific knowl- edge points in SEA languages, we then employed stronger models to generate targeted tutorial-style content, thereby enhancing SeaLLMs 3 with en- riched regional knowledge in a more explicit form. 2.2 Language-Specific Neuron Training We built our model based on the Qwen2 model family (Yang et al., 2024) and further conducted language enhancement to augment its capability in SEA languages. This approach allows the model to quickly inherit foundational knowledge from Qwen, rather than learning it from scratch. The most straightforward method for language enhancement is typically through continued pertain- -- 2 of 11 -- ing (Zhao et al., 2024a), which we also used for pre- vious versions of SeaLLMs. However, as discussed, the relatively scarce availability of language cor- pora limits the amount of training data, hindering the development and performance of these models. Furthermore, it is often observed that such direct continued pretraining can compromise the modelâs original capacity in high-resource languages like English and Chinese (Dou et al., 2024). In this iteration, we adopt
Chunk 5 · 1,998 chars
- pora limits the amount of training data, hindering the development and performance of these models. Furthermore, it is often observed that such direct continued pretraining can compromise the modelâs original capacity in high-resource languages like English and Chinese (Dou et al., 2024). In this iteration, we adopt Language-Specific Neuron (LSN) training for efficient language en- hancement, as shown in Figure 1. Recent studies have found that certain language-specific neurons in language models are responsible for process- ing specific languages. For instance, Zhao et al. (2024b) discovered that language-specific neurons comprise only about 0.1% of all parameters. Thus, the capabilities of a language can be enhanced by training its corresponding LSNs while preserving the multilingual abilities of other languages. To efficiently train SeaLLMs 3, we employ the par- allel language-specific neuron detection method proposed by Zhao et al. (2024b). As shown in the left most part of Figure 1, this method allows us to identify the LSNs of SEA languages using language-specific training data, selected as a down- sampled subset of the corpora from the training data. We then specifically train these detected LSNs to develop multiple monolingual LLMs in SEA languages, which are subsequently merged to create a unified multilingual LLM for SEA lan- guages. Additionally, to maintain proficiency in English and Chinese from the original foundation model, we detect their respective LSNs and exclude them from the entire pre-training process. This method offers several advantages. First, it requires relatively less training data since the training is more targeted, which significantly re- duces training costs. Second, because the training is targeted, we can ensure that the performance of high-resource languages from the original foun- dation model remains unaffected. LSNs operate independently and do not influence one another, avoiding the sacrifices seen with previous methods. 3
Chunk 6 · 1,995 chars
ich significantly re- duces training costs. Second, because the training is targeted, we can ensure that the performance of high-resource languages from the original foun- dation model remains unaffected. LSNs operate independently and do not influence one another, avoiding the sacrifices seen with previous methods. 3 Supervised Fine-Tuning (SFT) 3.1 Supervised Fine-tuning Data Most existing open-source supervised fine-tuning (SFT) datasets are predominantly in English (Wei et al., 2022; Taori et al., 2023), which presents a challenge for developing effective models for Figure 2: Language distribution of the SFT data Southeast Asian languages. To address this, we employed various techniques to construct our SFT pool. For example, we selectively translate some high-quality English data to SEA languages with quality filtering, conduct self-instruction to auto- matically generate SFT data of certain types, and use various prompting strategies (Madaan et al., 2023; Nguyen et al., 2023b). Following our pre- vious practice, native speakers have been actively engaged throughout the entire SFT data construc- tion process. They manually collect and write seed questions and topic lists, ensuring linguistic and cultural accuracy from the outset. Additionally, na- tive speakers verify, filter, and edit the synthetic SFT data to maintain high quality. Our preliminary experiments indicated that re- lying heavily on dominant English data adversely affects performance. To mitigate this, we strive to maintain a relatively good balance of language representation in our training data this time. Fig- ure 2 shows the language distribution of our SFT data. While English remains a significant portion of the dataset, substantial representation is given to other Southeast Asian languages such as Indone- sian, Vietnamese, Thai, and others, ensuring a com- prehensive and diverse linguistic foundation for the modelâs training. Since the first release of SeaLLMs (Nguyen et al., 2023c), the
Chunk 7 · 1,994 chars
remains a significant portion of the dataset, substantial representation is given to other Southeast Asian languages such as Indone- sian, Vietnamese, Thai, and others, ensuring a com- prehensive and diverse linguistic foundation for the modelâs training. Since the first release of SeaLLMs (Nguyen et al., 2023c), the task types of SFT data have been significantly expanded. The dataset now includes a diverse range of task types such as coding, math, education-related content, reasoning, general dia- logue, table-related tasks, open-domain QA, and many more. This expansion ensures that the model is well-rounded and capable of handling a vari- ety of queries and tasks. Additionally, SFT with multiple turns has been significantly increased to enhance the modelâs ability to engage in natural, -- 3 of 11 -- multi-turn dialogues, improving its conversational fluency and coherence. Model safety, trustworthiness, and reliability are also important factors for constructing the SFT pool. To address this, we specifically constructed refusal-type data, enabling the model to decline questions beyond its knowledge boundaries, such as those involving non-existing entities. Further- more, we carefully curated safety-related data, in- cluding both general safety data (which are cultur- ally independent, such as general moral principles) and country-specific safety data (which are cul- turally sensitive). This approach ensures that the model can be safely deployed with cultural consid- erations in mind, providing accurate and appropri- ate responses across different cultural contexts. 3.2 Training Details Two stages of training are employed to optimize the modelâs performance. In the first stage, a large volume of SFT data is used to equip the model with instruction-following capabilities and to famil- iarize it with different task types. In the second stage, a smaller but high-quality SFT dataset is uti- lized to fine-tune the model, ensuring it performs exceptionally well on
Chunk 8 · 1,998 chars
performance. In the first stage, a large volume of SFT data is used to equip the model with instruction-following capabilities and to famil- iarize it with different task types. In the second stage, a smaller but high-quality SFT dataset is uti- lized to fine-tune the model, ensuring it performs exceptionally well on important tasks. During the training process, different samples are packed together for efficiency, with a maximum length set at 8,192 tokens. The learning rate is set at 1.0e-5, with a warmup ratio of 0.1. Addition- ally, gradients are clipped at a maximum of 1.0 to prevent exploding gradients. 4 Evaluations We conduct extensive evaluations against models with similar sizes, including Sailor-7B / Sailor-7B- Chat (Dou et al., 2024), Gemma-7b / Gemma-7b-it (Mesnard et al., 2024), Qwen2-7B / Qwen2-7B- Chat (Yang et al., 2024), Meta-Llama-3-8B / Meta- Llama-3-8B-Instruct (Touvron et al., 2023b), Aya- 23-8B (Aryabumi et al., 2024), and the previous versions (mainly v2.5) of the SeaLLMs (Nguyen et al., 2023c). The models are listed by their release date in the following result tables. The evaluations can be generally categorized into the following two dimensions: âą Model Capability: We assess the modelâs performance on human exam questions, its proficiency in mathematics, its ability to fol- low multi-turn instructions, and its translation accuracy of different language pairs. âą Model Trustworthiness: We evaluate the modelâs safety and tendency to hallucinate, particularly in the context of Southeast Asia. 4.1 Model Capability 4.1.1 Multilingual World Knowledge Dataset We utilized the M3Exam dataset (Zhang et al., 2023), comprising real human exam ques- tions collected from different countries and span- ning different subjects and educational stages. This dataset effectively tests the modelâs multilingual world knowledge in a manner more akin to real- world settings. We take the questions in English (en), Chinese (zh), Indonesian (id), Vietnamese (vi), and
Chunk 9 · 1,999 chars
exam ques- tions collected from different countries and span- ning different subjects and educational stages. This dataset effectively tests the modelâs multilingual world knowledge in a manner more akin to real- world settings. We take the questions in English (en), Chinese (zh), Indonesian (id), Vietnamese (vi), and Thai (th). We also employ the trans- lated MMLU (Hendrycks et al., 2021) questions for evaluation, which primarily tests the cross-lingual alignment of the model as the required knowledge is still mainly Western-focused. For evaluation, we employ the SeaExam toolkit2 and measured perfor- mance using accuracy as the metric. Results As shown in Table 1 for results on M3Exam dataset, our models, SeaLLMs-v3-7B and SeaLLMs-v3-7B-Chat, demonstrate compet- itive performance, with SeaLLMs-v3-7B-Chat achieving the highest average score (0.692) and the highest average score for SEA languages (0.592). Compared to the previous version, SeaLLMs-7B- v2.5, our latest models show significant improve- ment in overall performance specifically in han- dling Southeast Asian languages. Furthermore, while the Qwen2-7B-Instruct model performs ex- ceptionally well in English and Chinese, our mod- els exhibit superior performance across a broader range of Southeast Asian languages, highlighting their enhanced multilingual capabilities. Table 2 shows the results of different models on the translated MMLU dataset, we can see that our SeaLLMs-v3-7B and SeaLLMs-v3-7B-Chat mod- els also outperform other models, particularly in Southeast Asian languages. Compared to the pre- vious SeaLLMs-7B-v2.5 version, our latest mod- els show substantial improvements, particularly in handling Southeast Asian languages (with avg_sea from 52.4 to 58.2). 4.1.2 Multilingual Math Dataset We assess the multilingual mathematics capabilities using the MGSM dataset (Shi et al., 2https://github.com/DAMO-NLP-SG/SeaExam -- 4 of 11 -- Model en zh id th vi avg avg_sea Gemma-7B 73.2 51.9 47.5 46.0 59.4 55.6
Chunk 10 · 1,992 chars
in handling Southeast Asian languages (with avg_sea from 52.4 to 58.2). 4.1.2 Multilingual Math Dataset We assess the multilingual mathematics capabilities using the MGSM dataset (Shi et al., 2https://github.com/DAMO-NLP-SG/SeaExam -- 4 of 11 -- Model en zh id th vi avg avg_sea Gemma-7B 73.2 51.9 47.5 46.0 59.4 55.6 51.0 Sailor-7B-Chat 66.0 65.2 47.5 46.2 51.3 55.2 48.3 SeaLLM-7B-v2.5 75.8 58.1 49.9 50.2 62.2 59.2 54.1 Sailor-14B 74.8 84.0 53.6 52.8 62.1 65.5 56.2 Sailor-14B-Chat 74.9 84.3 55.3 56.6 63.7 67.0 58.5 Qwen2-7B 81.5 87.4 53.0 47.9 62.8 66.5 54.6 Qwen2-7B-Instruct 80.9 88.0 55.8 55.5 62.4 68.5 57.9 SeaLLMs-v3-7B 80.9 86.3 54.5 53.0 62.8 67.5 56.8 SeaLLMs-v3-7B-Chat 80.9 87.4 55.8 56.9 64.9 69.2 59.2 Table 1: Results of multilingual world knowledge with the M3Exam benchmark Model en zh id th vi avg avg_sea Gemma-7B 63.4 50.9 54.5 49.0 49.4 53.5 51.0 Sailor-7B-Chat 55.8 47.2 48.4 41.4 46.2 47.8 45.4 SeaLLM-7B-v2.5 65.2 54.4 56.5 47.9 52.8 55.3 52.4 Sailor-14B 61.8 56.4 57.0 48.2 53.5 55.4 52.9 Sailor-14B-Chat 62.7 56.1 56.7 49.6 54.1 55.8 53.5 Qwen2-7B 71.0 64.2 60.2 52.0 56.6 60.8 56.3 Qwen2-7B-Instruct 70.8 63.5 59.9 52.4 56.8 60.7 56.4 SeaLLMs-v3-7B 70.6 65.4 61.7 53.6 58.7 62.0 58.0 SeaLLMs-v3-7B-Chat 71.3 64.7 62.5 54.4 57.8 62.2 58.2 Table 2: Results of multilingual world knowledge with the translated MMLU benchmark 2023). Originally, MGSM comprises testing sam- ples solely in English, Chinese and Thai. To extend this dataset to other SEA languages, specifically Indonesian, Malay, and Vietnamese, we utilize Google Translate to translate the original English questions to those questions. It is important to note that in our translations, we adhere to the numeri- cal notation conventions of each respective country. For instance, in both Indonesian and Vietnamese, dots are used as thousands separators, and commas as decimal separators, which is the reverse of the English numeral system. We follow the same con- vention when evaluating the modelâs
Chunk 11 · 1,999 chars
ations, we adhere to the numeri- cal notation conventions of each respective country. For instance, in both Indonesian and Vietnamese, dots are used as thousands separators, and commas as decimal separators, which is the reverse of the English numeral system. We follow the same con- vention when evaluating the modelâs generations. Results Table 3 presents the evaluation results on the MGSM benchmark, under both the few- shot setting (for testing base versions) and the zero-shot setting (for testing chat versions). In the few-shot setting, SeaLLMs-v3-7B demonstrates the highest average score (63.1), outperforming other models such as Qwen2-7B (62.3) and GLM- 4-9b (60.2), particularly excelling in Indonesian and Thai. In the zero-shot setting, the chat version of SeaLLMs-v3-7B-Chat achieves the highest av- erage score (73.1), showing strong performance across all languages. This highlights SeaLLMs-v3- 7B-Chatâs superior adaptability and robustness in multilingual math tasks compared to its counter- parts like Qwen2-7B-instruct (68.4). 4.1.3 Multilingual Instruction-following Dataset As there is no publicly available dataset for testing the modelâs multi-turn instruction- following capability in SEA languages, we con- struct our own benchmark, namely SeaBench3, for such evaluation. SeaBench consists of multi-turn human instructions spanning various task types for Indonesian, Vietnamese, and Thai. Following MT- Bench (Zheng et al., 2023), we consider 8 task types including writing, roleplay, extraction, rea- soning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science). Addi- tionally, considering the characteristics of the mul- tilingual setting, we include two more task types: safety and life. The safety task tests whether the model will respond to unsafe queries in a local con- text, while the life task includes questions likely 3It will be publicly available at https://huggingface. co/datasets/SeaLLMs/SeaBench -- 5 of 11 -- MGSM en id ms
Chunk 12 · 1,998 chars
lingual setting, we include two more task types: safety and life. The safety task tests whether the model will respond to unsafe queries in a local con- text, while the life task includes questions likely 3It will be publicly available at https://huggingface. co/datasets/SeaLLMs/SeaBench -- 5 of 11 -- MGSM en id ms th vi zh avg Few-shot setting Gemma-7B 64.8 41.2 43.2 38.0 34.0 39.6 43.5 Sailor-7B 34.4 25.2 22.8 24.8 22.4 26.4 26.0 Meta-Llama-3-8B 56.8 36.0 33.6 34.8 33.6 43.6 39.7 GLM-4-9B 78.0 53.6 57.2 46.0 56.8 69.6 60.2 Qwen2-7B 79.6 58.8 56.8 54.8 54.8 69.2 62.3 SeaLLMs-v3-7B 78.8 59.2 56.8 56.8 54.8 72.0 63.1 Zero-shot setting Gemma-1.1-7B-it 58.8 32.4 34.8 31.2 39.6 35.2 38.7 Sailor-7B-Chat 33.6 22.4 22.4 21.6 25.2 29.2 25.7 SeaLLM-7B-v2.5 79.6 69.2 70.8 61.2 66.8 62.4 68.3 Meta-Llama-3-8B-Instruct 77.6 48.0 57.6 56.0 46.8 58.8 57.5 GLM-4-9B-Chat 72.8 53.6 53.6 34.8 52.4 70.8 56.3 Qwen2-7B-Instruct 82.0 66.4 62.4 58.4 64.4 76.8 68.4 SeaLLMs-v3-7B-Chat 74.8 71.2 70.8 71.2 71.2 79.6 73.1 Table 3: Results of multilingual math with the MGSM benchmark. Model id th vi avg turn1 turn2 avg turn1 turn2 avg turn1 turn2 avg Sailor-7B-Chat 4.60 4.04 4.32 3.94 3.17 3.56 4.82 3.62 4.22 4.03 SeaLLM-7B-v2.5 6.27 4.96 5.62 5.79 3.82 4.81 6.02 4.02 5.02 5.15 Sailor-14B-Chat 5.26 5.53 5.40 4.62 4.36 4.49 5.31 4.74 5.03 4.97 Qwen2-7B-Instruct 5.93 5.84 5.89 5.47 5.20 5.34 6.17 5.60 5.89 5.70 SeaLLMs-v3-7B-Chat 6.73 6.59 6.66 6.48 5.90 6.19 6.34 5.79 6.07 6.31 Table 4: Results of multilingual instruction-following with SeaBench Benchmark asked in real-life settings, which might be infor- mally written or even ambiguous. All questions are manually written by native speakers of each language. During construction, we instructed the annotators to ensure the questions were as localized as possible, e.g., using local enti- ties, concepts, and knowledge in the questions. Fur- thermore, reference answers have been constructed to ensure fair judgment. Evaluation Details Given the
Chunk 13 · 1,989 chars
native speakers of each language. During construction, we instructed the annotators to ensure the questions were as localized as possible, e.g., using local enti- ties, concepts, and knowledge in the questions. Fur- thermore, reference answers have been constructed to ensure fair judgment. Evaluation Details Given the two-turn questions, the model under testing generates two-turn re- sponses in a multi-turn format. These responses are then graded by a stronger LLM (GPT-4o was used in our experiments) using the reference answer to the original questions. The scores are then assigned to each turn of the response. Results As shown in Table 4, SeaLLMs-v3-7B- Chat outperforms all other models in multilin- gual instruction-following across Indonesian (id), Thai (th), and Vietnamese (vi). It achieves the highest average scores in both individual turns and overall averages for each language. Specifi- cally, SeaLLMs-v3-7B-Chat surpasses the previous version, SeaLLMs-7B-v2.5, by a significant mar- gin (6.31 vs 5.15) and outperforms the strongest baseline model Qwen2-7B-Instruct (6.31 vs 5.70). These results highlight SeaLLMs-v3-7B-Chatâs su- perior ability to generate more coherent and con- textually appropriate multi-turn responses. 4.1.4 Translation Dataset We evaluate the machine translation per- formances with the test set of Flores-200 (Costa- jussĂ et al., 2022). We choose all 12 languages for a comprehensive evaluation, including Burmese (my), Chinese (zh), English (en), Indonesian (id), Javanese (jv), Khmer (km), Lao (lo), Malay (ms), Tagalog (tl), Tamil (ta), Thai (th), and Vietnamese (vi). We translate between each pair of languages and report the average 0-shot chrF scores after av- eraging the results for target languages. Results As shown in Table 5, SeaLLMs-v3- 7B-Chat outperforms other models in machine -- 6 of 11 -- Model en id jv km lo ms my ta th tl vi zh avg Sailor-7B-Chat 49.40 49.78 28.33 2.68 6.85 47.75 5.35 18.23 38.92 29.00 41.76 20.87
Chunk 14 · 1,996 chars
t the average 0-shot chrF scores after av- eraging the results for target languages. Results As shown in Table 5, SeaLLMs-v3- 7B-Chat outperforms other models in machine -- 6 of 11 -- Model en id jv km lo ms my ta th tl vi zh avg Sailor-7B-Chat 49.40 49.78 28.33 2.68 6.85 47.75 5.35 18.23 38.92 29.00 41.76 20.87 28.24 SeaLLM-7B-v2.5 55.09 53.71 18.13 18.09 15.53 51.33 19.71 26.10 40.55 45.58 44.56 24.18 34.38 Meta-Llama-3-8B-Instruct 51.54 49.03 22.46 15.34 5.42 46.72 21.24 32.09 35.75 40.80 39.31 14.87 31.22 Qwen2-7B-Instruct 50.36 47.55 29.36 19.26 11.06 42.43 19.33 20.04 36.07 37.91 39.63 22.87 31.32 SeaLLMs-v3-7B-Chat 54.68 52.52 29.86 27.30 26.34 45.04 21.54 31.93 41.52 38.51 43.78 26.10 36.52 Table 5: Results of translation with Flores-200. Model en zh vi th id avg avg_sea Gemma-1.1-7B-it 53.61 28.22 26.18 21.28 30.39 31.94 25.95 Sailor-7B-Chat 33.76 18.82 5.19 9.68 16.42 16.78 10.43 SeaLLM-7B-v2.5 13.10 1.53 3.24 19.58 0.78 7.65 7.87 Meta-Llama-3-8B-Instruct 72.23 0.00 1.23 0.80 3.91 15.63 1.98 GLM-4-9B-Chat 45.02 40.98 21.48 5.42 2.37 23.05 9.76 Qwen2-7B-Instruct 63.74 35.75 52.86 46.42 55.93 50.94 51.74 SeaLLMs-v3-7B-Chat 71.13 77.17 78.18 61.64 67.61 71.14 69.14 Table 6: Performance in refusing questions about non-existing entities on SeaRefuse-G. translation, achieving an average chrF score of 36.52. It excels particularly in Javanese, Khmer, Lao, Burmese, Thai, and Chinese, consistently achieving the highest scores in these languages. Compared to its predecessor, SeaLLMs-7B-v2.5, which has an average score of 34.38, SeaLLMs- v3-7B-Chat shows clear improvement. Addition- ally, SeaLLMs-v3-7B-Chat surpasses strong base- lines like Meta-Llama-3-8B-Instruct and Qwen2- 7B-Instruct, with average scores of 31.22 and 31.32, respectively. Notably, the modelâs performance in translating low-resource languages, such as Khmer (27.30) and Lao (26.34), highlights its robustness and effectiveness in handling diverse and challeng- ing translation tasks. This
Chunk 15 · 1,990 chars
Meta-Llama-3-8B-Instruct and Qwen2- 7B-Instruct, with average scores of 31.22 and 31.32, respectively. Notably, the modelâs performance in translating low-resource languages, such as Khmer (27.30) and Lao (26.34), highlights its robustness and effectiveness in handling diverse and challeng- ing translation tasks. This consistent performance across multiple languages underscores the modelâs versatility and capability in low-resource language translation settings. 4.2 Model Trustworthiness 4.2.1 Hallucination A trustworthy LLM should only answer the ques- tions that it knows and abstain from answering questions that it does not know. Previous stud- ies reveal that recent LLMs are prone to answer- ing questions that exceed their knowledge bound- aries, leading to hallucinated responses (Yang et al., 2023; Zhang et al., 2024). However, evaluating a modelâs ability to refuse questions it doesnât know is challenging. This requires distinguishing the modelâs knowledge boundaries, which is difficult since most existing LLMs do not provide trans- parency in their pre-training data. To address this challenge, we propose a novel SeaRefuse evaluation benchmark, which consists of unanswerable factoid questions about non- existing entities and answerable factoid questions in SEA languages4. Unanswerable questions about non-existent entities are designed to surpass the knowledge boundaries of LLMs. Our bench- mark includes two test sets: SeaRefuse-G and SeaRefuse-H. In the SeaRefuse-G test set, the unanswerable questions are generated by prompt- ing GPT-4o. In the SeaRefuse-H test set, our lin- guists annotate the unanswerable questions by re- fining machine-generated unanswerable questions. The answerable questions in both SeaRefuse-G and SeaRefuse-H are collected from existing factoid QA datasets including ParaRel (Zhang et al., 2024; Elazar et al., 2021), NLPCC-KBQA (Duan, 2016; Duan and Tang, 2018), and TyDi QA (Clark et al., 2020). For the SeaRefuse-G tese set, each
Chunk 16 · 1,995 chars
-generated unanswerable questions. The answerable questions in both SeaRefuse-G and SeaRefuse-H are collected from existing factoid QA datasets including ParaRel (Zhang et al., 2024; Elazar et al., 2021), NLPCC-KBQA (Duan, 2016; Duan and Tang, 2018), and TyDi QA (Clark et al., 2020). For the SeaRefuse-G tese set, each language has 500 answerable and 500 unanswerable ques- tions, except for Vietnamese, which has 483 of each. In the SeaRefuse-H dataset, each language contains 100 answerable and 100 unanswerable questions. In evaluation, we report the F1-score of each model on correctly refusing questions about non- existing entities. We follow the confusion matrix depicted in Table 8 to compute the F1 scores. We adopt a keyword-matching approach to determine 4It will be publicly available at https://huggingface. co/datasets/SeaLLMs/SeaRefuse -- 7 of 11 -- Model en zh vi th id avg avg_sea Gemma-1.1-7B-it 51.95 29.92 12.07 30.16 35.48 31.92 25.90 Sailor-7B-Chat 33.87 15.79 11.32 12.96 31.67 21.12 18.65 SeaLLM-7B-v2.5 10.91 3.92 11.32 49.66 31.15 21.39 30.71 Meta-Llama-3-8B-Instruct 73.17 0.00 0.00 0.00 11.21 16.88 3.74 GLM-4-9B-Chat 35.38 52.50 20.17 5.77 9.52 24.67 11.82 Qwen2-7B-Instruct 58.50 42.75 62.82 60.53 63.51 57.62 62.29 SeaLLMs-v3-7B-Chat 68.54 81.82 83.84 84.58 89.66 81.69 86.03 Table 7: Performance in refusing questions about non-existing entities on SeaRefuse-H. Answerable question Unanswerable question Refused True-Positive False-Positive Answered False-Negative True-Negative Table 8: The confusion matrix for the evaluation of the refusal ability of LLMs. whether a model refuses to answer a factoid ques- tion. Specifically, we work with professional and native linguists to devise a set of refusal keywords for English, Chinese, Vietnamese, Indonesian, and Thai, respectively. If the generated response con- tains any of these refusal keywords, we assume that the response is a refusal response. The experiment results on SeaRefuse-G and SeaRefuse-H are
Chunk 17 · 1,996 chars
th professional and native linguists to devise a set of refusal keywords for English, Chinese, Vietnamese, Indonesian, and Thai, respectively. If the generated response con- tains any of these refusal keywords, we assume that the response is a refusal response. The experiment results on SeaRefuse-G and SeaRefuse-H are shown in Table 6 and Table 7, respectively. We observe that SeaLLMs-v3-7B- Chat outperforms all other baseline models by a large margin in zh, vi, th, and id languages. In English, the performance of SeaLLMs-v3-7B-Chat is competitive with Llama-3-8B-Instruct. These results demonstrate the capability of SeaLLMs-v3 to refuse questions that it does not know. 4.2.2 Safety Model en jv th vi zh avg Sailor-7B-Chat 78.7 54.9 62.2 67.6 76.2 67.9 Meta-Llama-3-8B-Instruct 88.3 26.4 71.1 69.8 77.1 66.5 Sailor-14B-Chat 86.9 30.5 53.7 60.9 72.7 60.9 GLM-4-9B-Chat 77.1 21.3 30.2 60.6 74.9 52.8 Qwen2-7B-Instruct 88.6 43.8 63.8 73.0 87.3 71.3 SeaLLMs-v3-7B-Chat 88.9 60.0 73.3 83.8 92.7 79.7 Table 9: Safety performance of different models To evaluate the modelsâ safety capabilities, we use the questions of SEA languages from the Multi- Jail dataset (Deng et al., 2023), which includes En- glish (en), Javanese (jv), Thai (th), Vietnamese (vi), and Chinese (zh). Each question in the dataset is potentially malicious, and the model should refuse to answer them. To determine whether the modelâs response is safe, we first translate the response into English and then prompt GPT-4o to check if the translated response is harmful. The results are re- ported as the safe rate of the responses. Table 9 presents the safety capabilities of various models evaluated with the MultiJail dataset. No- tably, SeaLLMs-v3-7B-Chat outperforms all other models with an average safe rate of 79.7%, demon- strating robust performance across all languages, particularly excelling in Vietnamese (83.8%) and Chinese (92.7%). In comparison, Qwen2-7B- Instruct follows with a distant second average
Chunk 18 · 1,997 chars
d with the MultiJail dataset. No- tably, SeaLLMs-v3-7B-Chat outperforms all other models with an average safe rate of 79.7%, demon- strating robust performance across all languages, particularly excelling in Vietnamese (83.8%) and Chinese (92.7%). In comparison, Qwen2-7B- Instruct follows with a distant second average of 71.3%, with its highest safe rate in Chinese (87.3%). Other models like Sailor-7B-Chat and Llama-3-8B- Instruct also show competitive performance but lag behind in consistency across languages. Notably, the exceptional performance of SeaLLMs-v3 in the three Southeast Asian languages (jv, th, and vi) un- derscores SeaLLMâs effective design, which caters to the linguistic nuances of this region. 5 Conclusion SeaLLMs 3 represents a significant advancement in the development of large language models for Southeast Asian languages, addressing the regionâs unique linguistic and cultural challenges. By adopt- ing an efficient language enhancement approach and constructing a comprehensive instruction tun- ing dataset, SeaLLMs 3 achieves state-of-the-art performance while maintaining cost-effectiveness. Our commitment to reliability and safety, pro- viding contextually appropriate responses, further strengthens the modelâs applicability and trustwor- thiness. The open-sourcing of both foundational and chat models ensures that SeaLLMs 3 is acces- sible for a wide range of applications, fostering further innovation and inclusivity in AI develop- ment for Southeast Asia. -- 8 of 11 -- Acknowledgments We would like to express our special thanks to our professional and native linguists, Tantong Cham- paiboon, Nguyen Ngoc Yen Nhi and Tara Devina Putri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety. References Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Ut- tama Nambi, Tanuja Ganu, Sameer Segal,
Chunk 19 · 1,999 chars
utri, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety. References Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Ut- tama Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023. MEGA: multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 4232â4267. AI Singapore. 2023. Sea-lion (southeast asian lan- guages in one network): A family of large language models for southeast asia. https://github.com/ aisingapore/sealion. Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Ma- hendra, Kemal Kurniawan, David Moeljadi, Radi- tyo Eko Prasojo, Timothy Baldwin, Jey Han Lau, and Sebastian Ruder. 2022. One country, 700+ lan- guages: NLP challenges for underrepresented lan- guages and dialects in indonesia. In Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), ACL 2022, pages 7226â7249. Association for Computa- tional Linguistics. Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Mil- lican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- crap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul Ronald Barham, Tom Henni- gan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Pi- queras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, AnaĂŻs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha
Chunk 20 · 1,997 chars
ong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Pi- queras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, AnaĂŻs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, and et al. 2023. Gemini: A fam- ily of highly capable multimodal models. CoRR, abs/2312.11805. Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Se- bastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan N. Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet ĂstĂŒn, and Sara Hooker. 2024. Aya 23: Open weight releases to further multilingual progress. CoRR, abs/2405.15032. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jin- gren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen technical report. CoRR, abs/2309.16609. Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra. 2024. Cos- mopedia. Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. Tydi qa: A benchmark for information-seeking question answering in ty- pologically diverse languages. Transactions of the Association for Computational Linguistics. Marta R. Costa-jussĂ , James Cross, Onur Ăelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffer- nan, Elahe Kalbassi,
Chunk 21 · 1,985 chars
ikolaev, and Jennimaria Palomaki. 2020. Tydi qa: A benchmark for information-seeking question answering in ty- pologically diverse languages. Transactions of the Association for Computational Linguistics. Marta R. Costa-jussĂ , James Cross, Onur Ăelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffer- nan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Y. Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, LoĂŻc Bar- rault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco GuzmĂĄn, Philipp Koehn, Alexandre Mourachko, Christophe Rop- ers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scal- ing human-centered machine translation. CoRR, abs/2207.04672. Common Crawl. Common crawl news. Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li- dong Bing. 2023. Multilingual jailbreak challenges in large language models. CoRR, abs/2310.06474. Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Ji- ahui Zhou, Wei Lu, and Min Lin. 2024. Sailor: Open language models for south-east asia. CoRR, abs/2404.03608. Nan Duan. 2016. Overview of the nlpcc-iccpol 2016 shared task: Open domain chinese question answer- ing. In Natural Language Understanding and Intelli- gent Applications, pages 942â948. Springer Interna- tional Publishing. Nan Duan and Duyu Tang. 2018. Overview of the nlpcc 2017 shared task: Open domain chinese question -- 9 of 11 -- answering. In Natural Language Processing and Chinese Computing, pages 954â961. Springer Inter- national Publishing. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard H. Hovy, Hinrich SchĂŒtze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Trans. Assoc. Comput. Linguistics, 9:1012â1031. Wikimedia Foundation. Wikimedia
Chunk 22 · 1,998 chars
954â961. Springer Inter- national Publishing. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard H. Hovy, Hinrich SchĂŒtze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Trans. Assoc. Comput. Linguistics, 9:1012â1031. Wikimedia Foundation. Wikimedia downloads. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language under- standing. In 9th International Conference on Learn- ing Representations, ICLR 2021. OpenReview.net. Kaiyu Huang, Fengran Mo, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, Jinan Xu, Jian-Yun Nie, and Yang Liu. 2024. A survey on large language models with multilingualism: Recent advances and new frontiers. CoRR, abs/2405.10936. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LĂ©lio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- thĂ©e Lacroix, and William El Sayed. 2023. Mistral 7b. CoRR, abs/2310.06825. Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. Madlad-400: A multilingual and document-level large audited dataset. Preprint, arXiv:2309.04662. Chaoqun Liu, Wenxuan Zhang, Yiran Zhao, Anh Tuan Luu, and Lidong Bing. 2024. Is translation all you need? A study on solving multilingual tasks with large language models. CoRR, abs/2403.10258. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdan- bakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with
Chunk 23 · 1,998 chars
abs/2403.10258. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdan- bakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. In Advances in Neu- ral Information Processing Systems 36: Annual Con- ference on Neural Information Processing Systems 2023, NeurIPS 2023. Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane RiviÚre, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdh- ery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo- pher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Cristian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, and et al. 2024. Gemma: Open models based on gemini re- search and technology. CoRR, abs/2403.08295. Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. 2023a. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400. Xuan-Phi Nguyen, Sharifah Mahani Aljunied, Shafiq Joty, and Lidong Bing. 2023b. Democratizing llms for low-resource languages by leveraging their en- glish dominant abilities with linguistically-diverse prompts. CoRR, abs/2306.11372. Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang,
Chunk 24 · 1,994 chars
d Lidong Bing. 2023b. Democratizing llms for low-resource languages by leveraging their en- glish dominant abilities with linguistically-diverse prompts. CoRR, abs/2306.11372. Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2023c. Seallms - large language models for southeast asia. CoRR, abs/2312.00738. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774. Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and Philip S. Yu. 2024. Multilingual large language model: A survey of resources, taxonomy and fron- tiers. CoRR, abs/2404.04925. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2023. Language models are multi- lingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representa- tions, ICLR 2023. OpenReview.net. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste RoziÚre, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, -- 10 of 11 -- Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez,
Chunk 25 · 1,995 chars
hargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, -- 10 of 11 -- Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subrama- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, AurĂ©lien Ro- driguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288. Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M. Dai, and Quoc V. Le. 2022. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representa- tions, ICLR 2022. OpenReview.net. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671. Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neu- big, and Pengfei Liu. 2023. Alignment for honesty. CoRR, abs/2312.07000. Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2024. R-tuning: Instructing large lan- guage models to say âI donât knowâ. In Proceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7113â7139, Mexico City,
Chunk 26 · 1,760 chars
ng, Yangyi Chen, Heng Ji, and Tong Zhang. 2024. R-tuning: Instructing large lan- guage models to say âI donât knowâ. In Proceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7113â7139, Mexico City, Mexico. As- sociation for Computational Linguistics. Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2023. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Advances in Neural Information Processing Systems 36: An- nual Conference on Neural Information Processing Systems 2023, NeurIPS 2023. Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024a. Llama beyond english: An empirical study on language capability transfer. CoRR, abs/2401.01055. Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Be- ichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. CoRR, abs/2303.18223. Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. 2024b. How do large language models handle multilingualism? CoRR, abs/2402.18815. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Pro- cessing Systems 2023, NeurIPS 2023. -- 11 of 11 --