SEA-LION: Southeast Asian Languages in One Network

Summary

SEA-LION introduces two open-source multilingual large language models (LLMs), Llama-SEA-LION-8B-IT and Gemma-SEA-LION-9B-IT, designed to address the underrepresentation of Southeast Asian (SEA) languages in AI. These models support 11 SEA languages, including English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. The models were developed through large-scale multilingual continued pre-training (CPT) using 200 billion tokens and 16.8 million instruction-answer pairs, followed by a multi-stage post-training process involving instruction fine-tuning, alignment, and model merging. Evaluation on benchmarks like SEA-HELM and the Open LLM Leaderboard shows state-of-the-art performance for SEA languages, with the models outperforming existing LLMs in tasks like natural language understanding and instruction following. Both models are released under the MIT license, with training data, scripts, and checkpoints made publicly available to promote reproducibility and further research. The work emphasizes balancing general capabilities with SEA-specific linguistic fluency, achieving strong results in both SEA and English benchmarks.

PDF viewer

Chunks(37)

Chunk 0 · 1,999 chars

SEA-LION: Southeast Asian Languages in One Network
Raymond Ng♠, Thanh Ngan Nguyen♠, Yuli Huang♠, Ngee Chia Tai♠, Wai Yi Leong♠,
Wei Qi Leong♠, Xianbin Yong♠, Jian Gang Ngui♠, Yosephine Susanto♠, Nicholas Cheng♠,
Hamsawardhini Rengarajan♠, Peerat Limkonchotiwat♠, Adithya Venkatadri Hulagadri♠,
Kok Wai Teng♠, Yeo Yeow Tong♠, Bryan Siow♠, Wei Yi Teo♠, Wayne Lau♠,
Choon Meng Tan♠, Brandon Ong♠, Zhi Hao Ong♠, Jann Railey Montalan♠,
Adwin Chan♠, Sajeban Antonyrex♠, Ren Lee♠, Esther Choa♠, David Ong Tat-Wee♠,
Bing Jie Darius Liu♠, William Chandra Tjhi♠, Erik Cambria♢, Leslie Teo♠
♠AI Singapore, National University of Singapore
♢Nanyang Technological University
https://sea-lion.ai
Abstract
Recently, Large Language Models (LLMs)
have dominated much of the artificial intelli-
gence scene with their ability to process and
generate natural languages. However, the ma-
jority of LLM research and development re-
mains English-centric, leaving low-resource
languages such as those in the Southeast Asian
(SEA) region underrepresented. To address
this representation gap, we introduce Llama-
SEA-LION-8B-IT and Gemma-SEA-LION-
9B-IT, two cutting-edge multilingual LLMs
designed for SEA languages. The SEA-LION
family of LLMs supports 11 SEA languages,
namely English, Chinese, Indonesian, Viet-
namese, Malay, Thai, Burmese, Lao, Filipino,
Tamil, and Khmer. Our work leverages large-
scale multilingual continued pre-training with a
comprehensive post-training regime involving
multiple stages of instruction fine-tuning, align-
ment, and model merging. Evaluation results
on multilingual benchmarks show that our mod-
els achieve state-of-the-art performance across
LLMs supporting SEA languages. We open-
source the models 1 to benefit the wider SEA
community.
1 Introduction
Large language models (LLMs) have significantly
transformed the field of natural language process-
ing, achieving remarkable performance in text
generation, summarization and sentiment analy-
sis (Brown et al., 2020; OpenAI,

Chunk 1 · 1,994 chars

SEA languages. We open-
source the models 1 to benefit the wider SEA
community.
1 Introduction
Large language models (LLMs) have significantly
transformed the field of natural language process-
ing, achieving remarkable performance in text
generation, summarization and sentiment analy-
sis (Brown et al., 2020; OpenAI, 2023; Dubey
et al., 2024; Rivière et al., 2024; Zhang et al.,
2024b; Yeo et al., 2024). Despite their impressive
capabilities, most LLMs remain heavily English-
centric (Wendler et al., 2024; Zhong et al., 2024).
Unfortunately, this situation has led LLMs in re-
gions with many under-represented languages such
1SEA-LION Models Collection
as Southeast Asia (SEA) to suffer. Languages with
lower resources, such as Filipino, Lao, Burmese
and Khmer in the SEA region, are not supported
by many open-source English-centric LLMs. This
underscores the need to bridge the resource and
representation gap between English and SEA lan-
guages.
Recently, there have been many attempts to cre-
ate multilingual LLMs in an open-source man-
ner, e.g., BLOOM (Scao et al., 2022), a project
aimed at increasing multilingual presence in open-
source LLMs by supporting 46 languages. Popular
LLM families such as Llama (Dubey et al., 2024),
Gemma (Rivière et al., 2024) and Qwen (Yang
et al., 2024a) have also introduced multilingual
LLMs for their latest iteration. During our evalua-
tions, we found that the performance of these mod-
els is acceptable in the general case, i.e., when con-
sidering evaluation benchmarks formulated from
English datasets. However, we observe that the per-
formance degrades on SEA-specific benchmarks.
Moreover, researchers have also introduced LLMs
such as SeaLLMs (Nguyen et al., 2024; Zhang
et al., 2024a) and Sailor (Dou et al., 2024) to specifi-
cally address the LLM gap in SEA languages. How-
ever, the performance of these models is less than
ideal for languages such as Thai or Tamil2 (10X
et al., 2024; AI Products Team, 2024).
In this paper, we

Chunk 2 · 1,996 chars

troduced LLMs
such as SeaLLMs (Nguyen et al., 2024; Zhang
et al., 2024a) and Sailor (Dou et al., 2024) to specifi-
cally address the LLM gap in SEA languages. How-
ever, the performance of these models is less than
ideal for languages such as Thai or Tamil2 (10X
et al., 2024; AI Products Team, 2024).
In this paper, we address the issues by propos-
ing a robust open-source Southeast Asian model
with data transparency for reproducibility, namely
SEA-LION – a family of LLMs continued pre-
trained (CPT) and fine-tuned on Llama-3.1-8B-
Instruct for Llama-SEA-LION-8B-IT and Gemma-
2-9B for Gemma-SEA-LION-9B-IT with a focus
2Tamil is one of the official languages in Singapore. It is
also spoken in other areas in the SEA region, such as Malaysia.
arXiv:2504.05747v4 [cs.CL] 30 Oct 2025

-- 1 of 15 --

on SEA languages. To tackle the performance
problem, we utilize 200 billion English, code, and
SEA languages tokens as well as 16.8 million En-
glish and SEA languages instruction and answer
pairs for CPT and post-training steps, respectively,
to achieve a significant improvement in SEA lan-
guages. In order to allow our models to be used by
everyone without restrictions, we release our mod-
els under the fully open MIT license. We bench-
mark our models against the SEA-HELM(Susanto
et al., 2025) and Open LLM Leaderboard3 with
other LLMs of similar sizes in Southeast Asia like
Sailor 2 (Team, 2024) and SeaLLMs 3 (Zhang et al.,
2024a), where our models achieve state-of-the-art
performances.
We summarize the contribution of our paper as
follows.
• We released two LLMs, Llama-SEA-LION-8B-
IT and Gemma-SEA-LION-9B-IT, that are
meticulously trained to accurately represent the
unique linguistic diversity of SEA languages.
• We also provide in-depth insights in this paper
into our end-to-end training workflow to benefit
the community developing multilingual LLMs.
• We present a reproducible dataset development
process, covering sourcing and the model train-
ing process. We release

Chunk 3 · 1,995 chars

resent the
unique linguistic diversity of SEA languages.
• We also provide in-depth insights in this paper
into our end-to-end training workflow to benefit
the community developing multilingual LLMs.
• We present a reproducible dataset development
process, covering sourcing and the model train-
ing process. We release our training arti-
facts, including the training dataset, training
scripts, training checkpoints, and fine-tuned
models, including Llama-SEA-LION-8B-IT
and Gemma-SEA-LION-9B-IT, to provide
strong baselines, promote reproducibility, and
enable future research on applications that re-
quire SEA-specific knowledge 4.
2 Continued pre-training (CPT)
2.1 Pre-training data
The CPT data consists of a curated set of En-
glish, multilingual, and code corpora from sev-
eral open source repositories like Dolma (Sol-
daini et al., 2024), FineWeb (Penedo et al.,
2024), the-stackv2 (Lozhkov et al., 2024), SEA-
LION-Pile (AI Singapore, 2023), SEA-LION-Pile-
v2 (AI Singapore, 2025), as well as documents
from CommonCrawl (CommonCrawl, 2024) and
from the public domain, such as Wikipedia (Foun-
3Open LLM Leaderboard
4Please visit https://huggingface.co/aisingapore
for all artifacts in this paper, including training data and other
versions of SEA-LION
dation, 2024). For SEA-LION-Pilev2, we filter
CommonCrawl WARC data for documents in SEA
languages (i.e., Burmese, Simplified Chinese, In-
donesian, Khmer, Lao, Malay, Filipino, Tamil,
Thai, and Vietnamese) using the pretrained fast-
text language classifier (Joulin et al., 2017).
A document is retained if the language code re-
ported in its metadata matches that of one of the
aforementioned SEA languages. Additionally, we
further clean up the data with Trafilatura (Barbaresi,
2021). To determine the optimal dataset ratio be-
tween SEA languages, code, and English for the
CPT process, we conduct a series of small-scale
CPT experiments, each with a training budget of
10 billion tokens and varying proportions of En-
glish,

Chunk 4 · 1,997 chars

dditionally, we
further clean up the data with Trafilatura (Barbaresi,
2021). To determine the optimal dataset ratio be-
tween SEA languages, code, and English for the
CPT process, we conduct a series of small-scale
CPT experiments, each with a training budget of
10 billion tokens and varying proportions of En-
glish, code, and SEA language data. We settled on
an optimal data mix ratio of 55% SEA languages,
25% English, and 20% code tokens for a budget of
200 billion tokens. For a detailed breakdown of the
token count by languages, please refer to Table 6.
2.2 CPT process
Model selection. For the models to CPT from, we
choose Llama-3.1-8B-Instruct (Dubey et al., 2024)
and Gemma-2-9B (Rivière et al., 2024).
Training setup. Following previous works (Dou
et al., 2024), we use BPE-Dropout (Provilkov
et al., 2020) to increase the performance and ro-
bustness of the training. We use a Warmup-Stable-
Decay (WSD) (Hu et al., 2024) scheduler with
warm-up and cooldown phases each representing
10% of the entire training budget. We use the
AdamW (Loshchilov and Hutter, 2019) optimizer
with the maximum learning rate (LR) set to 1e−5
and the final LR after cooldown is 1e−7. Fol-
lowing Wortsman et al. (2024), we set epsilon to
1e−15. We use Composer (Team, 2021) and LLM
Foundry (Team, 2022) for distributed training us-
ing Fully Sharded Data Parallel (Zhao et al., 2023)
on a cluster of eight nodes of the p5.48xlarge in-
stance from Amazon Web Services (AWS). The
total training duration was approximately 6 days
and 10 days for the Llama 3.1 and Gemma 2 mod-
els, respectively. In this paper, we refer to the post-
CPT models as Llama-SEA-LION-8B and Gemma-
SEA-LION-9B for the Llama 3.1 and Gemma 2
continued pre-trained models, respectively.

-- 2 of 15 --

3 Post-training
3.1 Post-training data
The post-training data consists of 3 subsets of data
for Stage 1 IFT, Stage 2 IFT, and the Preference
dataset for alignment, respectively. We describe the
training data information of each

Chunk 5 · 1,995 chars

B for the Llama 3.1 and Gemma 2
continued pre-trained models, respectively.

-- 2 of 15 --

3 Post-training
3.1 Post-training data
The post-training data consists of 3 subsets of data
for Stage 1 IFT, Stage 2 IFT, and the Preference
dataset for alignment, respectively. We describe the
training data information of each step as follows.
Stage 1 IFT. In this step, we employ Infinity-
Instruct [Foundation and Chat] (Beijing
Academy of Artificial Intelligence, 2024)
and OpenMath-Instruct 2 (Toshniwal et al., 2024)
to improve the mathematical, reasoning, and
coding skills of the instruction model. The
full details of the training data are shown in
Appendix 7.
Stage 2 IFT. Then, in this step, we use general-
ized large-scale instructions on the previous instruc-
tion model. In particular, we employ 22 existing
datasets (written in English, Thai, and Vietnamese)
and formulate new 22 synthetic datasets using vari-
ous models and techniques to create SEA instruc-
tion datasets (see Appendix A.3 for the full data
generation details). As shown in Appendix 9, we
use a total of 7,298,828 instruction samples that
cover 11 languages.
Helpfulness and preference alignment. We also
conduct an alignment learning on top of the instruc-
tion model using a feedback dataset called Ultra-
FeedBack (Cui et al., 2024). In addition, we also
synthesized the SEA version of the UltraFeedBack
using NemoTron-70b with Gemma2 as a reward
model, see Appendix A.4 for the full details.
Figure 1: Training process of Llama-SEA-LION-8B-
IT (Section 3.2.1). The post-training process consists of
2 stages of instruction fine-tuning, an alignment stage
and multiple merge stages. Dotted lines denote a merge
stage and solid lines denote an alignment stage.
3.2 Post-training process
We use LLaMaFactory (Zheng et al., 2024b) with
DeepSpeed (Rasley et al., 2020) for all Instruc-
tion Fine Tuning (IFT) and alignment steps. All
IFT stages are performed using full model fine-
tuning, where the models are from the

Chunk 6 · 1,999 chars

denote a merge
stage and solid lines denote an alignment stage.
3.2 Post-training process
We use LLaMaFactory (Zheng et al., 2024b) with
DeepSpeed (Rasley et al., 2020) for all Instruc-
tion Fine Tuning (IFT) and alignment steps. All
IFT stages are performed using full model fine-
tuning, where the models are from the previous
step (Section 2.2) and existing models. We use
MergeKit (Goddard et al., 2024) with a value of
1 for weight and density parameters for all merge
steps. Models selected for merging are selected em-
pirically, based on the openness of model licenses,
the suitability for merging and performance.
3.2.1 Llama-SEA-LION-8B-IT
Stage 1 IFT As shown in Figure 1, we started
off the post-training phase with IFT of Llama-
SEA-LION-8B with the Infinity Instruct (Founda-
tion) (Beijing Academy of Artificial Intelligence,
2024) and OpenMathInstruct2 (Toshniwal et al.,
2024) datasets. Both datasets contain approxi-
mately 9.5 million instruction pairs, primarily in
English and centered around reasoning, math, and
code. We refer to the model at this stage as Stage-
1-Llama.
Stage 2 IFT We performed a second round of
IFT using the SEA-Instruct dataset, which con-
sists of approximately 7.3 million instruction pairs,
of which 5 million instruction pairs are gener-
ated using the Gemma-2-27B-Instruct (Rivière
et al., 2024) model and the Qwen2.5-32B-Instruct
model (Yang et al., 2024a) in SEA languages.
The remaining are English language instruction
pairs from the Infinity-Instruct (Chat) (Beijing
Academy of Artificial Intelligence, 2024) dataset.
We refer to the model at this stage as Stage-2-
Llama.
First merge After finishing the IFT stages, we
performed the first of a series of merges by merging
Stage-1-Llama and Stage-2-Llama into the Llama-
SEA-LION-8B using the DARE TIES (Yu et al.,
2024; Ilharco et al., 2023) method. We refer to the
model at this stage as Merge-1-Llama.
Second merge In order to mitigate catastrophic
forgetting due to the fine-tuning process

Chunk 7 · 1,996 chars

ormed the first of a series of merges by merging
Stage-1-Llama and Stage-2-Llama into the Llama-
SEA-LION-8B using the DARE TIES (Yu et al.,
2024; Ilharco et al., 2023) method. We refer to the
model at this stage as Merge-1-Llama.
Second merge In order to mitigate catastrophic
forgetting due to the fine-tuning process (Alexan-
drov et al., 2024), we performed the second round
of merging by merging top-performing instruction-
tuned models that share the Llama 3.1 lineage. We
merge the original Llama-3.1-8B-Instruct, Llama3-
8B-SEA-LION-v2.1-Instruct (SEA-LION Team,
2024), and SuperNova-Lite (Arcee-AI, 2024) into
Merge-1-Llama using the Consensus TA (Wang
et al., 2024b; Ilharco et al., 2023) merge method.

-- 3 of 15 --

We refer to the model at this stage as Merge-2-
Llama.
Helpfulness and preference alignment We per-
formed one round of alignment on Merge-2-Llama
using SimPO (Meng et al., 2024) with the SEA-
Preference dataset. We refer to the model at this
stage as Aligned-SimPO-Llama.
Final merge Lastly, we perform a merge using the
DELLA-Linear merge. With the original Llama-
3.1-8B-Instruct model as the base for merging,
we merge in Merge-2-Llama and Aligned-SimPO-
Llama to produce the final model, Llama-SEA-
LION-v3-9B-IT.
3.2.2 Gemma-SEA-LION-9B-IT
Figure 2: Training process of Gemma-SEA-LION-9B-
IT (Section 3.2.2). The post-training process comprises
two stages of instruction fine-tuning, an alignment stage,
and multiple merge stages. Dotted lines denote a merge
stage and solid lines denote an alignment stage.
Stage 1 and Stage 2 IFT Similar to the Llama-SEA-
LION-8B-IT, we started off the post-training phase
with both stages of IFT using the same datasets
on the Gemma-2-9B model (Rivière et al., 2024).
We refer to both models at stage 1 and stage 2 as
Stage-1-Gemma and Stage-2-Gemma, respectively.
First merge We merge the Gemma-2-9B-IT (Riv-
ière et al., 2024) and Stage-2-Gemma into Gemma-
2-9B using the DELLA Linear method. We refer
to the model at this

Chunk 8 · 1,986 chars

same datasets
on the Gemma-2-9B model (Rivière et al., 2024).
We refer to both models at stage 1 and stage 2 as
Stage-1-Gemma and Stage-2-Gemma, respectively.
First merge We merge the Gemma-2-9B-IT (Riv-
ière et al., 2024) and Stage-2-Gemma into Gemma-
2-9B using the DELLA Linear method. We refer
to the model at this stage as the Merge-1-Gemma.
Helpfulness and preference alignment Using the
Merge-1-Gemma as the base model, we performed
one round of alignment using SimPO with the SEA-
Preference dataset. We refer to the model at this
stage as the Aligned-SimPO-Gemma.
Final merge Finally, using the Gemma-2-9B
model as the base model, we merged Merge-1-
Gemma, FuseChat Gemma-2-9B-Instruct (Yang
et al., 2024b), Gemma-SEA-LION-9B, and Aligned-
SimPO-Gemma into it to produce the final model
Gemma-SEA-LION-9B-IT.
3.3 Discussion
This post-training workflow emphasizes the careful
balance between general capabilities, SEA-specific
linguistic fluency, and natural conversational abil-
ities. Each step in the workflow is designed to
progressively refine the model, ensuring it meets
the diverse needs of users in the Southeast Asian
region.
The entire post-training process for Gemma-
SEA-LION-9B-IT and Llama-SEA-LION-8B-IT
took approximately 1350 and 1024 GPU hours, re-
spectively, on eight H100 GPUs. To make the train-
ing efficient, all post-training steps utilize Liger
Kernel (Hsu et al., 2024) for substantial memory
savings of approximately 60%.
4 Experimental Setup
4.1 Competitive methods
For the evaluation, we compared our models
against well-known LLMs for multilingual and
SEA languages, such as SeaLLMsv3 (Zhang et al.,
2024a), Sailorv2 (Team, 2024), Qwen 2.5 (Yang
et al., 2024a), Gemma 2 (Rivière et al., 2024) and
Llama 3.1 (Dubey et al., 2024), where the parame-
ters of those models are less than 10 billion param-
eters, similar to our models.
4.2 Evaluation Benchmarks
To evaluate the robustness of our proposed models,
we compare our models to competitors in

Chunk 9 · 1,994 chars

4), Qwen 2.5 (Yang
et al., 2024a), Gemma 2 (Rivière et al., 2024) and
Llama 3.1 (Dubey et al., 2024), where the parame-
ters of those models are less than 10 billion param-
eters, similar to our models.
4.2 Evaluation Benchmarks
To evaluate the robustness of our proposed models,
we compare our models to competitors in three
benchmarks.
SEA Benchmarks. We evaluated the multilin-
gual performance of each LLM using the SEA-
HELM Leaderboard (Leong et al., 2023; Susanto
et al., 2025) 5. We selected SEA-HELM be-
cause the design choice of this benchmark re-
flects the performance of SEA culture and knowl-
edge the most compared with other existing bench-
marks (DAMO-NLP-SG, 2024; Lovenia et al.,
2024; Wang et al., 2024a). We also evaluate on
a wide-range SEA coverage language benchmark
called SEACrowd (Lovenia et al., 2024). This
benchmark consists of all SEA languages for natu-
ral language understanding and generation datasets.
5Please visit https://leaderboard.sea-lion.ai/ for
live score update of SEA-LION.

-- 4 of 15 --

SEA-HELM
NLU, NLG, NLR, NLI Instruction Following
Models Average ID VI TH TA ID VI TH
Meta-Llama-3.1-8B 35.37 42.33 40.67 35.13 38.88 16.19 19.05 9.00
SeaLLMs-v3-7B 37.04 44.79 48.29 43.53 27.45 26.67 35.24 26.00
Gemma-2-9B 41.48 47.65 43.28 42.00 53.26 4.76 3.81 10.00
Qwen2.5-7B 41.98 51.63 52.17 46.55 36.60 31.43 36.19 30.00
Sailor2-8B 42.62 53.23 47.33 46.64 45.04 30.48 30.48 35.00
Llama-SEA-LION-8B 41.42 44.98 46.25 42.79 43.03 25.71 32.38 23.00
Gemma-SEA-LION-9B 48.67 57.16 49.39 47.16 60.56 25.71 20.00 27.00
Table 1: SEA-HELM multilingual benchmark on NLU, NLG, NLR, NLI and instruction following on base and
continued pre-trained models of similar sizes.
Open LLM Leaderboard
Models 	Average MMLU-PRO BBH GPQA MATH Lvl 5 IFEval (EN) MUSR
Meta-Llama-3.1-8B 13.9 24.95 25.29 6.32 5.14 12.7 8.98
Sailor2-8B 	17.71 25.74 27.62 4.87 7.02 21.95 19.03
Gemma-2-9B 	21.15 34.48 34.1 10.51 13.14 20.4 14.3
SeaLLMs-v3-7B 24.00 35.71 34.57 9.28 18.81 32.94

Chunk 10 · 1,993 chars

ined models of similar sizes.
Open LLM Leaderboard
Models 	Average MMLU-PRO BBH GPQA MATH Lvl 5 IFEval (EN) MUSR
Meta-Llama-3.1-8B 13.9 24.95 25.29 6.32 5.14 12.7 8.98
Sailor2-8B 	17.71 25.74 27.62 4.87 7.02 21.95 19.03
Gemma-2-9B 	21.15 34.48 34.1 10.51 13.14 20.4 14.3
SeaLLMs-v3-7B 24.00 35.71 34.57 9.28 18.81 32.94 12.68
Qwen2.5-7B 	24.99 37.39 35.81 9.96 18.88 33.74 14.14
Llama-SEA-LION-8B 16.61 27.6 26.04 7.49 9.89 16.56 12.07
Gemma-SEA-LION-9B 22.41 32.78 37.24 10.29 9.89 30.12 14.11
Table 2: Open LLM Leaderboard benchmarks across different continued pre-trained models of similar sizes.
However, due to maintenance reasons, we can-
not reproduce the NLG benchmark of SEACrowd.
Therefore, we experiment only with the NLU
benchmark (zero-shot), which has 131 data sub-
sets, 7 tasks, and 31 SEA indigenous languages.
English performance. We also evaluated the En-
glish performance of the models using the Open
LLM Leaderboard (HuggingFace, 2024). This
is because English is also widely used in SEA
countries. Therefore, we need to evaluate the
understanding and knowledge of LLMs in the
English benchmark as well. The leaderboard
consists of six benchmarks, IFEval (Zhou et al.,
2023), Big Bench Hard (Suzgun et al., 2023),
MATH (Hendrycks et al., 2021), GPQA (Rein et al.,
2023), MuSR (Sprague et al., 2024) and MMLU-
PRO (Wang et al., 2024c). Moreover, we also eval-
uate the CPT models on SEA-HELM and the Open
LLM Leaderboard since these benchmarks support
the CPT evaluation.
5 Experimental Results
To understand the robustness and generalization of
our proposed models, we conduct three studies as
follows. Section 5.1 evaluates the robustness of
continual pre-training models using SEA-HELM
and the Open LLM leaderboard. In Section 5.2, we
compare our instruction fine-tuning models with
competitors in three benchmarks to demonstrate the
generalization of our models. Lastly, we discuss
the design choice of our models in Section 5.3.
5.1 Continued Pre-Training Results
SEA

Chunk 11 · 1,993 chars

pre-training models using SEA-HELM
and the Open LLM leaderboard. In Section 5.2, we
compare our instruction fine-tuning models with
competitors in three benchmarks to demonstrate the
generalization of our models. Lastly, we discuss
the design choice of our models in Section 5.3.
5.1 Continued Pre-Training Results
SEA performance. The CPT stage is primarily
focused on gaining SEA language capabilities and
knowledge. For the purpose of comparison against
base and CPT models, as shown in Table 1, we
observed a 6.05 and 7.19 average SEA-HELM per-
formance increase over the Meta-Llama-3.1-8B
and Gemma-2-9B for Llama-SEA-LION-8B and
Gemma-SEA-LION-9B, respectively. We observed
a much larger average increase with instruction fol-
lowing capabilities in particular, which we attribute
to the fact that our CPT models are trained from
the instruction models rather than from the base
models. Moreover, in the average performance,
we found that our Gemma-SEA-LION-9B mod-
els perform the best compared to other models.
This emphasizes a strong reason to perform CPT
for improving the performance of SEA languages,
rather than skipping the CPT and performing SFT
directly.

-- 5 of 15 --

SEA-HELM
NLU, NLG, NLR, NLI Instruction Following MTBench
Models 	Average ID VI TH TA ID VI TH ID VI TH
SeaLLMs-v3-7B-Chat 	39.19 42.72 48.50 42.59 12.06 57.14 53.33 47.00 59.81 65.24 56.59
Llama-3.1-8B-Instruct 	41.48 51.50 51.31 45.32 15.40 77.14 75.24 63.00 56.38 57.59 54.34
Sailor2-8B-Chat 	43.13 48.98 48.01 45.44 28.29 49.52 45.71 40.00 69.76 66.97 73.94
Qwen2.5-7B-Instruct 	44.58 60.28 53.46 53.43 21.03 81.90 69.52 66.00 65.66 66.80 68.71
Gemma-2-9B-IT 	55.33 64.04 59.86 57.22 52.28 88.57 78.10 71.00 68.78 68.37 73.51
Stage-1-Llama 	50.76 51.84 51.83 46.23 27.53 69.52 73.33 59.00 42.74 46.41 46.46
Stage-2-Llama 	59.49 53.87 55.18 50.92 44.80 77.14 76.19 67.00 50.90 53.72 46.97
Merge-1-Llama 	59.36 56.73 56.82 51.71 46.63 81.90 82.86 67.00 57.04 54.01 50.28
Merge-2-Llama 	58.01 59.19 52.63

Chunk 12 · 1,992 chars

2 52.28 88.57 78.10 71.00 68.78 68.37 73.51
Stage-1-Llama 	50.76 51.84 51.83 46.23 27.53 69.52 73.33 59.00 42.74 46.41 46.46
Stage-2-Llama 	59.49 53.87 55.18 50.92 44.80 77.14 76.19 67.00 50.90 53.72 46.97
Merge-1-Llama 	59.36 56.73 56.82 51.71 46.63 81.90 82.86 67.00 57.04 54.01 50.28
Merge-2-Llama 	58.01 59.19 52.63 51.89 35.40 87.62 80.95 78.00 56.38 59.32 58.86
Aligned-SimPO-Llama 51.30 54.86 51.69 46.77 26.40 82.86 80.00 68.00 68.20 64.68 64.92
Llama-SEA-LION-8B-IT 61.84 60.50 61.48 55.92 43.61 84.76 85.71 76.00 62.65 68.32 65.13
Stage-1-Gemma 	56.56 55.06 54.51 51.96 42.74 66.67 74.29 61.00 47.35 47.26 55.05
Stage-2-Gemma 	66.66 64.10 61.76 56.90 57.85 89.52 82.86 76.00 60.54 58.93 58.76
Merge-1-Gemma 	69.26 66.25 64.95 59.74 60.41 89.52 91.43 82.00 66.45 64.47 65.00
Aligned-SimPO-Gemma 69.37 65.69 65.47 59.51 57.38 86.67 88.57 78.00 68.89 73.67 73.51
Gemma-SEA-LION-9B-IT 69.35 66.26 64.93 59.23 58.82 94.29 88.57 78.00 65.85 73.27 69.07
Table 3: SEA-HELM multilingual benchmark on NLU, NLG, NLR, NLI, instruction following and multi-turn chat
on instruct models of similar sizes.
Gemma-SEA-LION-9B-IT	
Llama-SEA-LION-8B-IT	
Qwen2.5-7B-Instruct
Sailor2-8B-Chat
SeaLLMs-v3-7B-Chat	
Llama-3.1-8B-Instruct
Gemma-2-9B-IT
0
20
40
60
80
100
Weighted F1 Score
Figure 3: Zero-shot model performance across NLU tasks in SEA languages.
Model 	NLU
Score
SeaLLMs-v3-7B-chat 	52.68
Llama-3.1-8B-Instruct 	49.94
Sailor2-8B-Chat 	60.21
Qwen2.5-7B-Instruct 	54.51
Gemma-2-9B-IT 	60.21
Llama-SEA-LION-8B-IT 	55.10
Gemma-SEA-LION-9B-IT 64.13
Table 4: The average NLU
performance across 131
data subsets and 31 indige-
nous languages.
English performance. For the English perfor-
mance, as shown in Table 2, both CPT models
also managed to perform competitively against the
Meta-Llama-3.1-8B and Gemma-2-9B base models
on the Open LLM Leaderboard benchmarks. This
indicates that our choice of retraining with a propor-
tion of 25% English tokens has been beneficial in
mitigating catastrophic

Chunk 13 · 1,981 chars

erfor-
mance, as shown in Table 2, both CPT models
also managed to perform competitively against the
Meta-Llama-3.1-8B and Gemma-2-9B base models
on the Open LLM Leaderboard benchmarks. This
indicates that our choice of retraining with a propor-
tion of 25% English tokens has been beneficial in
mitigating catastrophic forgetting, which has been
shown to stem from CPT (Zheng et al., 2024a). Al-
though our CPT models perform lower than Qwen
and SeaLLMs on this benchmark, we outperform
them on the SEA language instead, which is the
main focus of this work.
5.2 Instruction Fine-tuning Results
In this study, we compare our models with com-
petitors on SEA-HELM, SEACrowd, and the Open
LLM Leaderboard as follows.
SEA-HELM. As shown in Table 3, the SEA-
HELM benchmark performance demonstrates that
our instruct models, Llama-SEA-LION-8B-IT and
Gemma-SEA-LION-9B-IT, attain competitive per-
formance in SEA languages, with Gemma-SEA-
LION-9B-IT achieving one of the highest aver-
age performances. Moreover, we significantly im-
prove the performance of Llama-3.1-8B-Instruct
from 41.48 to 61.84 using Llama-SEA-LION-8B-
IT, while Gemma-SEA-LION-9B-IT achieves 14.02
improvement points compared to Gemma-2-9B-IT.
Both Llama-SEA-LION-8B-IT and Gemma-SEA-
LION-9B-IT outperform other SEA languages-
focused LLMs, such as Sailor2-8B-Chat and
SEALLMs-v3-7B-Chat, with an average score of
69.35 across all the languages covered by the SEA-
HELM benchmark, apart from the SEA-MTBench
tasks. This conforms with the previous results on
the CPT models (Section 5.1) that our CPT model
performs the best on SEA languages, resulting in
the best performer in this experiment.
SEACrowd. Other than evaluating on some SEA

-- 6 of 15 --

Open LLM Leaderboard
Models 	Average MMLU-PRO BBH GPQA MATH Lvl 5 IFEval (EN) MUSR
Sailor2-8B-Chat 	16.37 27.93 27.15 3.47 0.00 37.49 2.19
SeaLLMs-v3-7B-Chat 22.49 33.93 24.37 7.27 15.86 44.10 9.38
Llama-3.1-8B-Instruct 27.88 29.36 26.10 10.63 17.45 77.03

Chunk 14 · 1,990 chars

nt.
SEACrowd. Other than evaluating on some SEA

-- 6 of 15 --

Open LLM Leaderboard
Models 	Average MMLU-PRO BBH GPQA MATH Lvl 5 IFEval (EN) MUSR
Sailor2-8B-Chat 	16.37 27.93 27.15 3.47 0.00 37.49 2.19
SeaLLMs-v3-7B-Chat 22.49 33.93 24.37 7.27 15.86 44.10 9.38
Llama-3.1-8B-Instruct 27.88 29.36 26.10 10.63 17.45 77.03 6.75
Qwen2.5-7B-Instruct 27.93 37.00 34.72 10.18 0.00 76.34 9.34
Gemma-2-9B-IT 	28.86 31.95 42.14 14.77 0.23 74.36 9.74
Stage-1-Llama 	24.51 25.87 26.32 7.83 19.26 62.89 4.88
Stage-2-Llama 	27.75 28.10 24.64 7.72 19.56 78.78 7.74
Merge-1-Llama 	27.49 27.47 26.22 8.28 19.79 76.16 7.04
Merge-2-Llama 	29.96 29.92 28.78 9.96 19.94 82.61 8.54
Aligned-SimPO-Llama 30.58 30.84 34.31 8.39 26.59 75.76 7.61
Llama-SEA-LION-8B-IT 30.39 31.01 29.47 10.40 22.58 80.35 8.54
Stage-1-Gemma 	29.88 33.34 38.51 10.74 24.17 56.87 15.66
Stage-2-Gemma 	33.48 34.67 36.06 11.74 20.77 83.00 14.61
Merge-1-Gemma 	35.15 36.22 41.42 15.32 26.28 82.09 9.59
Aligned-SimPO-Gemma 35.31 37.65 42.38 14.99 27.79 80.23 8.82
Gemma-SEA-LION-9B-IT 35.43 36.94 43.39 15.10 24.24 81.85 11.07
Table 5: Open LLM Leaderboard benchmarks across different instruct models of similar sizes.
languages like SEA-HELM, we also evaluated our
model compared to competitors on 31 SEA indige-
nous languages using SEACrowd-NLU. Note that,
for this study, we use only the best settings of our
models from the previous experiment (Table 3). As
shown in Table 4, we observe a state-of-the-art re-
sult from Gemma-SEA-LION-9B-IT by achieving
64.13 points on the NLU benchmark, while Llama-
SEA-LION-8B-IT improves its baseline from 49.94
to 55.10 points. Moreover, the results from Fig-
ure 3 also emphasize the robustness of our model
by reaching more than 80 points on this bench-
mark, while SeaLLMs and Llama-3.1 have only a
few cases where the performance exceeds 80 points.
These results emphasize the robustness of our mod-
els by achieving the state-of-the-art with a model
parameter less than 10B on SEA benchmarks,

Chunk 15 · 1,995 chars

ze the robustness of our model
by reaching more than 80 points on this bench-
mark, while SeaLLMs and Llama-3.1 have only a
few cases where the performance exceeds 80 points.
These results emphasize the robustness of our mod-
els by achieving the state-of-the-art with a model
parameter less than 10B on SEA benchmarks, in-
cluding both traditional classical NLP benchmark
(SEACrowd-NLU) and modern LLM benchmark
(SEA-HELM).
English performance. We also evaluate the perfor-
mance of a widely used language, English, to ob-
serve a difference between the results of SEA and
English. The Open LLM Leaderboard performance
is shown in Table 5. Both Llama-SEA-LION-8B-IT
and Gemma-SEA-LION-9B-IT performed compet-
itively in English language, math, and reasoning
tasks, with Gemma-SEA-LION-9B-IT achieving
the highest average score of 35.43. Moreover, we
notice that the SEA models (Sailor and SeaLLMs)
failed to perform on the English dataset. This might
be because these models are optimized for SEA lan-
guages during supervised fine-tuning, and English
performance decreased as a result. In contrast, our
models balance the performance between SEA and
English knowledge, resulting in a high score for all
benchmarks.
5.3 Performance Analysis
In this study, we discuss the performance improve-
ment in each design decision of our models (Ta-
bles 3 and 5) as follows.
Stage 1: English instruction fine tuning In Stage
1 IFT, the focus is predominantly on gaining gen-
eral capabilities in math, code and general instruc-
tion following in the English language. Although
our CPT models are based off of the instruct ver-
sions of Llama-3.1-8B, the CPT process has eroded
the instruction following capabilities (See Table 5).
We observe an increase of 3.86 and 9.72 for Stage-
1-Llama and Stage-1-Gemma respectively in En-
glish instruction following capabilities on the IFE-
val benchmark. We also observe an average in-
crease of 7.9 for Stage-1-Llama and 7.47 for Stage-
1-Gemma for the SEA-HELM

Chunk 16 · 1,992 chars

truction following capabilities (See Table 5).
We observe an increase of 3.86 and 9.72 for Stage-
1-Llama and Stage-1-Gemma respectively in En-
glish instruction following capabilities on the IFE-
val benchmark. We also observe an average in-
crease of 7.9 for Stage-1-Llama and 7.47 for Stage-
1-Gemma for the SEA-HELM benchmark.
Stage 2: Multilingual instruction fine tuning In
Stage 2 IFT, the focus is on multilingual and rea-
soning capabilities. By instruction fine tuning on
SEA languages and higher complexity English in-
struction pairs, the Stage 2 models saw an average
increase of 8.73 for Stage-2-Llama and 10.1 for
Stage-2-Gemma over Stage 1 models on the SEA-
HELM benchmark.
Merge 1: Combining Stage 1 and Stage 2 De-
spite the significant gains observed in Stage 1 and
2, we observed that the effects of catastrophic for-

-- 7 of 15 --

getting from earlier stages could still be observed
after Stage 2. In order to mitigate this, we merge
Stage 1 and Stage 2 models into the CPT model,
after which we we observed an average increase
of 2.6 for Merge-1-Gemma. We also observed an
increase across all SEA-HELM benchmark tasks
for Merge-1-Llama.
Merge 2: Incorporating instruct models To rein-
troduce helpfulness, relevance and informativeness
of responses observed in Llama 3.1 and Gemma
2 models, we perform further merges of open-
source instruct models. While we observed sig-
nificant increases in MT-Bench benchmark scores
for Vietnamese and Thai, we also observed a slight
degradation of average SEA-HELM performance
as well as a slight degradation of Indonesian MT-
Bench scores, which we view as acceptable trade-
offs for the significant performance increases in
Vietnamese and Thai.
Alignment steps In the alignment step to align the
models to human preference, we prioritize the SEA
MTBench performance over the other SEA-HELM
benchmark tasks. We observed a broad increase in
SEA MTBench performances across all languages
for both models. However, this comes with

Chunk 17 · 1,990 chars

nce increases in
Vietnamese and Thai.
Alignment steps In the alignment step to align the
models to human preference, we prioritize the SEA
MTBench performance over the other SEA-HELM
benchmark tasks. We observed a broad increase in
SEA MTBench performances across all languages
for both models. However, this comes with minor
degradation of instruction following capabilities
and overall Indonesian SEA-HELM performance.
The alignment step encourages longer, more help-
ful and sensitive responses but hurts performance
on task-specific benchmarks and instruction follow-
ing in some languages – an issue we address in the
next step.
Final merge: Combining aligned models To com-
pensate for the capability degradation in the previ-
ous steps, we merge Merge-2-Llama and Merge-1-
Gemma with Aligned-SimPO-Llama and Aligned-
SimPO-Gemma and various open sourced pre-
trained models describe in sections 3.2.1 and 3.2.2
for their respective model families. For Llama-SEA-
LION-8B-IT, we observed a significant increase in
average SEA-HELM performance (61.84) from the
alignment stage (51.30), mainly from the increase
in performance for the core tasks in SEA-HELM.
This performance increase demonstrates the value
of empirical selection of pre-trained models to be
merged in based on each model’s strengths and
weaknesses to produce a far superior model. For
Gemma-SEA-LION-9B-IT, it easily achieves higher
performance compared to the Llama-SEA-LION-
8B-IT with fewer post training steps. We attribute
this performance to the high performance of the
base Gemma 2 model and also to the larger vocab-
ulary size which have been demonstrated (Takase
et al., 2024) to produce better models.
6 Related Works
Recently, researchers have proposed large lan-
guage models that support multilingual settings.
Llama (Dubey et al., 2024) is the prior effort to re-
lease an open-source large language model for the
research community to develop their own models.
Then, Qwen (Yang et al., 2024a) and Gemma

Chunk 18 · 1,993 chars

etter models.
6 Related Works
Recently, researchers have proposed large lan-
guage models that support multilingual settings.
Llama (Dubey et al., 2024) is the prior effort to re-
lease an open-source large language model for the
research community to develop their own models.
Then, Qwen (Yang et al., 2024a) and Gemma (Riv-
ière et al., 2024) introduced open-source LLMs that
perform comparably or better than Llama with a
larger amount of training data and many supported
languages for these recent models. Massively multi-
lingual open-source models like Bloom (Scao et al.,
2022) and Aya (Üstün et al., 2024) also support a
very wide range of languages, including some SEA
languages. Although these models demonstrate a
robust performance in English benchmarks, they
mostly underperformed on SEA benchmarks that
tested for SEA languages, SEA knowledge and cul-
tural understanding (Lovenia et al., 2024; Susanto
et al., 2025), presumably due to a lack of language
support for certain SEA languages or cultures.
In the SEA community, many works propose a
large language model that is designed specifically
for SEA languages by adding more SEA tokens in
the training process, such as SeaLLMs (Nguyen
et al., 2024) and Sailor (Sailor2 Team, 2024). How-
ever, the performance of these models is robust
only on in-domain datasets or favors only some
tasks (i.e., classical NLP datasets). This is because
the design choice in the pre-training or fine-tuning
of these models is not well studied, e.g., performing
a single SFT step with low-quality datasets writ-
ten in some SEA languages, resulting in a slight
improvement on SEA benchmarks. To create a
robust SEA LLM, we need to carefully balance lan-
guage representation and design both pre-training
and post-training (i.e., SFT, alignment, and model
merging) for SEA contexts.
7 Conclusion
Despite the sizable population and language diver-
sity in Southeast Asia, there remains a scarcity of
resources and accurate linguistic and cultural

Chunk 19 · 1,997 chars

ed to carefully balance lan-
guage representation and design both pre-training
and post-training (i.e., SFT, alignment, and model
merging) for SEA contexts.
7 Conclusion
Despite the sizable population and language diver-
sity in Southeast Asia, there remains a scarcity of
resources and accurate linguistic and cultural rep-
resentation with open-source LLMs. In this paper,
we introduce Llama-SEA-LION-8B-IT and Gemma-
SEA-LION-9B-IT, two multilingual LLMs compre-
hensively trained to achieve state-of-the-art perfor-
mances in SEA languages, based on the Llama and

-- 8 of 15 --

Gemma family of LLMs. SEA-LION represents
the next advancement in the development of LLMs
that explicitly supports SEA languages. Both mod-
els are fully open-source and available for com-
mercial use to increase accessibility and innovation
in multilingual LLMs in Southeast Asia. We will
make our resources publicly available — including
the dataset, training scripts, training checkpoints,
and all fine-tuned models, even those that achieve
state-of-the-art performance on the benchmarks —
to establish solid baselines, ensure reproducibility,
and support future research focused on culturally
and professionally relevant SEA applications.
Acknowledgment
This research is supported by the National Research
Foundation, Singapore, under its National Large
Language Models Funding Initiative. Any opin-
ions, findings, and conclusions or recommenda-
tions expressed in this material are those of the
author(s) and do not reflect the views of the Na-
tional Research Foundation, Singapore.
Limitation
Although we propose the state-of-the-art SEA
LLMs, we found that the benchmark might not
cover all the properties and languages we want to
evaluate. For example, SEA-HELM is a robust-
ness benchmark, but only covers four languages.
SEACrowd is a benchmark that covers all SEA
languages, but it is only classical NLP datasets
(no chat or instruction following datasets). We re-
quire a more holistic SEA benchmark

Chunk 20 · 1,998 chars

r all the properties and languages we want to
evaluate. For example, SEA-HELM is a robust-
ness benchmark, but only covers four languages.
SEACrowd is a benchmark that covers all SEA
languages, but it is only classical NLP datasets
(no chat or instruction following datasets). We re-
quire a more holistic SEA benchmark that covers
LLM-specific tasks written in all SEA languages.
However, with the current evaluation design choice,
these benchmarks are the best design choice for cur-
rent SEA research works.
Moreover, we conduct experiments using only
8 and 9 billion parameter models. We argue that
this is the most commonly used model size in real-
world scenarios. In addition, our method can and
should also work with a higher or smaller model
size since our proposed technique does not rely on
the model size, as we demonstrated by applying
the SFT and alignment techniques on both Llama
and Gemma models.
References
SCB 10X, VISTEC, and SEACrowd. 2024. Thai llm
leaderboard.
AI Singapore AI Products Team. 2024. Sea-helm.
AISG AI Singapore. 2023. Sea-lion-pile.
AISG AI Singapore. 2025. Sea-lion-pile-v2.
Anton Alexandrov, Veselin Raychev, Mark Niklas
Mueller, Ce Zhang, Martin Vechev, and Kristina
Toutanova. 2024. Mitigating catastrophic forgetting
in language transfer via model merging. In Find-
ings of the Association for Computational Linguistics:
EMNLP 2024, pages 17167–17186, Miami, Florida,
USA. Association for Computational Linguistics.
Arcee-AI. 2024. Llama-3.1-supernova-lite.
Adrien Barbaresi. 2021. Trafilatura: A Web Scrap-
ing Library and Command-Line Tool for Text Dis-
covery and Extraction. In Proceedings of the Joint
Conference of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Processing: System Demonstrations, pages 122–131.
Association for Computational Linguistics.
BAAI Beijing Academy of Artificial Intelligence. 2024.
Infinity instruct.
Tom B. Brown, Benjamin Mann,

Chunk 21 · 1,996 chars

Meeting of the Asso-
ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Processing: System Demonstrations, pages 122–131.
Association for Computational Linguistics.
BAAI Beijing Academy of Artificial Intelligence. 2024.
Infinity instruct.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In Ad-
vances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Process-
ing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual.
CommonCrawl. 2024. Commoncrawl.
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao,
Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie,
Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong
Sun. 2024. ULTRAFEEDBACK: boosting language
models with scaled AI feedback. In Forty-first In-
ternational Conference on Machine Learning, ICML
2024, Vienna, Austria, July 21-27, 2024. OpenRe-
view.net.
DAMO-NLP-SG. 2024. Seaexam.
Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Ji-
ahui Zhou, Wei Lu, and Min Lin. 2024. Sailor:
Open language models for south-east asia. CoRR,
abs/2404.03608.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang,

-- 9 of 15 --

Archi Mitra, Archie Sravankumar, Artem Korenev,
Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien
Rodriguez, Austen Gregerson, Ava Spataru, Bap-
tiste Rozière, Bethany Biron, Binh Tang, Bobbie
Chern, Charlotte Caucheteux, Chaya Nayak,

Chunk 22 · 1,996 chars

n, Amy Yang, Angela
Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang,

-- 9 of 15 --

Archi Mitra, Archie Sravankumar, Artem Korenev,
Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien
Rodriguez, Austen Gregerson, Ava Spataru, Bap-
tiste Rozière, Bethany Biron, Binh Tang, Bobbie
Chern, Charlotte Caucheteux, Chaya Nayak, Chloe
Bi, Chris Marra, Chris McConnell, Christian Keller,
Christophe Touret, Chunyang Wu, Corinne Wong,
Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al-
lonsius, Daniel Song, Danielle Pintz, Danny Livshits,
David Esiobu, Dhruv Choudhary, Dhruv Mahajan,
Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes,
Egor Lakomkin, Ehab AlBadawy, Elina Lobanova,
Emily Dinan, Eric Michael Smith, Filip Radenovic,
Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor-
gia Lewis Anderson, Graeme Nail, Grégoire Mialon,
Guan Pang, Guillem Cucurell, Hailey Nguyen, Han-
nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov,
Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan
Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
Geffert, Jana Vranes, Jason Park, Jay Mahadeokar,
Jeet Shah, Jelmer van der Linde, Jennifer Billock,
Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi,
Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu,
Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph
Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia,
Kalyan Vasuden Alwala, Kartikeya Upasani, Kate
Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and
et al. 2024. The llama 3 herd of models. CoRR,
abs/2407.21783.
Wikimedia Foundation. 2024. Wikimedia enterprise
html dumps downloads.
Charles Goddard, Shamane Siriwardhana, Malikeh
Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian
Benedict, Mark McQuade, and Jacob Solawetz. 2024.
Arcee’s mergekit: A toolkit for merging large lan-
guage models. In Proceedings of the 2024 Confer-
ence on Empirical Methods in Natural Language
Processing: EMNLP 2024 - Industry Track, Miami,
Florida, USA, November 12-16, 2024, pages 477–
485. Association for Computational Linguistics.
Dan Hendrycks,

Chunk 23 · 1,992 chars

Solawetz. 2024.
Arcee’s mergekit: A toolkit for merging large lan-
guage models. In Proceedings of the 2024 Confer-
ence on Empirical Methods in Natural Language
Processing: EMNLP 2024 - Industry Track, Miami,
Florida, USA, November 12-16, 2024, pages 477–
485. Association for Computational Linguistics.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Arora, Steven Basart, Eric Tang, Dawn Song, and Ja-
cob Steinhardt. 2021. Measuring mathematical prob-
lem solving with the MATH dataset. In Proceedings
of the Neural Information Processing Systems Track
on Datasets and Benchmarks 1, NeurIPS Datasets
and Benchmarks 2021, December 2021, virtual.
Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan
Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam
Sahni, Haowen Ning, and Yanning Chen. 2024. Liger
kernel: Efficient triton kernels for llm training. arXiv
preprint arXiv:2410.10989.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He,
Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang,
Yuxiang Huang, Weilin Zhao, Xinrong Zhang,
Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan
Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu
Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li,
Zhiyuan Liu, and Maosong Sun. 2024. Minicpm: Un-
veiling the potential of small language models with
scalable training strategies. CoRR, abs/2404.06395.
HuggingFace. 2024. Open llm leaderboard.
Gabriel Ilharco, Marco Túlio Ribeiro, Mitchell Worts-
man, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali
Farhadi. 2023. Editing models with task arithmetic.
In The Eleventh International Conference on Learn-
ing Representations, ICLR 2023, Kigali, Rwanda,
May 1-5, 2023. OpenReview.net.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of tricks for efficient
text classification. In Proceedings of the 15th Con-
ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
pers, pages 427–431. Association for Computational
Linguistics.
Wei Qi Leong, Jian Gang Ngui,

Chunk 24 · 1,999 chars

Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of tricks for efficient
text classification. In Proceedings of the 15th Con-
ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
pers, pages 427–431. Association for Computational
Linguistics.
Wei Qi Leong, Jian Gang Ngui, Yosephine Su-
santo, Hamsawardhini Rengarajan, Kengatharaiyer
Sarveswaran, and William-Chandra Tjhi. 2023.
BHASA: A holistic southeast asian linguistic and
cultural evaluation suite for large language models.
CoRR, abs/2309.06085.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In 7th International
Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019. OpenRe-
view.net.
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana
Akbar, Lester James V. Miranda, Jennifer Santoso,
Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov,
Joseph Marvin Imperial, Onno Kampman, Joel
Ruben Antony Moniz, Muhammad Ravi Shulthan
Habibi, Frederikus Hudi, Jann Montalan, Ryan Hadi-
wijaya, Joanito Agili Lopo, William Nixon, Börje
Karlsson, James Jaya, Ryandito Diandaru, Yuze
Gao, Patrick Amadeus Irawan, Bin Wang, Jan Chris-
tian Blaise Cruz, Chenxi Whitehouse, Ivan Halim
Parmonangan, Maria Khelli, Wenyu Zhang, Lucky
Susanto, Reynard Adha Ryanda, Sonny Lazuardi Her-
mawan, Dan John Velasco, Muhammad Dehan Al
Kautsar, Willy Fitra Hendria, Yasmin Moslem, Noah
Flynn, Muhammad Farid Adilazuarda, Haochen Li,
Johanes Lee, R. Damanhuri, Shuo Sun, Muham-
mad Reza Qorib, Amirbek Djanibekov, Wei Qi
Leong, Quyet V. Do, Niklas Muennighoff, Tanrada
Pansuwan, Ilham Firdausi Putra, Yan Xu, Ngee Tai
Chia, Ayu Purwarianti, Sebastian Ruder, William-
Chandra Tjhi, Peerat Limkonchotiwat, Alham Fikri
Aji, Sedrick Keh, Genta Indra Winata, Ruochen
Zhang, Fajri Koto, Zheng Xin Yong, and Samuel
Cahyawijaya. 2024. Seacrowd: A multilingual mul-
timodal data hub and benchmark suite for southeast
asian languages. In Proceedings of the 2024 Con-
ference

Chunk 25 · 1,999 chars

stian Ruder, William-
Chandra Tjhi, Peerat Limkonchotiwat, Alham Fikri
Aji, Sedrick Keh, Genta Indra Winata, Ruochen
Zhang, Fajri Koto, Zheng Xin Yong, and Samuel
Cahyawijaya. 2024. Seacrowd: A multilingual mul-
timodal data hub and benchmark suite for southeast
asian languages. In Proceedings of the 2024 Con-
ference on Empirical Methods in Natural Language
Processing, EMNLP 2024, Miami, FL, USA, Novem-
ber 12-16, 2024, pages 5155–5203. Association for
Computational Linguistics.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Fed-
erico Cassano, Joel Lamy-Poirier, Nouamane Tazi,

-- 10 of 15 --

Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei,
Tianyang Liu, Max Tian, Denis Kocetkov, Arthur
Zucker, Younes Belkada, Zijian Wang, Qian Liu,
Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-
Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue
Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade,
Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su,
Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai,
Niklas Muennighoff, Xiangru Tang, Muhtasham
Oblokulov, Christopher Akiki, Marc Marone, Cheng-
hao Mou, Mayank Mishra, Alex Gu, Binyuan Hui,
Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Pa-
try, Canwen Xu, Julian J. McAuley, Han Hu, Torsten
Scholak, Sébastien Paquet, Jennifer Robinson, Car-
olyn Jane Anderson, Nicolas Chapados, and et al.
2024. Starcoder 2 and the stack v2: The next genera-
tion. CoRR, abs/2402.19173.
Yu Meng, Mengzhou Xia, and Danqi Chen. 2024.
Simpo: Simple preference optimization with a
reference-free reward. CoRR, abs/2405.14734.
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani
Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken
Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liy-
ing Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
Chaoqun Liu, Hang Zhang, and Lidong Bing. 2024.
SeaLLMs - large language models for Southeast Asia.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 3:
System Demonstrations), pages 294–304, Bangkok,
Thailand.

Chunk 26 · 1,992 chars

yu Tan, Liy-
ing Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
Chaoqun Liu, Hang Zhang, and Lidong Bing. 2024.
SeaLLMs - large language models for Southeast Asia.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 3:
System Demonstrations), pages 294–304, Bangkok,
Thailand. Association for Computational Linguistics.
OpenAI. 2023. GPT-4 technical report. CoRR,
abs/2303.08774.
Guilherme Penedo, Hynek Kydlícek, Loubna Ben Al-
lal, Anton Lozhkov, Margaret Mitchell, Colin Raffel,
Leandro von Werra, and Thomas Wolf. 2024. The
fineweb datasets: Decanting the web for the finest
text data at scale. CoRR, abs/2406.17557.
Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita.
2020. Bpe-dropout: Simple and effective subword
regularization. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics, ACL 2020, Online, July 5-10, 2020, pages
1882–1892. Association for Computational Linguis-
tics.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase,
and Yuxiong He. 2020. Deepspeed: System opti-
mizations enable training deep learning models with
over 100 billion parameters. In KDD ’20: The 26th
ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, Virtual Event, CA, USA, August
23-27, 2020, pages 3505–3506. ACM.
David Rein, Betty Li Hou, Asa Cooper Stickland,
Jackson Petty, Richard Yuanzhe Pang, Julien Di-
rani, Julian Michael, and Samuel R. Bowman. 2023.
GPQA: A graduate-level google-proof q&a bench-
mark. CoRR, abs/2311.12022.
Morgane Rivière, Shreya Pathak, Pier Giuseppe
Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard
Hussenot, Thomas Mesnard, Bobak Shahriari,
Alexandre Ramé, Johan Ferret, Peter Liu, Pouya
Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos,
Ravin Kumar, Charline Le Lan, Sammy Jerome, An-
ton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan
Girgin, Nikola Momchev, Matt Hoffman, Shantanu
Thakoor, Jean-Bastien Grill, Behnam Neyshabur,
Olivier Bachem, Alanna Walton, Aliaksei

Chunk 27 · 1,997 chars

, Johan Ferret, Peter Liu, Pouya
Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos,
Ravin Kumar, Charline Le Lan, Sammy Jerome, An-
ton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan
Girgin, Nikola Momchev, Matt Hoffman, Shantanu
Thakoor, Jean-Bastien Grill, Behnam Neyshabur,
Olivier Bachem, Alanna Walton, Aliaksei Severyn,
Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin
Abdagic, Amanda Carl, Amy Shen, Andy Brock,
Andy Coenen, Anthony Laforge, Antonia Pater-
son, Ben Bastian, Bilal Piot, Bo Wu, Brandon
Royal, Charlie Chen, Chintu Kumar, Chris Perry,
Chris Welty, Christopher A. Choquette-Choo, Danila
Sinopalnikov, David Weinberger, Dimple Vijayku-
mar, Dominika Rogozinska, Dustin Herbison, Elisa
Bandy, Emma Wang, Eric Noland, Erica Moreira,
Evan Senter, Evgenii Eltyshev, Francesco Visin,
Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus
Martins, Hadi Hashemi, Hanna Klimczak-Plucinska,
Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda
Mein, Jack Zhou, James Svensson, Jeff Stanway,
Jetha Chan, Jin Peng Zhou, Joana Carrasqueira,
Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost
van Amersfoort, Josh Gordon, Josh Lipschultz,
Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kar-
tikeya Badola, Kat Black, Katie Millican, Keelin
McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish
Greene, Lars Lowe Sjösund, Lauren Usui, Laurent
Sifre, Lena Heuermann, Leticia Lago, and Lilly Mc-
Nealus. 2024. Gemma 2: Improving open language
models at a practical size. CoRR, abs/2408.00118.
Sailor2 Team. 2024. Sailor2: Sailing in south-east asia
with inclusive multilingual llm.
Teven Le Scao, Angela Fan, Christopher Akiki, El-
lie Pavlick, Suzana Ilic, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon,
Matthias Gallé, Jonathan Tow, Alexander M. Rush,
Stella Biderman, Albert Webson, Pawan Sasanka Am-
manamanchi, Thomas Wang, Benoît Sagot, Niklas
Muennighoff, Albert Villanova del Moral, Olatunji
Ruwase, Rachel Bawden, Stas Bekman, Angelina
McMillan-Major, Iz Beltagy, Huu Nguyen,

Chunk 28 · 1,991 chars

lexandra Sasha Luccioni, François Yvon,
Matthias Gallé, Jonathan Tow, Alexander M. Rush,
Stella Biderman, Albert Webson, Pawan Sasanka Am-
manamanchi, Thomas Wang, Benoît Sagot, Niklas
Muennighoff, Albert Villanova del Moral, Olatunji
Ruwase, Rachel Bawden, Stas Bekman, Angelina
McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile
Saulnier, Samson Tan, Pedro Ortiz Suarez, Vic-
tor Sanh, Hugo Laurençon, Yacine Jernite, Julien
Launay, Margaret Mitchell, Colin Raffel, Aaron
Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri
Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg
Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue,
Christopher Klamm, Colin Leong, Daniel van Strien,
David Ifeoluwa Adelani, and et al. 2022. BLOOM:
A 176b-parameter open-access multilingual language
model. CoRR, abs/2211.05100.
AI Singapore SEA-LION Team. 2024. Llama3 8b cpt
sea-lionv2.1 instruct.
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin
Schwenk, David Atkinson, Russell Authur, Ben
Bogin, Khyathi Chandu, Jennifer Dumas, Yanai
Elazar, Valentin Hofmann, Ananya Jha, Sachin Ku-
mar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian
Magnusson, Jacob Morrison, Niklas Muennighoff,

-- 11 of 15 --

Aakanksha Naik, Crystal Nam, Matthew Peters, Ab-
hilasha Ravichander, Kyle Richardson, Zejiang Shen,
Emma Strubell, Nishant Subramani, Oyvind Tafjord,
Evan Walsh, Luke Zettlemoyer, Noah Smith, Han-
naneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse
Dodge, and Kyle Lo. 2024. Dolma: an open corpus
of three trillion tokens for language model pretraining
research. In Proceedings of the 62nd Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 15725–15788, Bangkok,
Thailand. Association for Computational Linguistics.
Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri,
and Greg Durrett. 2024. Musr: Testing the limits of
chain-of-thought with multistep soft reasoning. In
The Twelfth International Conference on Learning
Representations, ICLR 2024, Vienna, Austria, May
7-11, 2024.

Chunk 29 · 1,990 chars

Thailand. Association for Computational Linguistics.
Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri,
and Greg Durrett. 2024. Musr: Testing the limits of
chain-of-thought with multistep soft reasoning. In
The Twelfth International Conference on Learning
Representations, ICLR 2024, Vienna, Austria, May
7-11, 2024. OpenReview.net.
Yosephine Susanto, Adithya Venkatadri Hulagadri,
Jann Railey Montalan, Jian Gang Ngui, Xian Bin
Yong, Weiqi Leong, Hamsawardhini Rengara-
jan, Peerat Limkonchotiwat, Yifan Mai, and
William Chandra Tjhi. 2025. Sea-helm: South-
east asian holistic evaluation of language models.
Preprint, arXiv:2502.14301.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-
bastian Gehrmann, Yi Tay, Hyung Won Chung,
Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi,
Denny Zhou, and Jason Wei. 2023. Challenging
big-bench tasks and whether chain-of-thought can
solve them. In Findings of the Association for Com-
putational Linguistics: ACL 2023, Toronto, Canada,
July 9-14, 2023, pages 13003–13051. Association for
Computational Linguistics.
Sho Takase, Ryokan Ri, Shun Kiyono, and Takuya Kato.
2024. Large vocabulary size improves large language
models. CoRR, abs/2406.16508.
Sailor Team. 2024. Sailor2: Sailing in south-east asia
with inclusive multilingual llms.
The Mosaic ML Team. 2021. composer. https://
github.com/mosaicml/composer/.
The Mosaic ML Team. 2022. Llm foundry. https:
//github.com/mosaicml/llm-foundry.
Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav
Kisacanin, Alexan Ayrapetyan, and Igor Gitman.
2024. Openmathinstruct-2: Accelerating AI for math
with massive open-source instruction data. CoRR,
abs/2410.01560.
Ahmet Üstün, Viraat Aryabumi, Zheng Xin Yong, Wei-
Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel
Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid,
Freddie Vargus, Phil Blunsom, Shayne Longpre,
Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer,
and Sara Hooker. 2024. Aya model: An instruction
finetuned open-access multilingual language

Chunk 30 · 1,995 chars

ryabumi, Zheng Xin Yong, Wei-
Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel
Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid,
Freddie Vargus, Phil Blunsom, Shayne Longpre,
Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer,
and Sara Hooker. 2024. Aya model: An instruction
finetuned open-access multilingual language model.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), ACL 2024, Bangkok, Thailand, August
11-16, 2024, pages 15894–15939. Association for
Computational Linguistics.
Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao,
Yang Ding, AiTi Aw, and Nancy Chen. 2024a. Seae-
val for multilingual foundation models: From cross-
lingual alignment to cultural reasoning. In Proceed-
ings of the 2024 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies (Volume 1:
Long Papers), NAACL 2024, Mexico City, Mexico,
June 16-21, 2024, pages 370–390. Association for
Computational Linguistics.
Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-
Jiménez, François Fleuret, and Pascal Frossard.
2024b. Localizing task information for improved
model merging and compression. In Forty-first In-
ternational Conference on Machine Learning, ICML
2024, Vienna, Austria, July 21-27, 2024. OpenRe-
view.net.
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni,
Abhranil Chandra, Shiguang Guo, Weiming Ren,
Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max
Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue,
and Wenhu Chen. 2024c. Mmlu-pro: A more robust
and challenging multi-task language understanding
benchmark. CoRR, abs/2406.01574.
Chris Wendler, Veniamin Veselovsky, Giovanni Monea,
and Robert West. 2024. Do llamas work in english?
on the latent language of multilingual transformers.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), ACL 2024, Bangkok, Thailand, August
11-16, 2024, pages 15366–15394.

Chunk 31 · 1,999 chars

amin Veselovsky, Giovanni Monea,
and Robert West. 2024. Do llamas work in english?
on the latent language of multilingual transformers.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), ACL 2024, Bangkok, Thailand, August
11-16, 2024, pages 15366–15394. Association for
Computational Linguistics.
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie E.
Everett, Alexander A. Alemi, Ben Adlam, John D.
Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman
Novak, Jeffrey Pennington, Jascha Sohl-Dickstein,
Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon
Kornblith. 2024. Small-scale proxies for large-scale
transformer training instabilities. In The Twelfth In-
ternational Conference on Learning Representations,
ICLR 2024, Vienna, Austria, May 7-11, 2024. Open-
Review.net.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yun-
tian Deng, Radha Poovendran, Yejin Choi, and
Bill Yuchen Lin. 2024. Magpie: Alignment data
synthesis from scratch by prompting aligned llms
with nothing. CoRR, abs/2406.08464.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao-
ran Wei, Huan Lin, Jialong Tang, Jialin Wang,
Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin
Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai,
Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Ke-
qin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni,
Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize

-- 12 of 15 --

Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan,
Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge,
Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren,
Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing
Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan,
Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang,
Zhifang Guo, and Zhihao Fan. 2024a. Qwen2 techni-
cal report. CoRR, abs/2407.10671.
Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan
Shi, and Xiaojun Quan. 2024b. Weighted-reward
preference optimization for

Chunk 32 · 1,997 chars

eng Ren, Xuejing
Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan,
Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang,
Zhifang Guo, and Zhihao Fan. 2024a. Qwen2 techni-
cal report. CoRR, abs/2407.10671.
Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan
Shi, and Xiaojun Quan. 2024b. Weighted-reward
preference optimization for implicit model fusion.
CoRR, abs/2412.03187.
Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko,
Ranjan Satapathy, and Erik Cambria. 2024. Self-
training large language models through knowledge
detection. In Findings of the Association for Compu-
tational Linguistics: EMNLP 2024, Miami, Florida,
USA, November 12-16, 2024, pages 15033–15045.
Association for Computational Linguistics.
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin
Li. 2024. Language models are super mario: Absorb-
ing abilities from homologous models as a free lunch.
In Forty-first International Conference on Machine
Learning, ICML 2024, Vienna, Austria, July 21-27,
2024. OpenReview.net.
Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani
Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng,
Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li,
and Lidong Bing. 2024a. Seallms 3: Open founda-
tion and chat multilingual large language models for
southeast asian languages. CoRR, abs/2407.19672.
Xulang Zhang, Rui Mao, and Erik Cambria. 2024b.
Multilingual emotion recognition: Discovering the
variations of lexical semantics between languages. In
International Joint Conference on Neural Networks,
IJCNN 2024, Yokohama, Japan, June 30 - July 5,
2024, pages 1–9. IEEE.
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo,
Chien-Chin Huang, Min Xu, Less Wright, Hamid
Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmai-
son, Can Balioglu, Pritam Damania, Bernard Nguyen,
Geeta Chauhan, Yuchen Hao, Ajit Mathews, and
Shen Li. 2023. Pytorch FSDP: experiences on scal-
ing fully sharded data parallel. Proc. VLDB Endow.,
16(12):3848–3860.
Wenzhen Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue,
and Ming Zhou. 2024a. Breaking language

Chunk 33 · 1,997 chars

lban Desmai-
son, Can Balioglu, Pritam Damania, Bernard Nguyen,
Geeta Chauhan, Yuchen Hao, Ajit Mathews, and
Shen Li. 2023. Pytorch FSDP: experiences on scal-
ing fully sharded data parallel. Proc. VLDB Endow.,
16(12):3848–3860.
Wenzhen Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue,
and Ming Zhou. 2024a. Breaking language barri-
ers: Cross-lingual continual pre-training at scale. In
Proceedings of the 2024 Conference on Empirical
Methods in Natural Language Processing, EMNLP
2024, Miami, FL, USA, November 12-16, 2024, pages
7725–7738. Association for Computational Linguis-
tics.
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan
Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma.
2024b. Llamafactory: Unified efficient fine-tuning
of 100+ language models. In Proceedings of the
62nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 3: System Demonstra-
tions), Bangkok, Thailand. Association for Computa-
tional Linguistics.
Chengzhi Zhong, Fei Cheng, Qianying Liu, Junfeng
Jiang, Zhen Wan, Chenhui Chu, Yugo Murawaki,
and Sadao Kurohashi. 2024. Beyond english-centric
llms: What language do multilingual language mod-
els think in? CoRR, abs/2408.10811.
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha
Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and
Le Hou. 2023. Instruction-following evaluation for
large language models. CoRR, abs/2311.07911.

-- 13 of 15 --

A Appendix
A.1 Continued pre-training (CPT) data
Existing data: We utilize existing datasets as shown in Table 6 (HuggingFace Datasets).
Other data: As shown in Table 6 (the other data section), the listed datasets contain data from a diverse
range of domains, including news, books, articles, poems, etc.
Continued Pre-training Data
Source (HuggingFace Datasets) Languages Size (Billions of Tokens)
bigcode/the-stack-v2-dedup 	CODE 	40
allenai/dolma 	EN 	37.5
HuggingFaceFW/fineweb-edu 	EN 	7.5
aisingapore/SEA-PILE-v1 	SEA 	47.58
aisingapore/SEA-PILE-v2 	ID 	7
Source (Others) 	Languages Size (Billions

Chunk 34 · 1,997 chars

rticles, poems, etc.
Continued Pre-training Data
Source (HuggingFace Datasets) Languages Size (Billions of Tokens)
bigcode/the-stack-v2-dedup CODE 40
allenai/dolma EN 37.5
HuggingFaceFW/fineweb-edu EN 7.5
aisingapore/SEA-PILE-v1 SEA 47.58
aisingapore/SEA-PILE-v2 ID 7
Source (Others) Languages Size (Billions of Tokens)
VinBigData VI 16
WangChanBERTa TH 8.5
Others - EN EN 5
Others - SEA SEA 30.92
Table 6: List of datasets for the continued pre-training stage.
A.2 Stage 1 IFT data
Stage 1 IFT Datasets
Source (HuggingFace Datasets) Languages Size
BAAI/Infinity-Instruct EN 7,449,106
nvidia/OpenMathInstruct-2 EN 2,000,000
Table 7: List of datasets for Stage-1-IFT. For BAAI/Infinity-Instruct dataset, any conversation that originally ended
with a user turn has had that last turn removed.
A.3 Stage 2 IFT data
Existing data: We utilize existing datasets as shown in Table 9 (HuggingFace Datasets).
Synthetic data: As shown in Table 9 (the generated part), we describe how to formulate synthetic data as
follows
• qwen_gemma_synthetic datasets are generated first in English with Qwen 32B, utilizing an approach
similar to Magpie. Instructions are then translated into the target language with Gemma 2 27B.
• llama_gemma_synthetic datasets are generated first in English with Llama 3.1 70B, utilizing an
approach similar to Magpie (Xu et al., 2024). Instructions are then translated into the target language
with Gemma 2 27B.
• gemma_synthetic datasets are generated directly with Gemma 2 27B using Magpie (Xu et al., 2024).
• sea_multilingual_systemchat is a synthetic dataset translated with Gemma 2 27B from the English
systemchat dataset.
• rewritten_oasst is a dataset rewritten with Gemma 2 27B based on the English OASST dataset.
• rewritten_helpsteer is a dataset rewritten with Gemma 2 27B based on the English Helpsteer dataset.
A.4 Helpfulness and preference alignment data
As shown in Table 8, we use the princeton-nlp/gemma2-ultrafeedback-armorm as the source

Chunk 35 · 1,989 chars

n_oasst is a dataset rewritten with Gemma 2 27B based on the English OASST dataset.
• rewritten_helpsteer is a dataset rewritten with Gemma 2 27B based on the English Helpsteer dataset.
A.4 Helpfulness and preference alignment data
As shown in Table 8, we use the princeton-nlp/gemma2-ultrafeedback-armorm as the source of the
alignment data. We then further re-scored with the reward model, nvidia/Llama-3.1-Nemotron-70B-
Reward to create the SEA version. In particular, generated-gemma2-27b-seapref-nemotron-70b takes
prompts from seald, wangchan_thaiinstruct, and additional hand-written Southeast Asian cultural prompts
collected from native speakers and then generates responses (with a varying temperature) from them with
Gemma 2 27B. The responses are then scored with nvidia/Llama-3.1-Nemotron-70B-Reward, with the
top-scoring response selected as chosen and vice versa, similar to princeton-nlp/gemma2-ultrafeedback-
armorm.

-- 14 of 15 --

Preference Data
Source (HuggingFace Datasets) 	Languages Size
princeton-nlp/gemma2-ultrafeedback-armorm 	EN 	61,510
Source (Generated) 	Languages Size
generated-gemma2-27b-seapref-nemotron-70b 	SEA 	5,511
Table 8: List of preference datasets used for the alignment stage.
Stage 2 IFT Datasets
Source (HuggingFace Datasets) 	Languages 	Size
BAAI/Infinity-Instruct^* 	EN 	1,456,927
HuggingFaceTB/smoltalk 	EN 	409,537
allenai/tulu-3-sft-personas-math 	EN 	149,960
parinzee/seed-free-synthetic-instruct-thai-v1 	TH 	118,898
HuggingFaceTB/smoltalk 	EN 	96,356
HuggingFaceTB/smoltalk 	EN 	83,144
arcee-ai/EvolKit-75K 	EN 	74,174
AI-MO/NuminaMath-TIR 	EN 	72,441
Post-training-Data-Flywheel/AutoIF-instruct-61k 	EN 	61,492
argilla/ifeval-like-data 	EN 	56,339
HuggingFaceTB/smoltalk 	EN 	53,342
ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k 	EN 	50,000
ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k 	EN 	50,000
allenai/tulu-3-sft-personas-math-grade 	EN 	49,980
allenai/tulu-3-sft-personas-code 	EN

Chunk 36 · 1,524 chars

gilla/ifeval-like-data 	EN 	56,339
HuggingFaceTB/smoltalk 	EN 	53,342
ai2-adapt-dev/tulu_v3.9_wildjailbreak_decontaminated_50k 	EN 	50,000
ai2-adapt-dev/tulu_v3.9_synthetic_finalresp_wildguardmixtrain_decontaminated_50k 	EN 	50,000
allenai/tulu-3-sft-personas-math-grade 	EN 	49,980
allenai/tulu-3-sft-personas-code 	EN 	34,999
HuggingFaceTB/smoltalk 	EN 	34,424
allenai/tulu-3-sft-personas-instruction-following 	EN 	29,980
airesearch/WangchanThaiInstruct 	TH 	25,014
allenai/tulu-3-sft-personas-algebra 	EN 	20,000
arcee-ai/EvolKit-20k-vi 	VI 	15,378
allenai/coconot 	EN 	10,983
ai2-adapt-dev/tulu_v3.9_sciriff_10k 	EN 	10,000
Source (Generated) 	Languages 	Size
qwen_gemma_synthetic_tamil 	TA 	480,000
qwen_gemma_synthetic_thai 	TH 	480,000
qwen_gemma_synthetic_indonesian 	ID 	465,019
qwen_gemma_synthetic_vietnamese 	VI 	465,019
gemma_synthetic_indonesian 	ID 	458,149
gemma_synthetic_filipino 	TL 	455,093
gemma_synthetic_viet 	VI 	291,576
gemma_synthetic_tamil 	TA 	276,314
gemma_synthetic_thai 	TH 	186,339
gemma_synthetic_javanese 	JV 	110,000
gemma_synthetic_sudanese 	SU 	110,000
llama_gemma_synthetic_thai 	TH 	88,920
llama_gemma_synthetic_tamil 	TA 	88,920
llama_gemma_synthetic_vietnamese 	VI 	88,920
llama_gemma_synthetic_javanese 	JV 	88,920
llama_gemma_synthetic_indonesian 	ID 	88,920
llama_gemma_synthetic_filipino 	TL 	80,000
enrich_27k 	SEA 	27,463
sea_multilingual_systemchat 	SEA 	1,903
rewritten_oasst 	SEA 	841
rewritten_helpsteer 	SEA 	838
Table 9: List of datasets for Stage-2-IFT.

-- 15 of 15 --