OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data

Summary

This paper introduces OPENSEAL, the first fully open-source Southeast Asian large language model (LLM). The authors investigate the effectiveness of parallel data in continual pretraining (CPT) of LLMs to extend their multilingual capabilities. They find that using only parallel data is the most effective strategy for adding new languages, outperforming approaches that mix parallel and monolingual data. OPENSEAL is built by continually training the open-source OLMo 2 model on 34.7B tokens of parallel data across ten Southeast Asian languages and English. The model achieves strong performance on translation and commonsense reasoning benchmarks, rivaling or surpassing existing Southeast Asian LLMs that use much larger multilingual datasets. Key advantages include full transparency—open source code, data, and weights—which supports security, research, and AI sovereignty. The model was trained in 180 hours on 8× NVIDIA H200 GPUs at a cost under $12,000, making it a cost-effective solution. The study highlights the efficiency and effectiveness of parallel data in CPT, offering a practical pathway for developing high-quality, region-focused LLMs.

PDF viewer

Chunks(29)

Chunk 0 · 1,999 chars

OPENSEAL: Good, Fast, and Cheap Construction of
an Open-Source Southeast Asian LLM via Parallel Data
Tan Sang Nguyen, Muhammad Reza Qorib, Hwee Tou Ng
Department of Computer Science, National University of Singapore
e1583535@u.nus.edu, mrqorib@u.nus.edu, dcsnght@nus.edu.sg
Abstract
Large language models (LLMs) have proven
to be effective tools for a wide range of nat-
ural language processing (NLP) applications.
Although many LLMs are multilingual, most
remain English-centric and perform poorly
on low-resource languages. Recently, several
Southeast Asia–focused LLMs have been devel-
oped, but none are truly open source, as they do
not publicly disclose their training data. Truly
open-source models are important for trans-
parency and for enabling a deeper and more
precise understanding of LLM internals and
development, including biases, generalization,
and multilinguality. Motivated by recent ad-
vances demonstrating the effectiveness of paral-
lel data in improving multilingual performance,
we conduct controlled and comprehensive ex-
periments to study the effectiveness of parallel
data in continual pretraining of LLMs. Our find-
ings show that using only parallel data is the
most effective way to extend an LLM to new
languages. Using just 34.7B tokens of parallel
data and 180 hours on 8× NVIDIA H200 GPUs,
we built OPENSEAL, the first truly open South-
east Asian LLM that rivals the performance of
existing models of similar size.
1 Introduction
With the democratization of large language model
(LLM) development, enabled by open-source ef-
forts from the research community, region-specific
LLMs have been proposed to improve performance
across particular language groups, especially low-
resource languages. Consistent with the no free
lunch theorem (Wolpert and Macready, 1997), nu-
merous studies have shown that region-specific
LLMs outperform one-size-fits-all models (Sen-
gupta et al., 2023; Nguyen et al., 2024).
The most efficient way to build a region-specific
LLM

Chunk 1 · 1,992 chars

language groups, especially low-
resource languages. Consistent with the no free
lunch theorem (Wolpert and Macready, 1997), nu-
merous studies have shown that region-specific
LLMs outperform one-size-fits-all models (Sen-
gupta et al., 2023; Nguyen et al., 2024).
The most efficient way to build a region-specific
LLM is to continually train a model that has already
been pretrained on trillions of tokens—standing on
the shoulders of giants. Prior work has shown that
this approach is not only more efficient but also
more effective, achieving better performance than
training from scratch (Zheng et al., 2024). Most ex-
isting approaches adapt an LLM by further training
it on a mixture of monolingual corpora correspond-
ing to the target languages (Ng et al., 2025).
Recently, increasing effort has been devoted to
developing large language models for the South-
east Asian region. These efforts include compiling
large-scale training corpora (Ng et al., 2025), build-
ing evaluation benchmarks (Lovenia et al., 2024),
and training LLMs that support languages spoken
in the region (Nguyen et al., 2024). However, none
of these models are fully open source. Existing
Southeast Asian LLMs are open-weight models:
while their model weights are publicly available,
their training data (including the training data of
their base LLMs) and training code are not fully
disclosed.
Without complete transparency regarding their
training data, such models risk being poisoned, ei-
ther deliberately by the model provider or inad-
vertently through the inclusion of malicious con-
tent from the Internet. Zhang et al. (2025b) show
that poisoning as little as 0.1% of the training data
can induce harmful behavior in an LLM, and that
this harm can persist even after additional prefer-
ence tuning intended to improve safety and help-
fulness. This makes non–fully open-source LLMs
risky for security-sensitive applications, such as
those in government institutions.
Recent work has demonstrated the

Chunk 2 · 1,995 chars

data
can induce harmful behavior in an LLM, and that
this harm can persist even after additional prefer-
ence tuning intended to improve safety and help-
fulness. This makes non–fully open-source LLMs
risky for security-sensitive applications, such as
those in government institutions.
Recent work has demonstrated the effectiveness
of parallel data in improving multilingual capabili-
ties in LLMs (Qorib et al., 2025; Fujii et al., 2024).
Building on this advancement, we conduct a con-
trolled and comprehensive investigation into the ef-
fectiveness of parallel data for extending LLMs to
new languages. Based on this investigation, we de-
velop OPENSEAL (Open-source SouthE ast Asian
LLM), a fully open-source Southeast Asian LLM.
OPENSEAL is trained using only 34.7B tokens of
1
arXiv:2602.02266v1 [cs.CL] 2 Feb 2026

-- 1 of 13 --

parallel data and 180 hours on 8× NVIDIA H200
GPUs. The training cost based on commercial rates
charged by cloud providers1 is under US$12,000.
To summarize, our contributions are as follows:
• We propose a good, fast, and cheap way of
adapting an LLM to new languages by utiliz-
ing parallel data. We show that under the same
token budget, utilizing only parallel data is the
most efficient. To the best of our knowledge,
we are the first to present a comprehensive
controlled investigation of the use of parallel
data for continual pretraining.
• We built a new Southeast Asian LLM,
OPENSEAL, by continually training an open-
source English-only LLM (OLMo 2; Walsh
et al. 2025) on a relatively small amount of par-
allel data. On translation and commonsense
reasoning benchmarks, OPENSEAL performs
at least as well as — and in some cases outper-
forms — existing Southeast Asian LLMs that
were based on high-performing multilingual
LLMs and trained on a much larger amount
of multilingual data.
• Our model is fully open (open source, open
data, and open weight2), which allows future
research to investigate the multilinguality of
LLMs in a more precise

Chunk 3 · 1,996 chars

es outper-
forms — existing Southeast Asian LLMs that
were based on high-performing multilingual
LLMs and trained on a much larger amount
of multilingual data.
• Our model is fully open (open source, open
data, and open weight2), which allows future
research to investigate the multilinguality of
LLMs in a more precise and detailed manner.
To the best of our knowledge, our model is the
first fully open Southeast Asian LLM.
2 Related Work
In this section, we will discuss recent work that
investigates the effectiveness of parallel data for
LLM training—pretraining, continual pretraining,
and instruction tuning—and recent efforts in build-
ing LLMs for the Southeast Asian region.
2.1 Parallel Data in LLM training
Parallel data have been commonly used in training
encoder-only (Conneau and Lample, 2019; Ouyang
et al., 2021) and encoder–decoder (Liu et al., 2020;
Chi et al., 2021) LLMs, but early decoder-only
LLMs (Scao et al., 2022) did not utilize parallel
data and were trained solely on a mixture of mono-
lingual data.
Qorib et al. (2025) conduct a comprehensive
investigation into the effects of parallel data in
1For example, AWS p5en.48xlarge
2Source code, training data, and model weights will be
released upon publication.
LLM pretraining. They compare pretraining with
and without parallel data in a controlled setting
by fixing the choice and order of the training data.
Their experiments on a 1B-parameter model trained
on 167B tokens show that incorporating parallel
data improves multilingual capabilities more than
monolingual data, especially when parallel data are
placed at the end of training. While these findings
highlight the value of parallel data in LLM training,
the experiments are limited to 1B models trained
on a relatively small number of tokens compared
to production-scale LLMs.
A recent multilingual LLM, Apertus (Hernández-
Cano et al., 2025), includes parallel data in one
of its pretraining stages. Similarly, the model is
trained with parallel data in

Chunk 4 · 1,998 chars

a in LLM training,
the experiments are limited to 1B models trained
on a relatively small number of tokens compared
to production-scale LLMs.
A recent multilingual LLM, Apertus (Hernández-
Cano et al., 2025), includes parallel data in one
of its pretraining stages. Similarly, the model is
trained with parallel data in the final (fifth) pre-
training stage. Although the models have up to
70B parameters, the authors do not investigate the
specific effect of parallel data during pretraining.
Fujii et al. (2024) utilize parallel data to build a
Japanese LLM by performing continual pretraining
(CPT) of Llama 2 (Touvron et al., 2023). They first
train Llama 2 on 22 million parallel sentence pairs
before further training it on monolingual data. The
motivation for this training order is the assumption
that parallel sentences can facilitate the transition
from English-centric to Japanese-centric training.
In contrast, ALMA (Xu et al., 2024) adapts
Llama 2 for machine translation by training the
model on parallel data only after it has been trained
on monolingual data. The authors argue that mono-
lingual training is necessary to improve the model’s
proficiency in non-English languages before intro-
ducing parallel data.
These two approaches adopt the exact opposite
order for utilizing parallel data, and this difference
in training order is not inconsequential. We find
that such differences can result in models with dif-
fering capabilities. Due to the lack of compre-
hensive empirical studies on the effects of parallel
data in CPT, it remains unclear how parallel data
influence continual pretraining and how they can
be maximally leveraged to extend LLMs to new
languages.
Ranaldi et al. (2024) utilize parallel data as an
additional signal during instruction tuning. Specifi-
cally, they employ a machine translation model to
translate instruction-following datasets into multi-
ple languages and incorporate the resulting transla-
tion pairs during instruction tuning. They find

Chunk 5 · 1,997 chars

languages.
Ranaldi et al. (2024) utilize parallel data as an
additional signal during instruction tuning. Specifi-
cally, they employ a machine translation model to
translate instruction-following datasets into multi-
ple languages and incorporate the resulting transla-
tion pairs during instruction tuning. They find that
this approach promotes better cross-lingual transfer.
While instruction tuning is outside the scope of our
2

-- 2 of 13 --

current work, this method is orthogonal to ours and
can be applied sequentially.
Our work extends the investigation of parallel
data by Qorib et al. (2025) to continual pretraining
of a base LLM (OLMo 2) that is pretrained on more
than four trillion tokens. In addition to studying
CPT on a much larger scale, we scale the model
size to 7B parameters. Although parallel data have
been used previously for CPT, to the best of our
knowledge, no comprehensive empirical study has
systematically evaluated the role of parallel data
in continual pretraining. Our work aims to fill this
gap and demonstrates that parallel data are highly
effective for extending LLMs to new languages.
2.2 Southeast Asian LLMs
Notable LLMs developed for the Southeast Asian
region include SEA-LION v3.5 (Ng et al., 2025),
Sailor 2 (Dou et al., 2024), and SeaLLM 3 (Zhang
et al., 2025a). All of them were built on multilin-
gual open-weight LLMs by performing continual
pre-training (CPT) on Southeast Asian data, mak-
ing the models more attuned to native texts from
the region. Adapting high-performing multilingual
models allows them to retain strong capabilities in
English and Chinese, while also providing a signif-
icant head start on high-resource Southeast Asian
languages such as Indonesian and Thai. In this pa-
per, we will refer to languages using their ISO 639
codes, as listed in Table 1.
SEA-LION v3.5 has two versions: one built by
performing CPT on Llama 2 8B and the other on
Gemma 2 9B3. CPT was applied to the instruction-
tuned versions of the models

Chunk 6 · 1,995 chars

ce Southeast Asian
languages such as Indonesian and Thai. In this pa-
per, we will refer to languages using their ISO 639
codes, as listed in Table 1.
SEA-LION v3.5 has two versions: one built by
performing CPT on Llama 2 8B and the other on
Gemma 2 9B3. CPT was applied to the instruction-
tuned versions of the models using 200B tokens
of data, comprising 50B tokens of English, 110B
tokens from 10 Southeast Asian languages (id, km,
lo, ms, my, ta, th, tl, vi, zh), and 40B tokens
of code. The models then underwent instruction
tuning and preference tuning on both English and
Southeast Asian data. Between training stages, a
series of model-merging steps was employed to
mitigate catastrophic forgetting. On eight nodes
equipped with 8× NVIDIA H200 GPUs per node,
the total training time was six days for the Llama-
based version and ten days for the Gemma-based
version.
A newer version of SEA-LION (v4) has been re-
leased on HuggingFace4, built by performing CPT
3At the time of writing, only the Llama-based version is
publicly available.
4https://huggingface.co/collections/
aisingapore/sea-lion-v4
on Gemma 3 and Qwen 3. The Gemma version
has 27B parameters, while the Qwen version has
32B parameters. The Gemma model was continu-
ally pretrained on approximately 500B tokens from
11 Southeast Asian languages (en, id, km, lo, ms,
my, ta, th, tl, vi, zh), whereas the Qwen model
was continually pretrained on approximately 100B
tokens from 7 Southeast Asian languages (id, ms,
my, ta, th, tl, vi). SEA-LION v4 also includes vi-
sion–language model variants adapted from Qwen
4B, Qwen 8B, and Gemma 27B. At the time of
writing, no technical report or academic paper on
SEA-LION v4 has been released.
Sailor 2 was built by expanding Qwen 2.5 using
an approach similar to LlamaPro (Wu et al., 2024),
increasing the parameter counts from 0.5B to 1B,
from 7B to 8B, and from 14B to 20B. These ex-
pansions were designed to mitigate potential catas-
trophic forgetting on English and Chinese

Chunk 7 · 1,999 chars

n
SEA-LION v4 has been released.
Sailor 2 was built by expanding Qwen 2.5 using
an approach similar to LlamaPro (Wu et al., 2024),
increasing the parameter counts from 0.5B to 1B,
from 7B to 8B, and from 14B to 20B. These ex-
pansions were designed to mitigate potential catas-
trophic forgetting on English and Chinese tasks.
The models were trained on 500B tokens of data,
comprising 400B tokens from 13 Southeast Asian
languages (ceb, id, ilo, jv, km, lo, ms, my, su, th,
tl, vi, war) and 100B tokens of English text. CPT
was performed in two stages. In the first stage, the
model was trained on 450B tokens from a com-
prehensive dataset covering 8 Southeast Asian lan-
guages and English. In the second stage, it was
trained on high-quality data covering 13 Southeast
Asian languages along with English instruction-
tuning data. The model then underwent two stages
of instruction tuning and two stages of preference
tuning.
SeaLLM 3 was built by performing CPT on
Qwen 2 using data from 12 Southeast Asian lan-
guages (en, id, jv, km, lo, ms, my, ta, th, tl, vi,
zh). Unlike the previously discussed Southeast
Asian LLMs, SeaLLM updates only a small subset
of model parameters identified as language-specific
neurons. In addition, CPT was performed for one
language at a time, resulting in multiple monolin-
gual model variants, each specialized in a single
Southeast Asian language. Subsequently, the mono-
lingual models were merged into a single multilin-
gual model. This training approach requires less
data5.
3 Methods
To evaluate the effectiveness of parallel data, we
vary the inclusion and placement of parallel data
5The amount of training data used for SeaLLM is not
disclosed.
3

-- 3 of 13 --

Figure 1: Illustration of the training data sequences of our parallel data investigation.
and try to maintain the choice and order of the other
training data, while ensuring each data type is still
uniformly distributed. To this end, we define each
data source as small blocks of 262,144

Chunk 8 · 1,995 chars

ot
disclosed.
3

-- 3 of 13 --

Figure 1: Illustration of the training data sequences of our parallel data investigation.
and try to maintain the choice and order of the other
training data, while ensuring each data type is still
uniformly distributed. To this end, we define each
data source as small blocks of 262,144 tokens6
(Figure 1). Note that the order of blocks within
a batch (eight blocks for the 1B model or sixteen
blocks for the 7B model) is randomized.
Monolingual data blocks are constructed from
a collection of texts in a single language, whereas
parallel data blocks are formed by concatenat-
ing an English sentence and its Southeast Asian
translation using the format “source language:
source sentence\ntarget language: target
sentence<|endoftext|>”. This format mimics inci-
dental parallel data found in PaLM’s (Chowdhery
et al., 2023) training set, as reported by Briakou
et al. (2023). To ensure exposure to both translation
directions, we randomize the order of the source
and target languages within each block (e.g., for
en-id blocks, some segments begin with English,
while others begin with Indonesian).
Replay is crucial for mitigating catastrophic for-
getting: even a small replay ratio (e.g., 5%) yields
significant advantages over using no replay. The re-
play proportion represents a trade-off between pre-
serving proficiency in previously acquired domains
or languages and adapting to new ones. Ibrahim
et al. (2024) empirically analyze replay ratios and
recommend a 25% replay rate. Following this rec-
ommendation, we ensure that one out of every four
blocks in our training data is replay data across all
experimental settings.
We define five experimental settings: MULTI-
LINGUAL, MIXED, PARALLEL FIRST, PARALLEL
LAST, and PARALLEL ONLY. In the MIXED, PAR-
ALLEL FIRST, and PARALLEL LAST settings, we
6This corresponds to 64 sequences of 4,096 tokens (the
maximum context window size).
maintain an equal ratio of monolingual and parallel
data blocks.
3.1

Chunk 9 · 1,998 chars

e five experimental settings: MULTI-
LINGUAL, MIXED, PARALLEL FIRST, PARALLEL
LAST, and PARALLEL ONLY. In the MIXED, PAR-
ALLEL FIRST, and PARALLEL LAST settings, we
6This corresponds to 64 sequences of 4,096 tokens (the
maximum context window size).
maintain an equal ratio of monolingual and parallel
data blocks.
3.1 Multilingual
The MULTILINGUAL setting reflects the standard
approach to extending an LLM to new languages,
in which the model is continually pretrained solely
on a mixture of monolingual corpora. We sample
data blocks uniformly across all languages to pre-
vent dominance by any single language. Continual
pretraining on a mixture of monolingual corpora
is similar to the approach used by SEA-LION and
other Southeast Asian LLMs.
3.2 Mixed
In the MIXED setting, monolingual and parallel
data blocks are uniformly interleaved. To ensure
that each batch contains both data types, we en-
force the presence of at least one monolingual block
and one parallel block between two replay blocks.
Mixing monolingual data and parallel data is also
employed by Apertus in its fifth pretraining stage.
3.3 Parallel First
The PARALLEL FIRST setting prioritizes parallel
data in the early stages of continual pretraining.
Once the parallel blocks are exhausted, training
proceeds with multilingual data. This setting is
motivated by the hypothesis that early exposure to
parallel data facilitates the learning of cross-lingual
alignments, which can subsequently be reinforced
and generalized through multilingual training. Plac-
ing parallel data before monolingual data is also
the CPT strategy adopted by (Fujii et al., 2024).
3.4 Parallel Last
This setting begins with monolingual blocks and
switches to parallel data after the monolingual
4

-- 4 of 13 --

blocks are exhausted. It is motivated by the hypoth-
esis that extensive monolingual training improves
general fluency in new languages before introduc-
ing aligned data for enhanced cross-lingual transfer.
Placing monolingual data

Chunk 10 · 1,998 chars

onolingual blocks and
switches to parallel data after the monolingual
4

-- 4 of 13 --

blocks are exhausted. It is motivated by the hypoth-
esis that extensive monolingual training improves
general fluency in new languages before introduc-
ing aligned data for enhanced cross-lingual transfer.
Placing monolingual data before parallel data is the
CPT strategy used by ALMA.
3.5 Parallel Only
In the PARALLEL ONLY setting, continual pretrain-
ing relies exclusively on parallel data and replay
data, entirely omitting monolingual data. This
setting tests whether parallel data alone are suf-
ficient—and potentially superior—to monolingual
data for improving multilingual capabilities and
cross-lingual transfer, and whether it should be ex-
clusively used for CPT when available.
4 Experiments
4.1 Model
We build our models via continual pretraining
(CPT) of OLMo 2, a fully open, decoder-only trans-
former language model. We experiment with the
1B and 7B OLMo 2 variants. The 1B model has
16 transformer layers, while the 7B model has 32
layers. OLMo 2 employs an autoregressive archi-
tecture designed for stable large-scale training, in-
corporating bias-free linear layers, rotary positional
embeddings, RMS normalization, and SwiGLU ac-
tivation. We first train the 1B model on 10B tokens
to identify the optimal CPT strategy. We then scale
the best configuration to the 7B model by training
on 34.7B tokens and compare it against the MUL-
TILINGUAL setting, which is the most commonly
used CPT setting in prior work.
We keep the OLMo 2 architecture fixed and fo-
cus exclusively on the effects of training data com-
position and ordering during CPT. Starting from
an OLMo 2 checkpoint pretrained on four trillion
tokens of general-domain data, we further adapt the
model using different mixtures of parallel, multilin-
gual, and replay data discussed in Section 3. Aside
from the data mixture, all training hyperparameters
and model components are identical across the dif-
ferent settings,

Chunk 11 · 1,987 chars

2 checkpoint pretrained on four trillion
tokens of general-domain data, we further adapt the
model using different mixtures of parallel, multilin-
gual, and replay data discussed in Section 3. Aside
from the data mixture, all training hyperparameters
and model components are identical across the dif-
ferent settings, enabling a controlled comparison
of how parallel and multilingual data interact un-
der CPT. Following prior work, we use the WSD
scheduler (Hu et al., 2024). Details of the training
hyperparameters are provided in Appendix C.
ISO Language # sent. # tokens
Code SEA EN
id Indonesian 70.5M 2.2B 1.9B
km Khmer 5.8M 0.6B 0.1B
lo Lao 4.2M 0.4B 0.1B
ms Malay 56.8M 1.2B 1.0B
my Burmese 10.0M 1.2B 0.2B
ta Tamil 42.5M 3.7B 0.7B
th Thai 29.0M 1.6B 0.6B
tl Tagalog 63.6M 1.4B 1.2B
vi Vietnamese 50.1M 2.4B 1.3B
zh Chinese 71.3M 2.5B 1.7B
Total 403.8M 17.2B 8.8B
Table 1: Number of sentence pairs (# sent.) and tokens
(measured using the OLMo 2 tokenizer) of our parallel
data.
4.2 Data
For the parallel data, we construct a large-scale
SEA–English corpus covering multiple Southeast
Asian languages, comprising approximately 403.8
million parallel sentence pairs in total (Table 1).
The corpus contains 17.2 billion tokens in ten
Southeast Asian languages and 8.8 billion tokens in
English, as measured using the OLMo 2 tokenizer.
Wherever possible, we rely on NLLB (Costa-jussà
et al., 2022) as the primary source of parallel data
to ensure consistency across languages. For Thai,
however, NLLB data are not available; therefore,
we aggregate parallel data from multiple publicly
available sources7.
For the Southeast Asian multilingual data, we
use SEA-PILE-v2 (Ng et al., 2025), a large-scale
multilingual corpus covering Southeast Asian lan-
guages. Since Chinese is not included in SEA-
PILE-v2, we source Chinese monolingual data
from the MAP-CC corpus (Du et al., 2024), which
consists of high-quality text from diverse sources,
including books, encyclopedias, academic

Chunk 12 · 1,986 chars

LE-v2 (Ng et al., 2025), a large-scale
multilingual corpus covering Southeast Asian lan-
guages. Since Chinese is not included in SEA-
PILE-v2, we source Chinese monolingual data
from the MAP-CC corpus (Du et al., 2024), which
consists of high-quality text from diverse sources,
including books, encyclopedias, academic papers,
and web articles.
For the replay data, we sample from OLMo 2’s
second-stage pretraining data, which consist of a
diverse mixture of high-quality English corpora.
They include filtered web data, instruction-tuned
data, technical question–answer data, academic pa-
pers, encyclopedic text, and mathematics-focused
data.
When sampling the data, we aim to sample each
7bible-uedin, CCAligned, ELRC_2922, GNOME, HPLT,
KDE4, OpenSubtitles, QED, Tanzil, TED2020.
5

-- 5 of 13 --

Model Param EN → XX XX → EN XNLI Avg
Multilingual 1B 12.51 14.50 43.28 23.43
Mixed 1B 23.69 25.63 43.28 30.87
Parallel First 1B 19.00 20.85 44.81 28.22
Parallel Last 1B 22.69 26.55 42.84 30.69
Parallel Only 1B 24.12 26.24 44.09 31.48
Table 2: Comparison of translation (BLEU scores) and commonsense reasoning performance under different 1B
training settings. The highest score is bolded, while the second-highest score is underlined.
data source uniformly by equalizing the number of
tokens drawn from each source. The total number
of available tokens for each data source is provided
in Appendix A.
4.3 Evaluation
We evaluate the models on translation and multi-
lingual commonsense reasoning tasks. Translation
quality is assessed using the BLEU metric (Pa-
pineni et al., 2002), computed with SacreBLEU
(Post, 2018). Translation performance is evalu-
ated in both directions—from Southeast Asian lan-
guages to English and from English to Southeast
Asian languages—using test sets from FLORES-
200 (Costa-jussà et al., 2022). For all translation
evaluations, we employ a fixed 5-shot prompting
setup across all models reported in this study.
In addition to translation, we evaluate

Chunk 13 · 1,993 chars

h directions—from Southeast Asian lan-
guages to English and from English to Southeast
Asian languages—using test sets from FLORES-
200 (Costa-jussà et al., 2022). For all translation
evaluations, we employ a fixed 5-shot prompting
setup across all models reported in this study.
In addition to translation, we evaluate multilin-
gual commonsense reasoning using XNLI (Con-
neau et al., 2018), XCOPA (Ponti et al., 2020), and
PAWS-X (Yang et al., 2019). All experiments are
conducted using the lm-evaluation-harness8. For
XNLI, we report results for English, Thai, Viet-
namese, and Chinese. XCOPA is evaluated on
English, Indonesian, Tamil, Thai, Vietnamese, and
Chinese, covering a diverse set of typologically
distinct languages. For PAWS-X, we evaluate per-
formance on English and Chinese. This multilin-
gual evaluation setup enables a consistent assess-
ment of commonsense reasoning across languages
and tasks. In all experiments, we assess statisti-
cal significance using paired bootstrap resampling
(Koehn, 2004) with 1,000 samples.
5 Results
Our controlled experiments on the effect of paral-
lel data show that the PARALLEL ONLY strategy
is the most effective for extending LLMs to new
languages. It significantly outperforms (p < 0.01)
other settings in both translation directions, except
8https://github.com/EleutherAI/lm-evaluation-harness
when compared with PARALLEL LAST in the XX
→ EN direction, where no statistically significant
difference is observed (Table 2). In addition, PAR-
ALLEL ONLY significantly outperforms PARALLEL
LAST on XNLI. As no other configuration signifi-
cantly outperforms PARALLEL ONLY, this strategy
is superior overall.
Given the strong performance of PARALLEL
ONLY at the 1B scale, we next examine whether
this advantage persists as we scale the model size
and training data. Table 3 reports results for the
larger 7B model, trained on 34.7B tokens. Consis-
tent with the 1B-scale findings, PARALLEL ONLY
(OPENSEAL) achieves substantial gains in

Chunk 14 · 1,996 chars

he strong performance of PARALLEL
ONLY at the 1B scale, we next examine whether
this advantage persists as we scale the model size
and training data. Table 3 reports results for the
larger 7B model, trained on 34.7B tokens. Consis-
tent with the 1B-scale findings, PARALLEL ONLY
(OPENSEAL) achieves substantial gains in trans-
lation quality, attaining the highest BLEU scores
in both EN→XX and XX→EN directions. Im-
provements of PARALLEL ONLY (OPENSEAL)
over other Southeast Asian LLMs are statistically
significant (p < 0.01), except when compared with
Sailor 2 in the EN → XX direction, where no sig-
nificant difference is observed.
Beyond translation, PARALLEL ONLY
(OPENSEAL) remains competitive on XNLI,
XCOPA, and PAWS-X. It achieves the best
score on PAWS-X among all compared models,
significantly outperforming SeaLLM v3 and
MULTILINGUAL and showing no significant
differences on XNLI and XCOPA, except when
compared with Sailor 2 on XCOPA.
Overall, across all evaluated tasks and models,
PARALLEL ONLY (OPENSEAL) is significantly
outperformed only by Sailor 2 on XCOPA, while
it significantly outperforms all models—including
Sailor 2—on the translation task. This result is
notable because other Southeast Asian LLMs are
built from strong multilingual LLMs and continu-
ally trained on substantially larger corpora (400B
tokens for Sailor 2 and 200B tokens for SEA-LION
v3.5), whereas OPENSEAL is built by continual
pretraining of an English-only LLM on just 34.7B
tokens. These findings highlight the effectiveness
6

-- 6 of 13 --

Model Param Translation Reasoning
EN → XX XX → EN XNLI XCOPA PAWS-X
Multilingual 7B 24.17 30.15 45.73 70.30 60.08
Parallel Only (OPENSEAL) 7B 31.12 38.04 45.40 70.20 64.70
SeaLLM v3 7B 17.85 25.71 43.28 70.53 60.18
Sailor2 8B 31.09 36.37 44.59 74.50 64.53
SEA-LION v3.5 8B 25.37 32.03 46.11 70.37 63.58
Table 3: Comparison of translation and commonsense reasoning performance across 7B–8B multilingual models.
The highest score is bolded, while the

Chunk 15 · 1,992 chars

OPENSEAL) 7B 31.12 38.04 45.40 70.20 64.70
SeaLLM v3 7B 17.85 25.71 43.28 70.53 60.18
Sailor2 8B 31.09 36.37 44.59 74.50 64.53
SEA-LION v3.5 8B 25.37 32.03 46.11 70.37 63.58
Table 3: Comparison of translation and commonsense reasoning performance across 7B–8B multilingual models.
The highest score is bolded, while the second-highest score is underlined.
and efficiency of parallel data for CPT.
6 Analysis
6.1 Impact of Parallel Data
To isolate the contribution of parallel SEA sen-
tences, we introduce an ablated variant of PARAL-
LEL ONLY, denoted as MULTILINGUAL REPLACE-
MENT, in which the SEA sentences in the parallel
data blocks are replaced with multilingual SEA
text from SEA-PILE-v2. All other components are
kept identical: the replay data, the English-side
data, and the adjacency-based training format re-
main unchanged. This design allows us to assess
whether the advantages of PARALLEL ONLY stem
from bilingual alignment enabled by genuine paral-
lel sentences.
Table 4 shows that this substitution results in sig-
nificant (p < 0.05) degradation in translation per-
formance and PAWS-X. Specifically, the EN → XX
translation score drops from 31.12 to 22.03, the
XX → EN score from 38.04 to 23.19, and the
PAWS-X score from 64.70 to 62.45. In contrast,
no significant differences are observed on XNLI or
XCOPA.
Overall, these findings directly address our pri-
mary research question, demonstrating that the ben-
efits of PARALLEL ONLY arise from deliberate
bilingual alignment provided by authentic parallel
sentences. The results further indicate that perform-
ing CPT using only parallel data does not cause the
model to overfit to translation, as it also yields im-
proved performance on commonsense reasoning
tasks.
6.2 Logit Lens
Recent work by Wendler et al. (2024) examined
latent language probabilities through the logit lens
and demonstrated that multilingual LLMs exhibit
a distinctive depth-wise progression at the final
training checkpoint: early layers encode

Chunk 16 · 1,999 chars

also yields im-
proved performance on commonsense reasoning
tasks.
6.2 Logit Lens
Recent work by Wendler et al. (2024) examined
latent language probabilities through the logit lens
and demonstrated that multilingual LLMs exhibit
a distinctive depth-wise progression at the final
training checkpoint: early layers encode minimal
language-specific signals, intermediate layers pri-
oritize English representations, and deeper lay-
ers shift toward the target language. However,
their analysis is limited to a single fully trained
model, leaving open questions about how language-
specific representations develop over the course of
training and how different continual pretraining
strategies influence this process.
Figures 2 and 3 extend this analysis by track-
ing language probabilities across training check-
points for the Chinese Cloze task, comparing the
PARALLEL ONLY (7B) and MULTILINGUAL (7B)
models. Because layers below 20 do not exhibit
meaningful probabilities for either Chinese or En-
glish throughout training, we focus on layers 20
and above, where non-trivial language signals first
emerge. Both models reveal a consistent depth-
dependent pattern: layer 20 shows weak and un-
stable English-leaning behavior; intermediate lay-
ers (e.g., layer 24) display a pronounced English-
dominant phase across training checkpoints; and
deeper layers increasingly transition toward Chi-
nese dominance.
Despite these shared patterns, the temporal dy-
namics of the transition differ between the two
models, particularly around layer 30. Figure 2
shows that the PARALLEL ONLY model undergoes
a more gradual shift, with Chinese probability sur-
passing English only in the later stages of training.
In contrast, Figure 3 shows that the MULTILIN-
GUAL model exhibits an earlier and more abrupt
transition, characterized by a sharp increase in Chi-
nese probabilities and a corresponding decrease in
English probability across checkpoints.
English probabilities in the MULTILINGUAL
model are also

Chunk 17 · 1,996 chars

ter stages of training.
In contrast, Figure 3 shows that the MULTILIN-
GUAL model exhibits an earlier and more abrupt
transition, characterized by a sharp increase in Chi-
nese probabilities and a corresponding decrease in
English probability across checkpoints.
English probabilities in the MULTILINGUAL
model are also generally lower than those in the
PARALLEL ONLY model, especially in layers 30
and 31. Together, these observations suggest that
the PARALLEL ONLY model “thinks” more in En-
glish than the MULTILINGUAL model, a behavior
7

-- 7 of 13 --

Model Param Translation Reasoning
EN → XX XX → EN XNLI XCOPA PAWS-X
Parallel Only 7B 31.12 38.04 45.40 70.20 64.70
Multilingual Replacement 7B 22.03 23.19 46.13 69.83 62.45
Table 4: Comparison of translation and commonsense reasoning performance between PARALLEL ONLY and
MULTILINGUAL REPLACEMENT. The highest score is bolded.
(a) Layer 20 (b) Layer 24 (c) Layer 30 (d) Layer 31 (e) Layer 32
Figure 2: Language probabilities of the PARALLEL ONLY (7B) model on the Chinese Cloze task across layers and
training checkpoints.
(a) Layer 20 (b) Layer 24 (c) Layer 30 (d) Layer 31 (e) Layer 32
Figure 3: Language probabilities of the MULTILINGUAL (7B) model on the Chinese Cloze task across layers and
training checkpoints.
also observed in strong multilingual models such as
Llama 2 (Wendler et al., 2024). This may indicate
stronger cross-lingual transfer in the PARALLEL
ONLY model, which could help explain its superior
performance across tasks.
7 Conclusion
Our study investigates data choices for extending
an English-only LLM, OLMo 2, to new languages
and demonstrates that, under a fixed token budget,
training with only parallel data yields the strongest
performance. We further validate this finding on
a 7B model continually pretrained on 34.7B to-
kens, where we compare parallel-only continual
pretraining with the more conventional multilin-
gual data approach and observe consistent trends.
Through ablation experiments and Logit

Chunk 18 · 1,996 chars

nly parallel data yields the strongest
performance. We further validate this finding on
a 7B model continually pretrained on 34.7B to-
kens, where we compare parallel-only continual
pretraining with the more conventional multilin-
gual data approach and observe consistent trends.
Through ablation experiments and Logit Lens anal-
yses, we provide evidence that the sentence-level
alignment inherent in parallel data plays a key role
in its effectiveness, likely by facilitating stronger
cross-lingual transfer.
Building on these insights, we introduce
OPENSEAL, the first fully open Southeast Asian
LLM. Our model is fully transparent, enabling care-
ful inspection for security-sensitive applications
and promoting AI sovereignty. By releasing mod-
els that are entirely open, we aim to support future
in-depth research on multilinguality, bias, and other
related properties of LLMs.
Despite being trained with a relatively modest
computational budget, OPENSEAL matches or, in
some cases, surpasses existing Southeast Asian
LLMs that are derived from high-performing mul-
tilingual foundations and trained on substantially
larger multilingual corpora. Overall, our results
suggest that continually pretraining an English
LLM with parallel data offers a good, fast, and
cheap pathway to building high-quality, region-
focused language models.
Limitations
This work focuses less on building the highest-
performing LLM and more on empirically investi-
gating and analyzing the effectiveness of parallel
data for adapting LLMs to new languages, using
Southeast Asian languages as a case study. Our
study focuses on the continual pretraining stage and
does not include instruction tuning, which could
otherwise confound the results. We leave the in-
vestigation of instruction tuning to future work.
Due to computational resource limitations, our ex-
periments are conducted only at the 1B and 7B
parameter scales. We believe that our work poses
no immediate risk to society or to any individuals
8

Chunk 19 · 1,996 chars

tuning, which could
otherwise confound the results. We leave the in-
vestigation of instruction tuning to future work.
Due to computational resource limitations, our ex-
periments are conducted only at the 1B and 7B
parameter scales. We believe that our work poses
no immediate risk to society or to any individuals
8

-- 8 of 13 --

or organizations; however, we recommend caution
when using our models, as they have not undergone
any safety or value alignment.
Acknowledgments
This research is supported by the National Research
Foundation Singapore under its AI Singapore Pro-
gramme (Award Number: AISG3-RP-2022-030).
We would like to acknowledge that computational
work involved in this research work is partially sup-
ported by NUS IT’s Research Computing group
with grant number NUSREC-HPC-00001. We
would also like to thank Raymond Ng, Peerat
Limkonchotiwat, Jian Gang Ngui, and William Tjhi
for helpful discussions on SEA-LION.
References
Eleftheria Briakou, Colin Cherry, and George Foster.
2023. Searching for needles in a haystack: On the
role of incidental bilingualism in PaLM‘s translation
capability. In Proceedings of ACL, pages 9432–9452.
Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang,
Saksham Singhal, Xian-Ling Mao, Heyan Huang,
Xia Song, and Furu Wei. 2021. mT6: Multilingual
pretrained text-to-text transformer with translation
pairs. In Proceedings of EMNLP, pages 1671–1683.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
Barham, Hyung Won Chung, Charles Sutton, Sebas-
tian Gehrmann, and et al. 2023. PaLM: Scaling lan-
guage modeling with pathways. JMLR, 24(1).
Alexis Conneau and Guillaume Lample. 2019. Cross-
lingual language model pretraining. In Proceedings
of NeurIPS.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina
Williams, Samuel Bowman, Holger Schwenk, and
Veselin Stoyanov. 2018. XNLI: Evaluating cross-
lingual sentence representations. In Proceedings of
EMNLP, pages 2475–2485.
Marta R. Costa-jussà,

Chunk 20 · 1,999 chars

2019. Cross-
lingual language model pretraining. In Proceedings
of NeurIPS.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina
Williams, Samuel Bowman, Holger Schwenk, and
Veselin Stoyanov. 2018. XNLI: Evaluating cross-
lingual sentence representations. In Proceedings of
EMNLP, pages 2475–2485.
Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Kalbassi, Janice Lam, Daniel Licht, Jean Mail-
lard, Anna Sun, Skyler Wang, Guillaume Wen-
zek, Al Youngblood, Bapi Akula, Loic Barrault,
Gabriel Mejia Gonzalez, Prangthip Hansanti, John
Hoffman, and 19 others. 2022. No language left be-
hind: Scaling human-centered machine translation.
Preprint, arXiv:2207.04672.
Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui
Zhou, Xin Mao, Ziqi Jin, Wei Lu, and Min Lin. 2024.
Sailor: Open language models for South-East Asia.
In Proceedings of EMNLP, pages 424–435.
Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan,
Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xing-
wei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo,
Guorui Zhou, Wenhu Chen, and Ge Zhang. 2024.
Chinese Tiny LLM: Pretraining a Chinese-centric
large language model. Preprint, arXiv:2404.04167.
Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hi-
roki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota,
Sakae Mizuki, Rio Yokota, and Naoaki Okazaki.
2024. Continual pre-training for cross-lingual LLM
adaptation: Enhancing Japanese language capabili-
ties. In Proceedings of COLM.
Alejandro Hernández-Cano, Alexander Hägele,
Allen Hao Huang, Angelika Romanou, Antoni-Joan
Solergibert, Barna Pasztor, Bettina Messmer,
Dhia Garbaya, Eduard Frank ˇDurech, Ido Hakimi,
Juan García Giraldo, Mete Ismayilzada, Negar
Foroutan, Skander Moalla, Tiancheng Chen,
Vinko Sabolˇcec, Yixuan Xu, Michael Aerni, Badr
AlKhamissi, and 83 others. 2025. Apertus: De-
mocratizing open and compliant LLMs for global
language environments. Preprint, arXiv:2509.14233.
Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun
He, Weilin

Chunk 21 · 1,990 chars

Mete Ismayilzada, Negar
Foroutan, Skander Moalla, Tiancheng Chen,
Vinko Sabolˇcec, Yixuan Xu, Michael Aerni, Badr
AlKhamissi, and 83 others. 2025. Apertus: De-
mocratizing open and compliant LLMs for global
language environments. Preprint, arXiv:2509.14233.
Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun
He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei
Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng
Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie
Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, and 5 oth-
ers. 2024. MiniCPM: Unveiling the potential of small
language models with scalable training strategies. In
Proceedings of COLM.
Adam Ibrahim, Benjamin Thérien, Kshitij Gupta,
Mats L. Richter, Quentin Anthony, Timothée Lesort,
Eugene Belilovsky, and Irina Rish. 2024. Simple
and scalable strategies to continually pre-train large
language models. Preprint, arXiv:2403.08763.
Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In Proceedings of
EMNLP, pages 388–395.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Edunov, Marjan Ghazvininejad, Mike Lewis, and
Luke Zettlemoyer. 2020. Multilingual denoising pre-
training for neural machine translation. TACL, 8:726–
742.
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana
Akbar, Lester James V. Miranda, Jennifer San-
toso, Elyanah Aco, Akhdan Fadhilah, Jonibek
Mansurov, Joseph Marvin Imperial, Onno P. Kamp-
man, Joel Ruben Antony Moniz, Muhammad
Ravi Shulthan Habibi, Frederikus Hudi, Railey Mon-
talan, Ryan Ignatius, Joanito Agili Lopo, William
Nixon, Börje F. Karlsson, James Jaya, and 42 oth-
ers. 2024. SEACrowd: A multilingual multimodal
data hub and benchmark suite for Southeast Asian
languages. In Proceedings of EMNLP, pages 5155–
5203.
Raymond Ng, Thanh Ngan Nguyen, Yuli Huang,
Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin
9

-- 9 of 13 --

Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas
Cheng, Hamsawardhini Rengarajan, Peerat Limkon-
chotiwat, Adithya Venkatadri Hulagadri, Kok

Chunk 22 · 1,998 chars

heast Asian
languages. In Proceedings of EMNLP, pages 5155–
5203.
Raymond Ng, Thanh Ngan Nguyen, Yuli Huang,
Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin
9

-- 9 of 13 --

Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas
Cheng, Hamsawardhini Rengarajan, Peerat Limkon-
chotiwat, Adithya Venkatadri Hulagadri, Kok Wai
Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo,
Wayne Lau, Choon Meng Tan, and 12 others. 2025.
Sea-lion: Southeast Asian languages in one network.
Preprint, arXiv:2504.05747.
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani
Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken
Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liy-
ing Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
Chaoqun Liu, Hang Zhang, and Lidong Bing. 2024.
SeaLLMs - large language models for Southeast Asia.
In Proceedings of ACL, pages 294–304.
Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun,
Hao Tian, Hua Wu, and Haifeng Wang. 2021.
ERNIE-M: Enhanced multilingual representation by
aligning cross-lingual semantics with monolingual
corpora. In Proceedings of EMNLP, pages 27–38.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of ACL,
pages 311–318.
Edoardo Maria Ponti, Goran Glavaš, Olga Majewska,
Qianchu Liu, Ivan Vuli´c, and Anna Korhonen. 2020.
XCOPA: A multilingual dataset for causal common-
sense reasoning. In Proceedings of EMNLP, pages
2362–2376.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of WMT, pages 186–191.
Muhammad Reza Qorib, Junyi Li, and Hwee Tou Ng.
2025. Just go parallel: Improving the multilingual ca-
pabilities of large language models. In Proceedings
of ACL, pages 33411–33424.
Leonardo Ranaldi, Giulia Pucci, and Andre Fre-
itas. 2024. Empowering cross-lingual abilities
of instruction-tuned large language models by
translation-following demonstrations. In Findings
of ACL, pages 7961–7973.
Teven Le Scao, Angela Fan, Christopher Akiki, El-
lie Pavlick,

Chunk 23 · 1,994 chars

In Proceedings
of ACL, pages 33411–33424.
Leonardo Ranaldi, Giulia Pucci, and Andre Fre-
itas. 2024. Empowering cross-lingual abilities
of instruction-tuned large language models by
translation-following demonstrations. In Findings
of ACL, pages 7961–7973.
Teven Le Scao, Angela Fan, Christopher Akiki, El-
lie Pavlick, Suzana Ilic, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon,
Matthias Gallé, Jonathan Tow, Alexander M. Rush,
Stella Biderman, Albert Webson, Pawan Sasanka Am-
manamanchi, Thomas Wang, Benoît Sagot, Niklas
Muennighoff, Albert Villanova del Moral, and 30
others. 2022. BLOOM: A 176B-parameter open-
access multilingual language model. arXiv preprint
arXiv:2211.05100.
Neha Sengupta, Sunil Kumar Sahu, Bokang Jia,
Satheesh Katipomu, Haonan Li, Fajri Koto, William
Marshall, Gurpreet Gosal, Cynthia Liu, Zhim-
ing Chen, Osama Mohammed Afzal, Samta Kam-
boj, Onkar Pandit, Rahul Pal, Lalit Pradhan,
Zain Muhammad Mujahid, Massa Baali, Xudong
Han, Sondos Mahmoud Bsharat, and 13 others. 2023.
Jais and Jais-chat: Arabic-centric foundation and
instruction-tuned open generative large language
models. Preprint, arXiv:2308.16149.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 oth-
ers. 2023. Llama 2: Open foundation and fine-tuned
chat models. arXiv preprint arXiv:2307.09288.
Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld,
Kyle Lo, Shane Arora, Akshita Bhagia, Yuling
Gu, Shengyi Huang, Matt Jordan, Nathan Lambert,
Dustin Schwenk, Oyvind Tafjord, Taira Anderson,
David Atkinson, Faeze Brahman, Christopher Clark,
Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, and
23 others. 2025. 2 OLMo 2 furious. In Proceedings
of COLM.
Chris Wendler, Veniamin Veselovsky, Giovanni Monea,
and Robert West. 2024. Do

Chunk 24 · 1,993 chars

ang, Matt Jordan, Nathan Lambert,
Dustin Schwenk, Oyvind Tafjord, Taira Anderson,
David Atkinson, Faeze Brahman, Christopher Clark,
Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, and
23 others. 2025. 2 OLMo 2 furious. In Proceedings
of COLM.
Chris Wendler, Veniamin Veselovsky, Giovanni Monea,
and Robert West. 2024. Do Llamas work in English?
on the latent language of multilingual transformers.
In Proceedings of ACL, pages 15366–15394.
D.H. Wolpert and W.G. Macready. 1997. No free lunch
theorems for optimization. IEEE Transactions on
Evolutionary Computation, 1(1):67–82.
Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jia-
hao Wang, Ye Feng, Ying Shan, and Ping Luo. 2024.
LLaMA pro: Progressive LLaMA with block expan-
sion. In Proceedings of ACL, pages 6518–6537.
Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has-
san Awadalla. 2024. A paradigm shift in machine
translation: Boosting translation performance of
large language models. In Proceedings of ICLR.
Yinfei Yang, Yuan Zhang, Chris Tar, and Jason
Baldridge. 2019. PAWS-X: A cross-lingual adversar-
ial dataset for paraphrase identification. In Proceed-
ings of EMNLP-IJCNLP, pages 3687–3692.
Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani
Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng,
Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li,
and Lidong Bing. 2025a. SeaLLMs 3: Open founda-
tion and chat multilingual large language models for
Southeast Asian languages. In Proceedings of ACL,
pages 96–105.
Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng
Chi, Eric Michael Smith, Nicholas Carlini, Florian
Tramèr, and Daphne Ippolito. 2025b. Persistent pre-
training poisoning of LLMs. In Proceedings of ICLR.
Wenzhen Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue,
and Ming Zhou. 2024. Breaking language barriers:
Cross-lingual continual pre-training at scale. In Pro-
ceedings of EMNLP, pages 7725–7738.
10

-- 10 of 13 --

ISO Code # tokens
id 54.6B
km 1.7B
lo 2.1B
ms 11.8B
my 0.8B
ta 46.3B
th 16.5B
tl 2.5B
vi 90.0B
Table 5: Number of

Chunk 25 · 1,999 chars

en Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue,
and Ming Zhou. 2024. Breaking language barriers:
Cross-lingual continual pre-training at scale. In Pro-
ceedings of EMNLP, pages 7725–7738.
10

-- 10 of 13 --

ISO Code # tokens
id 54.6B
km 1.7B
lo 2.1B
ms 11.8B
my 0.8B
ta 46.3B
th 16.5B
tl 2.5B
vi 90.0B
Table 5: Number of tokens of SEA-PILE-v2, measured
using the OLMo 2 tokenizer.
Data source # tokens
Internet (common crawl) 677.6B
Books 56.8B
Academic papers 33.6B
Others (QA, etc.) 29.6B
Encyclopaedia 2.4B
Total 800.0B
Table 6: Number of tokens of the MAP-CC dataset,
measured using the original tokenizer from the paper.
A Resources
In this section, we report the details of the data
sources used in our experiments. We report the
statistics of our SEA multilingual data (Table 5),
Chinese multilingual data (Table 6), and replay data
(Table 7).
B Results
In this section, we report the detailed experimental
results for each language. We report the models’
performance on translating texts from Southeast
Asian languages into English (Table 8) and En-
glish into Southeast Asian languages (Table 9). We
Corpus name # tokens
Filtered DCLM 752.0B
Decontaminated FLAN 17.0B
StackExchange Q&A 1.3B
peS2o 58.6B
Wikipedia/Wikibooks 3.7B
Dolmino Math 10.7B
Total 843.3B
Table 7: Number of tokens of replay data sampled from
OLMo 2’s second-stage pretraining, measured using the
OLMo 2 tokenizer.
report the detailed performance on commonsense
reasoning tasks, which include XNLI (Table 10),
XCOPA (Table 11), and PAWS-X (Table 12).
C Training Details
We report the architectural configurations of the 1B
and 7B models in Table 13, and summarize the cor-
responding training hyperparameters in Table 14.
All experiments were conducted using 8× NVIDIA
H200 GPUs. Training the 1B model on 10B tokens
required approximately 11 hours, while training
the 7B model on 34.7B tokens took approximately
180 hours.
11

-- 11 of 13 --

Model Size id km lo ms my ta th tl vi zh Avg
Mixed 1B 42.81 21.14 23.87 42.33

Chunk 26 · 1,989 chars

ameters in Table 14.
All experiments were conducted using 8× NVIDIA
H200 GPUs. Training the 1B model on 10B tokens
required approximately 11 hours, while training
the 7B model on 34.7B tokens took approximately
180 hours.
11

-- 11 of 13 --

Model Size id km lo ms my ta th tl vi zh Avg
Mixed 1B 42.81 21.14 23.87 42.33 4.52 12.98 24.01 32.50 29.13 23.03 25.63
Parallel First 1B 35.31 13.60 18.02 35.26 3.52 11.36 20.50 32.22 23.01 15.71 20.85
Parallel Last 1B 41.00 21.46 23.76 42.02 7.29 17.71 24.35 38.31 27.99 21.59 26.55
Multilingual 1B 29.61 3.07 1.17 30.10 0.30 4.07 12.98 30.53 18.24 14.96 14.50
Parallel Only 1B 42.98 23.10 6.13 42.61 5.85 11.88 27.35 43.86 34.71 23.97 26.24
Multilingual 7B 41.56 26.70 30.67 41.75 10.86 21.83 27.84 44.07 34.42 21.76 30.15
Parallel Only 7B 49.48 32.92 34.94 49.06 22.44 33.45 35.07 51.93 41.00 30.06 38.04
SeaLLM v3 7B 41.53 16.31 8.84 40.68 7.59 12.87 28.97 36.47 34.99 28.80 25.71
Sailor2 8B 46.26 32.70 37.22 46.10 24.70 28.57 32.84 48.16 39.38 27.73 36.37
SEA-LION v3.5 8B 42.78 26.77 26.70 41.72 20.91 26.54 30.13 42.98 35.48 26.32 32.03
Table 8: BLEU scores for translation into English.
Model Size id km lo ms my ta th tl vi zh Avg
Mixed 1B 44.99 8.61 7.75 39.83 2.18 6.21 19.65 33.45 37.93 36.32 23.69
Parallel First 1B 37.44 4.58 7.83 32.97 2.26 2.51 14.60 29.54 30.04 28.19 19.00
Parallel Last 1B 45.73 2.07 3.58 40.57 1.78 6.63 20.05 33.71 38.69 36.77 22.96
Multilingual 1B 28.13 0.19 0.65 25.92 0.61 0.30 6.11 23.17 19.89 20.10 12.51
Parallel Only 1B 45.69 3.19 6.65 40.99 2.33 8.81 21.22 34.31 39.50 38.51 24.12
Multilingual 7B 41.40 13.87 14.61 37.66 5.28 7.73 20.62 34.30 35.30 30.90 24.17
Parallel Only 7B 49.50 17.99 16.47 44.50 9.25 17.58 28.66 37.92 43.44 45.86 31.12
SeaLLM v3 7B 35.27 3.16 0.91 25.81 1.96 1.22 15.92 15.73 35.22 43.27 17.85
Sailor2 8B 47.82 22.01 22.82 41.36 15.77 11.67 27.31 37.15 43.35 41.68 31.09
SEA-LION v3.5 8B 42.03 12.51 12.02 36.50 8.09 12.32 21.01 31.90 39.09 38.18 25.37
Table 9: BLEU scores for

Chunk 27 · 1,994 chars

0 17.99 16.47 44.50 9.25 17.58 28.66 37.92 43.44 45.86 31.12
SeaLLM v3 7B 35.27 3.16 0.91 25.81 1.96 1.22 15.92 15.73 35.22 43.27 17.85
Sailor2 8B 47.82 22.01 22.82 41.36 15.77 11.67 27.31 37.15 43.35 41.68 31.09
SEA-LION v3.5 8B 42.03 12.51 12.02 36.50 8.09 12.32 21.01 31.90 39.09 38.18 25.37
Table 9: BLEU scores for translation from English.
Model Size en th vi zh Avg
Mixed 1B 51.54 41.72 44.37 35.47 43.28
Parallel First 1B 52.79 43.31 44.15 39.00 44.81
Parallel Last 1B 51.00 42.91 44.15 33.29 42.84
Multilingual 1B 52.77 39.42 41.86 39.08 43.28
Parallel Only 1B 52.97 44.49 44.23 34.67 44.09
Multilingual 7B 52.61 45.45 47.76 37.11 45.73
Parallel Only 7B 52.10 47.50 47.33 34.65 45.40
SeaLLM v3 7B 53.77 40.32 42.87 36.17 43.28
Sailor2 8B 54.65 44.45 45.55 33.69 44.59
SEA-LION v3.5 8B 52.79 47.05 46.23 38.36 46.11
Table 10: XNLI scores for each language.
12

-- 12 of 13 --

Model Size en id ta th vi zh Avg
Mixed 1B 77.80 63.80 53.00 56.60 63.60 59.20 62.33
Parallel First 1B 79.20 64.20 54.80 54.20 63.00 58.60 62.33
Parallel Last 1B 77.40 61.40 54.80 56.00 65.00 59.60 62.37
Multilingual 1B 76.60 63.40 55.40 57.20 65.40 58.40 62.73
Parallel Only 1B 77.40 61.20 54.60 55.80 61.60 59.40 61.67
Multilingual 7B 86.00 73.60 61.60 61.60 72.80 66.20 70.30
Parallel Only 7B 85.00 74.00 59.60 60.00 73.80 68.80 70.20
SeaLLM v3 7B 87.80 71.00 53.80 61.20 72.40 77.00 70.53
Sailor2 8B 87.40 79.40 62.60 65.00 78.20 74.40 74.50
SEA-LION v3.5 8B 87.80 72.80 59.60 59.80 73.00 69.20 70.37
Table 11: XCOPA scores for each language.
Model Size en zh Avg
Mixed 1B 58.80 55.80 57.30
Parallel First 1B 56.90 52.85 54.88
Parallel Last 1B 59.35 52.00 55.68
Multilingual 1B 58.50 47.00 52.75
Parallel Only 1B 61.15 53.30 57.23
Multilingual 7B 66.45 53.70 60.08
Parallel Only 7B 70.10 59.30 64.70
SeaLLM v3 7B 66.85 53.50 60.18
Sailor2 8B 70.90 58.15 64.53
SEA-LION v3.5 8B 66.45 60.70 63.58
Table 12: PAWS-X scores for each language.
Hyperparameter 1B 7B
Number of layers 16 32
Embedding dimension 2048

Chunk 28 · 821 chars

00 52.75
Parallel Only 1B 61.15 53.30 57.23
Multilingual 7B 66.45 53.70 60.08
Parallel Only 7B 70.10 59.30 64.70
SeaLLM v3 7B 66.85 53.50 60.18
Sailor2 8B 70.90 58.15 64.53
SEA-LION v3.5 8B 66.45 60.70 63.58
Table 12: PAWS-X scores for each language.
Hyperparameter 1B 7B
Number of layers 16 32
Embedding dimension 2048 4096
Intermediate dimension 8192 22016
Attention heads 16 32
Context window 4096 4096
Vocabulary size 100352 100352
Table 13: Architecture of the OLMo 2 1B and 7B models.
Hyperparameter 1B 7B
Global batch size 512 1024
Micro batch size (per device) 4 2
Learning rate 2.0 × 10−4 2.0 × 10−4
Weight decay 0.1 0.1
Optimizer AdamW AdamW
(β1, β2) (0.9, 0.95) (0.9, 0.95)
Gradient clip 1.0 1.0
Max sequence length 4096 4096
Table 14: Training hyperparameters for the 1B and 7B experiments.
13

-- 13 of 13 --