OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data
Summary
This paper introduces OPENSEAL, the first fully open-source Southeast Asian large language model (LLM). The authors investigate the effectiveness of parallel data in continual pretraining (CPT) of LLMs to extend their multilingual capabilities. They find that using only parallel data is the most effective strategy for adding new languages, outperforming approaches that mix parallel and monolingual data. OPENSEAL is built by continually training the open-source OLMo 2 model on 34.7B tokens of parallel data across ten Southeast Asian languages and English. The model achieves strong performance on translation and commonsense reasoning benchmarks, rivaling or surpassing existing Southeast Asian LLMs that use much larger multilingual datasets. Key advantages include full transparencyâopen source code, data, and weightsâwhich supports security, research, and AI sovereignty. The model was trained in 180 hours on 8Ă NVIDIA H200 GPUs at a cost under $12,000, making it a cost-effective solution. The study highlights the efficiency and effectiveness of parallel data in CPT, offering a practical pathway for developing high-quality, region-focused LLMs.
PDF viewer
Chunks(29)
Chunk 0 ¡ 1,999 chars
OPENSEAL: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data Tan Sang Nguyen, Muhammad Reza Qorib, Hwee Tou Ng Department of Computer Science, National University of Singapore e1583535@u.nus.edu, mrqorib@u.nus.edu, dcsnght@nus.edu.sg Abstract Large language models (LLMs) have proven to be effective tools for a wide range of nat- ural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asiaâfocused LLMs have been devel- oped, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for trans- parency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent ad- vances demonstrating the effectiveness of paral- lel data in improving multilingual performance, we conduct controlled and comprehensive ex- periments to study the effectiveness of parallel data in continual pretraining of LLMs. Our find- ings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8Ă NVIDIA H200 GPUs, we built OPENSEAL, the first truly open South- east Asian LLM that rivals the performance of existing models of similar size. 1 Introduction With the democratization of large language model (LLM) development, enabled by open-source ef- forts from the research community, region-specific LLMs have been proposed to improve performance across particular language groups, especially low- resource languages. Consistent with the no free lunch theorem (Wolpert and Macready, 1997), nu- merous studies have shown that region-specific LLMs outperform one-size-fits-all models (Sen- gupta et al., 2023; Nguyen et al., 2024). The most efficient way to build a region-specific LLM
Chunk 1 ¡ 1,992 chars
language groups, especially low- resource languages. Consistent with the no free lunch theorem (Wolpert and Macready, 1997), nu- merous studies have shown that region-specific LLMs outperform one-size-fits-all models (Sen- gupta et al., 2023; Nguyen et al., 2024). The most efficient way to build a region-specific LLM is to continually train a model that has already been pretrained on trillions of tokensâstanding on the shoulders of giants. Prior work has shown that this approach is not only more efficient but also more effective, achieving better performance than training from scratch (Zheng et al., 2024). Most ex- isting approaches adapt an LLM by further training it on a mixture of monolingual corpora correspond- ing to the target languages (Ng et al., 2025). Recently, increasing effort has been devoted to developing large language models for the South- east Asian region. These efforts include compiling large-scale training corpora (Ng et al., 2025), build- ing evaluation benchmarks (Lovenia et al., 2024), and training LLMs that support languages spoken in the region (Nguyen et al., 2024). However, none of these models are fully open source. Existing Southeast Asian LLMs are open-weight models: while their model weights are publicly available, their training data (including the training data of their base LLMs) and training code are not fully disclosed. Without complete transparency regarding their training data, such models risk being poisoned, ei- ther deliberately by the model provider or inad- vertently through the inclusion of malicious con- tent from the Internet. Zhang et al. (2025b) show that poisoning as little as 0.1% of the training data can induce harmful behavior in an LLM, and that this harm can persist even after additional prefer- ence tuning intended to improve safety and help- fulness. This makes nonâfully open-source LLMs risky for security-sensitive applications, such as those in government institutions. Recent work has demonstrated the
Chunk 2 ¡ 1,995 chars
data can induce harmful behavior in an LLM, and that this harm can persist even after additional prefer- ence tuning intended to improve safety and help- fulness. This makes nonâfully open-source LLMs risky for security-sensitive applications, such as those in government institutions. Recent work has demonstrated the effectiveness of parallel data in improving multilingual capabili- ties in LLMs (Qorib et al., 2025; Fujii et al., 2024). Building on this advancement, we conduct a con- trolled and comprehensive investigation into the ef- fectiveness of parallel data for extending LLMs to new languages. Based on this investigation, we de- velop OPENSEAL (Open-source SouthE ast Asian LLM), a fully open-source Southeast Asian LLM. OPENSEAL is trained using only 34.7B tokens of 1 arXiv:2602.02266v1 [cs.CL] 2 Feb 2026 -- 1 of 13 -- parallel data and 180 hours on 8Ă NVIDIA H200 GPUs. The training cost based on commercial rates charged by cloud providers1 is under US$12,000. To summarize, our contributions are as follows: ⢠We propose a good, fast, and cheap way of adapting an LLM to new languages by utiliz- ing parallel data. We show that under the same token budget, utilizing only parallel data is the most efficient. To the best of our knowledge, we are the first to present a comprehensive controlled investigation of the use of parallel data for continual pretraining. ⢠We built a new Southeast Asian LLM, OPENSEAL, by continually training an open- source English-only LLM (OLMo 2; Walsh et al. 2025) on a relatively small amount of par- allel data. On translation and commonsense reasoning benchmarks, OPENSEAL performs at least as well as â and in some cases outper- forms â existing Southeast Asian LLMs that were based on high-performing multilingual LLMs and trained on a much larger amount of multilingual data. ⢠Our model is fully open (open source, open data, and open weight2), which allows future research to investigate the multilinguality of LLMs in a more precise
Chunk 3 ¡ 1,996 chars
es outper- forms â existing Southeast Asian LLMs that were based on high-performing multilingual LLMs and trained on a much larger amount of multilingual data. ⢠Our model is fully open (open source, open data, and open weight2), which allows future research to investigate the multilinguality of LLMs in a more precise and detailed manner. To the best of our knowledge, our model is the first fully open Southeast Asian LLM. 2 Related Work In this section, we will discuss recent work that investigates the effectiveness of parallel data for LLM trainingâpretraining, continual pretraining, and instruction tuningâand recent efforts in build- ing LLMs for the Southeast Asian region. 2.1 Parallel Data in LLM training Parallel data have been commonly used in training encoder-only (Conneau and Lample, 2019; Ouyang et al., 2021) and encoderâdecoder (Liu et al., 2020; Chi et al., 2021) LLMs, but early decoder-only LLMs (Scao et al., 2022) did not utilize parallel data and were trained solely on a mixture of mono- lingual data. Qorib et al. (2025) conduct a comprehensive investigation into the effects of parallel data in 1For example, AWS p5en.48xlarge 2Source code, training data, and model weights will be released upon publication. LLM pretraining. They compare pretraining with and without parallel data in a controlled setting by fixing the choice and order of the training data. Their experiments on a 1B-parameter model trained on 167B tokens show that incorporating parallel data improves multilingual capabilities more than monolingual data, especially when parallel data are placed at the end of training. While these findings highlight the value of parallel data in LLM training, the experiments are limited to 1B models trained on a relatively small number of tokens compared to production-scale LLMs. A recent multilingual LLM, Apertus (HernĂĄndez- Cano et al., 2025), includes parallel data in one of its pretraining stages. Similarly, the model is trained with parallel data in
Chunk 4 ¡ 1,998 chars
a in LLM training, the experiments are limited to 1B models trained on a relatively small number of tokens compared to production-scale LLMs. A recent multilingual LLM, Apertus (HernĂĄndez- Cano et al., 2025), includes parallel data in one of its pretraining stages. Similarly, the model is trained with parallel data in the final (fifth) pre- training stage. Although the models have up to 70B parameters, the authors do not investigate the specific effect of parallel data during pretraining. Fujii et al. (2024) utilize parallel data to build a Japanese LLM by performing continual pretraining (CPT) of Llama 2 (Touvron et al., 2023). They first train Llama 2 on 22 million parallel sentence pairs before further training it on monolingual data. The motivation for this training order is the assumption that parallel sentences can facilitate the transition from English-centric to Japanese-centric training. In contrast, ALMA (Xu et al., 2024) adapts Llama 2 for machine translation by training the model on parallel data only after it has been trained on monolingual data. The authors argue that mono- lingual training is necessary to improve the modelâs proficiency in non-English languages before intro- ducing parallel data. These two approaches adopt the exact opposite order for utilizing parallel data, and this difference in training order is not inconsequential. We find that such differences can result in models with dif- fering capabilities. Due to the lack of compre- hensive empirical studies on the effects of parallel data in CPT, it remains unclear how parallel data influence continual pretraining and how they can be maximally leveraged to extend LLMs to new languages. Ranaldi et al. (2024) utilize parallel data as an additional signal during instruction tuning. Specifi- cally, they employ a machine translation model to translate instruction-following datasets into multi- ple languages and incorporate the resulting transla- tion pairs during instruction tuning. They find
Chunk 5 ¡ 1,997 chars
languages. Ranaldi et al. (2024) utilize parallel data as an additional signal during instruction tuning. Specifi- cally, they employ a machine translation model to translate instruction-following datasets into multi- ple languages and incorporate the resulting transla- tion pairs during instruction tuning. They find that this approach promotes better cross-lingual transfer. While instruction tuning is outside the scope of our 2 -- 2 of 13 -- current work, this method is orthogonal to ours and can be applied sequentially. Our work extends the investigation of parallel data by Qorib et al. (2025) to continual pretraining of a base LLM (OLMo 2) that is pretrained on more than four trillion tokens. In addition to studying CPT on a much larger scale, we scale the model size to 7B parameters. Although parallel data have been used previously for CPT, to the best of our knowledge, no comprehensive empirical study has systematically evaluated the role of parallel data in continual pretraining. Our work aims to fill this gap and demonstrates that parallel data are highly effective for extending LLMs to new languages. 2.2 Southeast Asian LLMs Notable LLMs developed for the Southeast Asian region include SEA-LION v3.5 (Ng et al., 2025), Sailor 2 (Dou et al., 2024), and SeaLLM 3 (Zhang et al., 2025a). All of them were built on multilin- gual open-weight LLMs by performing continual pre-training (CPT) on Southeast Asian data, mak- ing the models more attuned to native texts from the region. Adapting high-performing multilingual models allows them to retain strong capabilities in English and Chinese, while also providing a signif- icant head start on high-resource Southeast Asian languages such as Indonesian and Thai. In this pa- per, we will refer to languages using their ISO 639 codes, as listed in Table 1. SEA-LION v3.5 has two versions: one built by performing CPT on Llama 2 8B and the other on Gemma 2 9B3. CPT was applied to the instruction- tuned versions of the models
Chunk 6 ¡ 1,995 chars
ce Southeast Asian languages such as Indonesian and Thai. In this pa- per, we will refer to languages using their ISO 639 codes, as listed in Table 1. SEA-LION v3.5 has two versions: one built by performing CPT on Llama 2 8B and the other on Gemma 2 9B3. CPT was applied to the instruction- tuned versions of the models using 200B tokens of data, comprising 50B tokens of English, 110B tokens from 10 Southeast Asian languages (id, km, lo, ms, my, ta, th, tl, vi, zh), and 40B tokens of code. The models then underwent instruction tuning and preference tuning on both English and Southeast Asian data. Between training stages, a series of model-merging steps was employed to mitigate catastrophic forgetting. On eight nodes equipped with 8Ă NVIDIA H200 GPUs per node, the total training time was six days for the Llama- based version and ten days for the Gemma-based version. A newer version of SEA-LION (v4) has been re- leased on HuggingFace4, built by performing CPT 3At the time of writing, only the Llama-based version is publicly available. 4https://huggingface.co/collections/ aisingapore/sea-lion-v4 on Gemma 3 and Qwen 3. The Gemma version has 27B parameters, while the Qwen version has 32B parameters. The Gemma model was continu- ally pretrained on approximately 500B tokens from 11 Southeast Asian languages (en, id, km, lo, ms, my, ta, th, tl, vi, zh), whereas the Qwen model was continually pretrained on approximately 100B tokens from 7 Southeast Asian languages (id, ms, my, ta, th, tl, vi). SEA-LION v4 also includes vi- sionâlanguage model variants adapted from Qwen 4B, Qwen 8B, and Gemma 27B. At the time of writing, no technical report or academic paper on SEA-LION v4 has been released. Sailor 2 was built by expanding Qwen 2.5 using an approach similar to LlamaPro (Wu et al., 2024), increasing the parameter counts from 0.5B to 1B, from 7B to 8B, and from 14B to 20B. These ex- pansions were designed to mitigate potential catas- trophic forgetting on English and Chinese
Chunk 7 ¡ 1,999 chars
n SEA-LION v4 has been released. Sailor 2 was built by expanding Qwen 2.5 using an approach similar to LlamaPro (Wu et al., 2024), increasing the parameter counts from 0.5B to 1B, from 7B to 8B, and from 14B to 20B. These ex- pansions were designed to mitigate potential catas- trophic forgetting on English and Chinese tasks. The models were trained on 500B tokens of data, comprising 400B tokens from 13 Southeast Asian languages (ceb, id, ilo, jv, km, lo, ms, my, su, th, tl, vi, war) and 100B tokens of English text. CPT was performed in two stages. In the first stage, the model was trained on 450B tokens from a com- prehensive dataset covering 8 Southeast Asian lan- guages and English. In the second stage, it was trained on high-quality data covering 13 Southeast Asian languages along with English instruction- tuning data. The model then underwent two stages of instruction tuning and two stages of preference tuning. SeaLLM 3 was built by performing CPT on Qwen 2 using data from 12 Southeast Asian lan- guages (en, id, jv, km, lo, ms, my, ta, th, tl, vi, zh). Unlike the previously discussed Southeast Asian LLMs, SeaLLM updates only a small subset of model parameters identified as language-specific neurons. In addition, CPT was performed for one language at a time, resulting in multiple monolin- gual model variants, each specialized in a single Southeast Asian language. Subsequently, the mono- lingual models were merged into a single multilin- gual model. This training approach requires less data5. 3 Methods To evaluate the effectiveness of parallel data, we vary the inclusion and placement of parallel data 5The amount of training data used for SeaLLM is not disclosed. 3 -- 3 of 13 -- Figure 1: Illustration of the training data sequences of our parallel data investigation. and try to maintain the choice and order of the other training data, while ensuring each data type is still uniformly distributed. To this end, we define each data source as small blocks of 262,144
Chunk 8 ¡ 1,995 chars
ot disclosed. 3 -- 3 of 13 -- Figure 1: Illustration of the training data sequences of our parallel data investigation. and try to maintain the choice and order of the other training data, while ensuring each data type is still uniformly distributed. To this end, we define each data source as small blocks of 262,144 tokens6 (Figure 1). Note that the order of blocks within a batch (eight blocks for the 1B model or sixteen blocks for the 7B model) is randomized. Monolingual data blocks are constructed from a collection of texts in a single language, whereas parallel data blocks are formed by concatenat- ing an English sentence and its Southeast Asian translation using the format âsource language: source sentence\ntarget language: target sentence<|endoftext|>â. This format mimics inci- dental parallel data found in PaLMâs (Chowdhery et al., 2023) training set, as reported by Briakou et al. (2023). To ensure exposure to both translation directions, we randomize the order of the source and target languages within each block (e.g., for en-id blocks, some segments begin with English, while others begin with Indonesian). Replay is crucial for mitigating catastrophic for- getting: even a small replay ratio (e.g., 5%) yields significant advantages over using no replay. The re- play proportion represents a trade-off between pre- serving proficiency in previously acquired domains or languages and adapting to new ones. Ibrahim et al. (2024) empirically analyze replay ratios and recommend a 25% replay rate. Following this rec- ommendation, we ensure that one out of every four blocks in our training data is replay data across all experimental settings. We define five experimental settings: MULTI- LINGUAL, MIXED, PARALLEL FIRST, PARALLEL LAST, and PARALLEL ONLY. In the MIXED, PAR- ALLEL FIRST, and PARALLEL LAST settings, we 6This corresponds to 64 sequences of 4,096 tokens (the maximum context window size). maintain an equal ratio of monolingual and parallel data blocks. 3.1
Chunk 9 ¡ 1,998 chars
e five experimental settings: MULTI- LINGUAL, MIXED, PARALLEL FIRST, PARALLEL LAST, and PARALLEL ONLY. In the MIXED, PAR- ALLEL FIRST, and PARALLEL LAST settings, we 6This corresponds to 64 sequences of 4,096 tokens (the maximum context window size). maintain an equal ratio of monolingual and parallel data blocks. 3.1 Multilingual The MULTILINGUAL setting reflects the standard approach to extending an LLM to new languages, in which the model is continually pretrained solely on a mixture of monolingual corpora. We sample data blocks uniformly across all languages to pre- vent dominance by any single language. Continual pretraining on a mixture of monolingual corpora is similar to the approach used by SEA-LION and other Southeast Asian LLMs. 3.2 Mixed In the MIXED setting, monolingual and parallel data blocks are uniformly interleaved. To ensure that each batch contains both data types, we en- force the presence of at least one monolingual block and one parallel block between two replay blocks. Mixing monolingual data and parallel data is also employed by Apertus in its fifth pretraining stage. 3.3 Parallel First The PARALLEL FIRST setting prioritizes parallel data in the early stages of continual pretraining. Once the parallel blocks are exhausted, training proceeds with multilingual data. This setting is motivated by the hypothesis that early exposure to parallel data facilitates the learning of cross-lingual alignments, which can subsequently be reinforced and generalized through multilingual training. Plac- ing parallel data before monolingual data is also the CPT strategy adopted by (Fujii et al., 2024). 3.4 Parallel Last This setting begins with monolingual blocks and switches to parallel data after the monolingual 4 -- 4 of 13 -- blocks are exhausted. It is motivated by the hypoth- esis that extensive monolingual training improves general fluency in new languages before introduc- ing aligned data for enhanced cross-lingual transfer. Placing monolingual data
Chunk 10 ¡ 1,998 chars
onolingual blocks and switches to parallel data after the monolingual 4 -- 4 of 13 -- blocks are exhausted. It is motivated by the hypoth- esis that extensive monolingual training improves general fluency in new languages before introduc- ing aligned data for enhanced cross-lingual transfer. Placing monolingual data before parallel data is the CPT strategy used by ALMA. 3.5 Parallel Only In the PARALLEL ONLY setting, continual pretrain- ing relies exclusively on parallel data and replay data, entirely omitting monolingual data. This setting tests whether parallel data alone are suf- ficientâand potentially superiorâto monolingual data for improving multilingual capabilities and cross-lingual transfer, and whether it should be ex- clusively used for CPT when available. 4 Experiments 4.1 Model We build our models via continual pretraining (CPT) of OLMo 2, a fully open, decoder-only trans- former language model. We experiment with the 1B and 7B OLMo 2 variants. The 1B model has 16 transformer layers, while the 7B model has 32 layers. OLMo 2 employs an autoregressive archi- tecture designed for stable large-scale training, in- corporating bias-free linear layers, rotary positional embeddings, RMS normalization, and SwiGLU ac- tivation. We first train the 1B model on 10B tokens to identify the optimal CPT strategy. We then scale the best configuration to the 7B model by training on 34.7B tokens and compare it against the MUL- TILINGUAL setting, which is the most commonly used CPT setting in prior work. We keep the OLMo 2 architecture fixed and fo- cus exclusively on the effects of training data com- position and ordering during CPT. Starting from an OLMo 2 checkpoint pretrained on four trillion tokens of general-domain data, we further adapt the model using different mixtures of parallel, multilin- gual, and replay data discussed in Section 3. Aside from the data mixture, all training hyperparameters and model components are identical across the dif- ferent settings,
Chunk 11 ¡ 1,987 chars
2 checkpoint pretrained on four trillion tokens of general-domain data, we further adapt the model using different mixtures of parallel, multilin- gual, and replay data discussed in Section 3. Aside from the data mixture, all training hyperparameters and model components are identical across the dif- ferent settings, enabling a controlled comparison of how parallel and multilingual data interact un- der CPT. Following prior work, we use the WSD scheduler (Hu et al., 2024). Details of the training hyperparameters are provided in Appendix C. ISO Language # sent. # tokens Code SEA EN id Indonesian 70.5M 2.2B 1.9B km Khmer 5.8M 0.6B 0.1B lo Lao 4.2M 0.4B 0.1B ms Malay 56.8M 1.2B 1.0B my Burmese 10.0M 1.2B 0.2B ta Tamil 42.5M 3.7B 0.7B th Thai 29.0M 1.6B 0.6B tl Tagalog 63.6M 1.4B 1.2B vi Vietnamese 50.1M 2.4B 1.3B zh Chinese 71.3M 2.5B 1.7B Total 403.8M 17.2B 8.8B Table 1: Number of sentence pairs (# sent.) and tokens (measured using the OLMo 2 tokenizer) of our parallel data. 4.2 Data For the parallel data, we construct a large-scale SEAâEnglish corpus covering multiple Southeast Asian languages, comprising approximately 403.8 million parallel sentence pairs in total (Table 1). The corpus contains 17.2 billion tokens in ten Southeast Asian languages and 8.8 billion tokens in English, as measured using the OLMo 2 tokenizer. Wherever possible, we rely on NLLB (Costa-jussĂ et al., 2022) as the primary source of parallel data to ensure consistency across languages. For Thai, however, NLLB data are not available; therefore, we aggregate parallel data from multiple publicly available sources7. For the Southeast Asian multilingual data, we use SEA-PILE-v2 (Ng et al., 2025), a large-scale multilingual corpus covering Southeast Asian lan- guages. Since Chinese is not included in SEA- PILE-v2, we source Chinese monolingual data from the MAP-CC corpus (Du et al., 2024), which consists of high-quality text from diverse sources, including books, encyclopedias, academic
Chunk 12 ¡ 1,986 chars
LE-v2 (Ng et al., 2025), a large-scale multilingual corpus covering Southeast Asian lan- guages. Since Chinese is not included in SEA- PILE-v2, we source Chinese monolingual data from the MAP-CC corpus (Du et al., 2024), which consists of high-quality text from diverse sources, including books, encyclopedias, academic papers, and web articles. For the replay data, we sample from OLMo 2âs second-stage pretraining data, which consist of a diverse mixture of high-quality English corpora. They include filtered web data, instruction-tuned data, technical questionâanswer data, academic pa- pers, encyclopedic text, and mathematics-focused data. When sampling the data, we aim to sample each 7bible-uedin, CCAligned, ELRC_2922, GNOME, HPLT, KDE4, OpenSubtitles, QED, Tanzil, TED2020. 5 -- 5 of 13 -- Model Param EN â XX XX â EN XNLI Avg Multilingual 1B 12.51 14.50 43.28 23.43 Mixed 1B 23.69 25.63 43.28 30.87 Parallel First 1B 19.00 20.85 44.81 28.22 Parallel Last 1B 22.69 26.55 42.84 30.69 Parallel Only 1B 24.12 26.24 44.09 31.48 Table 2: Comparison of translation (BLEU scores) and commonsense reasoning performance under different 1B training settings. The highest score is bolded, while the second-highest score is underlined. data source uniformly by equalizing the number of tokens drawn from each source. The total number of available tokens for each data source is provided in Appendix A. 4.3 Evaluation We evaluate the models on translation and multi- lingual commonsense reasoning tasks. Translation quality is assessed using the BLEU metric (Pa- pineni et al., 2002), computed with SacreBLEU (Post, 2018). Translation performance is evalu- ated in both directionsâfrom Southeast Asian lan- guages to English and from English to Southeast Asian languagesâusing test sets from FLORES- 200 (Costa-jussĂ et al., 2022). For all translation evaluations, we employ a fixed 5-shot prompting setup across all models reported in this study. In addition to translation, we evaluate
Chunk 13 ¡ 1,993 chars
h directionsâfrom Southeast Asian lan- guages to English and from English to Southeast Asian languagesâusing test sets from FLORES- 200 (Costa-jussĂ et al., 2022). For all translation evaluations, we employ a fixed 5-shot prompting setup across all models reported in this study. In addition to translation, we evaluate multilin- gual commonsense reasoning using XNLI (Con- neau et al., 2018), XCOPA (Ponti et al., 2020), and PAWS-X (Yang et al., 2019). All experiments are conducted using the lm-evaluation-harness8. For XNLI, we report results for English, Thai, Viet- namese, and Chinese. XCOPA is evaluated on English, Indonesian, Tamil, Thai, Vietnamese, and Chinese, covering a diverse set of typologically distinct languages. For PAWS-X, we evaluate per- formance on English and Chinese. This multilin- gual evaluation setup enables a consistent assess- ment of commonsense reasoning across languages and tasks. In all experiments, we assess statisti- cal significance using paired bootstrap resampling (Koehn, 2004) with 1,000 samples. 5 Results Our controlled experiments on the effect of paral- lel data show that the PARALLEL ONLY strategy is the most effective for extending LLMs to new languages. It significantly outperforms (p < 0.01) other settings in both translation directions, except 8https://github.com/EleutherAI/lm-evaluation-harness when compared with PARALLEL LAST in the XX â EN direction, where no statistically significant difference is observed (Table 2). In addition, PAR- ALLEL ONLY significantly outperforms PARALLEL LAST on XNLI. As no other configuration signifi- cantly outperforms PARALLEL ONLY, this strategy is superior overall. Given the strong performance of PARALLEL ONLY at the 1B scale, we next examine whether this advantage persists as we scale the model size and training data. Table 3 reports results for the larger 7B model, trained on 34.7B tokens. Consis- tent with the 1B-scale findings, PARALLEL ONLY (OPENSEAL) achieves substantial gains in
Chunk 14 ¡ 1,996 chars
he strong performance of PARALLEL ONLY at the 1B scale, we next examine whether this advantage persists as we scale the model size and training data. Table 3 reports results for the larger 7B model, trained on 34.7B tokens. Consis- tent with the 1B-scale findings, PARALLEL ONLY (OPENSEAL) achieves substantial gains in trans- lation quality, attaining the highest BLEU scores in both ENâXX and XXâEN directions. Im- provements of PARALLEL ONLY (OPENSEAL) over other Southeast Asian LLMs are statistically significant (p < 0.01), except when compared with Sailor 2 in the EN â XX direction, where no sig- nificant difference is observed. Beyond translation, PARALLEL ONLY (OPENSEAL) remains competitive on XNLI, XCOPA, and PAWS-X. It achieves the best score on PAWS-X among all compared models, significantly outperforming SeaLLM v3 and MULTILINGUAL and showing no significant differences on XNLI and XCOPA, except when compared with Sailor 2 on XCOPA. Overall, across all evaluated tasks and models, PARALLEL ONLY (OPENSEAL) is significantly outperformed only by Sailor 2 on XCOPA, while it significantly outperforms all modelsâincluding Sailor 2âon the translation task. This result is notable because other Southeast Asian LLMs are built from strong multilingual LLMs and continu- ally trained on substantially larger corpora (400B tokens for Sailor 2 and 200B tokens for SEA-LION v3.5), whereas OPENSEAL is built by continual pretraining of an English-only LLM on just 34.7B tokens. These findings highlight the effectiveness 6 -- 6 of 13 -- Model Param Translation Reasoning EN â XX XX â EN XNLI XCOPA PAWS-X Multilingual 7B 24.17 30.15 45.73 70.30 60.08 Parallel Only (OPENSEAL) 7B 31.12 38.04 45.40 70.20 64.70 SeaLLM v3 7B 17.85 25.71 43.28 70.53 60.18 Sailor2 8B 31.09 36.37 44.59 74.50 64.53 SEA-LION v3.5 8B 25.37 32.03 46.11 70.37 63.58 Table 3: Comparison of translation and commonsense reasoning performance across 7Bâ8B multilingual models. The highest score is bolded, while the
Chunk 15 ¡ 1,992 chars
OPENSEAL) 7B 31.12 38.04 45.40 70.20 64.70 SeaLLM v3 7B 17.85 25.71 43.28 70.53 60.18 Sailor2 8B 31.09 36.37 44.59 74.50 64.53 SEA-LION v3.5 8B 25.37 32.03 46.11 70.37 63.58 Table 3: Comparison of translation and commonsense reasoning performance across 7Bâ8B multilingual models. The highest score is bolded, while the second-highest score is underlined. and efficiency of parallel data for CPT. 6 Analysis 6.1 Impact of Parallel Data To isolate the contribution of parallel SEA sen- tences, we introduce an ablated variant of PARAL- LEL ONLY, denoted as MULTILINGUAL REPLACE- MENT, in which the SEA sentences in the parallel data blocks are replaced with multilingual SEA text from SEA-PILE-v2. All other components are kept identical: the replay data, the English-side data, and the adjacency-based training format re- main unchanged. This design allows us to assess whether the advantages of PARALLEL ONLY stem from bilingual alignment enabled by genuine paral- lel sentences. Table 4 shows that this substitution results in sig- nificant (p < 0.05) degradation in translation per- formance and PAWS-X. Specifically, the EN â XX translation score drops from 31.12 to 22.03, the XX â EN score from 38.04 to 23.19, and the PAWS-X score from 64.70 to 62.45. In contrast, no significant differences are observed on XNLI or XCOPA. Overall, these findings directly address our pri- mary research question, demonstrating that the ben- efits of PARALLEL ONLY arise from deliberate bilingual alignment provided by authentic parallel sentences. The results further indicate that perform- ing CPT using only parallel data does not cause the model to overfit to translation, as it also yields im- proved performance on commonsense reasoning tasks. 6.2 Logit Lens Recent work by Wendler et al. (2024) examined latent language probabilities through the logit lens and demonstrated that multilingual LLMs exhibit a distinctive depth-wise progression at the final training checkpoint: early layers encode
Chunk 16 ¡ 1,999 chars
also yields im- proved performance on commonsense reasoning tasks. 6.2 Logit Lens Recent work by Wendler et al. (2024) examined latent language probabilities through the logit lens and demonstrated that multilingual LLMs exhibit a distinctive depth-wise progression at the final training checkpoint: early layers encode minimal language-specific signals, intermediate layers pri- oritize English representations, and deeper lay- ers shift toward the target language. However, their analysis is limited to a single fully trained model, leaving open questions about how language- specific representations develop over the course of training and how different continual pretraining strategies influence this process. Figures 2 and 3 extend this analysis by track- ing language probabilities across training check- points for the Chinese Cloze task, comparing the PARALLEL ONLY (7B) and MULTILINGUAL (7B) models. Because layers below 20 do not exhibit meaningful probabilities for either Chinese or En- glish throughout training, we focus on layers 20 and above, where non-trivial language signals first emerge. Both models reveal a consistent depth- dependent pattern: layer 20 shows weak and un- stable English-leaning behavior; intermediate lay- ers (e.g., layer 24) display a pronounced English- dominant phase across training checkpoints; and deeper layers increasingly transition toward Chi- nese dominance. Despite these shared patterns, the temporal dy- namics of the transition differ between the two models, particularly around layer 30. Figure 2 shows that the PARALLEL ONLY model undergoes a more gradual shift, with Chinese probability sur- passing English only in the later stages of training. In contrast, Figure 3 shows that the MULTILIN- GUAL model exhibits an earlier and more abrupt transition, characterized by a sharp increase in Chi- nese probabilities and a corresponding decrease in English probability across checkpoints. English probabilities in the MULTILINGUAL model are also
Chunk 17 ¡ 1,996 chars
ter stages of training. In contrast, Figure 3 shows that the MULTILIN- GUAL model exhibits an earlier and more abrupt transition, characterized by a sharp increase in Chi- nese probabilities and a corresponding decrease in English probability across checkpoints. English probabilities in the MULTILINGUAL model are also generally lower than those in the PARALLEL ONLY model, especially in layers 30 and 31. Together, these observations suggest that the PARALLEL ONLY model âthinksâ more in En- glish than the MULTILINGUAL model, a behavior 7 -- 7 of 13 -- Model Param Translation Reasoning EN â XX XX â EN XNLI XCOPA PAWS-X Parallel Only 7B 31.12 38.04 45.40 70.20 64.70 Multilingual Replacement 7B 22.03 23.19 46.13 69.83 62.45 Table 4: Comparison of translation and commonsense reasoning performance between PARALLEL ONLY and MULTILINGUAL REPLACEMENT. The highest score is bolded. (a) Layer 20 (b) Layer 24 (c) Layer 30 (d) Layer 31 (e) Layer 32 Figure 2: Language probabilities of the PARALLEL ONLY (7B) model on the Chinese Cloze task across layers and training checkpoints. (a) Layer 20 (b) Layer 24 (c) Layer 30 (d) Layer 31 (e) Layer 32 Figure 3: Language probabilities of the MULTILINGUAL (7B) model on the Chinese Cloze task across layers and training checkpoints. also observed in strong multilingual models such as Llama 2 (Wendler et al., 2024). This may indicate stronger cross-lingual transfer in the PARALLEL ONLY model, which could help explain its superior performance across tasks. 7 Conclusion Our study investigates data choices for extending an English-only LLM, OLMo 2, to new languages and demonstrates that, under a fixed token budget, training with only parallel data yields the strongest performance. We further validate this finding on a 7B model continually pretrained on 34.7B to- kens, where we compare parallel-only continual pretraining with the more conventional multilin- gual data approach and observe consistent trends. Through ablation experiments and Logit
Chunk 18 ¡ 1,996 chars
nly parallel data yields the strongest performance. We further validate this finding on a 7B model continually pretrained on 34.7B to- kens, where we compare parallel-only continual pretraining with the more conventional multilin- gual data approach and observe consistent trends. Through ablation experiments and Logit Lens anal- yses, we provide evidence that the sentence-level alignment inherent in parallel data plays a key role in its effectiveness, likely by facilitating stronger cross-lingual transfer. Building on these insights, we introduce OPENSEAL, the first fully open Southeast Asian LLM. Our model is fully transparent, enabling care- ful inspection for security-sensitive applications and promoting AI sovereignty. By releasing mod- els that are entirely open, we aim to support future in-depth research on multilinguality, bias, and other related properties of LLMs. Despite being trained with a relatively modest computational budget, OPENSEAL matches or, in some cases, surpasses existing Southeast Asian LLMs that are derived from high-performing mul- tilingual foundations and trained on substantially larger multilingual corpora. Overall, our results suggest that continually pretraining an English LLM with parallel data offers a good, fast, and cheap pathway to building high-quality, region- focused language models. Limitations This work focuses less on building the highest- performing LLM and more on empirically investi- gating and analyzing the effectiveness of parallel data for adapting LLMs to new languages, using Southeast Asian languages as a case study. Our study focuses on the continual pretraining stage and does not include instruction tuning, which could otherwise confound the results. We leave the in- vestigation of instruction tuning to future work. Due to computational resource limitations, our ex- periments are conducted only at the 1B and 7B parameter scales. We believe that our work poses no immediate risk to society or to any individuals 8
Chunk 19 ¡ 1,996 chars
tuning, which could otherwise confound the results. We leave the in- vestigation of instruction tuning to future work. Due to computational resource limitations, our ex- periments are conducted only at the 1B and 7B parameter scales. We believe that our work poses no immediate risk to society or to any individuals 8 -- 8 of 13 -- or organizations; however, we recommend caution when using our models, as they have not undergone any safety or value alignment. Acknowledgments This research is supported by the National Research Foundation Singapore under its AI Singapore Pro- gramme (Award Number: AISG3-RP-2022-030). We would like to acknowledge that computational work involved in this research work is partially sup- ported by NUS ITâs Research Computing group with grant number NUSREC-HPC-00001. We would also like to thank Raymond Ng, Peerat Limkonchotiwat, Jian Gang Ngui, and William Tjhi for helpful discussions on SEA-LION. References Eleftheria Briakou, Colin Cherry, and George Foster. 2023. Searching for needles in a haystack: On the role of incidental bilingualism in PaLMâs translation capability. In Proceedings of ACL, pages 9432â9452. Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang, Saksham Singhal, Xian-Ling Mao, Heyan Huang, Xia Song, and Furu Wei. 2021. mT6: Multilingual pretrained text-to-text transformer with translation pairs. In Proceedings of EMNLP, pages 1671â1683. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, and et al. 2023. PaLM: Scaling lan- guage modeling with pathways. JMLR, 24(1). Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining. In Proceedings of NeurIPS. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross- lingual sentence representations. In Proceedings of EMNLP, pages 2475â2485. Marta R. Costa-jussĂ ,
Chunk 20 ¡ 1,999 chars
2019. Cross- lingual language model pretraining. In Proceedings of NeurIPS. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross- lingual sentence representations. In Proceedings of EMNLP, pages 2475â2485. Marta R. Costa-jussĂ , James Cross, Onur Ăelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Mail- lard, Anna Sun, Skyler Wang, Guillaume Wen- zek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, and 19 others. 2022. No language left be- hind: Scaling human-centered machine translation. Preprint, arXiv:2207.04672. Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Xin Mao, Ziqi Jin, Wei Lu, and Min Lin. 2024. Sailor: Open language models for South-East Asia. In Proceedings of EMNLP, pages 424â435. Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xing- wei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Wenhu Chen, and Ge Zhang. 2024. Chinese Tiny LLM: Pretraining a Chinese-centric large language model. Preprint, arXiv:2404.04167. Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hi- roki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. 2024. Continual pre-training for cross-lingual LLM adaptation: Enhancing Japanese language capabili- ties. In Proceedings of COLM. Alejandro HernĂĄndez-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank ËDurech, Ido Hakimi, Juan GarcĂa Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko SabolËcec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, and 83 others. 2025. Apertus: De- mocratizing open and compliant LLMs for global language environments. Preprint, arXiv:2509.14233. Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin
Chunk 21 ¡ 1,990 chars
Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko SabolËcec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, and 83 others. 2025. Apertus: De- mocratizing open and compliant LLMs for global language environments. Preprint, arXiv:2509.14233. Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, and 5 oth- ers. 2024. MiniCPM: Unveiling the potential of small language models with scalable training strategies. In Proceedings of COLM. Adam Ibrahim, Benjamin ThĂŠrien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, TimothĂŠe Lesort, Eugene Belilovsky, and Irina Rish. 2024. Simple and scalable strategies to continually pre-train large language models. Preprint, arXiv:2403.08763. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP, pages 388â395. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre- training for neural machine translation. TACL, 8:726â 742. Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer San- toso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kamp- man, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Mon- talan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, BĂśrje F. Karlsson, James Jaya, and 42 oth- ers. 2024. SEACrowd: A multilingual multimodal data hub and benchmark suite for Southeast Asian languages. In Proceedings of EMNLP, pages 5155â 5203. Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin 9 -- 9 of 13 -- Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkon- chotiwat, Adithya Venkatadri Hulagadri, Kok
Chunk 22 ¡ 1,998 chars
heast Asian languages. In Proceedings of EMNLP, pages 5155â 5203. Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin 9 -- 9 of 13 -- Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkon- chotiwat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, and 12 others. 2025. Sea-lion: Southeast Asian languages in one network. Preprint, arXiv:2504.05747. Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liy- ing Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2024. SeaLLMs - large language models for Southeast Asia. In Proceedings of ACL, pages 294â304. Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-M: Enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In Proceedings of EMNLP, pages 27â38. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. In Proceedings of ACL, pages 311â318. Edoardo Maria Ponti, Goran GlavaĹĄ, Olga Majewska, Qianchu Liu, Ivan Vuli´c, and Anna Korhonen. 2020. XCOPA: A multilingual dataset for causal common- sense reasoning. In Proceedings of EMNLP, pages 2362â2376. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of WMT, pages 186â191. Muhammad Reza Qorib, Junyi Li, and Hwee Tou Ng. 2025. Just go parallel: Improving the multilingual ca- pabilities of large language models. In Proceedings of ACL, pages 33411â33424. Leonardo Ranaldi, Giulia Pucci, and Andre Fre- itas. 2024. Empowering cross-lingual abilities of instruction-tuned large language models by translation-following demonstrations. In Findings of ACL, pages 7961â7973. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick,
Chunk 23 ¡ 1,994 chars
In Proceedings of ACL, pages 33411â33424. Leonardo Ranaldi, Giulia Pucci, and Andre Fre- itas. 2024. Empowering cross-lingual abilities of instruction-tuned large language models by translation-following demonstrations. In Findings of ACL, pages 7961â7973. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ilic, Daniel Hesslow, Roman CastagnĂŠ, Alexandra Sasha Luccioni, François Yvon, Matthias GallĂŠ, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Am- manamanchi, Thomas Wang, BenoĂŽt Sagot, Niklas Muennighoff, Albert Villanova del Moral, and 30 others. 2022. BLOOM: A 176B-parameter open- access multilingual language model. arXiv preprint arXiv:2211.05100. Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhim- ing Chen, Osama Mohammed Afzal, Samta Kam- boj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, and 13 others. 2023. Jais and Jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. Preprint, arXiv:2308.16149. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 oth- ers. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, and 23 others. 2025. 2 OLMo 2 furious. In Proceedings of COLM. Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. Do
Chunk 24 ¡ 1,993 chars
ang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, and 23 others. 2025. 2 OLMo 2 furious. In Proceedings of COLM. Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. Do Llamas work in English? on the latent language of multilingual transformers. In Proceedings of ACL, pages 15366â15394. D.H. Wolpert and W.G. Macready. 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67â82. Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jia- hao Wang, Ye Feng, Ying Shan, and Ping Luo. 2024. LLaMA pro: Progressive LLaMA with block expan- sion. In Proceedings of ACL, pages 6518â6537. Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Has- san Awadalla. 2024. A paradigm shift in machine translation: Boosting translation performance of large language models. In Proceedings of ICLR. Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019. PAWS-X: A cross-lingual adversar- ial dataset for paraphrase identification. In Proceed- ings of EMNLP-IJCNLP, pages 3687â3692. Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, and Lidong Bing. 2025a. SeaLLMs 3: Open founda- tion and chat multilingual large language models for Southeast Asian languages. In Proceedings of ACL, pages 96â105. Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. 2025b. Persistent pre- training poisoning of LLMs. In Proceedings of ICLR. Wenzhen Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue, and Ming Zhou. 2024. Breaking language barriers: Cross-lingual continual pre-training at scale. In Pro- ceedings of EMNLP, pages 7725â7738. 10 -- 10 of 13 -- ISO Code # tokens id 54.6B km 1.7B lo 2.1B ms 11.8B my 0.8B ta 46.3B th 16.5B tl 2.5B vi 90.0B Table 5: Number of
Chunk 25 ¡ 1,999 chars
en Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue, and Ming Zhou. 2024. Breaking language barriers: Cross-lingual continual pre-training at scale. In Pro- ceedings of EMNLP, pages 7725â7738. 10 -- 10 of 13 -- ISO Code # tokens id 54.6B km 1.7B lo 2.1B ms 11.8B my 0.8B ta 46.3B th 16.5B tl 2.5B vi 90.0B Table 5: Number of tokens of SEA-PILE-v2, measured using the OLMo 2 tokenizer. Data source # tokens Internet (common crawl) 677.6B Books 56.8B Academic papers 33.6B Others (QA, etc.) 29.6B Encyclopaedia 2.4B Total 800.0B Table 6: Number of tokens of the MAP-CC dataset, measured using the original tokenizer from the paper. A Resources In this section, we report the details of the data sources used in our experiments. We report the statistics of our SEA multilingual data (Table 5), Chinese multilingual data (Table 6), and replay data (Table 7). B Results In this section, we report the detailed experimental results for each language. We report the modelsâ performance on translating texts from Southeast Asian languages into English (Table 8) and En- glish into Southeast Asian languages (Table 9). We Corpus name # tokens Filtered DCLM 752.0B Decontaminated FLAN 17.0B StackExchange Q&A 1.3B peS2o 58.6B Wikipedia/Wikibooks 3.7B Dolmino Math 10.7B Total 843.3B Table 7: Number of tokens of replay data sampled from OLMo 2âs second-stage pretraining, measured using the OLMo 2 tokenizer. report the detailed performance on commonsense reasoning tasks, which include XNLI (Table 10), XCOPA (Table 11), and PAWS-X (Table 12). C Training Details We report the architectural configurations of the 1B and 7B models in Table 13, and summarize the cor- responding training hyperparameters in Table 14. All experiments were conducted using 8Ă NVIDIA H200 GPUs. Training the 1B model on 10B tokens required approximately 11 hours, while training the 7B model on 34.7B tokens took approximately 180 hours. 11 -- 11 of 13 -- Model Size id km lo ms my ta th tl vi zh Avg Mixed 1B 42.81 21.14 23.87 42.33
Chunk 26 ¡ 1,989 chars
ameters in Table 14. All experiments were conducted using 8Ă NVIDIA H200 GPUs. Training the 1B model on 10B tokens required approximately 11 hours, while training the 7B model on 34.7B tokens took approximately 180 hours. 11 -- 11 of 13 -- Model Size id km lo ms my ta th tl vi zh Avg Mixed 1B 42.81 21.14 23.87 42.33 4.52 12.98 24.01 32.50 29.13 23.03 25.63 Parallel First 1B 35.31 13.60 18.02 35.26 3.52 11.36 20.50 32.22 23.01 15.71 20.85 Parallel Last 1B 41.00 21.46 23.76 42.02 7.29 17.71 24.35 38.31 27.99 21.59 26.55 Multilingual 1B 29.61 3.07 1.17 30.10 0.30 4.07 12.98 30.53 18.24 14.96 14.50 Parallel Only 1B 42.98 23.10 6.13 42.61 5.85 11.88 27.35 43.86 34.71 23.97 26.24 Multilingual 7B 41.56 26.70 30.67 41.75 10.86 21.83 27.84 44.07 34.42 21.76 30.15 Parallel Only 7B 49.48 32.92 34.94 49.06 22.44 33.45 35.07 51.93 41.00 30.06 38.04 SeaLLM v3 7B 41.53 16.31 8.84 40.68 7.59 12.87 28.97 36.47 34.99 28.80 25.71 Sailor2 8B 46.26 32.70 37.22 46.10 24.70 28.57 32.84 48.16 39.38 27.73 36.37 SEA-LION v3.5 8B 42.78 26.77 26.70 41.72 20.91 26.54 30.13 42.98 35.48 26.32 32.03 Table 8: BLEU scores for translation into English. Model Size id km lo ms my ta th tl vi zh Avg Mixed 1B 44.99 8.61 7.75 39.83 2.18 6.21 19.65 33.45 37.93 36.32 23.69 Parallel First 1B 37.44 4.58 7.83 32.97 2.26 2.51 14.60 29.54 30.04 28.19 19.00 Parallel Last 1B 45.73 2.07 3.58 40.57 1.78 6.63 20.05 33.71 38.69 36.77 22.96 Multilingual 1B 28.13 0.19 0.65 25.92 0.61 0.30 6.11 23.17 19.89 20.10 12.51 Parallel Only 1B 45.69 3.19 6.65 40.99 2.33 8.81 21.22 34.31 39.50 38.51 24.12 Multilingual 7B 41.40 13.87 14.61 37.66 5.28 7.73 20.62 34.30 35.30 30.90 24.17 Parallel Only 7B 49.50 17.99 16.47 44.50 9.25 17.58 28.66 37.92 43.44 45.86 31.12 SeaLLM v3 7B 35.27 3.16 0.91 25.81 1.96 1.22 15.92 15.73 35.22 43.27 17.85 Sailor2 8B 47.82 22.01 22.82 41.36 15.77 11.67 27.31 37.15 43.35 41.68 31.09 SEA-LION v3.5 8B 42.03 12.51 12.02 36.50 8.09 12.32 21.01 31.90 39.09 38.18 25.37 Table 9: BLEU scores for
Chunk 27 ¡ 1,994 chars
0 17.99 16.47 44.50 9.25 17.58 28.66 37.92 43.44 45.86 31.12 SeaLLM v3 7B 35.27 3.16 0.91 25.81 1.96 1.22 15.92 15.73 35.22 43.27 17.85 Sailor2 8B 47.82 22.01 22.82 41.36 15.77 11.67 27.31 37.15 43.35 41.68 31.09 SEA-LION v3.5 8B 42.03 12.51 12.02 36.50 8.09 12.32 21.01 31.90 39.09 38.18 25.37 Table 9: BLEU scores for translation from English. Model Size en th vi zh Avg Mixed 1B 51.54 41.72 44.37 35.47 43.28 Parallel First 1B 52.79 43.31 44.15 39.00 44.81 Parallel Last 1B 51.00 42.91 44.15 33.29 42.84 Multilingual 1B 52.77 39.42 41.86 39.08 43.28 Parallel Only 1B 52.97 44.49 44.23 34.67 44.09 Multilingual 7B 52.61 45.45 47.76 37.11 45.73 Parallel Only 7B 52.10 47.50 47.33 34.65 45.40 SeaLLM v3 7B 53.77 40.32 42.87 36.17 43.28 Sailor2 8B 54.65 44.45 45.55 33.69 44.59 SEA-LION v3.5 8B 52.79 47.05 46.23 38.36 46.11 Table 10: XNLI scores for each language. 12 -- 12 of 13 -- Model Size en id ta th vi zh Avg Mixed 1B 77.80 63.80 53.00 56.60 63.60 59.20 62.33 Parallel First 1B 79.20 64.20 54.80 54.20 63.00 58.60 62.33 Parallel Last 1B 77.40 61.40 54.80 56.00 65.00 59.60 62.37 Multilingual 1B 76.60 63.40 55.40 57.20 65.40 58.40 62.73 Parallel Only 1B 77.40 61.20 54.60 55.80 61.60 59.40 61.67 Multilingual 7B 86.00 73.60 61.60 61.60 72.80 66.20 70.30 Parallel Only 7B 85.00 74.00 59.60 60.00 73.80 68.80 70.20 SeaLLM v3 7B 87.80 71.00 53.80 61.20 72.40 77.00 70.53 Sailor2 8B 87.40 79.40 62.60 65.00 78.20 74.40 74.50 SEA-LION v3.5 8B 87.80 72.80 59.60 59.80 73.00 69.20 70.37 Table 11: XCOPA scores for each language. Model Size en zh Avg Mixed 1B 58.80 55.80 57.30 Parallel First 1B 56.90 52.85 54.88 Parallel Last 1B 59.35 52.00 55.68 Multilingual 1B 58.50 47.00 52.75 Parallel Only 1B 61.15 53.30 57.23 Multilingual 7B 66.45 53.70 60.08 Parallel Only 7B 70.10 59.30 64.70 SeaLLM v3 7B 66.85 53.50 60.18 Sailor2 8B 70.90 58.15 64.53 SEA-LION v3.5 8B 66.45 60.70 63.58 Table 12: PAWS-X scores for each language. Hyperparameter 1B 7B Number of layers 16 32 Embedding dimension 2048
Chunk 28 ¡ 821 chars
00 52.75 Parallel Only 1B 61.15 53.30 57.23 Multilingual 7B 66.45 53.70 60.08 Parallel Only 7B 70.10 59.30 64.70 SeaLLM v3 7B 66.85 53.50 60.18 Sailor2 8B 70.90 58.15 64.53 SEA-LION v3.5 8B 66.45 60.70 63.58 Table 12: PAWS-X scores for each language. Hyperparameter 1B 7B Number of layers 16 32 Embedding dimension 2048 4096 Intermediate dimension 8192 22016 Attention heads 16 32 Context window 4096 4096 Vocabulary size 100352 100352 Table 13: Architecture of the OLMo 2 1B and 7B models. Hyperparameter 1B 7B Global batch size 512 1024 Micro batch size (per device) 4 2 Learning rate 2.0 Ă 10â4 2.0 Ă 10â4 Weight decay 0.1 0.1 Optimizer AdamW AdamW (β1, β2) (0.9, 0.95) (0.9, 0.95) Gradient clip 1.0 1.0 Max sequence length 4096 4096 Table 14: Training hyperparameters for the 1B and 7B experiments. 13 -- 13 of 13 --