Mangosteen: An Open Thai Corpus for Language Model Pretraining

Summary

This paper introduces Mangosteen, a 47.4 billion-token open Thai corpus for language model pretraining. Existing large-scale corpora rely on English-centric or language-agnostic pipelines that fail to address Thai script and cultural nuances, often leaving risky content like gambling material untreated. Mangosteen addresses this by adapting the Dolma pipeline with custom rule-based language identification, revised C4/Gopher quality filters, and Thai-trained content filters. It incorporates diverse sources including Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline reduces CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG scores from 3 to 11. An 8B-parameter SEA-LION model pre-trained on Mangosteen outperforms SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. The full pipeline code, cleaning manifests, corpus snapshot, and checkpoints are released, providing a reproducible foundation for Thai and regional LLM research.

PDF viewer

Chunks(53)

Chunk 0 · 1,995 chars

Mangosteen: An Open Thai Corpus for Language Model Pretraining
Wannaphong Phatthiyaphaibun♠, †, Can Udomcharoenchaikit♠, †, Pakpoom Singkorapoom♠,
Kunat Pipatanakul♣, Ekapol Chuangsuwanich♢,
Peerat Limkonchotiwat♡, ‡, Sarana Nutanong♠, ‡
♠Vidyasirimedhi Institute of Science and Technology ♣SCB10X
♢Chulalongkorn University ♡AI Singapore
Abstract
Pre-training data shapes a language model’s
quality, but raw web text is noisy and
demands careful cleaning. Existing
large-scale corpora rely on English-centric
or language-agnostic pipelines whose
heuristics do not capture Thai script or
cultural nuances, leaving risky material
such as gambling content untreated. Prior
Thai-specific efforts customize pipelines
or build new ones, yet seldom release their
data or document design choices, hindering
reproducibility and raising the question of
how to construct a transparent, high-quality
Thai corpus. We introduce Mangosteen: a
47 billion-token Thai corpus built through
a Thai-adapted Dolma pipeline that in-
cludes custom rule-based language ID,
revised C4/Gopher quality filters, and
Thai-trained content filters, plus curated
non-web sources such as Wikipedia, Royal
Gazette texts, OCR-extracted books, and
CC-licensed YouTube subtitles. Systematic
ablations using GPT-2 show the pipeline
trims CommonCrawl from 202M to 25M
documents while raising SEA-HELM NLG
from 3 to 11; an 8B-parameter SEA-LION
model continually pre-trained on Man-
gosteen then surpasses SEA-LION-v3 and
Llama-3.1 by about four points on Thai
benchmarks. We release the full pipeline
code, cleaning manifests, corpus snapshot,
and all checkpoints, providing a fully
reproducible foundation for future Thai and
regional LLM research. 1
1 Introduction
Pre-training datasets are an important part of train-
ing a language model. The size, the diversity,
and the quality of the pre-training data play a cru-
cial role in obtaining a high-performing language
†Co-first Authors, ‡Corresponding Authors
1All artifacts in this

Chunk 1 · 1,997 chars

e Thai and
regional LLM research. 1
1 Introduction
Pre-training datasets are an important part of train-
ing a language model. The size, the diversity,
and the quality of the pre-training data play a cru-
cial role in obtaining a high-performing language
†Co-first Authors, ‡Corresponding Authors
1All artifacts in this papers
model. These datasets are compiled by scraping
vast amounts of web text, augmented with diverse
sources like books, scholarly articles, and code.
These datasets are enormous, often reaching hun-
dreds of billions of tokens. Raw web corpora are
messy, containing noise, duplicates, and harmful
content. If not carefully filtered, these issues can
lead to undesirable model behavior. To improve
the quality of the dataset, a filtering pipeline is
needed to remove these unwanted samples.
Most large-scale English pre-training datasets,
such as CC-100 (Wenzek et al., 2020), C4 (Raffel
et al., 2020), RefinedWeb (Penedo et al., 2023),
and Dolma (Soldaini et al., 2024), are mainly
derived from internet data such as the Com-
mon Crawl corpus2. Hence, this requires exten-
sive cleaning and filtering. However, most data-
cleaning pipelines for web pages are optimized for
the English language using known best practices.
For multilingual datasets that support Thai, such
as CC-100, FineWeb2 (Penedo et al., 2024b), and
HPLT v2 (Burchell et al., 2025), the data cleaning
pipelines are usually language-agnostic. Although
being language-agnostic can lead to a generalized
design for a data cleaning pipeline, this approach
prioritizes broad applicability over incorporating
local, cultural, and language-specific knowledge.
Consequently, language-specific filtering steps are
necessary to prevent unwanted information from
being used to train language models. For exam-
ple, the FineWeb2 dataset contains content from
gambling websites, which are illegal in Thailand.
For language-specific datasets, many data col-
lection projects utilize existing data cleaning
pipelines by

Chunk 2 · 1,998 chars

filtering steps are
necessary to prevent unwanted information from
being used to train language models. For exam-
ple, the FineWeb2 dataset contains content from
gambling websites, which are illegal in Thailand.
For language-specific datasets, many data col-
lection projects utilize existing data cleaning
pipelines by customizing them to make these
pipelines more suitable for their target languages.
This ranges from making a small modification in
one process, for instance, the Latxa project (Etx-
aniz et al., 2024) for Basque, which adapted the
Dolma and Corpus Cleaner v2 pipelines with a
2https://commoncrawl.org/
arXiv:2507.14664v2 [cs.CL] 22 Jul 2025

-- 1 of 28 --

specific change in the filtering step, to customiz-
ing the whole pipeline. Examples of more exten-
sive customization include the FinGPT (Luukko-
nen et al., 2023) and IndicLLMSuite (Khan
et al., 2024) projects that built their own dedi-
cated pipelines for Finnish and Indian languages,
respectively. For the Thai language, the ap-
proach ranges from using a standard language-
agnostic pipeline, such as the Typhoon-2 project,
which uses Trafilatura (Barbaresi, 2021) and text-
dedup (Mou et al., 2023) directly while creating its
own heuristic filtering, to customizing the whole
pipeline to specifically handle Thai. An exam-
ple of a dedicated Thai data cleaning pipeline is
from the OpenThaiGPT (Yuenyong et al., 2025),
in which they build the data preprocessing pipeline
from scratch. However, these projects focus on
open models rather than the openness of their
training data and collection methods. While
this provides a valuable resource, it means the
pre-training corpora remain inaccessible and the
pipeline’s design choices and empirical backing
are not detailed, which is an important consid-
eration for researchers focused on reproducibil-
ity. This raises a central question: What does it
take to build a high-quality pre-training corpus
for a language like Thai, and how should exist-
ing pipelines be

Chunk 3 · 1,997 chars

le and the
pipeline’s design choices and empirical backing
are not detailed, which is an important consid-
eration for researchers focused on reproducibil-
ity. This raises a central question: What does it
take to build a high-quality pre-training corpus
for a language like Thai, and how should exist-
ing pipelines be adapted to meet its linguistic and
cultural specifics?
To address the lack of Thai pre-training re-
sources and to increase the exposure of Thai
within the open-source NLP community, we have
developed Mangosteen, a large-scale pre-training
dataset for the Thai language. This dataset con-
tains 47.4 billion tokens and is available under an
open-source license. In addition, we provide an
in-depth analysis of our data cleaning pipeline, de-
signed specifically for Thai, including an ablation
study that confirms the effectiveness of each pro-
cessing step. We customized the Dolma (Soldaini
et al., 2024) pipeline to curate variants of Thai
Common Crawl data through a systematic clean-
ing process because the Dolma pipeline is easy to
customize and supports parallel processing. For
language identification, we developed a rule-based
script to accurately detect Thai content. The qual-
ity filter was adapted by modifying the C4 and Go-
pher filter rules to address specific Thai language
nuances. Furthermore, we implemented a content
filter composed of two components: an obscene
content filter and a gambling-related content filter,
both of which were trained on Thai-specific data
to effectively remove undesirable content. In addi-
tion to Common Crawl data, we include data from
several other sources to ensure diversity. These in-
clude Wikipedia, YouTube subtitles, and even text
extracted from open-access books using OCR.
To determine the most effective data pipeline
for the Thai language, we systematically evalu-
ate the impact of each processing step on lan-
guage model performance. We test each step in
our pipeline to find the optimal settings for the
Thai

Chunk 4 · 1,998 chars

YouTube subtitles, and even text
extracted from open-access books using OCR.
To determine the most effective data pipeline
for the Thai language, we systematically evalu-
ate the impact of each processing step on lan-
guage model performance. We test each step in
our pipeline to find the optimal settings for the
Thai language. This process involves pre-training
a GPT-2 model (Radford et al., 2019) with 124M
parameters on 10B tokens for each configuration.
The effectiveness of each step is then confirmed by
evaluating the trained model against benchmarks
like SEACrowd and SEA-HELM. Compared to
the uncleaned Common Crawl data, data pro-
cessed through our pipeline achieves better per-
formance despite being multiple times smaller.
Furthermore, we can improve the already-cleaned
FineWeb2 data by passing it through our pipeline,
resulting in a much smaller dataset that main-
tains a similar level of performance. Furthermore,
to confirm the robustness of our data, we devel-
oped a new Thai large language model by fur-
ther pre-training the Southeast Asian model (SEA-
LION (Ng et al., 2025)) on our dataset. Our
LLM outperforms other Llama3.1-based models
on both the Thai LLM Leaderboard and SEA-
HELM benchmarks.
We summarize the contribution of our paper as
follows:
• We introduce a data cleaning pipeline for im-
proving the quality of data, which adapts the
Dolma pipeline for the language’s specific
needs. We also describe the customization
steps taken in this process. Moreover, we con-
duct an ablation study to demonstrate the ef-
fectiveness of our data cleaning pipeline.
• Beyond Common Crawl data, through an ex-
tensive curation effort, we curate and ex-
tract high-quality texts from numerous other
sources, including Wikipedia, YouTube sub-
titles, open-access books whose text we ex-
tracted via OCR, official documents from the
Royal Gazette, open government data, estab-
lished resources like Wikipedia, and existing
Thai datasets on Hugging Face.
• In a large-scale

Chunk 5 · 1,992 chars

x-
tract high-quality texts from numerous other
sources, including Wikipedia, YouTube sub-
titles, open-access books whose text we ex-
tracted via OCR, official documents from the
Royal Gazette, open government data, estab-
lished resources like Wikipedia, and existing
Thai datasets on Hugging Face.
• In a large-scale effort to advance open
data and open models for the Thai lan-

-- 2 of 28 --

guage, we introduce a high-quality named
“Mangosteen”, 47B token pre-training cor-
pus. To demonstrate its effectiveness, we also
present WangchanLION-V3-8B, a new, fully
open-source Thai LLM, developed by further
pre-training a SEA-LION-based model on our
dataset. In line with our commitment to open
science, this entire ecosystem—the dataset,
the model, and all related code—is available
under permissive open-source licenses.
2 Related Work
2.1 Text Pretrained Corpora
A critical design choice for any pre-training cor-
pora is data cleaning, which ensures the quality of
the data and its effectiveness for the pre-training
process. Prior efforts (Wenzek et al., 2020; Raf-
fel et al., 2020; Xue et al., 2021; Gao et al., 2020)
focusing on improving data quality using simple
techniques such as language identification, docu-
ment deduplication, and quality filtering using a
perplexity filter model on web corpora, where the
language is monolingual (English only) to multi-
lingual (more than 100 languages). Later works
also demonstrate more pristine text processing us-
ing another model to classify text quality (Or-
tiz Su’arez et al., 2020). The C4 corpus (Raf-
fel et al., 2023) was created by the rule, which
was later named the “C4” rule. Gopher rule (Rae
et al., 2021), and more complicated data rules,
e.g., RefinedWeb (Penedo et al., 2023) demon-
strate the pipeline that uses URL filtering, text
extraction, language identification, repetition re-
moval, document-wise filtering, Line-wise correc-
tions, fuzzy deduplication, and exact deduplica-
tion, outperform previous data

Chunk 6 · 1,994 chars

, 2021), and more complicated data rules,
e.g., RefinedWeb (Penedo et al., 2023) demon-
strate the pipeline that uses URL filtering, text
extraction, language identification, repetition re-
moval, document-wise filtering, Line-wise correc-
tions, fuzzy deduplication, and exact deduplica-
tion, outperform previous data cleaning processes.
Recently, the size of the pre-training corpora
has been increased significantly according to the
model size, which has been increased from 100
million parameters (i.e., BERT-base (Devlin et al.,
2018)) to a large language model (i.e., Llama-
8B (Grattafiori et al., 2024)). Therefore, most
recent works propose a large-scale pre-training
dataset with methods for creating clean and high-
quality pre-training data to improve the pre-
training model’s performance. The Dolma cor-
pus (Soldaini et al., 2024) has been an English-
pretrained dataset of 3 trillion tokens mixed with
many sources: web pages, academic publications,
code, books, and encyclopedic materials. They re-
leased a dataset and a pipeline to create dataset
transparency in the language model, where the
pipeline includes language filtering, URL and text
overlap deduplications, quality filters, and content
filters. FineWeb (Penedo et al., 2024a) is clean
and deduplicates English web data from Common-
Crawl, where this previous work uses a linear re-
gression model to classify educational texts that
are made using Llama-3 synthetic annotations.
FineWeb2 (Penedo et al., 2024b) creates over 1000
languages for pre-training models. Its methodol-
ogy is similar to FineWeb, but data is deduplicated
per language globally, and C4 filters are not used.
In general, these pipelines are similar and
mostly open-source, but the majority do not
include instructions on how to apply them to
other languages. Furthermore, the aforementioned
pipelines fail to explain the processes that make
the preprocessing steps effective for the target lan-
guage, and these pipelines are not effectively

Chunk 7 · 1,999 chars

hese pipelines are similar and
mostly open-source, but the majority do not
include instructions on how to apply them to
other languages. Furthermore, the aforementioned
pipelines fail to explain the processes that make
the preprocessing steps effective for the target lan-
guage, and these pipelines are not effectively mod-
ified to an extent that makes them suitable for the
target language. For instance, some languages ne-
cessitate either language-specific word segmenta-
tion tools or a newly trained model to filter out of-
fensive content, including pornography and gam-
bling material.
2.2 Thai-text pre-training datasets and
models
The most prominent pre-training datasets for Thai
are typically included as subsets within multilin-
gual corpora. Early examples include the multi-
lingual BERT (Devlin et al., 2018) model, which
uses Wikipedia and includes Thai content. This
is followed by the CC-100 corpus (Wenzek et al.,
2020), which supports Thai and is used in mod-
els such as XLM and XLM-R (Conneau et al.,
2020). The OSCAR corpus (Ortiz Su’arez et al.,
2020) also includes Thai and forms part of the
dataset used in GPT-2 Thai by Flax’s GPT-2
base 3. More recently, trillion-scale multilingual
pre-training datasets such as CulturaX (Nguyen
et al., 2023), FineWeb2 (Penedo et al., 2024b), and
HPLT v2 (Burchell et al., 2025) have also included
Thai among hundreds of supported languages.
However, their language-agnostic approach disre-
gards language-specific and local knowledge vital
for accurate data cleansing.
Moreover, the Thai community also pro-
poses a Thai CPT project to improve the
3https://huggingface.co/
flax-community/gpt2-base-thai

-- 3 of 28 --

number of data and the model’s robustness.
OpenThaiGPT (Yuenyong et al., 2025) is a project
that performs continual pre-training on Llama
models with Thai language by extending vocab-
ulary and continually pre-training Thai language
datasets. SambaLingo (Csaki et al., 2024) is
a Llama-2 model that does vocabulary

Chunk 8 · 1,996 chars

--

number of data and the model’s robustness.
OpenThaiGPT (Yuenyong et al., 2025) is a project
that performs continual pre-training on Llama
models with Thai language by extending vocab-
ulary and continually pre-training Thai language
datasets. SambaLingo (Csaki et al., 2024) is
a Llama-2 model that does vocabulary expan-
sion and continuous pre-training for nine lan-
guages, including Thai, using data sourced from
CulturaX (Nguyen et al., 2023). Typhoon (Pi-
patanakul et al., 2023) performs CPT on cleaned
and deduplicated CC and extends vocabulary with
Mistral-7B, and Typhoon 2 (Pipatanakul et al.,
2024) applies CPT on Qwen 2.5 7B and Llama 3.1
by combining the original pipeline with a multi-
classifier model for data filtering. OpenThaiLLM-
Prebuilt-7B4 is a project that continues pre-
training on Qwen 2 with the Thai and Chinese
datasets.
Although previous works have demonstrated
the possibility of improving Thai’s performance
using CPT, they only partially open-sourced their
pipelines or did not study the effect of each in-
dividual component. For example, Typhoons use
heuristic filtering like RefineWeb (Penedo et al.,
2023) and DCLM (Li et al., 2025), but do not
open-source their pipeline code or model to en-
able reproducibility. OpenThaiGPT open-sourced
their data processing pipeline5; however, they do
not provide thorough study of the pipeline, includ-
ing the necessary analyses and resources required
for full reproducibility. For example, they do not
share the dataset or the model used for perplexity-
based filtering. Therefore, the Thai research com-
munity is struggling to develop or make any for-
ward progress without open-source data and a
study on its design decision.
2.3 Gap Summary
Existing data cleaning pipelines like C4, Refined-
Web, and Dolma, while scalable, are designed
primarily for English or applied in a language-
agnostic way, requiring further adjustments to fit
distinct scripts and cultural norms. Thai pre-
training efforts, such as

Chunk 9 · 1,989 chars

and a
study on its design decision.
2.3 Gap Summary
Existing data cleaning pipelines like C4, Refined-
Web, and Dolma, while scalable, are designed
primarily for English or applied in a language-
agnostic way, requiring further adjustments to fit
distinct scripts and cultural norms. Thai pre-
training efforts, such as OpenThaiGPT and Ty-
phoon, reuse such pipelines or apply ad hoc mod-
ifications without empirical validation, and rarely
release their datasets or pipelines, limiting repro-
4https://medium.com/nectec/
openthaillm-prebuilt-release-f1b0e22be6a5
5https://github.com/OpenThaiGPT/
data-processing
ducibility. This paper addresses these gaps by sys-
tematically applying the principles found in C4,
RefinedWeb, and Dolma to the Thai-specific con-
text.
3 Data Curation
The goal of data curation is to build a Thai pre-
training corpus that covers the broadest possible
set of domains, enabling adaptations in various
downstream tasks. We collect datasets from var-
ious sources that can be classified into two cat-
egories: (i) Common Crawl-Derived Dataset and
(ii) Curated Non-Common Crawl. We have a total
of 30.1M documents and 47.4B tokens, as shown
in Table 1.
Source 	Documents Tokens (B)
Common Crawl-Derived 29.7M 	45.9
Curated Non-Common Crawl 425,304 1.5
Total 	30.1M 	47.4
Table 1: Document and token statistics (using the
Llama 3 tokenizer)
3.1 Common Crawl-derived Dataset
Following the best practices from other pre-
training corpora (Raffel et al., 2023; Rae et al.,
2021; Penedo et al., 2023; Soldaini et al., 2024),
our Common Crawl-derived data consists of two
main sources: raw Common Crawl snapshots and
the FineWeb2 dataset. We focus specifically on
the Thai language subset from both sources.
Common Crawl We collect Common Crawl
dataset by processing each Common Crawl snap-
shot from 2018-30 to 2023-23 and extract only
Thai-language content using Common Crawl
metadata. For text extraction, we use trafilatura
(Barbaresi, 2021). Followed by our data

Chunk 10 · 1,987 chars

specifically on
the Thai language subset from both sources.
Common Crawl We collect Common Crawl
dataset by processing each Common Crawl snap-
shot from 2018-30 to 2023-23 and extract only
Thai-language content using Common Crawl
metadata. For text extraction, we use trafilatura
(Barbaresi, 2021). Followed by our data cleaning
pipeline (see Section 4 for full details) to filter and
deduplicate the result.
FineWeb2 To further increase the quantity of our
Common Crawl-derived dataset, we incorporate
FineWeb2, a large-scale pre-training corpus built
from Common Crawl data, which employs multi-
ple filtering techniques to improve dataset quality.
FineWeb2 spans from summer 2013 to April 2024.
The Thai subset contains approximately 51.4 bil-
lion words across 35 million documents. We fur-
ther process the cleaned FineWeb2 dataset using
our data cleaning pipeline (Section 4) to dedu-
plicate overlapping URLs and text inherited from
the Common Crawl source while also enhancing

-- 4 of 28 --

language-specific quality. After this step, the final
cleaned Thai FineWeb2 subset consists of around
7.3 billion words and 4.6 million documents.
3.2 Curated Non-Common Crawl Dataset
In addition to the Common Crawl-based data, we
enhance the data diversity through the efforts of
incorporating harder-to-reach data as follows. (1)
A significant subset of our source documents con-
sisted of scanned or image-based PDFs, whose
textual content is not directly accessible. To over-
come this, we perform Optical Character Recog-
nition (OCR) to extract the text from these files
using Marker6. Due to the diverse publication
formats of the scanned documents, we must cus-
tomize the OCR pipeline for each document for-
mat and clean them manually. (2) We also ex-
tract more text from YouTube video subtitles us-
ing the provided metadata, so we can create a
pipeline that filters suitable Thai video subtitles
that use the Creative Commons (CC) license. To
extract subtitles, we modified

Chunk 11 · 1,988 chars

s-
tomize the OCR pipeline for each document for-
mat and clean them manually. (2) We also ex-
tract more text from YouTube video subtitles us-
ing the provided metadata, so we can create a
pipeline that filters suitable Thai video subtitles
that use the Creative Commons (CC) license. To
extract subtitles, we modified Scrapetube’s source
code7 to exclusively retrieve Creative Commons
licensed videos. Our two-step methodology in-
volved: 1. Compiling Thai keywords across do-
mains (mathematics, investing, fitness, gaming,
films, etc.), applying CC/subtitle filters to gather
≤1,000 video metadata entries per keyword. 2.
Using quoted channel names as queries to identify
predominantly CC-licensed content. Video IDs
were processed via YouTube’s Transcription API8
to generate textual JSON Lines. As a final val-
idation step, non-CC contents were removed via
yt_dlp9.
The curated non-Common Crawl dataset spans
over a large number of sources, which covers the
following categories:
• Encyclopedic: We collect Thai data from
Wikipedia, Wikibooks, Wikiquote, and Wik-
isource, then use WikiExtractor (Attardi,
2015) to extract text and then remove the
HTML tags and adjust the formatting accord-
ing to the WikiExtractor tool’s outputs.
• Finance: We use the Financial Text Data Col-
lection10 by the VISTEC-depa Thailand ar-
6https://github.com/VikParuchuri/
marker
7https://github.com/dermasmid/
scrapetube
8https://github.com/jdepoix/
youtube-transcript-api
9https://github.com/yt-dlp/yt-dlp
10https://huggingface.co/datasets/
tificial intelligence research institute. This
dataset includes financial reports from Thai
companies and Thai financial news.
• Legal: We collect Thai legal data from mul-
tiple HuggingFace repositories, including acts
and statutes, constitutions, and Creative Com-
mons licenses.
• Government: We collect data that is publicly
released by government agencies from multi-
ple sources. This includes the Open Govern-
ment Data of Thailand, which is a Thai

Chunk 12 · 1,998 chars

We collect Thai legal data from mul-
tiple HuggingFace repositories, including acts
and statutes, constitutions, and Creative Com-
mons licenses.
• Government: We collect data that is publicly
released by government agencies from multi-
ple sources. This includes the Open Govern-
ment Data of Thailand, which is a Thai gov-
ernment effort to improve transparency and
engagement through access to open data. Gov-
ernment data often comes in scanned PDF for-
mat, which requires extraction through OCR.
Furthermore, we have to extract texts from
various formats such as DOCX and CSV.
• Education: We collect educational materials
ranging from informative articles and classical
Thai literature to advanced academic research.
• YouTube: We collect Thai YouTube subtitles
from videos with the CC BY 3.0 license.
Table 2 shows the document count distribution
for each category. The curated Non-Common
Crawl corpus comprises 425,304 Thai-language
documents from sources not present in Common
Crawl. After deduplication, we split the cor-
pus into a training set of 397,488 documents and
a validation set of 4,044 documents, where the
Non-Common Crawl dataset was curated from a
wide variety of web sources and includes data
from encyclopedic websites, financial reports, ed-
ucation, legal corpora, academic literature, and
YouTube transcripts. In total, the Non-Common
Crawl corpus contributes an additional 1.54 bil-
lion tokens to the dataset, tokenized using the
meta-llama/Llama-3.2-1B tokenizer. All
sources were verified to ensure the use of per-
missive content licenses. The full list of text
sources can be found in https://github.
com/vistec-AI/Mangosteen.
A more detailed table on sources of information
for the Non-Common Crawl dataset can be seen
in the appendix. We also use a deduplication pro-
cess to ensure this new data does not overlap with
the web data collected earlier. By incorporating
this non-Common Crawl into our data, we gain an
additional 1.5B tokens of high-quality data.

Chunk 13 · 1,990 chars

table on sources of information
for the Non-Common Crawl dataset can be seen
in the appendix. We also use a deduplication pro-
cess to ensure this new data does not overlap with
the web data collected earlier. By incorporating
this non-Common Crawl into our data, we gain an
additional 1.5B tokens of high-quality data. Al-
though these data are of high quality, we still need
to clean the noisy web data (e.g., Common Crawl
airesearch/CMDF_VISTEC

-- 5 of 28 --

Domain Count Percentage
Encyclopedic 166,187 41.34
Finance 86,813 21.59
Government 72,879 18.13
Legal 52,343 13.02
YouTube 17,837 4.43
Education 5,911 1.47
Table 2: Document counts and percentage in Non-
Common Crawl by domain.
and Fineweb2) to ensure the overall quality of our
collected dataset. Therefore, we still require a data
cleaning pipeline to mitigate these problems.
4 Data cleaning pipeline for web data
4.1 Overview
A data cleaning pipeline is a crucial compo-
nent for ensuring the quality of the pre-training
data. For the Thai NLP community, previous Thai
LLMs also demonstrate how to gather a large-
scale dataset for pre-training an LLM. However,
as we discuss in Section 2.2, the focus of existing
works has been on the model release rather than
the training data and collection methods. Conse-
quently, the pre-training corpora often remain in-
accessible, and pipeline design choices are not dis-
cussed in detail, which is an important considera-
tion for researchers focused on reproducibility and
adaptability.
In this work, we propose a novel data collec-
tion and cleaning pipeline tailored to the unique
characteristics and challenges of Thai data. In par-
ticular, we present a Thai adaptation of the well-
known Dolma (Soldaini et al., 2024) data curation
pipeline, featuring a new data processing design
customized for the Thai language. After apply-
ing our data filtering pipeline, we obtained a large-
scale dataset containing 45.9B tokens. The Dolma
pipeline consists of four steps: language

Chunk 14 · 1,991 chars

Thai adaptation of the well-
known Dolma (Soldaini et al., 2024) data curation
pipeline, featuring a new data processing design
customized for the Thai language. After apply-
ing our data filtering pipeline, we obtained a large-
scale dataset containing 45.9B tokens. The Dolma
pipeline consists of four steps: language identifi-
cation, quality filters, deduplication, and content
filters, described as follows.
4.2 Language Identification
The first step of our data preprocessing is language
identification, where we remove all non-Thai data
from the pre-training corpus. In particular, we dis-
card any document containing text in other lan-
guages, such as Lao or Burmese.
Dolma used FastText (Joulin et al., 2017) as a
language identifier for English corpora. However,
we found that when we compare three language
identifiers: langdetect, lingua, and the FastText
language identifier model, all three failed to per-
form this task on Thai text, as shown in the Ap-
pendix B. Therefore, we use a simple and efficient
rule-based approach that identifies Thai text with
a regular expression based on the Thai Unicode
character range. This filter allows only documents
in which at least half of the characters are Thai
Unicode characters to pass.
4.3 Quality Filters
As a core step in our pipeline, we filter out low-
quality data. To do this, we follow the approach
used by Dolma, applying quality filtering rules
such as C4 (Raffel et al., 2023) and Gopher (Rae
et al., 2021). Moreover, we also customize these
rules to better account for the unique characteris-
tics of the Thai language and to integrate insights
from our observations.
C4: We adopt all C4 rules from Dolma as follows:
• Contains curly braces (e.g., { or })
• Includes the placeholder text “lorem ipsum”
• Contains JavaScript code or references
• Includes offensive or inappropriate language
(using a filter we built specifically for Thai).
• Has lines that lack ending punctuation
• Contains lines with fewer than three

Chunk 15 · 1,998 chars

Dolma as follows:
• Contains curly braces (e.g., { or })
• Includes the placeholder text “lorem ipsum”
• Contains JavaScript code or references
• Includes offensive or inappropriate language
(using a filter we built specifically for Thai).
• Has lines that lack ending punctuation
• Contains lines with fewer than three words
To detect offensive content, we employ a cus-
tom Thai lexicon. Because Thai text is not
delimited by spaces, we segment it using the
nlpO3 tokenizer (Suntorntip et al., 2024) from
PyThaiNLP (Phatthiyaphaibun et al., 2023). We
also implement a rule to remove Unicode replace-
ment characters, as these are unknown or unrepre-
sentable characters. Although we retain all orig-
inal Dolma rules, we omit the c4_no_punc rule,
since Thai sentences do not conventionally end
with punctuation such as a period.
Gopher: The Dolma pipeline uses Gopher rules
as part of its quality filtering process for English
text. However, based on our analysis of the Com-
mon Crawl-based data, we modified some of these
rules and adjusted their corresponding thresholds
to make them suitable for Thai.
• First, we raise the minimum document length
from 50 to 200 words and cap it at 100,000
words, since Thai sub-200-word texts in our
Common Crawl corpus of 200,000 examples
were predominantly low quality or sourced
from pornographic and gambling sites.

-- 6 of 28 --

• Second, we exclude any document in which
Thai consonants account for less than 80% of
all characters, reflecting our bespoke charac-
teristic of the Thai language.
• Next, we discard texts where over 30 % of
lines end with an ellipsis (“. . . ” or three
dots). Ellipses are often used to mark the
end of incomplete snippets of an article or ex-
cerpt.
• We also introduce a new rule to filter out
any document containing markers of trun-
cated content in Thai, e.g., “continue read-
ing”, “read more” or “read more at”, and
change the list of stop words to Thai.
To justify the higher minimum document length
threshold,

Chunk 16 · 1,995 chars

of incomplete snippets of an article or ex-
cerpt.
• We also introduce a new rule to filter out
any document containing markers of trun-
cated content in Thai, e.g., “continue read-
ing”, “read more” or “read more at”, and
change the list of stop words to Thai.
To justify the higher minimum document length
threshold, we compared WangchanBART 11 per-
plexity statistics at lower bounds of 50 versus 171
words and found both mean and median perplexi-
ties to be lower at the higher threshold; therefore,
we set the bound at 200 words. Finally, to speed up
processing, we employ the ICU tokenizer rather
than the slower nlpO3 tokenizer.
4.4 Deduplication
Deduplication by URL. A common text pre-
processing step is the removal of duplicate con-
tent. For this, we utilize the deduplication tech-
nique from the Dolma pipeline (Soldaini et al.,
2024), which uses a Bloom filter to identify dupli-
cate URLs. We found this method to be effective
on our Thai web dataset and therefore kept this
step of the pipeline unmodified.
Deduplication on text overlap. We perform
document-level deduplication using the Dolma
Bloom filter. Dolma’s paragraph-level dedupli-
cation is ineffective for Thai web data since the
UTF-8 newline (\n) is not always a good indica-
tor of the paragraph boundary in Thai text. There-
fore, we only apply deduplication at the document
level.
4.5 Content Filters
Following Dolma’s approach, we use FastText
for language-specific filtering due to its excellent
speed-accuracy balance for large-scale data clean-
ing. We employ pre-trained FastText vectors for
157 languages (Grave et al., 2018) to train filters
for adult and gambling content. To train our two
content filters, we create dedicated datasets for bi-
nary classifiers. We label documents as belong-
11airesearch/wangchanbart-base
ing to a specific class if they contain three or more
distinct words from a predefined list for that class.
For gambling content, we also include documents
that an LLM identified as

Chunk 17 · 1,936 chars

two
content filters, we create dedicated datasets for bi-
nary classifiers. We label documents as belong-
11airesearch/wangchanbart-base
ing to a specific class if they contain three or more
distinct words from a predefined list for that class.
For gambling content, we also include documents
that an LLM identified as promoting gambling. Fi-
nally, we randomly sample a subset of the data for
human validation, ensuring the quality of our con-
tent filtering datasets. Moreover, we changed the
phone number rule for the Personally Identifiable
Information (PII) filter to ensure Thai phone num-
ber compatibility.
5 Experimental Settings
5.1 CPT and SFT Details
To assess the effectiveness of our pre-training data
(47.4B tokens as shown in Section 3), we use it for
continuous pre-training (CPT) on SEA-LIONv3-
Llama-8B-instruction (Ng et al., 2025). We use
the training setup as follows:
• max_seq_len: 8192
• learning rate: 5.0e-6
• optimizer: decoupled_lionw
• lr_scheduler_type: cosine_with_warmup
• num_train_epochs: 1
• GPU: H100 (64 GPUs)
• Time: 1d 12h 24m
After CPT, we conduct supervised fine-tuning
(SFT) using QLoRA (Dettmers et al., 2023) on
the Thai instruction dataset, named Wangchan-
FLAN-6M12. We call the resultant model
WangchanLION-V3-8B. For comparison, we em-
ploy Llama 3.1 8B base, SEA-LION-V3-8B-base,
and Typhoon 2 8B base with the same SFT setting
as our model. For consistency, all models have 8
billion parameters and use Llama-3.1-base as the
original base model.
5.2 Evaluation Benchmark
To evaluate our trained model, including both
Llama and GPT-2, we evaluate them using two
benchmarks:
• Thai LLM Benchmark (10X et al., 2024) is
a benchmark suite for Southeast Asian Lan-
guages (Lovenia et al., 2024). It can evaluate
LLMs on NLG and NLU tasks. For GPT-2,
we use the NLG task only.
• SEA-HELM (Susanto et al., 2025) is an LLM
benchmark designed to evaluate SEA lan-
guages, including Thai ones. We

Chunk 18 · 1,997 chars

marks:
• Thai LLM Benchmark (10X et al., 2024) is
a benchmark suite for Southeast Asian Lan-
guages (Lovenia et al., 2024). It can evaluate
LLMs on NLG and NLU tasks. For GPT-2,
we use the NLG task only.
• SEA-HELM (Susanto et al., 2025) is an LLM
benchmark designed to evaluate SEA lan-
guages, including Thai ones. We remove
12https://huggingface.co/datasets/
airesearch/WangchanX-FLAN-v6.1

-- 7 of 28 --

NLR from the benchmark because the bench-
mark uses a low-quality machine-translated
dataset, XNLI (Conneau et al., 2018), for
which the results from this dataset are un-
reliable (Agrawal et al., 2024; Singh et al.,
2025).
These benchmarks allow us to evaluate the base
and instruction models. For the evaluation of GPT-
2, we opted not to include metrics related to NLU,
MT-bench, and safety. This decision is based on
the observation that the scores for GPT-2 were sig-
nificantly lower than the established baseline, re-
sulting in normalized scores of zero. The primary
reason for this issue lies in the limited capacity
of the 124-million parameter GPT-2 model, which
is insufficient for effectively processing complex
prompts for the NLU tasks. Consequently, our
evaluation focused exclusively on the NLG and
Instruction-Following Evaluation (IF) tasks.
6 Experimental Results
6.1 Data Ablation Study
In this study, we want to explore the effectiveness
of our data pipeline using GPT-2’s downstream
task performances as the indicator. In particular,
we apply our pipeline to unclean data, Common
Crawl, and cleaned data, FineWeb2. Our data
cleaning pipeline should yield improvement on all
data since we designed it specifically for Thai.
6.1.1 Ablation Design
We describe the setup of the data ablation studies
on Common Crawl and FineWeb2 as follows:
Common Crawl. We train a GPT-2 model on
each of five dataset variations, with each contain-
ing 10 billion tokens. Each dataset variation builds
upon the previous one by incrementally adding a
new cleaning component, allowing us

Chunk 19 · 1,999 chars

e describe the setup of the data ablation studies
on Common Crawl and FineWeb2 as follows:
Common Crawl. We train a GPT-2 model on
each of five dataset variations, with each contain-
ing 10 billion tokens. Each dataset variation builds
upon the previous one by incrementally adding a
new cleaning component, allowing us to measure
each step’s impact.
• Baseline: The raw Thai Common Crawl cor-
pus.
• + Language Identification: Baseline with
only Thai-language document retained.
• + Quality Filters: Adds the quality filtering
from Section 4.3.
• + Deduplication: Adds URL and text dedu-
plication
• + Content Filters: Adds filters for adult con-
tent, gambling content, and PII as we pro-
posed in Section 4.5.
FineWeb2. We aim to assess the effectiveness of
our data cleaning pipeline. To do this, we compare
the performance of the model trained on a clean
baseline dataset with the performance on the same
data after our pipeline has further processed it. We
train a separate GPT-2 model on 10B tokens from
each of the two versions of the FineWeb2 dataset,
as follows:
• FineWeb2: We use the original data from
FineWeb 2, where the data is already cleaned
using their pipeline.
• FineWeb2 + our pipeline: We perform
the second cleaning with our data cleaning
pipeline.
6.1.2 Before and after applying our pipeline.
As shown in Table 4, we found that our pipeline
can clean and reduce the size of the dataset. For
Common Crawl, we can reduce the dataset from
202 million documents to 25.1 million, resulting
in a cleaner dataset than the raw source. Similarly,
we also reduce the size of FineWeb2 by half. We
found that most of the text was removed by our
C4, Gopher, and gambling-related content filters.
These samples were low-quality and did not affect
downstream task performance. In the following,
we will discuss the effect of our pipeline when ap-
plied to both datasets in downstream tasks.
6.1.3 Using our data cleaning pipeline with
unclean data (Common Crawl)
Table 4 shows the results

Chunk 20 · 1,998 chars

ing-related content filters.
These samples were low-quality and did not affect
downstream task performance. In the following,
we will discuss the effect of our pipeline when ap-
plied to both datasets in downstream tasks.
6.1.3 Using our data cleaning pipeline with
unclean data (Common Crawl)
Table 4 shows the results of each filtering step:
selecting for Thai-language documents yielded
about 27 million documents, applying quality fil-
ters affected 139.4 million, deduplicating based on
URL and text overlap affected 9.6 million, and fil-
tering for inappropriate content (e.g., adult mate-
rial, gambling, PII) affected 0.9 million.
As shown in Table 3, the Common Crawl data
that has gone through all of our cleaning compo-
nents yields the best results on both SEA-HELM
and the Thai LLM benchmark. When incremen-
tally adding a new component, we generally ob-
serve an improvement on both benchmarks, except
for the deduplication process, where we see a drop
in the average SEA-HELM score. However, the
score increases and surpasses the dip after adding
the content filters, yielding the best performance.
This implies that our cleaning pipeline can filter
out low-quality data from an uncleaned dataset, re-
sulting in better performance in downstream tasks.

-- 8 of 28 --

SEA-HELM 	Thai LLM Benchmark
Training Data
Task NLG IF Avg. ENG->TH TH->ENG XLSum iApp Avg.
Common crawl
baseline 	3.09 11.00 7.04 0.13 	0.17 	6.98 1.05 2.08
+ Language Identification 4.43 14.00 9.21 0.08 	0.21 	6.92 1.08 2.07
+ Quality Filters 	12.29 23.00 17.64 0.09 	0.24 	8.11 1.54 2.50
+ Deduplication 	6.59 18.00 12.29 0.09 	0.21 	8.37 1.43 2.52
+ Content Filters 	10.60 25.00 17.80 0.07 	0.21 	8.66 1.17 2.53
FineWeb2
FineWeb2 	9.13 13.56 11.34 0.08 	0.13 	8.71 1.28 2.55
FineWeb2 + our pipeline 15.37 16.00 15.68 0.07 	0.18 	8.33 1.37 2.49
Table 3: The result of SEA-HELM and Thai LLM Benchmark evaluation for GPT-2
Dataset 	Common Crawl FineWeb2
Baseline 	231Bt/202Md 51.4Bt/35.9Md
Language Identification

Chunk 21 · 1,995 chars

0.21 	8.66 1.17 2.53
FineWeb2
FineWeb2 	9.13 13.56 11.34 0.08 	0.13 	8.71 1.28 2.55
FineWeb2 + our pipeline 15.37 16.00 15.68 0.07 	0.18 	8.33 1.37 2.49
Table 3: The result of SEA-HELM and Thai LLM Benchmark evaluation for GPT-2
Dataset 	Common Crawl FineWeb2
Baseline 	231Bt/202Md 51.4Bt/35.9Md
Language Identification 206Bt/175Md 51.3Bt/35.9Md
Quality Filters 	54.6Bt/35.6Md 30.4Bt/19.1Md
Deduplication 	40.7Bt/26.0Md 27.8Bt/17.9Md
Content Filters 	38.6Bt/25.1Md 26.0Bt/17.1Md
Table 4: The result of the dataset size in each step
from our pipeline (Bt is billions of tokens and Md
is millions of documents)
6.1.4 Using our data cleaning pipeline with
cleaned data (FineWeb2)
Next, we question our data pipeline: can our data
cleaning pipeline improve the data that has al-
ready been cleaned, like FineWeb2? To answer
this question, we compare the original FineWeb2
and FineWeb2 processed with our pipeline. Our
findings indicate that the original FineWeb2 per-
forms slightly better overall than the version fur-
ther cleaned with our pipeline in the Thai LLM
Benchmark evaluation, as shown in Table 3.
FineWeb2 processed with our pipeline substan-
tially outperforms the original FineWeb2 on av-
erage in the SEA-HELM evaluation, as shown in
Table 3. Moreover, we also found that, before ap-
plying our data pipeline, FineWeb2 had unwanted
texts, e.g., low-quality, duplication of URLs and
text overlap, adult content, gambling content, and
PII. Further processed with our pipeline, we can
significantly reduce its size from 35.9 million doc-
uments to 17.1 million, as shown in Table 4. This
result is achieved by further cleaning and filter-
ing data from FineWeb2. This implies that our
benchmark can improve data quality by filtering
low-quality data from the cleaned data. In addi-
tion, this also implies that the original cleaning
data pipeline from FineWeb2 might not be appli-
cable or suitable for Thai texts.
6.2 Downstream task results
We assess the effectiveness of our training

Chunk 22 · 1,997 chars

implies that our
benchmark can improve data quality by filtering
low-quality data from the cleaned data. In addi-
tion, this also implies that the original cleaning
data pipeline from FineWeb2 might not be appli-
cable or suitable for Thai texts.
6.2 Downstream task results
We assess the effectiveness of our training data by
comparing the model trained on our training data
with models trained on other pre-training corpora.
We hypothesize that the model trained on cleaner
and better data quality should yield more improve-
ment than the others.
SEA-HELM. As shown in Table 5, we evaluate all
the base models using the same SFT data, as men-
tioned in Section 5.1. Our model achieves the best
overall performance among the other three mod-
els. In particular, we outperform our base model,
SEA-LION V3 8B, in most cases. Furthermore,
we perform a more detailed analysis on the chat
benchmark, MT-Bench, in Figure 1. We found
that, when using our base model, the performance
of the Knowledge III (cultural evaluation) cate-
gory is higher than that of other models. This em-
phasizes the importance of using our data, which
can also yield improvement in terms of Thai cul-
tural knowledge. In addition, we have also gained
improvements in the Roleplay and Reasoning cat-
egories.
Task
Llama 3.1 8B
SEA-LION V3 8B
Typhoon 2 8B
WangchanLION-V3-8B
NLU 	46.59 54.35 60.55 62.33
Safety 	2.47 	17.21 23.74 6.62
NLG 	54.09 51.01 51.69 54.79
IF 	31.00 44.00 38.00 52.00
MT-Bench 23.58 30.00 33.33 41.36
Avg. 	31.54 39.31 41.46 43.42
Table 5: The performance comparison in SEA-
HELM between existing base models and
WangchanLION-V3-8B model when utilizing the
same SFT dataset.

-- 9 of 28 --

Coding
Knowledge III
Social Science
Math
Extraction	Reasoning
STEM
Writing
Roleplay
0.2
0.4
0.6
0.8
1.0
Llama 3.1 8B 	SEA-LION V3 8B 	WangchanLION-V3-8B 	Typhoon V2 8B
Figure 1: The results from the Thai MT-Bench us-
ing four instruction models trained on the same
data.
Thai LLM Leaderboard. We also conducted

Chunk 23 · 1,993 chars

f 28 --

Coding
Knowledge III
Social Science
Math
Extraction	Reasoning
STEM
Writing
Roleplay
0.2
0.4
0.6
0.8
1.0
Llama 3.1 8B 	SEA-LION V3 8B 	WangchanLION-V3-8B 	Typhoon V2 8B
Figure 1: The results from the Thai MT-Bench us-
ing four instruction models trained on the same
data.
Thai LLM Leaderboard. We also conducted an
experiment using the benchmark that was formu-
lated focusing on Thai contexts, namely the Thai
LLM Leaderboard. As shown in Table 6, we found
that our model outperforms other models on the
NLG task, but performs lower than other models
on the NLU task. This is because we added ex-
tensive Thai pre-training corpora, yielding more
Thai fluency and knowledge, resulting in the im-
provement of the NLG task. In contrast, the world
knowledge will also disappear from our model, re-
sulting in lower performance in NLU (the major-
ity of NLU datasets are not created in Thai con-
texts, unlike the NLG datasets). This is consistent
with the result from the chat dataset, MT-Bench,
that WangchanLION-V3-8B model yields more
fluency in Thai but fails in the world-knowledge
subset (STEM).
7 Conclusion and Outlook
We propose Mangosteen, an openly released Thai
pre-training data pipeline and corpus that closes
the transparency gap in Thai continual pre-training
(CPT) models. Our Thai-specific filters shrink
raw Common Crawl data from 202 million to 25
million documents and raise SEA-HELM NLG
from about 3 to roughly 11 points. An 8
billion-parameter model trained on the resulting
47 billion-token corpus surpasses SEA-LION-v3
and Llama-3.1 on Thai benchmarks by about two
points. To ensure full reproducibility, we release
the pipeline code, the cleaned corpus, and all CPT
Task subset
Llama 3.1 8B
SEA-LION V3 8B
Typhoon 2 8B
WangchanLION-V3-8B
NLU Wisesight 34.82 52.30 62.11 52.64
XCOPA 	69.40 78.20 81.00 71.40
Belebele 	54.67 63.44 65.44 58.56
Avg. 	52.96 64.64 69.51 60.86
NLG XLSum 	48.18 48.88 54.33 57.01
iApp 	74.45 79.81 80.15 80.99
ENG->TH 21.15 21.54 26.09

Chunk 24 · 1,998 chars

ned corpus, and all CPT
Task subset
Llama 3.1 8B
SEA-LION V3 8B
Typhoon 2 8B
WangchanLION-V3-8B
NLU Wisesight 34.82 52.30 62.11 52.64
XCOPA 	69.40 78.20 81.00 71.40
Belebele 	54.67 63.44 65.44 58.56
Avg. 	52.96 64.64 69.51 60.86
NLG XLSum 	48.18 48.88 54.33 57.01
iApp 	74.45 79.81 80.15 80.99
ENG->TH 21.15 21.54 26.09 29.00
TH->ENG 49.23 51.20 51.94 52.34
Avg. 	48.25 50.36 53.13 54.84
Table 6: The performance comparison in Thai
LLM Leaderboard between existing base models
and WangchanLION-V3-8B model when utilizing
the same SFT dataset.
and SFT checkpoints.
Open-data expansion. Mangosteen currently
includes only public-domain or explicitly Creative
Commons licensed text. We chose this approach
to respect the legal rights of content owners. We
are working with agencies and rights-holders to
unlock more than 10 billion additional tokens un-
der permissive licenses.
Pre-training Cost and Shared Knowledge.
Computational expense is still the main obstacle
to studying the impact of data pipeline design.
For this project, we secured 2,000 GPU-hours on
H100 hardware, giving us effectively one full trial.
Until reliable low-compute pre-training methods
emerge, maximizing the public value of every
costly run through complete knowledge sharing is
essential to the collective progress of this field.
With this limited experimentation budget, our
model attains strong overall scores (approximately
2 points above recent Thai baselines), confirming
the effectiveness of Thai-aware curation. The cen-
tral contribution, however, is not the lead itself but
the openly released pipeline, corpus, and check-
points that let others build on and quickly surpass
these results.
Acknowledgement
This research is supported by the National Re-
search Foundation, Singapore, under its National
Large Language Models Funding Initiative. Any
opinions, findings and conclusions or recommen-
dations expressed in this material are those of the
author(s) and do not reflect the views of National

-- 10 of 28

Chunk 25 · 1,997 chars

knowledgement
This research is supported by the National Re-
search Foundation, Singapore, under its National
Large Language Models Funding Initiative. Any
opinions, findings and conclusions or recommen-
dations expressed in this material are those of the
author(s) and do not reflect the views of National

-- 10 of 28 --

Research Foundation, Singapore.
References
SCB 10X, VISTEC, and SEACrowd. 2024. Thai
llm leaderboard.
Ashish Agrawal, Barah Fazili, and Preethi Jyothi.
2024. Translation errors significantly impact
low-resource languages in cross-lingual learn-
ing. In Proceedings of the 18th Conference of
the European Chapter of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 319–329, St. Julian’s, Malta. As-
sociation for Computational Linguistics.
Giusepppe Attardi. 2015. Wikiextractor.
https://github.com/attardi/
wikiextractor.
Adrien Barbaresi. 2021. Trafilatura: A Web
Scraping Library and Command-Line Tool for
Text Discovery and Extraction. In Proceed-
ings of the Joint Conference of the 59th An-
nual Meeting of the Association for Compu-
tational Linguistics and the 11th International
Joint Conference on Natural Language Pro-
cessing: System Demonstrations, pages 122–
131. Association for Computational Linguis-
tics.
Laurie Burchell, Ona de Gibert, Nikolay Arefyev,
Mikko Aulamo, Marta Bañón, Pinzhen Chen,
Mariia Fedorova, Liane Guillou, Barry Had-
dow, Jan Hajiˇc, Jindˇrich Helcl, Erik Henriksson,
Mateusz Klimaszewski, Ville Komulainen, An-
drey Kutuzov, Joona Kytöniemi, Veronika Laip-
pala, Petter Mæhlum, Bhavitvya Malik, and
16 others. 2025. An expanded massive multi-
lingual dataset for high-performance language
technologies. Preprint, arXiv:2503.10267.
Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wen-
zek, Francisco Guzmán, Edouard Grave, Myle
Ott, Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
tion learning at scale. In Proceedings of the
58th Annual

Chunk 26 · 1,998 chars

ge
technologies. Preprint, arXiv:2503.10267.
Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wen-
zek, Francisco Guzmán, Edouard Grave, Myle
Ott, Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
tion learning at scale. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 8440–8451,
Online. Association for Computational Linguis-
tics.
Alexis Conneau, Ruty Rinott, Guillaume Lam-
ple, Adina Williams, Samuel Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. XNLI:
Evaluating cross-lingual sentence representa-
tions. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language
Processing, pages 2475–2485, Brussels, Bel-
gium. Association for Computational Linguis-
tics.
Zoltan Csaki, Bo Li, Jonathan Li, Qiantong
Xu, Pian Pawakapan, Leon Zhang, Yun Du,
Hengyu Zhao, Changran Hu, and Urmish
Thakker. 2024. Sambalingo: Teaching large
language models new languages. Preprint,
arXiv:2404.05829.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman,
and Luke Zettlemoyer. 2023. QLoRA: Effi-
cient finetuning of quantized LLMs. In Thirty-
seventh Conference on Neural Information Pro-
cessing Systems.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: pre-training
of deep bidirectional transformers for language
understanding. CoRR, abs/1810.04805.
Julen Etxaniz, Oscar Sainz, Naiara Miguel, Itziar
Aldabe, German Rigau, Eneko Agirre, Aitor
Ormazabal, Mikel Artetxe, and Aitor Soroa.
2024. Latxa: An open language model and
evaluation suite for Basque. In Proceedings of
the 62nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
pers), pages 14952–14972, Bangkok, Thailand.
Association for Computational Linguistics.
Leo Gao, Stella Biderman, Sid Black, Laurence
Golding, Travis Hoppe, Charles Foster, Ja-
son Phang, Horace He, Anish Thite, Noa
Nabeshima, Shawn Presser, and Connor Leahy.
2020. The Pile: An 800gb

Chunk 27 · 1,992 chars

al Linguistics (Volume 1: Long Pa-
pers), pages 14952–14972, Bangkok, Thailand.
Association for Computational Linguistics.
Leo Gao, Stella Biderman, Sid Black, Laurence
Golding, Travis Hoppe, Charles Foster, Ja-
son Phang, Horace He, Anish Thite, Noa
Nabeshima, Shawn Presser, and Connor Leahy.
2020. The Pile: An 800gb dataset of diverse
text for language modeling. arXiv preprint
arXiv:2101.00027.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav
Jauhri, Abhinav Pandey, Abhishek Kadian, Ah-
mad Al-Dahle, Aiesha Letman, Akhil Mathur,
Alan Schelten, Alex Vaughan, Amy Yang, An-
gela Fan, Anirudh Goyal, Anthony Hartshorn,
Aobo Yang, Archi Mitra, Archie Sravanku-
mar, Artem Korenev, Arthur Hinsvark, and 542

-- 11 of 28 --

others. 2024. The llama 3 herd of models.
Preprint, arXiv:2407.21783.
Edouard Grave, Piotr Bojanowski, Prakhar Gupta,
Armand Joulin, and Tomas Mikolov. 2018.
Learning word vectors for 157 languages. In
Proceedings of the Eleventh International Con-
ference on Language Resources and Evaluation
(LREC 2018), Miyazaki, Japan. European Lan-
guage Resources Association (ELRA).
Armand Joulin, Edouard Grave, Piotr Bojanowski,
and Tomas Mikolov. 2017. Bag of tricks for
efficient text classification. In Proceedings of
the 15th Conference of the European Chapter
of the Association for Computational Linguis-
tics: Volume 2, Short Papers, pages 427–431,
Valencia, Spain. Association for Computational
Linguistics.
Mohammed Safi Ur Rahman Khan, Priyam
Mehta, Ananth Sankar, Umashankar Kumar-
avelan, Sumanth Doddapaneni, Suriyaprasaad
B, Varun G, Sparsh Jain, Anoop Kunchukut-
tan, Pratyush Kumar, Raj Dabre, and Mitesh M.
Khapra. 2024. IndicLLMSuite: A blueprint for
creating pre-training and fine-tuning datasets
for Indian languages. In Proceedings of the
62nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
pers), pages 15831–15879, Bangkok, Thailand.
Association for Computational Linguistics.
Jeffrey Li, Alex Fang, Georgios Smyrnis,

Chunk 28 · 1,990 chars

eprint for
creating pre-training and fine-tuning datasets
for Indian languages. In Proceedings of the
62nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
pers), pages 15831–15879, Bangkok, Thailand.
Association for Computational Linguistics.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor
Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal,
Etash Guha, Sedrick Keh, Kushal Arora,
Saurabh Garg, Rui Xin, Niklas Muennighoff,
Reinhard Heckel, Jean Mercat, Mayee Chen,
Suchin Gururangan, Mitchell Wortsman, Alon
Albalak, and 40 others. 2025. Datacomp-lm: In
search of the next generation of training sets for
language models. Preprint, arXiv:2406.11794.
Holy Lovenia, Rahmad Mahendra,
Salsabil Maulana Akbar, Lester James Val-
idad Miranda, Jennifer Santoso, Elyanah
Aco, Akhdan Fadhilah, Jonibek Mansurov,
Joseph Marvin Imperial, Onno P. Kamp-
man, Joel Ruben Antony Moniz, Muhammad
Ravi Shulthan Habibi, Frederikus Hudi, Rai-
ley Montalan, Ryan Ignatius Hadiwijaya,
Joanito Agili Lopo, William Nixon, Börje F.
Karlsson, James Jaya, and 42 others. 2024.
SEACrowd: A multilingual multimodal
data hub and benchmark suite for Southeast
Asian languages. In Proceedings of the
2024 Conference on Empirical Methods in
Natural Language Processing, pages 5155–
5203, Miami, Florida, USA. Association for
Computational Linguistics.
Risto Luukkonen, Ville Komulainen, Jouni Lu-
oma, Anni Eskelinen, Jenna Kanerva, Hanna-
Mari Kupari, Filip Ginter, Veronika Laip-
pala, Niklas Muennighoff, Aleksandra Piktus,
Thomas Wang, Nouamane Tazi, Teven Scao,
Thomas Wolf, Osma Suominen, Samuli Saira-
nen, Mikko Merioksa, Jyrki Heinonen, Aija
Vahtola, and 2 others. 2023. FinGPT: Large
generative models for a small language. In Pro-
ceedings of the 2023 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2710–2726, Singapore. Association for
Computational Linguistics.
Chenghao Mou, Chris Ha, Kenneth Enevoldsen,
and Peiyuan Liu. 2023. Chenghaomou/text-
dedup:

Chunk 29 · 1,999 chars

. 2023. FinGPT: Large
generative models for a small language. In Pro-
ceedings of the 2023 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2710–2726, Singapore. Association for
Computational Linguistics.
Chenghao Mou, Chris Ha, Kenneth Enevoldsen,
and Peiyuan Liu. 2023. Chenghaomou/text-
dedup: Reference snapshot.
Raymond Ng, Thanh Ngan Nguyen, Yuli Huang,
Ngee Chia Tai, Wai Yi Leong, Wei Qi
Leong, Xianbin Yong, Jian Gang Ngui,
Yosephine Susanto, Nicholas Cheng, Ham-
sawardhini Rengarajan, Peerat Limkonchoti-
wat, Adithya Venkatadri Hulagadri, Kok Wai
Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo,
Wayne Lau, Choon Meng Tan, and 12 others.
2025. Sea-lion: Southeast asian languages in
one network. Preprint, arXiv:2504.05747.
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai,
Hieu Man, Nghia Trung Ngo, Franck Dernon-
court, Ryan A. Rossi, and Thien Huu Nguyen.
2023. Culturax: A cleaned, enormous, and
multilingual dataset for large language models
in 167 languages. Preprint, arXiv:2309.09400.
Pedro Javier Ortiz Su’arez, Laurent Romary, and
Benoit Sagot. 2020. A monolingual approach
to contextualized word embeddings for mid-
resource languages. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
tational Linguistics, pages 1703–1714, Online.
Association for Computational Linguistics.
Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben
allal, Anton Lozhkov, Margaret Mitchell, Colin

-- 12 of 28 --

Raffel, Leandro Von Werra, and Thomas Wolf.
2024a. The fineweb datasets: Decanting the
web for the finest text data at scale. Preprint,
arXiv:2406.17557.
Guilherme Penedo, Hynek Kydlíˇcek, Vinko
Sabolˇcec, Bettina Messmer, Negar Foroutan,
Martin Jaggi, Leandro von Werra, and Thomas
Wolf. 2024b. Fineweb2: A sparkling update
with 1000s of languages.
Guilherme Penedo, Quentin Malartic, Daniel
Hesslow, Ruxandra Cojocaru, Alessandro Cap-
pelli, Hamza Alobeidli, Baptiste Pannier, Ebte-
sam Almazrouei, and Julien Launay. 2023. The
refinedweb dataset for

Chunk 30 · 1,997 chars

utan,
Martin Jaggi, Leandro von Werra, and Thomas
Wolf. 2024b. Fineweb2: A sparkling update
with 1000s of languages.
Guilherme Penedo, Quentin Malartic, Daniel
Hesslow, Ruxandra Cojocaru, Alessandro Cap-
pelli, Hamza Alobeidli, Baptiste Pannier, Ebte-
sam Almazrouei, and Julien Launay. 2023. The
refinedweb dataset for falcon llm: Outperform-
ing curated corpora with web data, and web data
only. Preprint, arXiv:2306.01116.
Wannaphong Phatthiyaphaibun, Korakot Chao-
vavanich, Charin Polpanumas, Arthit Suriya-
wongkul, Lalita Lowphansirikul, Pattarawat
Chormai, Peerat Limkonchotiwat, Thanathip
Suntorntip, and Can Udomcharoenchaikit.
2023. PyThaiNLP: Thai natural language pro-
cessing in python. In Proceedings of the
3rd Workshop for Natural Language Process-
ing Open Source Software (NLP-OSS 2023),
pages 25–36, Singapore. Association for Com-
putational Linguistics.
Kunat Pipatanakul, Phatrasek Jirabovonvisut,
Potsawee Manakul, Sittipong Sripaisarn-
mongkol, Ruangsak Patomwong, Pathomporn
Chokchainant, and Kasima Tharnpipitchai.
2023. Typhoon: Thai large language models.
Preprint, arXiv:2312.13951.
Kunat Pipatanakul, Potsawee Manakul, Natapong
Nitarach, Warit Sirichotedumrong, Surapon
Nonesung, Teetouch Jaknamon, Parinthapat
Pengpun, Pittawat Taveekitworachai, Adi-
sai Na-Thalang, Sittipong Sripaisarnmongkol,
Krisanapong Jirayoot, and Kasima Tharnpip-
itchai. 2024. Typhoon 2: A family of open
text and multimodal thai large language models.
Preprint, arXiv:2412.13702.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Lan-
guage models are unsupervised multitask learn-
ers.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai,
Katie Millican, Jordan Hoffmann, Francis
Song, John Aslanides, Sarah Henderson, Ro-
man Ring, Susannah Young, Eliza Rutherford,
Tom Hennigan, Jacob Menick, Albin Cassirer,
Richard Powell, George van den Driessche,
Lisa Anne Hendricks, Maribeth Rauh, Po-Sen
Huang, and 61 others. 2021. Scaling language
models: Methods,

Chunk 31 · 1,990 chars

Katie Millican, Jordan Hoffmann, Francis
Song, John Aslanides, Sarah Henderson, Ro-
man Ring, Susannah Young, Eliza Rutherford,
Tom Hennigan, Jacob Menick, Albin Cassirer,
Richard Powell, George van den Driessche,
Lisa Anne Hendricks, Maribeth Rauh, Po-Sen
Huang, and 61 others. 2021. Scaling language
models: Methods, analysis & insights from
training gopher. ArXiv, abs/2112.11446.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. J. Mach.
Learn. Res., 21(1).
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J.
Liu. 2023. Exploring the limits of transfer
learning with a unified text-to-text transformer.
Preprint, arXiv:1910.10683.
Shivalika Singh, Angelika Romanou, Clémen-
tine Fourrier, David I. Adelani, Jian Gang
Ngui, Daniel Vila-Suero, Peerat Limkonchoti-
wat, Kelly Marchisio, Wei Qi Leong, Yosephine
Susanto, Raymond Ng, Shayne Longpre, Wei-
Yin Ko, Sebastian Ruder, Madeline Smith, An-
toine Bosselut, Alice Oh, Andre F. T. Martins,
Leshem Choshen, and 5 others. 2025. Global
mmlu: Understanding and addressing cultural
and linguistic biases in multilingual evaluation.
Preprint, arXiv:2412.03304.
Luca Soldaini, Rodney Kinney, Akshita Bha-
gia, Dustin Schwenk, David Atkinson, Rus-
sell Authur, Ben Bogin, Khyathi Chandu, Jen-
nifer Dumas, Yanai Elazar, Valentin Hofmann,
Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu,
Nathan Lambert, Ian Magnusson, Jacob Morri-
son, Niklas Muennighoff, and 17 others. 2024.
Dolma: an open corpus of three trillion to-
kens for language model pretraining research.
In Proceedings of the 62nd Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 15725–15788,
Bangkok, Thailand. Association for Computa-
tional Linguistics.
Thanathip Suntorntip, Arthit Suriyawongkul,

Chunk 32 · 1,999 chars

n corpus of three trillion to-
kens for language model pretraining research.
In Proceedings of the 62nd Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 15725–15788,
Bangkok, Thailand. Association for Computa-
tional Linguistics.
Thanathip Suntorntip, Arthit Suriyawongkul, and
Wannaphong Phatthiyaphaibun. 2024. nlpo3.

-- 13 of 28 --

Yosephine Susanto, Adithya Venkatadri Hula-
gadri, Jann Railey Montalan, Jian Gang Ngui,
Xian Bin Yong, Weiqi Leong, Hamsawardhini
Rengarajan, Peerat Limkonchotiwat, Yifan Mai,
and William Chandra Tjhi. 2025. Sea-helm:
Southeast asian holistic evaluation of language
models. Preprint, arXiv:2502.14301.
Guillaume Wenzek, Marie-Anne Lachaux,
Alexis Conneau, Vishrav Chaudhary, Fran-
cisco Guzmán, Armand Joulin, and Edouard
Grave. 2020. CCNet: Extracting high quality
monolingual datasets from web crawl data.
In Proceedings of the Twelfth Language Re-
sources and Evaluation Conference, pages
4003–4012, Marseille, France. European
Language Resources Association.
Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2021. mT5: A
massively multilingual pre-trained text-to-text
transformer. In Proceedings of the 2021 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 483–498,
Online. Association for Computational Linguis-
tics.
Sumeth Yuenyong, Kobkrit Viriyayudhakorn,
Apivadee Piyatumrong, and Jillaphat
Jaroenkantasima. 2025. Openthaigpt 1.5:
A thai-centric open source large language
model. Preprint, arXiv:2411.07238.

-- 14 of 28 --

A GPT-2 Config
We use the GPT-2 config as follows: sequence length: 1024, n_layer: 12, n_head: 12, n_embd: 768,
vocab_size: 50257, learning rate: 0.0006, num_train_epochs: 1
B Language Identification
In this section, we compare three language identification libraries—langdetect, lingua, and fastText—and
present our findings. The

Chunk 33 · 1,999 chars

fig
We use the GPT-2 config as follows: sequence length: 1024, n_layer: 12, n_head: 12, n_embd: 768,
vocab_size: 50257, learning rate: 0.0006, num_train_epochs: 1
B Language Identification
In this section, we compare three language identification libraries—langdetect, lingua, and fastText—and
present our findings. The Dolma codebase supports five language identifiers: cld3, pycld2, langdetect,
Lingua, and fastText. While fastText was used in the original Dolma implementation, its effectiveness on
Thai texts remains uncertain. Thus, our evaluation focuses on langdetect, lingua, and fastText, excluding
cld3 and pycld2 due to their outdated models and limited adoption.
We conducted our experiments on a subsample of the raw Thai Common Crawl dataset, consisting of
the first 200,000 documents. Each of the three language identifiers was run separately on this subsample
to generate a language confidence score for each document. We used a threshold of 0.5, the same cut-off
used in the Dolma English corpus, to determine language inclusion. Based on this threshold, the number
of documents retained by each language identifier is shown in Table 7.
Language Identifier Documents Remaining
Lingua 177,221
Langdetect 181,872
FastText 187,772
Table 7: Number of Documents Remaining After Applying a 0.5 Threshold
In our analysis, we observed that lingua excluded the highest number of documents compared to the
other two language identifiers. Regarding processing time, langdetect exhibited the slowest performance,
taking approximately 40 minutes to process all 200,000 documents. In contrast, both lingua and fastText
completed the processing in roughly 3 minutes. Subsequently, we calculated the proportions of Thai
characters, vowels, and intonation marks in the documents retained by each language identifier and
compared their distributions.
As illustrated in Figure 2, the majority of documents exhibit a high proportion of Thai characters,
vowels, and intonation marks, indicating that the

Chunk 34 · 1,995 chars

y, we calculated the proportions of Thai
characters, vowels, and intonation marks in the documents retained by each language identifier and
compared their distributions.
As illustrated in Figure 2, the majority of documents exhibit a high proportion of Thai characters,
vowels, and intonation marks, indicating that the language identifiers are generally effective in detecting
Thai text. However, it is evident that fastText flags a larger number of documents as Thai, even
when those documents contain a relatively low proportion of Thai-specific characters—specifically, less
than 40%. This suggests that fastText may be more permissive or less precise in its classification
compared to Lingua and Langdetect.
To further evaluate fastText’s performance, we isolated the subset of documents in which the
combined proportion of Thai characters, vowels, and intonation marks was less than or equal to 40%. We
then analyzed the distribution of language confidence scores assigned by fastText to these documents.
This analysis was intended to assess the reliability of its language identification in cases where Thai-
specific character usage is minimal.
As illustrated in the Figure 3, fastText assigns high language confidence scores—often approaching
100%—to documents where the proportion of Thai characters, vowels, and intonation marks does not
exceed 40%. This indicates a tendency of fastText to exhibit overconfidence in its language identifica-
tion, potentially misclassifying documents with minimal Thai-specific character content. Notably, 8,008
documents (accounting for 4% of the 200,000-document sample) were identified by fastText as Thai with
confidence scores above 0.5, despite containing 40% or fewer Thai-specific characters.
To this end, it might seem that the only language identifier good enough for the Thai language is Lin-
gua. However, considering Thai’s distinctive script, we concluded that relying on the Lingua language
identifier may not be optimal. Therefore, we

Chunk 35 · 1,991 chars

above 0.5, despite containing 40% or fewer Thai-specific characters.
To this end, it might seem that the only language identifier good enough for the Thai language is Lin-
gua. However, considering Thai’s distinctive script, we concluded that relying on the Lingua language
identifier may not be optimal. Therefore, we developed the ThaiCharRatioTagger, a custom lan-
guage identifier that calculates the percentage of Thai characters, vowels, and intonation marks within

-- 15 of 28 --

0 20 40 60 80 100
Percentage of Thai Characters per Document
100
101
102
103
104
Number of Documents
Distribution of Thai Character Percentage
Identifier Score > 0.5
Language Identifier
Lingua
LangDetect
FastText
Figure 2: Comparison of the distribution of Thai character ratios for three language identifiers: Lingua,
Langdetect, and FastText. Notably, FastText assigns a language score ≥ 50 to a portion of documents
where the Thai character ratio is only between 10 and 40, indicating potential misclassification.
Figure 3: Scatter plot showing that even when
the fastText score is high, the proportion of
Thai characters can be very low. The overall
trend deviates from what would be expected in
an ideal language classification scenario.
Figure 4: Comparison between Thai character
ratio and fastText language scores across
documents, highlighting inconsistencies in the
confidence scores relative to Thai character
content.

-- 16 of 28 --

a document. This approach simplifies the codebase and offers a more straightforward alternative to
complex machine learning models.
C Gopher Rules
As stated in Section 3.1.3, the original English corpus Dolma relies heavily on the Gopher Rules, as it
uses all of them. In this work, we aim to follow the same methodology as the original, while adjusting
the thresholds of certain rules to better suit the Thai language.
The original Gopher Rules used in the English corpus Dolma consist of 11 rules:
• Fraction of characters in the most common n-gram

Chunk 36 · 1,997 chars

Gopher Rules, as it
uses all of them. In this work, we aim to follow the same methodology as the original, while adjusting
the thresholds of certain rules to better suit the Thai language.
The original Gopher Rules used in the English corpus Dolma consist of 11 rules:
• Fraction of characters in the most common n-gram exceeds a threshold
• Fraction of characters in duplicate n-grams exceeds a threshold
• Contains fewer than 50 or more than 100K words
• Median word length is less than 3 or greater than 10
• Fraction of words containing alphabetic characters is less than 0.80
• Contains fewer than 2 of a predefined set of required words
• Fraction of lines in the document starting with a bullet point exceeds 0.90
• Fraction of lines in the document ending with an ellipsis exceeds 0.30
• Fraction of lines in the document that are duplicated exceeds 0.30
• Fraction of characters in duplicated lines exceeds 0.30
We then begin by experimenting with the correlations of each rule. We first use the Dolma pipeline to
obtain scores on a 200,000-document subsample. Specifically, instead of tokenizing by spaces, we use
the ICU tokenizer to obtain a list of tokens for each document. We use the set of required words provided
by Stopwords ISO 13, which essentially corresponds to a list of Thai stopwords. In addition, we divide
each document into sentences using the delimiter \n+
In the Dolma work, the authors noted that stacking quality filtering, content filtering, and deduplication
results in a positive compounding effect; these rules overlap very little in the texts they remove. Our study
comes to a similar conclusion regarding the Gopher rules. As shown in Figure 5 (Spearman correlation
heatmap) and Figure 6 (Pearson correlation heatmap), we observe only low to moderate correlations
between most rules. Some exceptions exist; for example, the "median word length" rule and the "fraction
of words with alpha characters" rule are somewhat correlated. This is expected, as a higher

Chunk 37 · 1,621 chars

e 5 (Spearman correlation
heatmap) and Figure 6 (Pearson correlation heatmap), we observe only low to moderate correlations
between most rules. Some exceptions exist; for example, the "median word length" rule and the "fraction
of words with alpha characters" rule are somewhat correlated. This is expected, as a higher median word
length generally implies a higher number of alphabetic characters. A similar relationship is observed
between the "word count" rule and the "required word count" rule. Based on this analysis, we choose
to individually adjust the threshold for each rule. Note that for the rules "fraction of characters in most
common n-gram" and "fraction of characters in duplicate n-grams," we use the average value within each
group in our analysis, as similar rules tend to exhibit high intra-group correlation.
The next question we address is: Which rules should be adjusted? Should we modify all of them or
only a subset? To explore this, we examine the top 10 combinations of rules that remove the largest
number of documents.
As shown in Table 8, approximately 78% of the documents are filtered out by these rules or com-
binations of rules. The rules frequently appearing in this table include: fraction of words with alpha
characters, fraction of characters in duplicate n-grams, median word length, fraction of duplicate lines,
and required word count. Among these, we choose to further investigate the "median word length" rule,
as the other rules can be applied directly to Thai without modification. Although not included in the
13https://github.com/stopwords-iso/stopwords-th

-- 17 of 28 --

Chunk 38 · 1,494 chars

, median word length, fraction of duplicate lines,
and required word count. Among these, we choose to further investigate the "median word length" rule,
as the other rules can be applied directly to Thai without modification. Although not included in the
13https://github.com/stopwords-iso/stopwords-th

-- 17 of 28 --

fraction_of_characters_in_most_common_grams_avg
fraction_of_characters_in_duplicate_grams_avg
word_count
median_word_length	
symbol_to_word_ratio
fraction_of_words_with_alpha_character
required_word_count
fraction_of_lines_starting_with_bullet_point
fraction_of_lines_ending_with_ellipsis
fraction_of_duplicate_lines
fraction_of_characters_in_duplicate_lines
fraction_of_characters_in_most_common_grams_avg
fraction_of_characters_in_duplicate_grams_avg
word_count
median_word_length
symbol_to_word_ratio
fraction_of_words_with_alpha_character
required_word_count
fraction_of_lines_starting_with_bullet_point
fraction_of_lines_ending_with_ellipsis
fraction_of_duplicate_lines
fraction_of_characters_in_duplicate_lines
1.00
0.10 	1.00
-0.53 	0.41 	1.00
-0.27 	-0.22 	0.01 	1.00
-0.12 	0.11 	0.17 	-0.04 	1.00
-0.35 	-0.29 	0.09 	0.80 	-0.02 	1.00
-0.60 	0.16 	0.78 	0.44 	0.13 	0.52 	1.00
-0.06 	0.14 	0.22 	-0.13 	0.01 	-0.16 	0.11 	1.00
-0.05 	0.04 	0.08 	0.02 	0.47 	0.03 	0.07 	0.03 	1.00
0.11 	0.57 	0.20 	-0.30 	0.04 	-0.39 	-0.04 	0.18 	0.02 	1.00
0.11 	0.60 	0.20 	-0.28 	0.05 	-0.37 	-0.04 	0.16 	0.03 	0.98 	1.00
Figure 5: Spearman correlation heatmap of Gopher Rule

Chunk 39 · 1,997 chars

6 	0.78 	0.44 	0.13 	0.52 	1.00
-0.06 	0.14 	0.22 	-0.13 	0.01 	-0.16 	0.11 	1.00
-0.05 	0.04 	0.08 	0.02 	0.47 	0.03 	0.07 	0.03 	1.00
0.11 	0.57 	0.20 	-0.30 	0.04 	-0.39 	-0.04 	0.18 	0.02 	1.00
0.11 	0.60 	0.20 	-0.28 	0.05 	-0.37 	-0.04 	0.16 	0.03 	0.98 	1.00
Figure 5: Spearman correlation heatmap of Gopher Rule scores.
fraction_of_characters_in_most_common_grams_avg
fraction_of_characters_in_duplicate_grams_avg
word_count
median_word_length	
symbol_to_word_ratio
fraction_of_words_with_alpha_character
required_word_count
fraction_of_lines_starting_with_bullet_point
fraction_of_lines_ending_with_ellipsis
fraction_of_duplicate_lines
fraction_of_characters_in_duplicate_lines
fraction_of_characters_in_most_common_grams_avg
fraction_of_characters_in_duplicate_grams_avg
word_count
median_word_length
symbol_to_word_ratio
fraction_of_words_with_alpha_character
required_word_count
fraction_of_lines_starting_with_bullet_point
fraction_of_lines_ending_with_ellipsis
fraction_of_duplicate_lines
fraction_of_characters_in_duplicate_lines
1.00
0.17 	1.00
-0.04 	0.15 	1.00
-0.20 	-0.18 	-0.04 	1.00
0.04 	0.02 	-0.00 	-0.02 	1.00
-0.30 	-0.25 	-0.03 	0.80 	-0.03 	1.00
-0.13 	0.03 	0.72 	0.15 	-0.02 	0.20 	1.00
0.00 	0.03 	0.08 	-0.13 	-0.02 	-0.14 	0.02 	1.00
0.01 	-0.00 	-0.01 	0.04 	0.29 	0.05 	-0.00 	-0.03 	1.00
0.10 	0.69 	0.07 	-0.26 	0.00 	-0.33 	-0.04 	0.03 	-0.03 	1.00
0.09 	0.80 	0.05 	-0.15 	0.01 	-0.20 	-0.03 	-0.02 	-0.01 	0.86 	1.00
Figure 6: Pearson correlation heatmap of Gopher Rule scores.

-- 18 of 28 --

Combination of Rules Count Percentage Cumulative Percentage
fraction of words with alpha character 50,712 29.3318 29.3318
fraction of characters in duplicate n grams,
fraction of words with alpha character
17,823 10.3088 39.6406
median word length, fraction of words with
alpha character
17,028 9.8490 49.4896
fraction of characters in duplicate n grams,
median word length, fraction of words with
alpha character
12,812 7.4104 56.9000
fraction of characters in

Chunk 40 · 1,992 chars

rs in duplicate n grams,
fraction of words with alpha character
17,823 10.3088 39.6406
median word length, fraction of words with
alpha character
17,028 9.8490 49.4896
fraction of characters in duplicate n grams,
median word length, fraction of words with
alpha character
12,812 7.4104 56.9000
fraction of characters in duplicate n grams,
median word length, fraction of words with
alpha character, fraction of duplicate lines,
fraction of characters in duplicate lines
9,103 5.2652 62.1652
fraction of characters in duplicate n grams,
fraction of words with alpha character, frac-
tion of duplicate lines, fraction of characters
in duplicate lines
8,447 4.8857 67.0509
fraction of characters in duplicate n grams,
median word length, fraction of words with
alpha character, fraction of duplicate lines
7,631 4.4138 71.4647
median word length, fraction of words with
alpha character, required word count
5,268 3.0470 74.5117
fraction of characters in duplicate n grams,
fraction of words with alpha character, frac-
tion of duplicate lines
3,694 2.1366 76.6483
median word length, fraction of words with
alpha character, fraction of duplicate lines
2,772 1.6033 78.2516
Table 8: Top combinations of Gopher Rules by number of filtered documents

-- 19 of 28 --

Statistic Value
Count 200,000
Mean 675.47
Standard Deviation 1,964.60
Minimum 0
10th Percentile 56
30th Percentile 131
50th Percentile (Median) 283
70th Percentile 600
90th Percentile 1,556
95th Percentile 2,384
99th Percentile 4,823.01
Maximum 195,713
Table 9: Descriptive statistics of word counts in the sub-document dataset, including the mean, standard
deviation, median, minimum, maximum, and selected percentiles.
table, we also specifically examine the "word count" rule, as we believe that the typical word count dif-
fers significantly between Thai and English documents. Furthermore, we observe that low word count
documents in Thai often correspond to low-quality text data. Some rules can be modified with minimal
data

Chunk 41 · 1,998 chars

entiles.
table, we also specifically examine the "word count" rule, as we believe that the typical word count dif-
fers significantly between Thai and English documents. Furthermore, we observe that low word count
documents in Thai often correspond to low-quality text data. Some rules can be modified with minimal
data exploration:
• The first is the rule "fraction of words with alpha characters." We modify it to "fraction of words
with Thai letters" by counting Thai characters instead of alphabetic characters.
• For the rule "fraction of lines in document ending with ellipsis," we detect the presence of three dots
(...) at the end of each line, as this pattern appears frequently in our corpus.
C.1 Word Count
To begin our experiment, we use the same dataset described in Appendix E—the 200,000-document
subsample. In the following, we present descriptive statistics of the word counts in this dataset, where
each document is tokenized using the ICU tokenizer.
The distribution of word count is clearly right-skewed. If we apply the Gopher rule cutoffs of 50 and
100,000 words, very few documents are filtered out, and we suspect that many low-quality documents
would remain. Upon closer inspection through data binning, we find that the interval containing a high
proportion of documents ranges from 21 to 170 words, approximately 35% of the subsample. We suspect
that this interval may contain a disproportionately large amount of low-quality data.
To confirm our assumption, we conduct an experiment using perplexity scores to assess data qual-
ity. The model used for this experiment is airesearch/wangchanbart-base. The results are
presented in the table below.
Table 11 presents the perplexity statistics under four different settings. In the Baseline setting, no
filtering is applied; the perplexity scores represent the entire 200,000-document subsample. In the Go-
pher Rules setting, documents are filtered using the original Gopher cutoffs of 50 and 100,000 words.
The Experiment

Chunk 42 · 1,997 chars

e 11 presents the perplexity statistics under four different settings. In the Baseline setting, no
filtering is applied; the perplexity scores represent the entire 200,000-document subsample. In the Go-
pher Rules setting, documents are filtered using the original Gopher cutoffs of 50 and 100,000 words.
The Experiment setting modifies the lower cutoff, increasing it from 50 to 171, in order to filter out all
low word count documents, which we suspect may be of lower quality. Finally, the Short Text setting
isolates documents with word counts between 21 and 170. Notably, the perplexity scores in this group
are nearly as high as those in the baseline, supporting our hypothesis that short documents in this range
are likely to be of lower quality. From these statistics, we observe that the Experiment setting yields
the lowest mean and median perplexity scores, suggesting that removing very short documents improves
overall data quality. To further support this conclusion, we report the Kruskal–Wallis H-test statistic and
corresponding p-values in Table 12. In all cases, the p-values indicate a statistically significant difference,
leading us to reject the null hypothesis that the medians across settings are equal.

-- 20 of 28 --

0 25000 50000 75000 100000 125000 150000 175000 200000
The number of words
100
101
102
103
104
105
Frequency
Histogram with Auto Bins
x=50
x=100k
0 25000 50000 75000 100000 125000 150000 175000 200000
The number of words
100
101
102
103
104
105
Frequency
Histogram with 50 Bins
x=50
x=100k
Figure 7: Histograms of word counts in the sub-document dataset. The left plot shows the distribution
using automatic binning for finer granularity, while the right plot uses a fixed bin size of 50 to provide a
more uniform view.
Basline Goher rules Experiment Short text
1080
1090
1100
1110
1120
1130
1140
Perplexity (Mean)
Mean Perplexity by Experiment
Baseline Gopher rules Experiment Short text
695
700
705
710
715
720
725
Perplexity (Median)
Median

Chunk 43 · 1,999 chars

ularity, while the right plot uses a fixed bin size of 50 to provide a
more uniform view.
Basline 	Goher rules 	Experiment 	Short text
1080
1090
1100
1110
1120
1130
1140
Perplexity (Mean)
Mean Perplexity by Experiment
Baseline 	Gopher rules 	Experiment 	Short text
695
700
705
710
715
720
725
Perplexity (Median)
Median Perplexity by Experiment
Figure 8: Trends in the mean and median perplexity scores across different experimental settings

-- 21 of 28 --

bin frequency cumulative frequency proportion cumulative proportion
[ 0. 21.3056] 2,762 2,762 1.3810 1.3810
[21.3056 42.6111] 11,137 13,899 5.5685 6.9495
[42.6111 63.9167] 10,778 24,677 5.3890 12.3385
[63.9167 85.2223] 14,169 38,846 7.0845 19.4230
[ 85.2223 106.5279] 11,247 50,093 5.6235 25.0465
[106.5279 127.8334] 8,541 58,634 4.2705 29.3170
[127.8334 149.139 ] 7,834 66,468 3.9170 33.2340
[149.139 170.4446] 6,929 73,397 3.4645 36.6985
[170.4446 191.7502] 5,854 79,251 2.9270 39.6255
[191.7502 213.0557] 5,426 84,677 2.7130 42.3385
[213.0557 234.3613] 5,083 89,760 2.5415 44.8800
[234.3613 255.6669] 4,680 94,440 2.3400 47.2200
[255.6669 276.9725] 4,286 98,726 2.1430 49.3630
[276.9725 298.278 ] 4,169 102,895 2.0845 51.4475
[298.278 319.5836] 3,622 106,517 1.8110 53.2585
[319.5836 340.8892] 3,400 109,917 1.7000 54.9585
[340.8892 362.1948] 3,639 113,556 1.8195 56.7780
[362.1948 383.5003] 3,117 116,673 1.5585 58.3365
[383.5003 404.8059] 2,939 119,612 1.4695 59.8060
[404.8059 426.1115] 2,598 122,210 1.2990 61.1050
[426.1115 447.417 ] 2,432 124,642 1.2160 62.3210
[447.417 468.7226] 2,186 126,828 1.0930 63.4140
[468.7226 490.0282] 2,288 129,116 1.1440 64.5580
[490.0282 511.3338] 2,114 131,230 1.0570 65.6150
[511.3338 532.6393] 2,169 133,399 1.0845 66.6995
[532.6393 553.9449] 2,082 135,481 1.0410 67.7405
[553.9449 575.2505] 2,034 137,515 1.0170 68.7575
[575.2505 596.5561] 2,113 139,628 1.0565 69.8140
[596.5561 617.8616] 1,843 141,471 0.9215 70.7355
[617.8616 639.1672] 1,803 143,274 0.9015 71.6370
Table 10: Binning analysis of

Chunk 44 · 1,990 chars

.3338 532.6393] 2,169 133,399 1.0845 66.6995
[532.6393 553.9449] 2,082 135,481 1.0410 67.7405
[553.9449 575.2505] 2,034 137,515 1.0170 68.7575
[575.2505 596.5561] 2,113 139,628 1.0565 69.8140
[596.5561 617.8616] 1,843 141,471 0.9215 70.7355
[617.8616 639.1672] 1,803 143,274 0.9015 71.6370
Table 10: Binning analysis of word counts in the sub-document dataset. A notably high proportion of
documents fall within the 21–170 word range. Only the first 30 bins are shown here for display purposes.

-- 22 of 28 --

Statistic Baseline Gopher Rules Experiment Short Text
Count 200,000 182,976 126,596 70,635
Mean 1143.80 1101.69 1074.90 1137.07
Standard Deviation 4414.26 2162.41 2359.05 1583.46
Minimum 1.37 1.37 1.37 5.37
10th Percentile 176.05 171.94 169.55 187.32
30th Percentile 402.48 397.45 390.92 416.09
50th Percentile (Median) 710.95 706.76 693.37 724.38
70th Percentile 1136.51 1126.69 1081.74 1240.15
90th Percentile 2163.38 2135.86 1985.92 2372.85
95th Percentile 3153.02 3096.17 2922.14 3298.03
99th Percentile 7660.35 7435.88 8193.99 5539.36
Maximum 814,671.56 539,081.63 539,081.63 51,177.91
Table 11: Perplexity statistics across four experimental settings. Applying the Gopher Rules results in
a noticeable decrease in perplexity scores, with a further reduction observed when increasing the lower
bound in our experiment. The "Short Text" setting is included to illustrate that, despite comprising 38%
of the data, it yields perplexity scores nearly as high as the baseline, indicating lower quality.
Settings H-statistics P-value
Baseline vs Experiment 135.063 0.00
Gopher rules vs Experiment 73.337 0.00
Baseline vs Gopher rules vs Experiment 138.922 0.00
Table 12: Kruskal–Wallis H-test results for perplexity scores across different experimental settings. The
p-values for all comparisons are 0.00, indicating statistically significant differences in the population
medians and rejecting the null hypothesis that the medians are equal.

-- 23 of 28 --

Word Count 	Text
13

Chunk 45 · 1,997 chars

Table 12: Kruskal–Wallis H-test results for perplexity scores across different experimental settings. The
p-values for all comparisons are 0.00, indicating statistically significant differences in the population
medians and rejecting the null hypothesis that the medians are equal.

-- 23 of 28 --

Word Count 	Text
13 	"Amornrat ฟรีแลนซมืออาชีพ โดนจางแลว 2 ครั้ง | Fastwork.co"
6 	"ตรวจสอบขอมูล ตรวจสอบขอมูล"
18 	"รูปสินคา หนุมไนซหัวใจสุดแซบ : ชุด Hot Girl สวยแสบซาส #1"
5 	"www.lampangsporttime.com คลิกเขาเว็บไซตฺ"
4 	"ศูนยชวยเหลือ Workplace"
9 	"ไอเหนอ หัวใจใหญ (wasin_nildee) on Pinterest"
11 	"Toyota Hilux Revo 2.8MT ป 2019 รถตอนเดียวมือสอง"
17 	"งบการเงิน ไตรมาสที่ 3/2565 (สอบทานแลว) | PTT Global Chemical"
20
"พัฒนาโดย นายวีระยุทธ ประชุมรักษ พนักงานคอมพิวเตอร ประจําศาลแขวงอุบลราชธานีหากโปรแกรมมีปญหา กรุณาโทร
0876188331"
4 	"ตลาดหลักทรัพยแหงประเทศไทย -"
28
"Login เขาสูระบบ Pantipmarket ประกาศนี้สามารถเลื่อนวันหมดอายุไดโดยสมาชิกเทานั้น กรุณา Login สมาชิกเพื่อเลื่อนวันหมดอายุ
Username * Password *"
38
"ตั๋วเครื่องบิน กิจกรรม TH JP EN CN TW KR TH プライバシーポリシー ・ 利用規約 に同意の上、ボタンを押してください。 ログイン
（無料）するとより便利に利用できます"
34
"สํารวจ บทความ รีวิวปายยา ขอมูลโปรดักส ซิสคอลแลปส โปรไฟล SistaCafe เขาสูระบบ สํารวจ บทความ Original Content ซิสปายยา
ขอมูลโปรดักส ซิสคอลแลปส"
36
"โคตรสวยของดี สด ซิงจมมิดควยงามจัดขอเย็ดบางเถอะละเลงลิ้น สิงหาคม 21, 2017 โคตรสวยของดี สด
ซิงจมมิดควยงามจัดขอเย็ดบางเถอะละเลงลิ้น"
40
"WALAI AutoLib is library automation system produced by Informatic Innovation Research Unit (IIRU), Walailak University.
Starting from date 14 Dec 2561 2019 © WALAI AutoLib. ALL Rights Reserved. Privacy Policy | Terms of Service"
35
"| BANANA GAS PLUS āļĻāļđāļāļĒāđāļāļīāļāļāļąāđāļāđāļāđāļŠāļĢāļāļĒāļāļāđ LPG āđāļĨāļ° NGV āđāļāļĢāļ°āļāļąāļāļĄāļēāļāļĢāļāļēāļ
āđāļāļĒāļāđāļēāļāļāļĩāđāļĄāļĩāļāļ§āļēāļĄāļāļģāļāļēāļāļāđāļēāļāļāļēāļĢāļĢāļąāļāļĢāļāļ āļāļāļĒāļāļ§āļĨāļāļąāļ 44
āđāļĨāļĩāļĒāļāļāļēāļāļāđāļ§āļāļĢāļēāļĄāļāļīāļāļāļĢāļē āļĄāļ·āļāļāļ·āļ āļāļīāļāļāđāļ

Chunk 46 · 1,986 chars

"| BANANA GAS PLUS āļĻāļđāļāļĒāđāļāļīāļāļāļąāđāļāđāļāđāļŠāļĢāļāļĒāļāļāđ LPG āđāļĨāļ° NGV āđāļāļĢāļ°āļāļąāļāļĄāļēāļāļĢāļāļēāļ
āđāļāļĒāļāđāļēāļāļāļĩāđāļĄāļĩāļāļ§āļēāļĄāļāļģāļāļēāļāļāđāļēāļāļāļēāļĢāļĢāļąāļāļĢāļāļ āļāļāļĒāļāļ§āļĨāļāļąāļ 44
āđāļĨāļĩāļĒāļāļāļēāļāļāđāļ§āļāļĢāļēāļĄāļāļīāļāļāļĢāļē āļĄāļ·āļāļāļ·āļ āļāļīāļāļāđāļ āļāļļāļāđāļāļāļĢāđ āļĄāļ·āļāļāļ·āļ 086-323-5305,086-
308-6869"
42
"9 The Green mile ปาฏิหาริยแดนประหาร The Green mile ปาฏิหาริยแดนประหาร IMDb: 9 189 นาที min 72 views The Green Mile :
ปาฏิหาริยแดนประหาร หนังเรื่องนี้เปนเ […] Crim อาชญากรรมDrama ดรามาInter Movie หนังผรั่ง"
25 	"› เขาสูระบบ ชื่อผูใช รหัสผาน บันทึกการใชงานของฉัน คุณจํารหัสผานไมได? ← กลับไป"
24 	"UFAPOWERBET เว็บขาวกีฬาที่ดีที่สุดในไทย นําเสนอขอมูลขาวสารเกี่ยวกับ ขาวกีฬาทั่วโลก ทั้งหมด สดใหมกอนใคร"
29
"กระชับผิวหนา, แกปญหาเรื่องผิวพรรณ 5 วิธี และ 5 การใชอายครีมขั้นเทพลดริ้วรอย By TopClinicThailand.com On พฤศจิกายน 8,
2020"
85
"Skip to content Menu Close หนาแรก เกี่ยวกับเรา สินคา สินคา สั่งซื้อสินคา ประชาสัมพันธ ประกาศ/ขาวสาร กิจกรรม โปรโมชั่นทั่วไป
สมาชิก ติดตอเรา THAILAND LAOS VIETNAM CAMBODIA facebook You Tube Search for: Menu Search for: Menu Add custom
text here or remove it Search for: หนาแรก เกี่ยวกับเรา สินคา สินคา สั่งซื้อสินคา ประชาสัมพันธ ประกาศ/ขาวสาร กิจกรรม
โปรโมชั่นทั่วไป สมาชิก ติดตอเรา THAILAND LAOS VIETNAM CAMBODIA facebook You Tube Button darkblurbg.jpg"
79
"- - Step 1 หยิบใสตะกรา - Step 2 ระบุขอมูลจัดสง - Step 3 ยืนยันขอมูล - Step 4 เลือกวิธีการชําระเงิน - Step 5 เสร็จสิ้น - - Step 1 / 5
หยิบใสตะกรา |สินคา |ราคา |จํานวน |รวมราคา เลือกซื้อสินคาตอ ขั้นตอนตอไป ตะกราสินคาราคาพิเศษ |สินคา |ราคา |จํานวน |รวมราคา
เลือกซื้อสินคาตอ ขั้นตอนตอไป"
81
"【M98】-galaxy 88 เครดิต ฟรี 【M98】-galaxy 88 เครดิต ฟรีเว็บคาสิโนเปดใหม google playinferno joker slot review 855ufabet biz
【M98】-galaxy 88 เครดิต ฟรีalotto888 qq joker สล็อต777 zen 【M98】-galaxy 88 เครดิต ฟรีพนันออนไลน เว็บไหนดี pantip english ทาง
เขา บา คา รา รอยัล

Chunk 47 · 1,997 chars

านวน |รวมราคา
เลือกซื้อสินคาตอ ขั้นตอนตอไป"
81
"【M98】-galaxy 88 เครดิต ฟรี 【M98】-galaxy 88 เครดิต ฟรีเว็บคาสิโนเปดใหม google playinferno joker slot review 855ufabet biz
【M98】-galaxy 88 เครดิต ฟรีalotto888 qq joker สล็อต777 zen 【M98】-galaxy 88 เครดิต ฟรีพนันออนไลน เว็บไหนดี pantip english ทาง
เขา บา คา รา รอยัล wikiเว็บคาสิโนออนไลนอันดับ1 usb ที่มาของลิ้ง:【M98】-galaxy 88 เครดิต ฟรี"
80
"{movie_title} Kingmaker (2022) หนังป 2022 IMDB 6.7 Genres: ดรามา เรื่องยอ ระหวางชวงการหาเสียง
อยูดีๆก็เกิดเหตุระเบิดขึ้นที่บานของ คิมอุนบอม (ซอลคยองกู) 1 ในนักการเมืองตัวเต็งที่จะไดเปนประธานาธิบดี เมื่อสืบหาตนตอ
กลับพบวาผูลงมืออาจจะเปน ซอชางแด (อีซอนกยุน) 1 ในทีมงานของเขาเอง"
84
"รายการขาววันใหม “ชวง AEC Energy with Kasemsant” ตอนที่ 5 นโยบายพลังงานแตละประเทศ ออกอากาศ ทุกวันจันทร – อังคาร
เวลา 00.20 น. (เที่ยงคืนยี่สิบนาที) ทางชอง 3 [smartslider3 slider="9"] รายการขาววันใหม “ชวง AEC Energy with Kasemsant” ตอนที่
5 นโยบายพลังงานแตละประเทศ ออกอากาศ ทุกวันจันทร – อังคาร เวลา 00.20 น. (เที่ยงคืนยี่สิบนาที) ทางชอง 3"
82
"[Album] เปดโลกชะนีใหมีมากกวาเครื่องสําอาง เมื่อผูชายชวนคุณมาอานหนังสือ รวมหนังสือจาก "#จินยองอาน" Image Tags
ถาชอบบทความนี้ กด Like ไดเลยนาาา มีบทความดีๆ อีกเพียบ Gallery ที่เกี่ยวของ #ลากซิสเขาดอม! สองวงไอดอล 'BXW (Black X
White)' หลอเท ฟนกรุบ จากเรียลลิตี้ PRODUCE 101 JAPAN ♥ Mollacake| กิจกรรม SistaCafe"
66
"You may also like มังกรฟา🥳 ขอแสดงความยิ […] Bluedragontary.com สลากมังกรฟา รหัส BD00209F
ตัวแทนจําหนายลอตเตอรี่มังกรฟา ลอตเตอรี่ออนไลน สลากกินแบงออนไลน หวยมังกรฟา ขายลอตเตอรี่ออนไลน มีใหเลือกมากกวา 1
ลานหมายเลข คุณกดซื้อกับแพลตฟอรมม […]"
77
"A B C D E F G H I J K L M N O P R S T V W Y 4.8 5.0 5.0 4.7 4.6 4.8 4.8 4.9 4.7 4.7 4.8 4.6 4.8 4.6 CUSTOMER SERVICE
KONVY ฝายบริการลูกคา 02-105-4235 support@konvy.com จันทร - อาทิตย (ยกเวนวันหยุดนักขัตฤกษ)8.00AM - 6.00PM
การสั่งสินคาเปน 0 สินคาทั้งหมด 0 ชิ้น จํานวนเงิน: ฿0"
78
"【Look618】-บา คา รา 6699 【Look618】-บา คา รา 6699ขนาด ไซส เสื้อ กีฬาall bet

Chunk 48 · 1,994 chars

8 5.0 5.0 4.7 4.6 4.8 4.8 4.9 4.7 4.7 4.8 4.6 4.8 4.6 CUSTOMER SERVICE
KONVY ฝายบริการลูกคา 02-105-4235 support@konvy.com จันทร - อาทิตย (ยกเวนวันหยุดนักขัตฤกษ)8.00AM - 6.00PM
การสั่งสินคาเปน 0 สินคาทั้งหมด 0 ชิ้น จํานวนเงิน: ฿0"
78
"【Look618】-บา คา รา 6699 【Look618】-บา คา รา 6699ขนาด ไซส เสื้อ กีฬาall bet prediction ตัว จริง ลิเวอรพูล【Look618】-บา คา รา
6699สกรีน เสื้อ กีฬา ราคา ยู ฟา ยูโร ปา ลีก 2019 【Look618】-บา คา รา 6699gclub asia 88 link รองเทา ส ตั๊ ด ไน กี้ ใหม ลาสุดรองเทา ส
ตั๊ ด หนา กวาง ที่มาของลิ้ง:【Look618】-บา คา รา 6699"
Figure 9: Examples of low word count documents. These examples illustrate the generally low quality
of such content. Notably, some documents include material from illegal gambling websites and adult
content sources.

-- 24 of 28 --

Statistic Value
Count 200,000
Mean 3.23
Standard Deviation 0.72
Minimum 0
10th Percentile 3
30th Percentile 3
50th Percentile (Median) 3
70th Percentile 3.5
90th Percentile 4
95th Percentile 4
99th Percentile 5
Maximum 54
Table 13: Descriptive statistics of median word length in the sub-document dataset, including the mean,
standard deviation, median, minimum, maximum, and selected percentiles.
Based on this experiment, and specifically for the Thai language, we conclude that the upper bound of
the word count rule can remain unchanged, while the lower bound should be increased. In our pipeline,
we set the lower bound to 200 and retain the original upper bound of 100,000. Ideally, the optimal value
for the lower bound should be determined through a series of data ablation experiments. However, due to
limited resources and computational constraints, we were unable to perform such extensive evaluations.
C.2 Median Word Length
The original Gopher Rules filter out documents with a median word length less than 3 or greater than
10. In our Thai sub-document dataset, we observe that the majority of documents have a median word
length between 3 and 4, or more broadly, between 2 and 5. We

Chunk 49 · 1,999 chars

perform such extensive evaluations.
C.2 Median Word Length
The original Gopher Rules filter out documents with a median word length less than 3 or greater than
10. In our Thai sub-document dataset, we observe that the majority of documents have a median word
length between 3 and 4, or more broadly, between 2 and 5. We conducted the same experiment described
in Appendix F.1 to compute perplexity statistics. However, in this case, we modified the upper bound of
the median word length rule from 10 to 5, as 5 corresponds to the 99th percentile of our dataset. The
results of this experiment are presented below.
As shown in Tabel 15, we found that the perplexity statistics between the two settings did not differ
significantly, which suggests that the original Gopher Rule threshold may also be applicable to Thai.
Consequently, we decided to retain the original threshold values. The Kruskal–Wallis H-statistic com-
paring the Gopher Rules and Experiment settings is 0.027, with a p-value of 0.869, indicating no
statistically significant difference between the medians of the two samples.
In our Dolma pipeline, we therefore chose not to modify the threshold for this rule. However, we
recommend that readers with sufficient computational resources consider conducting a data ablation
study to better understand the impact of this rule and identify the optimal threshold for their specific use
case.
C.3 Gopher Rules Changes Summary
To provide a clear comparison between the original and adapted Gopher Rules used in our work, we
present a detailed summary of all modifications in the table below. These changes reflect adjustments
made to better accommodate the characteristics of the Thai language.
D C4 Rules
Unlike the Gopher rules, we made minimal modifications to the C4 rules taggers. The original source
code for the C4 rules taggers returns the following attributes for a document:
• has_curly_brace: Indicates the presence of the character { in the document.
• has_lorem_ipsum: Checks

Chunk 50 · 1,996 chars

ics of the Thai language.
D C4 Rules
Unlike the Gopher rules, we made minimal modifications to the C4 rules taggers. The original source
code for the C4 rules taggers returns the following attributes for a document:
• has_curly_brace: Indicates the presence of the character { in the document.
• has_lorem_ipsum: Checks if the phrase "lorem ipsum" is present in the document.

-- 25 of 28 --

0 	10 	20 	30 	40 	50
Median Word Length
100
101
102
103
104
105
Frequency
Distribution of Median Word Lengths
x = 3
x = 5
x = 10
Figure 10: Histogram of the distribution of median word lengths in the sub-document dataset. The dis-
tribution is right-skewed. Vertical lines are drawn at x = 3 and x = 10 to indicate the threshold values
used in the original Gopher Rules. An additional line at x = 5 marks the 99th percentile in our dataset.

-- 26 of 28 --

median word length count percentage cumulative percentage
0 4 0.0020 0.0020
1 4,985 2.4925 2.4945
2 9,909 4.9545 7.4490
3 126,156 63.0780 70.5270
4 55,471 27.7355 98.2625
5 2,880 1.4400 99.7025
6 433 0.2165 99.9190
7 83 0.0415 99.9605
8 26 0.0130 99.9735
9 10 0.0050 99.9785
10 11 0.0055 99.9840
11 7 0.0035 99.9875
12 2 0.0010 99.9885
13 3 0.0015 99.9900
14 1 0.0005 99.9905
15 3 0.0015 99.9920
17 6 0.0030 99.9950
18 1 0.0005 99.9955
20 3 0.0015 99.9970
21 1 0.0005 99.9975
23 1 0.0005 99.9980
30 1 0.0005 99.9985
33 1 0.0005 99.9990
34 1 0.0005 99.9995
54 1 0.0005 100.0000
Table 14: Value counts, percentages, and cumulative percentages of median word lengths in the sub-
document dataset. The distribution clearly shows that the majority of documents have a median word
length of 3 or 4.
Statistic Baseline Gopher Rules Experiment
Count 200,000 185,069 184,370
Mean 1143.80 1170.90 1168.76
Standard Deviation 4414.26 2227.70 2203.26
Minimum 1.37 8.71 8.71
10th Percentile 176.05 206.64 206.77
30th Percentile 402.48 434.22 434.07
50th Percentile (Median) 710.95 747.78 747.27
70th Percentile 1136.51 1181.81 1181.60
90th Percentile 2163.38

Chunk 51 · 1,998 chars

Experiment
Count 200,000 185,069 184,370
Mean 1143.80 1170.90 1168.76
Standard Deviation 4414.26 2227.70 2203.26
Minimum 1.37 8.71 8.71
10th Percentile 176.05 206.64 206.77
30th Percentile 402.48 434.22 434.07
50th Percentile (Median) 710.95 747.78 747.27
70th Percentile 1136.51 1181.81 1181.60
90th Percentile 2163.38 2238.32 2237.23
95th Percentile 3153.02 3264.85 3262.91
99th Percentile 7660.35 7933.48 7917.08
Maximum 814,671.56 191,404.94 191,404.94
Table 15: Perplexity statistics across three experimental settings. Applying the Gopher Rules results in
a reduction in perplexity scores. However, further adjusting the upper bound of the median word length
rule (from 10 to 5) has no significant additional effect on perplexity in our experiment.

-- 27 of 28 --

Original Rules Our Rules
Fraction of characters in most common
n-gram greater than a threshold
Fraction of characters in most common n-
gram greater than a threshold
Fraction of characters in duplicate n-
grams greater than a threshold
Fraction of characters in duplicate n-grams
greater than a threshold
Contains fewer than 50 or more than
100K words
Contains fewer than 200 or more than 100K
words
Median word length is less than 3 or
greater than 10
Median word length is less than 3 or greater
than 10
Symbol to word ratio greater than 0.10 Symbol to word ratio greater than 0.10
Fraction of words with alpha character
less than 0.80
Fraction of words with Thai letters less than
0.80
Contains fewer than 2 of a set of re-
quired words
Contains fewer than 2 of a set of required
words
Fraction of lines in document starting
with bullet point greater than 0.90
Fraction of lines in document starting with
bullet point greater than 0.90
Fraction of lines in document ending
with ellipsis greater than 0.30
Fraction of lines ending with ellipsis or “. . . ”
greater than 0.30
Fraction of lines in document that are
duplicated greater than 0.30
Fraction of lines that are duplicated greater
than 0.30
Fraction of characters in

Chunk 52 · 1,801 chars

bullet point greater than 0.90
Fraction of lines in document ending
with ellipsis greater than 0.30
Fraction of lines ending with ellipsis or “. . . ”
greater than 0.30
Fraction of lines in document that are
duplicated greater than 0.30
Fraction of lines that are duplicated greater
than 0.30
Fraction of characters in duplicated
lines greater than 0.30
Fraction of characters in duplicated lines
greater than 0.30
– Contains one of the following: read more
Table 16: Summary of modifications to Gopher Rules for the Thai language
• has_javascript: Detects the presence of JavaScript code within the document.
• has_naughty_word: Identifies the presence of inappropriate or profane words.
• lines_with_no_ending_punctuation: Flags lines that do not end with standard punctu-
ation marks.
• lines_with_too_few_words: Flags lines that contain fewer words than a predefined thresh-
old.
• line_count: Records the total number of lines in the document.
Firstly, we update the corpus used for detecting inappropriate words. We adopted the list provided by
AI Singapore, which includes both English and Thai terms. Furthermore, we introduced a new attribute,
corrupt_unicode, which identifies the spans of corrupted Unicode characters in the text. These
corrupted spans are frequently encountered in our dataset. In the quality filtering pipeline, these spans are
replaced with empty strings. We included corrupt_unicode in the C4 rules taggers for convenience.
Note that while the primary attribute of this rule is lines_with_no_ending_punctuation,
we did not utilize it in our pipeline due to the differing nature of sentence-ending punctuation between
Thai and English languages. Specifically, Thai sentences often do not end with punctuation marks,
making this attribute less applicable.

-- 28 of 28 --