Mangosteen: An Open Thai Corpus for Language Model Pretraining
Summary
This paper introduces Mangosteen, a 47.4 billion-token open Thai corpus for language model pretraining. Existing large-scale corpora rely on English-centric or language-agnostic pipelines that fail to address Thai script and cultural nuances, often leaving risky content like gambling material untreated. Mangosteen addresses this by adapting the Dolma pipeline with custom rule-based language identification, revised C4/Gopher quality filters, and Thai-trained content filters. It incorporates diverse sources including Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline reduces CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG scores from 3 to 11. An 8B-parameter SEA-LION model pre-trained on Mangosteen outperforms SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. The full pipeline code, cleaning manifests, corpus snapshot, and checkpoints are released, providing a reproducible foundation for Thai and regional LLM research.
PDF viewer
Chunks(53)
Chunk 0 · 1,995 chars
Mangosteen: An Open Thai Corpus for Language Model Pretraining Wannaphong Phatthiyaphaibun♠, †, Can Udomcharoenchaikit♠, †, Pakpoom Singkorapoom♠, Kunat Pipatanakul♣, Ekapol Chuangsuwanich♢, Peerat Limkonchotiwat♡, ‡, Sarana Nutanong♠, ‡ ♠Vidyasirimedhi Institute of Science and Technology ♣SCB10X ♢Chulalongkorn University ♡AI Singapore Abstract Pre-training data shapes a language model’s quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that in- cludes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Man- gosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research. 1 1 Introduction Pre-training datasets are an important part of train- ing a language model. The size, the diversity, and the quality of the pre-training data play a cru- cial role in obtaining a high-performing language †Co-first Authors, ‡Corresponding Authors 1All artifacts in this
Chunk 1 · 1,997 chars
e Thai and regional LLM research. 1 1 Introduction Pre-training datasets are an important part of train- ing a language model. The size, the diversity, and the quality of the pre-training data play a cru- cial role in obtaining a high-performing language †Co-first Authors, ‡Corresponding Authors 1All artifacts in this papers model. These datasets are compiled by scraping vast amounts of web text, augmented with diverse sources like books, scholarly articles, and code. These datasets are enormous, often reaching hun- dreds of billions of tokens. Raw web corpora are messy, containing noise, duplicates, and harmful content. If not carefully filtered, these issues can lead to undesirable model behavior. To improve the quality of the dataset, a filtering pipeline is needed to remove these unwanted samples. Most large-scale English pre-training datasets, such as CC-100 (Wenzek et al., 2020), C4 (Raffel et al., 2020), RefinedWeb (Penedo et al., 2023), and Dolma (Soldaini et al., 2024), are mainly derived from internet data such as the Com- mon Crawl corpus2. Hence, this requires exten- sive cleaning and filtering. However, most data- cleaning pipelines for web pages are optimized for the English language using known best practices. For multilingual datasets that support Thai, such as CC-100, FineWeb2 (Penedo et al., 2024b), and HPLT v2 (Burchell et al., 2025), the data cleaning pipelines are usually language-agnostic. Although being language-agnostic can lead to a generalized design for a data cleaning pipeline, this approach prioritizes broad applicability over incorporating local, cultural, and language-specific knowledge. Consequently, language-specific filtering steps are necessary to prevent unwanted information from being used to train language models. For exam- ple, the FineWeb2 dataset contains content from gambling websites, which are illegal in Thailand. For language-specific datasets, many data col- lection projects utilize existing data cleaning pipelines by
Chunk 2 · 1,998 chars
filtering steps are necessary to prevent unwanted information from being used to train language models. For exam- ple, the FineWeb2 dataset contains content from gambling websites, which are illegal in Thailand. For language-specific datasets, many data col- lection projects utilize existing data cleaning pipelines by customizing them to make these pipelines more suitable for their target languages. This ranges from making a small modification in one process, for instance, the Latxa project (Etx- aniz et al., 2024) for Basque, which adapted the Dolma and Corpus Cleaner v2 pipelines with a 2https://commoncrawl.org/ arXiv:2507.14664v2 [cs.CL] 22 Jul 2025 -- 1 of 28 -- specific change in the filtering step, to customiz- ing the whole pipeline. Examples of more exten- sive customization include the FinGPT (Luukko- nen et al., 2023) and IndicLLMSuite (Khan et al., 2024) projects that built their own dedi- cated pipelines for Finnish and Indian languages, respectively. For the Thai language, the ap- proach ranges from using a standard language- agnostic pipeline, such as the Typhoon-2 project, which uses Trafilatura (Barbaresi, 2021) and text- dedup (Mou et al., 2023) directly while creating its own heuristic filtering, to customizing the whole pipeline to specifically handle Thai. An exam- ple of a dedicated Thai data cleaning pipeline is from the OpenThaiGPT (Yuenyong et al., 2025), in which they build the data preprocessing pipeline from scratch. However, these projects focus on open models rather than the openness of their training data and collection methods. While this provides a valuable resource, it means the pre-training corpora remain inaccessible and the pipeline’s design choices and empirical backing are not detailed, which is an important consid- eration for researchers focused on reproducibil- ity. This raises a central question: What does it take to build a high-quality pre-training corpus for a language like Thai, and how should exist- ing pipelines be
Chunk 3 · 1,997 chars
le and the pipeline’s design choices and empirical backing are not detailed, which is an important consid- eration for researchers focused on reproducibil- ity. This raises a central question: What does it take to build a high-quality pre-training corpus for a language like Thai, and how should exist- ing pipelines be adapted to meet its linguistic and cultural specifics? To address the lack of Thai pre-training re- sources and to increase the exposure of Thai within the open-source NLP community, we have developed Mangosteen, a large-scale pre-training dataset for the Thai language. This dataset con- tains 47.4 billion tokens and is available under an open-source license. In addition, we provide an in-depth analysis of our data cleaning pipeline, de- signed specifically for Thai, including an ablation study that confirms the effectiveness of each pro- cessing step. We customized the Dolma (Soldaini et al., 2024) pipeline to curate variants of Thai Common Crawl data through a systematic clean- ing process because the Dolma pipeline is easy to customize and supports parallel processing. For language identification, we developed a rule-based script to accurately detect Thai content. The qual- ity filter was adapted by modifying the C4 and Go- pher filter rules to address specific Thai language nuances. Furthermore, we implemented a content filter composed of two components: an obscene content filter and a gambling-related content filter, both of which were trained on Thai-specific data to effectively remove undesirable content. In addi- tion to Common Crawl data, we include data from several other sources to ensure diversity. These in- clude Wikipedia, YouTube subtitles, and even text extracted from open-access books using OCR. To determine the most effective data pipeline for the Thai language, we systematically evalu- ate the impact of each processing step on lan- guage model performance. We test each step in our pipeline to find the optimal settings for the Thai
Chunk 4 · 1,998 chars
YouTube subtitles, and even text extracted from open-access books using OCR. To determine the most effective data pipeline for the Thai language, we systematically evalu- ate the impact of each processing step on lan- guage model performance. We test each step in our pipeline to find the optimal settings for the Thai language. This process involves pre-training a GPT-2 model (Radford et al., 2019) with 124M parameters on 10B tokens for each configuration. The effectiveness of each step is then confirmed by evaluating the trained model against benchmarks like SEACrowd and SEA-HELM. Compared to the uncleaned Common Crawl data, data pro- cessed through our pipeline achieves better per- formance despite being multiple times smaller. Furthermore, we can improve the already-cleaned FineWeb2 data by passing it through our pipeline, resulting in a much smaller dataset that main- tains a similar level of performance. Furthermore, to confirm the robustness of our data, we devel- oped a new Thai large language model by fur- ther pre-training the Southeast Asian model (SEA- LION (Ng et al., 2025)) on our dataset. Our LLM outperforms other Llama3.1-based models on both the Thai LLM Leaderboard and SEA- HELM benchmarks. We summarize the contribution of our paper as follows: • We introduce a data cleaning pipeline for im- proving the quality of data, which adapts the Dolma pipeline for the language’s specific needs. We also describe the customization steps taken in this process. Moreover, we con- duct an ablation study to demonstrate the ef- fectiveness of our data cleaning pipeline. • Beyond Common Crawl data, through an ex- tensive curation effort, we curate and ex- tract high-quality texts from numerous other sources, including Wikipedia, YouTube sub- titles, open-access books whose text we ex- tracted via OCR, official documents from the Royal Gazette, open government data, estab- lished resources like Wikipedia, and existing Thai datasets on Hugging Face. • In a large-scale
Chunk 5 · 1,992 chars
x- tract high-quality texts from numerous other sources, including Wikipedia, YouTube sub- titles, open-access books whose text we ex- tracted via OCR, official documents from the Royal Gazette, open government data, estab- lished resources like Wikipedia, and existing Thai datasets on Hugging Face. • In a large-scale effort to advance open data and open models for the Thai lan- -- 2 of 28 -- guage, we introduce a high-quality named “Mangosteen”, 47B token pre-training cor- pus. To demonstrate its effectiveness, we also present WangchanLION-V3-8B, a new, fully open-source Thai LLM, developed by further pre-training a SEA-LION-based model on our dataset. In line with our commitment to open science, this entire ecosystem—the dataset, the model, and all related code—is available under permissive open-source licenses. 2 Related Work 2.1 Text Pretrained Corpora A critical design choice for any pre-training cor- pora is data cleaning, which ensures the quality of the data and its effectiveness for the pre-training process. Prior efforts (Wenzek et al., 2020; Raf- fel et al., 2020; Xue et al., 2021; Gao et al., 2020) focusing on improving data quality using simple techniques such as language identification, docu- ment deduplication, and quality filtering using a perplexity filter model on web corpora, where the language is monolingual (English only) to multi- lingual (more than 100 languages). Later works also demonstrate more pristine text processing us- ing another model to classify text quality (Or- tiz Su’arez et al., 2020). The C4 corpus (Raf- fel et al., 2023) was created by the rule, which was later named the “C4” rule. Gopher rule (Rae et al., 2021), and more complicated data rules, e.g., RefinedWeb (Penedo et al., 2023) demon- strate the pipeline that uses URL filtering, text extraction, language identification, repetition re- moval, document-wise filtering, Line-wise correc- tions, fuzzy deduplication, and exact deduplica- tion, outperform previous data
Chunk 6 · 1,994 chars
, 2021), and more complicated data rules, e.g., RefinedWeb (Penedo et al., 2023) demon- strate the pipeline that uses URL filtering, text extraction, language identification, repetition re- moval, document-wise filtering, Line-wise correc- tions, fuzzy deduplication, and exact deduplica- tion, outperform previous data cleaning processes. Recently, the size of the pre-training corpora has been increased significantly according to the model size, which has been increased from 100 million parameters (i.e., BERT-base (Devlin et al., 2018)) to a large language model (i.e., Llama- 8B (Grattafiori et al., 2024)). Therefore, most recent works propose a large-scale pre-training dataset with methods for creating clean and high- quality pre-training data to improve the pre- training model’s performance. The Dolma cor- pus (Soldaini et al., 2024) has been an English- pretrained dataset of 3 trillion tokens mixed with many sources: web pages, academic publications, code, books, and encyclopedic materials. They re- leased a dataset and a pipeline to create dataset transparency in the language model, where the pipeline includes language filtering, URL and text overlap deduplications, quality filters, and content filters. FineWeb (Penedo et al., 2024a) is clean and deduplicates English web data from Common- Crawl, where this previous work uses a linear re- gression model to classify educational texts that are made using Llama-3 synthetic annotations. FineWeb2 (Penedo et al., 2024b) creates over 1000 languages for pre-training models. Its methodol- ogy is similar to FineWeb, but data is deduplicated per language globally, and C4 filters are not used. In general, these pipelines are similar and mostly open-source, but the majority do not include instructions on how to apply them to other languages. Furthermore, the aforementioned pipelines fail to explain the processes that make the preprocessing steps effective for the target lan- guage, and these pipelines are not effectively
Chunk 7 · 1,999 chars
hese pipelines are similar and mostly open-source, but the majority do not include instructions on how to apply them to other languages. Furthermore, the aforementioned pipelines fail to explain the processes that make the preprocessing steps effective for the target lan- guage, and these pipelines are not effectively mod- ified to an extent that makes them suitable for the target language. For instance, some languages ne- cessitate either language-specific word segmenta- tion tools or a newly trained model to filter out of- fensive content, including pornography and gam- bling material. 2.2 Thai-text pre-training datasets and models The most prominent pre-training datasets for Thai are typically included as subsets within multilin- gual corpora. Early examples include the multi- lingual BERT (Devlin et al., 2018) model, which uses Wikipedia and includes Thai content. This is followed by the CC-100 corpus (Wenzek et al., 2020), which supports Thai and is used in mod- els such as XLM and XLM-R (Conneau et al., 2020). The OSCAR corpus (Ortiz Su’arez et al., 2020) also includes Thai and forms part of the dataset used in GPT-2 Thai by Flax’s GPT-2 base 3. More recently, trillion-scale multilingual pre-training datasets such as CulturaX (Nguyen et al., 2023), FineWeb2 (Penedo et al., 2024b), and HPLT v2 (Burchell et al., 2025) have also included Thai among hundreds of supported languages. However, their language-agnostic approach disre- gards language-specific and local knowledge vital for accurate data cleansing. Moreover, the Thai community also pro- poses a Thai CPT project to improve the 3https://huggingface.co/ flax-community/gpt2-base-thai -- 3 of 28 -- number of data and the model’s robustness. OpenThaiGPT (Yuenyong et al., 2025) is a project that performs continual pre-training on Llama models with Thai language by extending vocab- ulary and continually pre-training Thai language datasets. SambaLingo (Csaki et al., 2024) is a Llama-2 model that does vocabulary
Chunk 8 · 1,996 chars
-- number of data and the model’s robustness. OpenThaiGPT (Yuenyong et al., 2025) is a project that performs continual pre-training on Llama models with Thai language by extending vocab- ulary and continually pre-training Thai language datasets. SambaLingo (Csaki et al., 2024) is a Llama-2 model that does vocabulary expan- sion and continuous pre-training for nine lan- guages, including Thai, using data sourced from CulturaX (Nguyen et al., 2023). Typhoon (Pi- patanakul et al., 2023) performs CPT on cleaned and deduplicated CC and extends vocabulary with Mistral-7B, and Typhoon 2 (Pipatanakul et al., 2024) applies CPT on Qwen 2.5 7B and Llama 3.1 by combining the original pipeline with a multi- classifier model for data filtering. OpenThaiLLM- Prebuilt-7B4 is a project that continues pre- training on Qwen 2 with the Thai and Chinese datasets. Although previous works have demonstrated the possibility of improving Thai’s performance using CPT, they only partially open-sourced their pipelines or did not study the effect of each in- dividual component. For example, Typhoons use heuristic filtering like RefineWeb (Penedo et al., 2023) and DCLM (Li et al., 2025), but do not open-source their pipeline code or model to en- able reproducibility. OpenThaiGPT open-sourced their data processing pipeline5; however, they do not provide thorough study of the pipeline, includ- ing the necessary analyses and resources required for full reproducibility. For example, they do not share the dataset or the model used for perplexity- based filtering. Therefore, the Thai research com- munity is struggling to develop or make any for- ward progress without open-source data and a study on its design decision. 2.3 Gap Summary Existing data cleaning pipelines like C4, Refined- Web, and Dolma, while scalable, are designed primarily for English or applied in a language- agnostic way, requiring further adjustments to fit distinct scripts and cultural norms. Thai pre- training efforts, such as
Chunk 9 · 1,989 chars
and a study on its design decision. 2.3 Gap Summary Existing data cleaning pipelines like C4, Refined- Web, and Dolma, while scalable, are designed primarily for English or applied in a language- agnostic way, requiring further adjustments to fit distinct scripts and cultural norms. Thai pre- training efforts, such as OpenThaiGPT and Ty- phoon, reuse such pipelines or apply ad hoc mod- ifications without empirical validation, and rarely release their datasets or pipelines, limiting repro- 4https://medium.com/nectec/ openthaillm-prebuilt-release-f1b0e22be6a5 5https://github.com/OpenThaiGPT/ data-processing ducibility. This paper addresses these gaps by sys- tematically applying the principles found in C4, RefinedWeb, and Dolma to the Thai-specific con- text. 3 Data Curation The goal of data curation is to build a Thai pre- training corpus that covers the broadest possible set of domains, enabling adaptations in various downstream tasks. We collect datasets from var- ious sources that can be classified into two cat- egories: (i) Common Crawl-Derived Dataset and (ii) Curated Non-Common Crawl. We have a total of 30.1M documents and 47.4B tokens, as shown in Table 1. Source Documents Tokens (B) Common Crawl-Derived 29.7M 45.9 Curated Non-Common Crawl 425,304 1.5 Total 30.1M 47.4 Table 1: Document and token statistics (using the Llama 3 tokenizer) 3.1 Common Crawl-derived Dataset Following the best practices from other pre- training corpora (Raffel et al., 2023; Rae et al., 2021; Penedo et al., 2023; Soldaini et al., 2024), our Common Crawl-derived data consists of two main sources: raw Common Crawl snapshots and the FineWeb2 dataset. We focus specifically on the Thai language subset from both sources. Common Crawl We collect Common Crawl dataset by processing each Common Crawl snap- shot from 2018-30 to 2023-23 and extract only Thai-language content using Common Crawl metadata. For text extraction, we use trafilatura (Barbaresi, 2021). Followed by our data
Chunk 10 · 1,987 chars
specifically on the Thai language subset from both sources. Common Crawl We collect Common Crawl dataset by processing each Common Crawl snap- shot from 2018-30 to 2023-23 and extract only Thai-language content using Common Crawl metadata. For text extraction, we use trafilatura (Barbaresi, 2021). Followed by our data cleaning pipeline (see Section 4 for full details) to filter and deduplicate the result. FineWeb2 To further increase the quantity of our Common Crawl-derived dataset, we incorporate FineWeb2, a large-scale pre-training corpus built from Common Crawl data, which employs multi- ple filtering techniques to improve dataset quality. FineWeb2 spans from summer 2013 to April 2024. The Thai subset contains approximately 51.4 bil- lion words across 35 million documents. We fur- ther process the cleaned FineWeb2 dataset using our data cleaning pipeline (Section 4) to dedu- plicate overlapping URLs and text inherited from the Common Crawl source while also enhancing -- 4 of 28 -- language-specific quality. After this step, the final cleaned Thai FineWeb2 subset consists of around 7.3 billion words and 4.6 million documents. 3.2 Curated Non-Common Crawl Dataset In addition to the Common Crawl-based data, we enhance the data diversity through the efforts of incorporating harder-to-reach data as follows. (1) A significant subset of our source documents con- sisted of scanned or image-based PDFs, whose textual content is not directly accessible. To over- come this, we perform Optical Character Recog- nition (OCR) to extract the text from these files using Marker6. Due to the diverse publication formats of the scanned documents, we must cus- tomize the OCR pipeline for each document for- mat and clean them manually. (2) We also ex- tract more text from YouTube video subtitles us- ing the provided metadata, so we can create a pipeline that filters suitable Thai video subtitles that use the Creative Commons (CC) license. To extract subtitles, we modified
Chunk 11 · 1,988 chars
s- tomize the OCR pipeline for each document for- mat and clean them manually. (2) We also ex- tract more text from YouTube video subtitles us- ing the provided metadata, so we can create a pipeline that filters suitable Thai video subtitles that use the Creative Commons (CC) license. To extract subtitles, we modified Scrapetube’s source code7 to exclusively retrieve Creative Commons licensed videos. Our two-step methodology in- volved: 1. Compiling Thai keywords across do- mains (mathematics, investing, fitness, gaming, films, etc.), applying CC/subtitle filters to gather ≤1,000 video metadata entries per keyword. 2. Using quoted channel names as queries to identify predominantly CC-licensed content. Video IDs were processed via YouTube’s Transcription API8 to generate textual JSON Lines. As a final val- idation step, non-CC contents were removed via yt_dlp9. The curated non-Common Crawl dataset spans over a large number of sources, which covers the following categories: • Encyclopedic: We collect Thai data from Wikipedia, Wikibooks, Wikiquote, and Wik- isource, then use WikiExtractor (Attardi, 2015) to extract text and then remove the HTML tags and adjust the formatting accord- ing to the WikiExtractor tool’s outputs. • Finance: We use the Financial Text Data Col- lection10 by the VISTEC-depa Thailand ar- 6https://github.com/VikParuchuri/ marker 7https://github.com/dermasmid/ scrapetube 8https://github.com/jdepoix/ youtube-transcript-api 9https://github.com/yt-dlp/yt-dlp 10https://huggingface.co/datasets/ tificial intelligence research institute. This dataset includes financial reports from Thai companies and Thai financial news. • Legal: We collect Thai legal data from mul- tiple HuggingFace repositories, including acts and statutes, constitutions, and Creative Com- mons licenses. • Government: We collect data that is publicly released by government agencies from multi- ple sources. This includes the Open Govern- ment Data of Thailand, which is a Thai
Chunk 12 · 1,998 chars
We collect Thai legal data from mul- tiple HuggingFace repositories, including acts and statutes, constitutions, and Creative Com- mons licenses. • Government: We collect data that is publicly released by government agencies from multi- ple sources. This includes the Open Govern- ment Data of Thailand, which is a Thai gov- ernment effort to improve transparency and engagement through access to open data. Gov- ernment data often comes in scanned PDF for- mat, which requires extraction through OCR. Furthermore, we have to extract texts from various formats such as DOCX and CSV. • Education: We collect educational materials ranging from informative articles and classical Thai literature to advanced academic research. • YouTube: We collect Thai YouTube subtitles from videos with the CC BY 3.0 license. Table 2 shows the document count distribution for each category. The curated Non-Common Crawl corpus comprises 425,304 Thai-language documents from sources not present in Common Crawl. After deduplication, we split the cor- pus into a training set of 397,488 documents and a validation set of 4,044 documents, where the Non-Common Crawl dataset was curated from a wide variety of web sources and includes data from encyclopedic websites, financial reports, ed- ucation, legal corpora, academic literature, and YouTube transcripts. In total, the Non-Common Crawl corpus contributes an additional 1.54 bil- lion tokens to the dataset, tokenized using the meta-llama/Llama-3.2-1B tokenizer. All sources were verified to ensure the use of per- missive content licenses. The full list of text sources can be found in https://github. com/vistec-AI/Mangosteen. A more detailed table on sources of information for the Non-Common Crawl dataset can be seen in the appendix. We also use a deduplication pro- cess to ensure this new data does not overlap with the web data collected earlier. By incorporating this non-Common Crawl into our data, we gain an additional 1.5B tokens of high-quality data.
Chunk 13 · 1,990 chars
table on sources of information for the Non-Common Crawl dataset can be seen in the appendix. We also use a deduplication pro- cess to ensure this new data does not overlap with the web data collected earlier. By incorporating this non-Common Crawl into our data, we gain an additional 1.5B tokens of high-quality data. Al- though these data are of high quality, we still need to clean the noisy web data (e.g., Common Crawl airesearch/CMDF_VISTEC -- 5 of 28 -- Domain Count Percentage Encyclopedic 166,187 41.34 Finance 86,813 21.59 Government 72,879 18.13 Legal 52,343 13.02 YouTube 17,837 4.43 Education 5,911 1.47 Table 2: Document counts and percentage in Non- Common Crawl by domain. and Fineweb2) to ensure the overall quality of our collected dataset. Therefore, we still require a data cleaning pipeline to mitigate these problems. 4 Data cleaning pipeline for web data 4.1 Overview A data cleaning pipeline is a crucial compo- nent for ensuring the quality of the pre-training data. For the Thai NLP community, previous Thai LLMs also demonstrate how to gather a large- scale dataset for pre-training an LLM. However, as we discuss in Section 2.2, the focus of existing works has been on the model release rather than the training data and collection methods. Conse- quently, the pre-training corpora often remain in- accessible, and pipeline design choices are not dis- cussed in detail, which is an important considera- tion for researchers focused on reproducibility and adaptability. In this work, we propose a novel data collec- tion and cleaning pipeline tailored to the unique characteristics and challenges of Thai data. In par- ticular, we present a Thai adaptation of the well- known Dolma (Soldaini et al., 2024) data curation pipeline, featuring a new data processing design customized for the Thai language. After apply- ing our data filtering pipeline, we obtained a large- scale dataset containing 45.9B tokens. The Dolma pipeline consists of four steps: language
Chunk 14 · 1,991 chars
Thai adaptation of the well-
known Dolma (Soldaini et al., 2024) data curation
pipeline, featuring a new data processing design
customized for the Thai language. After apply-
ing our data filtering pipeline, we obtained a large-
scale dataset containing 45.9B tokens. The Dolma
pipeline consists of four steps: language identifi-
cation, quality filters, deduplication, and content
filters, described as follows.
4.2 Language Identification
The first step of our data preprocessing is language
identification, where we remove all non-Thai data
from the pre-training corpus. In particular, we dis-
card any document containing text in other lan-
guages, such as Lao or Burmese.
Dolma used FastText (Joulin et al., 2017) as a
language identifier for English corpora. However,
we found that when we compare three language
identifiers: langdetect, lingua, and the FastText
language identifier model, all three failed to per-
form this task on Thai text, as shown in the Ap-
pendix B. Therefore, we use a simple and efficient
rule-based approach that identifies Thai text with
a regular expression based on the Thai Unicode
character range. This filter allows only documents
in which at least half of the characters are Thai
Unicode characters to pass.
4.3 Quality Filters
As a core step in our pipeline, we filter out low-
quality data. To do this, we follow the approach
used by Dolma, applying quality filtering rules
such as C4 (Raffel et al., 2023) and Gopher (Rae
et al., 2021). Moreover, we also customize these
rules to better account for the unique characteris-
tics of the Thai language and to integrate insights
from our observations.
C4: We adopt all C4 rules from Dolma as follows:
• Contains curly braces (e.g., { or })
• Includes the placeholder text “lorem ipsum”
• Contains JavaScript code or references
• Includes offensive or inappropriate language
(using a filter we built specifically for Thai).
• Has lines that lack ending punctuation
• Contains lines with fewer than threeChunk 15 · 1,998 chars
Dolma as follows:
• Contains curly braces (e.g., { or })
• Includes the placeholder text “lorem ipsum”
• Contains JavaScript code or references
• Includes offensive or inappropriate language
(using a filter we built specifically for Thai).
• Has lines that lack ending punctuation
• Contains lines with fewer than three words
To detect offensive content, we employ a cus-
tom Thai lexicon. Because Thai text is not
delimited by spaces, we segment it using the
nlpO3 tokenizer (Suntorntip et al., 2024) from
PyThaiNLP (Phatthiyaphaibun et al., 2023). We
also implement a rule to remove Unicode replace-
ment characters, as these are unknown or unrepre-
sentable characters. Although we retain all orig-
inal Dolma rules, we omit the c4_no_punc rule,
since Thai sentences do not conventionally end
with punctuation such as a period.
Gopher: The Dolma pipeline uses Gopher rules
as part of its quality filtering process for English
text. However, based on our analysis of the Com-
mon Crawl-based data, we modified some of these
rules and adjusted their corresponding thresholds
to make them suitable for Thai.
• First, we raise the minimum document length
from 50 to 200 words and cap it at 100,000
words, since Thai sub-200-word texts in our
Common Crawl corpus of 200,000 examples
were predominantly low quality or sourced
from pornographic and gambling sites.
-- 6 of 28 --
• Second, we exclude any document in which
Thai consonants account for less than 80% of
all characters, reflecting our bespoke charac-
teristic of the Thai language.
• Next, we discard texts where over 30 % of
lines end with an ellipsis (“. . . ” or three
dots). Ellipses are often used to mark the
end of incomplete snippets of an article or ex-
cerpt.
• We also introduce a new rule to filter out
any document containing markers of trun-
cated content in Thai, e.g., “continue read-
ing”, “read more” or “read more at”, and
change the list of stop words to Thai.
To justify the higher minimum document length
threshold,Chunk 16 · 1,995 chars
of incomplete snippets of an article or ex- cerpt. • We also introduce a new rule to filter out any document containing markers of trun- cated content in Thai, e.g., “continue read- ing”, “read more” or “read more at”, and change the list of stop words to Thai. To justify the higher minimum document length threshold, we compared WangchanBART 11 per- plexity statistics at lower bounds of 50 versus 171 words and found both mean and median perplexi- ties to be lower at the higher threshold; therefore, we set the bound at 200 words. Finally, to speed up processing, we employ the ICU tokenizer rather than the slower nlpO3 tokenizer. 4.4 Deduplication Deduplication by URL. A common text pre- processing step is the removal of duplicate con- tent. For this, we utilize the deduplication tech- nique from the Dolma pipeline (Soldaini et al., 2024), which uses a Bloom filter to identify dupli- cate URLs. We found this method to be effective on our Thai web dataset and therefore kept this step of the pipeline unmodified. Deduplication on text overlap. We perform document-level deduplication using the Dolma Bloom filter. Dolma’s paragraph-level dedupli- cation is ineffective for Thai web data since the UTF-8 newline (\n) is not always a good indica- tor of the paragraph boundary in Thai text. There- fore, we only apply deduplication at the document level. 4.5 Content Filters Following Dolma’s approach, we use FastText for language-specific filtering due to its excellent speed-accuracy balance for large-scale data clean- ing. We employ pre-trained FastText vectors for 157 languages (Grave et al., 2018) to train filters for adult and gambling content. To train our two content filters, we create dedicated datasets for bi- nary classifiers. We label documents as belong- 11airesearch/wangchanbart-base ing to a specific class if they contain three or more distinct words from a predefined list for that class. For gambling content, we also include documents that an LLM identified as
Chunk 17 · 1,936 chars
two content filters, we create dedicated datasets for bi- nary classifiers. We label documents as belong- 11airesearch/wangchanbart-base ing to a specific class if they contain three or more distinct words from a predefined list for that class. For gambling content, we also include documents that an LLM identified as promoting gambling. Fi- nally, we randomly sample a subset of the data for human validation, ensuring the quality of our con- tent filtering datasets. Moreover, we changed the phone number rule for the Personally Identifiable Information (PII) filter to ensure Thai phone num- ber compatibility. 5 Experimental Settings 5.1 CPT and SFT Details To assess the effectiveness of our pre-training data (47.4B tokens as shown in Section 3), we use it for continuous pre-training (CPT) on SEA-LIONv3- Llama-8B-instruction (Ng et al., 2025). We use the training setup as follows: • max_seq_len: 8192 • learning rate: 5.0e-6 • optimizer: decoupled_lionw • lr_scheduler_type: cosine_with_warmup • num_train_epochs: 1 • GPU: H100 (64 GPUs) • Time: 1d 12h 24m After CPT, we conduct supervised fine-tuning (SFT) using QLoRA (Dettmers et al., 2023) on the Thai instruction dataset, named Wangchan- FLAN-6M12. We call the resultant model WangchanLION-V3-8B. For comparison, we em- ploy Llama 3.1 8B base, SEA-LION-V3-8B-base, and Typhoon 2 8B base with the same SFT setting as our model. For consistency, all models have 8 billion parameters and use Llama-3.1-base as the original base model. 5.2 Evaluation Benchmark To evaluate our trained model, including both Llama and GPT-2, we evaluate them using two benchmarks: • Thai LLM Benchmark (10X et al., 2024) is a benchmark suite for Southeast Asian Lan- guages (Lovenia et al., 2024). It can evaluate LLMs on NLG and NLU tasks. For GPT-2, we use the NLG task only. • SEA-HELM (Susanto et al., 2025) is an LLM benchmark designed to evaluate SEA lan- guages, including Thai ones. We
Chunk 18 · 1,997 chars
marks: • Thai LLM Benchmark (10X et al., 2024) is a benchmark suite for Southeast Asian Lan- guages (Lovenia et al., 2024). It can evaluate LLMs on NLG and NLU tasks. For GPT-2, we use the NLG task only. • SEA-HELM (Susanto et al., 2025) is an LLM benchmark designed to evaluate SEA lan- guages, including Thai ones. We remove 12https://huggingface.co/datasets/ airesearch/WangchanX-FLAN-v6.1 -- 7 of 28 -- NLR from the benchmark because the bench- mark uses a low-quality machine-translated dataset, XNLI (Conneau et al., 2018), for which the results from this dataset are un- reliable (Agrawal et al., 2024; Singh et al., 2025). These benchmarks allow us to evaluate the base and instruction models. For the evaluation of GPT- 2, we opted not to include metrics related to NLU, MT-bench, and safety. This decision is based on the observation that the scores for GPT-2 were sig- nificantly lower than the established baseline, re- sulting in normalized scores of zero. The primary reason for this issue lies in the limited capacity of the 124-million parameter GPT-2 model, which is insufficient for effectively processing complex prompts for the NLU tasks. Consequently, our evaluation focused exclusively on the NLG and Instruction-Following Evaluation (IF) tasks. 6 Experimental Results 6.1 Data Ablation Study In this study, we want to explore the effectiveness of our data pipeline using GPT-2’s downstream task performances as the indicator. In particular, we apply our pipeline to unclean data, Common Crawl, and cleaned data, FineWeb2. Our data cleaning pipeline should yield improvement on all data since we designed it specifically for Thai. 6.1.1 Ablation Design We describe the setup of the data ablation studies on Common Crawl and FineWeb2 as follows: Common Crawl. We train a GPT-2 model on each of five dataset variations, with each contain- ing 10 billion tokens. Each dataset variation builds upon the previous one by incrementally adding a new cleaning component, allowing us
Chunk 19 · 1,999 chars
e describe the setup of the data ablation studies on Common Crawl and FineWeb2 as follows: Common Crawl. We train a GPT-2 model on each of five dataset variations, with each contain- ing 10 billion tokens. Each dataset variation builds upon the previous one by incrementally adding a new cleaning component, allowing us to measure each step’s impact. • Baseline: The raw Thai Common Crawl cor- pus. • + Language Identification: Baseline with only Thai-language document retained. • + Quality Filters: Adds the quality filtering from Section 4.3. • + Deduplication: Adds URL and text dedu- plication • + Content Filters: Adds filters for adult con- tent, gambling content, and PII as we pro- posed in Section 4.5. FineWeb2. We aim to assess the effectiveness of our data cleaning pipeline. To do this, we compare the performance of the model trained on a clean baseline dataset with the performance on the same data after our pipeline has further processed it. We train a separate GPT-2 model on 10B tokens from each of the two versions of the FineWeb2 dataset, as follows: • FineWeb2: We use the original data from FineWeb 2, where the data is already cleaned using their pipeline. • FineWeb2 + our pipeline: We perform the second cleaning with our data cleaning pipeline. 6.1.2 Before and after applying our pipeline. As shown in Table 4, we found that our pipeline can clean and reduce the size of the dataset. For Common Crawl, we can reduce the dataset from 202 million documents to 25.1 million, resulting in a cleaner dataset than the raw source. Similarly, we also reduce the size of FineWeb2 by half. We found that most of the text was removed by our C4, Gopher, and gambling-related content filters. These samples were low-quality and did not affect downstream task performance. In the following, we will discuss the effect of our pipeline when ap- plied to both datasets in downstream tasks. 6.1.3 Using our data cleaning pipeline with unclean data (Common Crawl) Table 4 shows the results
Chunk 20 · 1,998 chars
ing-related content filters. These samples were low-quality and did not affect downstream task performance. In the following, we will discuss the effect of our pipeline when ap- plied to both datasets in downstream tasks. 6.1.3 Using our data cleaning pipeline with unclean data (Common Crawl) Table 4 shows the results of each filtering step: selecting for Thai-language documents yielded about 27 million documents, applying quality fil- ters affected 139.4 million, deduplicating based on URL and text overlap affected 9.6 million, and fil- tering for inappropriate content (e.g., adult mate- rial, gambling, PII) affected 0.9 million. As shown in Table 3, the Common Crawl data that has gone through all of our cleaning compo- nents yields the best results on both SEA-HELM and the Thai LLM benchmark. When incremen- tally adding a new component, we generally ob- serve an improvement on both benchmarks, except for the deduplication process, where we see a drop in the average SEA-HELM score. However, the score increases and surpasses the dip after adding the content filters, yielding the best performance. This implies that our cleaning pipeline can filter out low-quality data from an uncleaned dataset, re- sulting in better performance in downstream tasks. -- 8 of 28 -- SEA-HELM Thai LLM Benchmark Training Data Task NLG IF Avg. ENG->TH TH->ENG XLSum iApp Avg. Common crawl baseline 3.09 11.00 7.04 0.13 0.17 6.98 1.05 2.08 + Language Identification 4.43 14.00 9.21 0.08 0.21 6.92 1.08 2.07 + Quality Filters 12.29 23.00 17.64 0.09 0.24 8.11 1.54 2.50 + Deduplication 6.59 18.00 12.29 0.09 0.21 8.37 1.43 2.52 + Content Filters 10.60 25.00 17.80 0.07 0.21 8.66 1.17 2.53 FineWeb2 FineWeb2 9.13 13.56 11.34 0.08 0.13 8.71 1.28 2.55 FineWeb2 + our pipeline 15.37 16.00 15.68 0.07 0.18 8.33 1.37 2.49 Table 3: The result of SEA-HELM and Thai LLM Benchmark evaluation for GPT-2 Dataset Common Crawl FineWeb2 Baseline 231Bt/202Md 51.4Bt/35.9Md Language Identification
Chunk 21 · 1,995 chars
0.21 8.66 1.17 2.53 FineWeb2 FineWeb2 9.13 13.56 11.34 0.08 0.13 8.71 1.28 2.55 FineWeb2 + our pipeline 15.37 16.00 15.68 0.07 0.18 8.33 1.37 2.49 Table 3: The result of SEA-HELM and Thai LLM Benchmark evaluation for GPT-2 Dataset Common Crawl FineWeb2 Baseline 231Bt/202Md 51.4Bt/35.9Md Language Identification 206Bt/175Md 51.3Bt/35.9Md Quality Filters 54.6Bt/35.6Md 30.4Bt/19.1Md Deduplication 40.7Bt/26.0Md 27.8Bt/17.9Md Content Filters 38.6Bt/25.1Md 26.0Bt/17.1Md Table 4: The result of the dataset size in each step from our pipeline (Bt is billions of tokens and Md is millions of documents) 6.1.4 Using our data cleaning pipeline with cleaned data (FineWeb2) Next, we question our data pipeline: can our data cleaning pipeline improve the data that has al- ready been cleaned, like FineWeb2? To answer this question, we compare the original FineWeb2 and FineWeb2 processed with our pipeline. Our findings indicate that the original FineWeb2 per- forms slightly better overall than the version fur- ther cleaned with our pipeline in the Thai LLM Benchmark evaluation, as shown in Table 3. FineWeb2 processed with our pipeline substan- tially outperforms the original FineWeb2 on av- erage in the SEA-HELM evaluation, as shown in Table 3. Moreover, we also found that, before ap- plying our data pipeline, FineWeb2 had unwanted texts, e.g., low-quality, duplication of URLs and text overlap, adult content, gambling content, and PII. Further processed with our pipeline, we can significantly reduce its size from 35.9 million doc- uments to 17.1 million, as shown in Table 4. This result is achieved by further cleaning and filter- ing data from FineWeb2. This implies that our benchmark can improve data quality by filtering low-quality data from the cleaned data. In addi- tion, this also implies that the original cleaning data pipeline from FineWeb2 might not be appli- cable or suitable for Thai texts. 6.2 Downstream task results We assess the effectiveness of our training
Chunk 22 · 1,997 chars
implies that our benchmark can improve data quality by filtering low-quality data from the cleaned data. In addi- tion, this also implies that the original cleaning data pipeline from FineWeb2 might not be appli- cable or suitable for Thai texts. 6.2 Downstream task results We assess the effectiveness of our training data by comparing the model trained on our training data with models trained on other pre-training corpora. We hypothesize that the model trained on cleaner and better data quality should yield more improve- ment than the others. SEA-HELM. As shown in Table 5, we evaluate all the base models using the same SFT data, as men- tioned in Section 5.1. Our model achieves the best overall performance among the other three mod- els. In particular, we outperform our base model, SEA-LION V3 8B, in most cases. Furthermore, we perform a more detailed analysis on the chat benchmark, MT-Bench, in Figure 1. We found that, when using our base model, the performance of the Knowledge III (cultural evaluation) cate- gory is higher than that of other models. This em- phasizes the importance of using our data, which can also yield improvement in terms of Thai cul- tural knowledge. In addition, we have also gained improvements in the Roleplay and Reasoning cat- egories. Task Llama 3.1 8B SEA-LION V3 8B Typhoon 2 8B WangchanLION-V3-8B NLU 46.59 54.35 60.55 62.33 Safety 2.47 17.21 23.74 6.62 NLG 54.09 51.01 51.69 54.79 IF 31.00 44.00 38.00 52.00 MT-Bench 23.58 30.00 33.33 41.36 Avg. 31.54 39.31 41.46 43.42 Table 5: The performance comparison in SEA- HELM between existing base models and WangchanLION-V3-8B model when utilizing the same SFT dataset. -- 9 of 28 -- Coding Knowledge III Social Science Math Extraction Reasoning STEM Writing Roleplay 0.2 0.4 0.6 0.8 1.0 Llama 3.1 8B SEA-LION V3 8B WangchanLION-V3-8B Typhoon V2 8B Figure 1: The results from the Thai MT-Bench us- ing four instruction models trained on the same data. Thai LLM Leaderboard. We also conducted
Chunk 23 · 1,993 chars
f 28 -- Coding Knowledge III Social Science Math Extraction Reasoning STEM Writing Roleplay 0.2 0.4 0.6 0.8 1.0 Llama 3.1 8B SEA-LION V3 8B WangchanLION-V3-8B Typhoon V2 8B Figure 1: The results from the Thai MT-Bench us- ing four instruction models trained on the same data. Thai LLM Leaderboard. We also conducted an experiment using the benchmark that was formu- lated focusing on Thai contexts, namely the Thai LLM Leaderboard. As shown in Table 6, we found that our model outperforms other models on the NLG task, but performs lower than other models on the NLU task. This is because we added ex- tensive Thai pre-training corpora, yielding more Thai fluency and knowledge, resulting in the im- provement of the NLG task. In contrast, the world knowledge will also disappear from our model, re- sulting in lower performance in NLU (the major- ity of NLU datasets are not created in Thai con- texts, unlike the NLG datasets). This is consistent with the result from the chat dataset, MT-Bench, that WangchanLION-V3-8B model yields more fluency in Thai but fails in the world-knowledge subset (STEM). 7 Conclusion and Outlook We propose Mangosteen, an openly released Thai pre-training data pipeline and corpus that closes the transparency gap in Thai continual pre-training (CPT) models. Our Thai-specific filters shrink raw Common Crawl data from 202 million to 25 million documents and raise SEA-HELM NLG from about 3 to roughly 11 points. An 8 billion-parameter model trained on the resulting 47 billion-token corpus surpasses SEA-LION-v3 and Llama-3.1 on Thai benchmarks by about two points. To ensure full reproducibility, we release the pipeline code, the cleaned corpus, and all CPT Task subset Llama 3.1 8B SEA-LION V3 8B Typhoon 2 8B WangchanLION-V3-8B NLU Wisesight 34.82 52.30 62.11 52.64 XCOPA 69.40 78.20 81.00 71.40 Belebele 54.67 63.44 65.44 58.56 Avg. 52.96 64.64 69.51 60.86 NLG XLSum 48.18 48.88 54.33 57.01 iApp 74.45 79.81 80.15 80.99 ENG->TH 21.15 21.54 26.09
Chunk 24 · 1,998 chars
ned corpus, and all CPT Task subset Llama 3.1 8B SEA-LION V3 8B Typhoon 2 8B WangchanLION-V3-8B NLU Wisesight 34.82 52.30 62.11 52.64 XCOPA 69.40 78.20 81.00 71.40 Belebele 54.67 63.44 65.44 58.56 Avg. 52.96 64.64 69.51 60.86 NLG XLSum 48.18 48.88 54.33 57.01 iApp 74.45 79.81 80.15 80.99 ENG->TH 21.15 21.54 26.09 29.00 TH->ENG 49.23 51.20 51.94 52.34 Avg. 48.25 50.36 53.13 54.84 Table 6: The performance comparison in Thai LLM Leaderboard between existing base models and WangchanLION-V3-8B model when utilizing the same SFT dataset. and SFT checkpoints. Open-data expansion. Mangosteen currently includes only public-domain or explicitly Creative Commons licensed text. We chose this approach to respect the legal rights of content owners. We are working with agencies and rights-holders to unlock more than 10 billion additional tokens un- der permissive licenses. Pre-training Cost and Shared Knowledge. Computational expense is still the main obstacle to studying the impact of data pipeline design. For this project, we secured 2,000 GPU-hours on H100 hardware, giving us effectively one full trial. Until reliable low-compute pre-training methods emerge, maximizing the public value of every costly run through complete knowledge sharing is essential to the collective progress of this field. With this limited experimentation budget, our model attains strong overall scores (approximately 2 points above recent Thai baselines), confirming the effectiveness of Thai-aware curation. The cen- tral contribution, however, is not the lead itself but the openly released pipeline, corpus, and check- points that let others build on and quickly surpass these results. Acknowledgement This research is supported by the National Re- search Foundation, Singapore, under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommen- dations expressed in this material are those of the author(s) and do not reflect the views of National -- 10 of 28
Chunk 25 · 1,997 chars
knowledgement This research is supported by the National Re- search Foundation, Singapore, under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommen- dations expressed in this material are those of the author(s) and do not reflect the views of National -- 10 of 28 -- Research Foundation, Singapore. References SCB 10X, VISTEC, and SEACrowd. 2024. Thai llm leaderboard. Ashish Agrawal, Barah Fazili, and Preethi Jyothi. 2024. Translation errors significantly impact low-resource languages in cross-lingual learn- ing. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 319–329, St. Julian’s, Malta. As- sociation for Computational Linguistics. Giusepppe Attardi. 2015. Wikiextractor. https://github.com/attardi/ wikiextractor. Adrien Barbaresi. 2021. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. In Proceed- ings of the Joint Conference of the 59th An- nual Meeting of the Association for Compu- tational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing: System Demonstrations, pages 122– 131. Association for Computational Linguis- tics. Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Had- dow, Jan Hajiˇc, Jindˇrich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, An- drey Kutuzov, Joona Kytöniemi, Veronika Laip- pala, Petter Mæhlum, Bhavitvya Malik, and 16 others. 2025. An expanded massive multi- lingual dataset for high-performance language technologies. Preprint, arXiv:2503.10267. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representa- tion learning at scale. In Proceedings of the 58th Annual
Chunk 26 · 1,998 chars
ge technologies. Preprint, arXiv:2503.10267. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representa- tion learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguis- tics. Alexis Conneau, Ruty Rinott, Guillaume Lam- ple, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representa- tions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Bel- gium. Association for Computational Linguis- tics. Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, and Urmish Thakker. 2024. Sambalingo: Teaching large language models new languages. Preprint, arXiv:2404.05829. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Effi- cient finetuning of quantized LLMs. In Thirty- seventh Conference on Neural Information Pro- cessing Systems. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. Julen Etxaniz, Oscar Sainz, Naiara Miguel, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, and Aitor Soroa. 2024. Latxa: An open language model and evaluation suite for Basque. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 14952–14972, Bangkok, Thailand. Association for Computational Linguistics. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Ja- son Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800gb
Chunk 27 · 1,992 chars
al Linguistics (Volume 1: Long Pa- pers), pages 14952–14972, Bangkok, Thailand. Association for Computational Linguistics. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Ja- son Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, An- gela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, and 542 -- 11 of 28 -- others. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Lan- guage Resources Association (ELRA). Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguis- tics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational Linguistics. Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumar- avelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun G, Sparsh Jain, Anoop Kunchukut- tan, Pratyush Kumar, Raj Dabre, and Mitesh M. Khapra. 2024. IndicLLMSuite: A blueprint for creating pre-training and fine-tuning datasets for Indian languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 15831–15879, Bangkok, Thailand. Association for Computational Linguistics. Jeffrey Li, Alex Fang, Georgios Smyrnis,
Chunk 28 · 1,990 chars
eprint for creating pre-training and fine-tuning datasets for Indian languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 15831–15879, Bangkok, Thailand. Association for Computational Linguistics. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others. 2025. Datacomp-lm: In search of the next generation of training sets for language models. Preprint, arXiv:2406.11794. Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James Val- idad Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kamp- man, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Rai- ley Montalan, Ryan Ignatius Hadiwijaya, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, and 42 others. 2024. SEACrowd: A multilingual multimodal data hub and benchmark suite for Southeast Asian languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155– 5203, Miami, Florida, USA. Association for Computational Linguistics. Risto Luukkonen, Ville Komulainen, Jouni Lu- oma, Anni Eskelinen, Jenna Kanerva, Hanna- Mari Kupari, Filip Ginter, Veronika Laip- pala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Scao, Thomas Wolf, Osma Suominen, Samuli Saira- nen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, and 2 others. 2023. FinGPT: Large generative models for a small language. In Pro- ceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 2710–2726, Singapore. Association for Computational Linguistics. Chenghao Mou, Chris Ha, Kenneth Enevoldsen, and Peiyuan Liu. 2023. Chenghaomou/text- dedup:
Chunk 29 · 1,999 chars
. 2023. FinGPT: Large generative models for a small language. In Pro- ceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 2710–2726, Singapore. Association for Computational Linguistics. Chenghao Mou, Chris Ha, Kenneth Enevoldsen, and Peiyuan Liu. 2023. Chenghaomou/text- dedup: Reference snapshot. Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Ham- sawardhini Rengarajan, Peerat Limkonchoti- wat, Adithya Venkatadri Hulagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, and 12 others. 2025. Sea-lion: Southeast asian languages in one network. Preprint, arXiv:2504.05747. Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernon- court, Ryan A. Rossi, and Thien Huu Nguyen. 2023. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. Preprint, arXiv:2309.09400. Pedro Javier Ortiz Su’arez, Laurent Romary, and Benoit Sagot. 2020. A monolingual approach to contextualized word embeddings for mid- resource languages. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 1703–1714, Online. Association for Computational Linguistics. Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin -- 12 of 28 -- Raffel, Leandro Von Werra, and Thomas Wolf. 2024a. The fineweb datasets: Decanting the web for the finest text data at scale. Preprint, arXiv:2406.17557. Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Leandro von Werra, and Thomas Wolf. 2024b. Fineweb2: A sparkling update with 1000s of languages. Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cap- pelli, Hamza Alobeidli, Baptiste Pannier, Ebte- sam Almazrouei, and Julien Launay. 2023. The refinedweb dataset for
Chunk 30 · 1,997 chars
utan, Martin Jaggi, Leandro von Werra, and Thomas Wolf. 2024b. Fineweb2: A sparkling update with 1000s of languages. Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cap- pelli, Hamza Alobeidli, Baptiste Pannier, Ebte- sam Almazrouei, and Julien Launay. 2023. The refinedweb dataset for falcon llm: Outperform- ing curated corpora with web data, and web data only. Preprint, arXiv:2306.01116. Wannaphong Phatthiyaphaibun, Korakot Chao- vavanich, Charin Polpanumas, Arthit Suriya- wongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai natural language pro- cessing in python. In Proceedings of the 3rd Workshop for Natural Language Process- ing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore. Association for Com- putational Linguistics. Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarn- mongkol, Ruangsak Patomwong, Pathomporn Chokchainant, and Kasima Tharnpipitchai. 2023. Typhoon: Thai large language models. Preprint, arXiv:2312.13951. Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adi- sai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, and Kasima Tharnpip- itchai. 2024. Typhoon 2: A family of open text and multimodal thai large language models. Preprint, arXiv:2412.13702. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Lan- guage models are unsupervised multitask learn- ers. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Ro- man Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, and 61 others. 2021. Scaling language models: Methods,
Chunk 31 · 1,990 chars
Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Ro- man Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, and 61 others. 2021. Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1). Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint, arXiv:1910.10683. Shivalika Singh, Angelika Romanou, Clémen- tine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchoti- wat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei- Yin Ko, Sebastian Ruder, Madeline Smith, An- toine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, and 5 others. 2025. Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation. Preprint, arXiv:2412.03304. Luca Soldaini, Rodney Kinney, Akshita Bha- gia, Dustin Schwenk, David Atkinson, Rus- sell Authur, Ben Bogin, Khyathi Chandu, Jen- nifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morri- son, Niklas Muennighoff, and 17 others. 2024. Dolma: an open corpus of three trillion to- kens for language model pretraining research. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15725–15788, Bangkok, Thailand. Association for Computa- tional Linguistics. Thanathip Suntorntip, Arthit Suriyawongkul,
Chunk 32 · 1,999 chars
n corpus of three trillion to- kens for language model pretraining research. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15725–15788, Bangkok, Thailand. Association for Computa- tional Linguistics. Thanathip Suntorntip, Arthit Suriyawongkul, and Wannaphong Phatthiyaphaibun. 2024. nlpo3. -- 13 of 28 -- Yosephine Susanto, Adithya Venkatadri Hula- gadri, Jann Railey Montalan, Jian Gang Ngui, Xian Bin Yong, Weiqi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. 2025. Sea-helm: Southeast asian holistic evaluation of language models. Preprint, arXiv:2502.14301. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Fran- cisco Guzmán, Armand Joulin, and Edouard Grave. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Re- sources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 483–498, Online. Association for Computational Linguis- tics. Sumeth Yuenyong, Kobkrit Viriyayudhakorn, Apivadee Piyatumrong, and Jillaphat Jaroenkantasima. 2025. Openthaigpt 1.5: A thai-centric open source large language model. Preprint, arXiv:2411.07238. -- 14 of 28 -- A GPT-2 Config We use the GPT-2 config as follows: sequence length: 1024, n_layer: 12, n_head: 12, n_embd: 768, vocab_size: 50257, learning rate: 0.0006, num_train_epochs: 1 B Language Identification In this section, we compare three language identification libraries—langdetect, lingua, and fastText—and present our findings. The
Chunk 33 · 1,999 chars
fig We use the GPT-2 config as follows: sequence length: 1024, n_layer: 12, n_head: 12, n_embd: 768, vocab_size: 50257, learning rate: 0.0006, num_train_epochs: 1 B Language Identification In this section, we compare three language identification libraries—langdetect, lingua, and fastText—and present our findings. The Dolma codebase supports five language identifiers: cld3, pycld2, langdetect, Lingua, and fastText. While fastText was used in the original Dolma implementation, its effectiveness on Thai texts remains uncertain. Thus, our evaluation focuses on langdetect, lingua, and fastText, excluding cld3 and pycld2 due to their outdated models and limited adoption. We conducted our experiments on a subsample of the raw Thai Common Crawl dataset, consisting of the first 200,000 documents. Each of the three language identifiers was run separately on this subsample to generate a language confidence score for each document. We used a threshold of 0.5, the same cut-off used in the Dolma English corpus, to determine language inclusion. Based on this threshold, the number of documents retained by each language identifier is shown in Table 7. Language Identifier Documents Remaining Lingua 177,221 Langdetect 181,872 FastText 187,772 Table 7: Number of Documents Remaining After Applying a 0.5 Threshold In our analysis, we observed that lingua excluded the highest number of documents compared to the other two language identifiers. Regarding processing time, langdetect exhibited the slowest performance, taking approximately 40 minutes to process all 200,000 documents. In contrast, both lingua and fastText completed the processing in roughly 3 minutes. Subsequently, we calculated the proportions of Thai characters, vowels, and intonation marks in the documents retained by each language identifier and compared their distributions. As illustrated in Figure 2, the majority of documents exhibit a high proportion of Thai characters, vowels, and intonation marks, indicating that the
Chunk 34 · 1,995 chars
y, we calculated the proportions of Thai characters, vowels, and intonation marks in the documents retained by each language identifier and compared their distributions. As illustrated in Figure 2, the majority of documents exhibit a high proportion of Thai characters, vowels, and intonation marks, indicating that the language identifiers are generally effective in detecting Thai text. However, it is evident that fastText flags a larger number of documents as Thai, even when those documents contain a relatively low proportion of Thai-specific characters—specifically, less than 40%. This suggests that fastText may be more permissive or less precise in its classification compared to Lingua and Langdetect. To further evaluate fastText’s performance, we isolated the subset of documents in which the combined proportion of Thai characters, vowels, and intonation marks was less than or equal to 40%. We then analyzed the distribution of language confidence scores assigned by fastText to these documents. This analysis was intended to assess the reliability of its language identification in cases where Thai- specific character usage is minimal. As illustrated in the Figure 3, fastText assigns high language confidence scores—often approaching 100%—to documents where the proportion of Thai characters, vowels, and intonation marks does not exceed 40%. This indicates a tendency of fastText to exhibit overconfidence in its language identifica- tion, potentially misclassifying documents with minimal Thai-specific character content. Notably, 8,008 documents (accounting for 4% of the 200,000-document sample) were identified by fastText as Thai with confidence scores above 0.5, despite containing 40% or fewer Thai-specific characters. To this end, it might seem that the only language identifier good enough for the Thai language is Lin- gua. However, considering Thai’s distinctive script, we concluded that relying on the Lingua language identifier may not be optimal. Therefore, we
Chunk 35 · 1,991 chars
above 0.5, despite containing 40% or fewer Thai-specific characters. To this end, it might seem that the only language identifier good enough for the Thai language is Lin- gua. However, considering Thai’s distinctive script, we concluded that relying on the Lingua language identifier may not be optimal. Therefore, we developed the ThaiCharRatioTagger, a custom lan- guage identifier that calculates the percentage of Thai characters, vowels, and intonation marks within -- 15 of 28 -- 0 20 40 60 80 100 Percentage of Thai Characters per Document 100 101 102 103 104 Number of Documents Distribution of Thai Character Percentage Identifier Score > 0.5 Language Identifier Lingua LangDetect FastText Figure 2: Comparison of the distribution of Thai character ratios for three language identifiers: Lingua, Langdetect, and FastText. Notably, FastText assigns a language score ≥ 50 to a portion of documents where the Thai character ratio is only between 10 and 40, indicating potential misclassification. Figure 3: Scatter plot showing that even when the fastText score is high, the proportion of Thai characters can be very low. The overall trend deviates from what would be expected in an ideal language classification scenario. Figure 4: Comparison between Thai character ratio and fastText language scores across documents, highlighting inconsistencies in the confidence scores relative to Thai character content. -- 16 of 28 -- a document. This approach simplifies the codebase and offers a more straightforward alternative to complex machine learning models. C Gopher Rules As stated in Section 3.1.3, the original English corpus Dolma relies heavily on the Gopher Rules, as it uses all of them. In this work, we aim to follow the same methodology as the original, while adjusting the thresholds of certain rules to better suit the Thai language. The original Gopher Rules used in the English corpus Dolma consist of 11 rules: • Fraction of characters in the most common n-gram
Chunk 36 · 1,997 chars
Gopher Rules, as it uses all of them. In this work, we aim to follow the same methodology as the original, while adjusting the thresholds of certain rules to better suit the Thai language. The original Gopher Rules used in the English corpus Dolma consist of 11 rules: • Fraction of characters in the most common n-gram exceeds a threshold • Fraction of characters in duplicate n-grams exceeds a threshold • Contains fewer than 50 or more than 100K words • Median word length is less than 3 or greater than 10 • Fraction of words containing alphabetic characters is less than 0.80 • Contains fewer than 2 of a predefined set of required words • Fraction of lines in the document starting with a bullet point exceeds 0.90 • Fraction of lines in the document ending with an ellipsis exceeds 0.30 • Fraction of lines in the document that are duplicated exceeds 0.30 • Fraction of characters in duplicated lines exceeds 0.30 We then begin by experimenting with the correlations of each rule. We first use the Dolma pipeline to obtain scores on a 200,000-document subsample. Specifically, instead of tokenizing by spaces, we use the ICU tokenizer to obtain a list of tokens for each document. We use the set of required words provided by Stopwords ISO 13, which essentially corresponds to a list of Thai stopwords. In addition, we divide each document into sentences using the delimiter \n+ In the Dolma work, the authors noted that stacking quality filtering, content filtering, and deduplication results in a positive compounding effect; these rules overlap very little in the texts they remove. Our study comes to a similar conclusion regarding the Gopher rules. As shown in Figure 5 (Spearman correlation heatmap) and Figure 6 (Pearson correlation heatmap), we observe only low to moderate correlations between most rules. Some exceptions exist; for example, the "median word length" rule and the "fraction of words with alpha characters" rule are somewhat correlated. This is expected, as a higher
Chunk 37 · 1,621 chars
e 5 (Spearman correlation heatmap) and Figure 6 (Pearson correlation heatmap), we observe only low to moderate correlations between most rules. Some exceptions exist; for example, the "median word length" rule and the "fraction of words with alpha characters" rule are somewhat correlated. This is expected, as a higher median word length generally implies a higher number of alphabetic characters. A similar relationship is observed between the "word count" rule and the "required word count" rule. Based on this analysis, we choose to individually adjust the threshold for each rule. Note that for the rules "fraction of characters in most common n-gram" and "fraction of characters in duplicate n-grams," we use the average value within each group in our analysis, as similar rules tend to exhibit high intra-group correlation. The next question we address is: Which rules should be adjusted? Should we modify all of them or only a subset? To explore this, we examine the top 10 combinations of rules that remove the largest number of documents. As shown in Table 8, approximately 78% of the documents are filtered out by these rules or com- binations of rules. The rules frequently appearing in this table include: fraction of words with alpha characters, fraction of characters in duplicate n-grams, median word length, fraction of duplicate lines, and required word count. Among these, we choose to further investigate the "median word length" rule, as the other rules can be applied directly to Thai without modification. Although not included in the 13https://github.com/stopwords-iso/stopwords-th -- 17 of 28 --
Chunk 38 · 1,494 chars
, median word length, fraction of duplicate lines, and required word count. Among these, we choose to further investigate the "median word length" rule, as the other rules can be applied directly to Thai without modification. Although not included in the 13https://github.com/stopwords-iso/stopwords-th -- 17 of 28 -- fraction_of_characters_in_most_common_grams_avg fraction_of_characters_in_duplicate_grams_avg word_count median_word_length symbol_to_word_ratio fraction_of_words_with_alpha_character required_word_count fraction_of_lines_starting_with_bullet_point fraction_of_lines_ending_with_ellipsis fraction_of_duplicate_lines fraction_of_characters_in_duplicate_lines fraction_of_characters_in_most_common_grams_avg fraction_of_characters_in_duplicate_grams_avg word_count median_word_length symbol_to_word_ratio fraction_of_words_with_alpha_character required_word_count fraction_of_lines_starting_with_bullet_point fraction_of_lines_ending_with_ellipsis fraction_of_duplicate_lines fraction_of_characters_in_duplicate_lines 1.00 0.10 1.00 -0.53 0.41 1.00 -0.27 -0.22 0.01 1.00 -0.12 0.11 0.17 -0.04 1.00 -0.35 -0.29 0.09 0.80 -0.02 1.00 -0.60 0.16 0.78 0.44 0.13 0.52 1.00 -0.06 0.14 0.22 -0.13 0.01 -0.16 0.11 1.00 -0.05 0.04 0.08 0.02 0.47 0.03 0.07 0.03 1.00 0.11 0.57 0.20 -0.30 0.04 -0.39 -0.04 0.18 0.02 1.00 0.11 0.60 0.20 -0.28 0.05 -0.37 -0.04 0.16 0.03 0.98 1.00 Figure 5: Spearman correlation heatmap of Gopher Rule
Chunk 39 · 1,997 chars
6 0.78 0.44 0.13 0.52 1.00 -0.06 0.14 0.22 -0.13 0.01 -0.16 0.11 1.00 -0.05 0.04 0.08 0.02 0.47 0.03 0.07 0.03 1.00 0.11 0.57 0.20 -0.30 0.04 -0.39 -0.04 0.18 0.02 1.00 0.11 0.60 0.20 -0.28 0.05 -0.37 -0.04 0.16 0.03 0.98 1.00 Figure 5: Spearman correlation heatmap of Gopher Rule scores. fraction_of_characters_in_most_common_grams_avg fraction_of_characters_in_duplicate_grams_avg word_count median_word_length symbol_to_word_ratio fraction_of_words_with_alpha_character required_word_count fraction_of_lines_starting_with_bullet_point fraction_of_lines_ending_with_ellipsis fraction_of_duplicate_lines fraction_of_characters_in_duplicate_lines fraction_of_characters_in_most_common_grams_avg fraction_of_characters_in_duplicate_grams_avg word_count median_word_length symbol_to_word_ratio fraction_of_words_with_alpha_character required_word_count fraction_of_lines_starting_with_bullet_point fraction_of_lines_ending_with_ellipsis fraction_of_duplicate_lines fraction_of_characters_in_duplicate_lines 1.00 0.17 1.00 -0.04 0.15 1.00 -0.20 -0.18 -0.04 1.00 0.04 0.02 -0.00 -0.02 1.00 -0.30 -0.25 -0.03 0.80 -0.03 1.00 -0.13 0.03 0.72 0.15 -0.02 0.20 1.00 0.00 0.03 0.08 -0.13 -0.02 -0.14 0.02 1.00 0.01 -0.00 -0.01 0.04 0.29 0.05 -0.00 -0.03 1.00 0.10 0.69 0.07 -0.26 0.00 -0.33 -0.04 0.03 -0.03 1.00 0.09 0.80 0.05 -0.15 0.01 -0.20 -0.03 -0.02 -0.01 0.86 1.00 Figure 6: Pearson correlation heatmap of Gopher Rule scores. -- 18 of 28 -- Combination of Rules Count Percentage Cumulative Percentage fraction of words with alpha character 50,712 29.3318 29.3318 fraction of characters in duplicate n grams, fraction of words with alpha character 17,823 10.3088 39.6406 median word length, fraction of words with alpha character 17,028 9.8490 49.4896 fraction of characters in duplicate n grams, median word length, fraction of words with alpha character 12,812 7.4104 56.9000 fraction of characters in
Chunk 40 · 1,992 chars
rs in duplicate n grams, fraction of words with alpha character 17,823 10.3088 39.6406 median word length, fraction of words with alpha character 17,028 9.8490 49.4896 fraction of characters in duplicate n grams, median word length, fraction of words with alpha character 12,812 7.4104 56.9000 fraction of characters in duplicate n grams, median word length, fraction of words with alpha character, fraction of duplicate lines, fraction of characters in duplicate lines 9,103 5.2652 62.1652 fraction of characters in duplicate n grams, fraction of words with alpha character, frac- tion of duplicate lines, fraction of characters in duplicate lines 8,447 4.8857 67.0509 fraction of characters in duplicate n grams, median word length, fraction of words with alpha character, fraction of duplicate lines 7,631 4.4138 71.4647 median word length, fraction of words with alpha character, required word count 5,268 3.0470 74.5117 fraction of characters in duplicate n grams, fraction of words with alpha character, frac- tion of duplicate lines 3,694 2.1366 76.6483 median word length, fraction of words with alpha character, fraction of duplicate lines 2,772 1.6033 78.2516 Table 8: Top combinations of Gopher Rules by number of filtered documents -- 19 of 28 -- Statistic Value Count 200,000 Mean 675.47 Standard Deviation 1,964.60 Minimum 0 10th Percentile 56 30th Percentile 131 50th Percentile (Median) 283 70th Percentile 600 90th Percentile 1,556 95th Percentile 2,384 99th Percentile 4,823.01 Maximum 195,713 Table 9: Descriptive statistics of word counts in the sub-document dataset, including the mean, standard deviation, median, minimum, maximum, and selected percentiles. table, we also specifically examine the "word count" rule, as we believe that the typical word count dif- fers significantly between Thai and English documents. Furthermore, we observe that low word count documents in Thai often correspond to low-quality text data. Some rules can be modified with minimal data
Chunk 41 · 1,998 chars
entiles. table, we also specifically examine the "word count" rule, as we believe that the typical word count dif- fers significantly between Thai and English documents. Furthermore, we observe that low word count documents in Thai often correspond to low-quality text data. Some rules can be modified with minimal data exploration: • The first is the rule "fraction of words with alpha characters." We modify it to "fraction of words with Thai letters" by counting Thai characters instead of alphabetic characters. • For the rule "fraction of lines in document ending with ellipsis," we detect the presence of three dots (...) at the end of each line, as this pattern appears frequently in our corpus. C.1 Word Count To begin our experiment, we use the same dataset described in Appendix E—the 200,000-document subsample. In the following, we present descriptive statistics of the word counts in this dataset, where each document is tokenized using the ICU tokenizer. The distribution of word count is clearly right-skewed. If we apply the Gopher rule cutoffs of 50 and 100,000 words, very few documents are filtered out, and we suspect that many low-quality documents would remain. Upon closer inspection through data binning, we find that the interval containing a high proportion of documents ranges from 21 to 170 words, approximately 35% of the subsample. We suspect that this interval may contain a disproportionately large amount of low-quality data. To confirm our assumption, we conduct an experiment using perplexity scores to assess data qual- ity. The model used for this experiment is airesearch/wangchanbart-base. The results are presented in the table below. Table 11 presents the perplexity statistics under four different settings. In the Baseline setting, no filtering is applied; the perplexity scores represent the entire 200,000-document subsample. In the Go- pher Rules setting, documents are filtered using the original Gopher cutoffs of 50 and 100,000 words. The Experiment
Chunk 42 · 1,997 chars
e 11 presents the perplexity statistics under four different settings. In the Baseline setting, no filtering is applied; the perplexity scores represent the entire 200,000-document subsample. In the Go- pher Rules setting, documents are filtered using the original Gopher cutoffs of 50 and 100,000 words. The Experiment setting modifies the lower cutoff, increasing it from 50 to 171, in order to filter out all low word count documents, which we suspect may be of lower quality. Finally, the Short Text setting isolates documents with word counts between 21 and 170. Notably, the perplexity scores in this group are nearly as high as those in the baseline, supporting our hypothesis that short documents in this range are likely to be of lower quality. From these statistics, we observe that the Experiment setting yields the lowest mean and median perplexity scores, suggesting that removing very short documents improves overall data quality. To further support this conclusion, we report the Kruskal–Wallis H-test statistic and corresponding p-values in Table 12. In all cases, the p-values indicate a statistically significant difference, leading us to reject the null hypothesis that the medians across settings are equal. -- 20 of 28 -- 0 25000 50000 75000 100000 125000 150000 175000 200000 The number of words 100 101 102 103 104 105 Frequency Histogram with Auto Bins x=50 x=100k 0 25000 50000 75000 100000 125000 150000 175000 200000 The number of words 100 101 102 103 104 105 Frequency Histogram with 50 Bins x=50 x=100k Figure 7: Histograms of word counts in the sub-document dataset. The left plot shows the distribution using automatic binning for finer granularity, while the right plot uses a fixed bin size of 50 to provide a more uniform view. Basline Goher rules Experiment Short text 1080 1090 1100 1110 1120 1130 1140 Perplexity (Mean) Mean Perplexity by Experiment Baseline Gopher rules Experiment Short text 695 700 705 710 715 720 725 Perplexity (Median) Median
Chunk 43 · 1,999 chars
ularity, while the right plot uses a fixed bin size of 50 to provide a more uniform view. Basline Goher rules Experiment Short text 1080 1090 1100 1110 1120 1130 1140 Perplexity (Mean) Mean Perplexity by Experiment Baseline Gopher rules Experiment Short text 695 700 705 710 715 720 725 Perplexity (Median) Median Perplexity by Experiment Figure 8: Trends in the mean and median perplexity scores across different experimental settings -- 21 of 28 -- bin frequency cumulative frequency proportion cumulative proportion [ 0. 21.3056] 2,762 2,762 1.3810 1.3810 [21.3056 42.6111] 11,137 13,899 5.5685 6.9495 [42.6111 63.9167] 10,778 24,677 5.3890 12.3385 [63.9167 85.2223] 14,169 38,846 7.0845 19.4230 [ 85.2223 106.5279] 11,247 50,093 5.6235 25.0465 [106.5279 127.8334] 8,541 58,634 4.2705 29.3170 [127.8334 149.139 ] 7,834 66,468 3.9170 33.2340 [149.139 170.4446] 6,929 73,397 3.4645 36.6985 [170.4446 191.7502] 5,854 79,251 2.9270 39.6255 [191.7502 213.0557] 5,426 84,677 2.7130 42.3385 [213.0557 234.3613] 5,083 89,760 2.5415 44.8800 [234.3613 255.6669] 4,680 94,440 2.3400 47.2200 [255.6669 276.9725] 4,286 98,726 2.1430 49.3630 [276.9725 298.278 ] 4,169 102,895 2.0845 51.4475 [298.278 319.5836] 3,622 106,517 1.8110 53.2585 [319.5836 340.8892] 3,400 109,917 1.7000 54.9585 [340.8892 362.1948] 3,639 113,556 1.8195 56.7780 [362.1948 383.5003] 3,117 116,673 1.5585 58.3365 [383.5003 404.8059] 2,939 119,612 1.4695 59.8060 [404.8059 426.1115] 2,598 122,210 1.2990 61.1050 [426.1115 447.417 ] 2,432 124,642 1.2160 62.3210 [447.417 468.7226] 2,186 126,828 1.0930 63.4140 [468.7226 490.0282] 2,288 129,116 1.1440 64.5580 [490.0282 511.3338] 2,114 131,230 1.0570 65.6150 [511.3338 532.6393] 2,169 133,399 1.0845 66.6995 [532.6393 553.9449] 2,082 135,481 1.0410 67.7405 [553.9449 575.2505] 2,034 137,515 1.0170 68.7575 [575.2505 596.5561] 2,113 139,628 1.0565 69.8140 [596.5561 617.8616] 1,843 141,471 0.9215 70.7355 [617.8616 639.1672] 1,803 143,274 0.9015 71.6370 Table 10: Binning analysis of
Chunk 44 · 1,990 chars
.3338 532.6393] 2,169 133,399 1.0845 66.6995 [532.6393 553.9449] 2,082 135,481 1.0410 67.7405 [553.9449 575.2505] 2,034 137,515 1.0170 68.7575 [575.2505 596.5561] 2,113 139,628 1.0565 69.8140 [596.5561 617.8616] 1,843 141,471 0.9215 70.7355 [617.8616 639.1672] 1,803 143,274 0.9015 71.6370 Table 10: Binning analysis of word counts in the sub-document dataset. A notably high proportion of documents fall within the 21–170 word range. Only the first 30 bins are shown here for display purposes. -- 22 of 28 -- Statistic Baseline Gopher Rules Experiment Short Text Count 200,000 182,976 126,596 70,635 Mean 1143.80 1101.69 1074.90 1137.07 Standard Deviation 4414.26 2162.41 2359.05 1583.46 Minimum 1.37 1.37 1.37 5.37 10th Percentile 176.05 171.94 169.55 187.32 30th Percentile 402.48 397.45 390.92 416.09 50th Percentile (Median) 710.95 706.76 693.37 724.38 70th Percentile 1136.51 1126.69 1081.74 1240.15 90th Percentile 2163.38 2135.86 1985.92 2372.85 95th Percentile 3153.02 3096.17 2922.14 3298.03 99th Percentile 7660.35 7435.88 8193.99 5539.36 Maximum 814,671.56 539,081.63 539,081.63 51,177.91 Table 11: Perplexity statistics across four experimental settings. Applying the Gopher Rules results in a noticeable decrease in perplexity scores, with a further reduction observed when increasing the lower bound in our experiment. The "Short Text" setting is included to illustrate that, despite comprising 38% of the data, it yields perplexity scores nearly as high as the baseline, indicating lower quality. Settings H-statistics P-value Baseline vs Experiment 135.063 0.00 Gopher rules vs Experiment 73.337 0.00 Baseline vs Gopher rules vs Experiment 138.922 0.00 Table 12: Kruskal–Wallis H-test results for perplexity scores across different experimental settings. The p-values for all comparisons are 0.00, indicating statistically significant differences in the population medians and rejecting the null hypothesis that the medians are equal. -- 23 of 28 -- Word Count Text 13
Chunk 45 · 1,997 chars
Table 12: Kruskal–Wallis H-test results for perplexity scores across different experimental settings. The p-values for all comparisons are 0.00, indicating statistically significant differences in the population medians and rejecting the null hypothesis that the medians are equal. -- 23 of 28 -- Word Count Text 13 "Amornrat ฟรีแลนซมืออาชีพ โดนจางแลว 2 ครั้ง | Fastwork.co" 6 "ตรวจสอบขอมูล ตรวจสอบขอมูล" 18 "รูปสินคา หนุมไนซหัวใจสุดแซบ : ชุด Hot Girl สวยแสบซาส #1" 5 "www.lampangsporttime.com คลิกเขาเว็บไซตฺ" 4 "ศูนยชวยเหลือ Workplace" 9 "ไอเหนอ หัวใจใหญ (wasin_nildee) on Pinterest" 11 "Toyota Hilux Revo 2.8MT ป 2019 รถตอนเดียวมือสอง" 17 "งบการเงิน ไตรมาสที่ 3/2565 (สอบทานแลว) | PTT Global Chemical" 20 "พัฒนาโดย นายวีระยุทธ ประชุมรักษ พนักงานคอมพิวเตอร ประจําศาลแขวงอุบลราชธานีหากโปรแกรมมีปญหา กรุณาโทร 0876188331" 4 "ตลาดหลักทรัพยแหงประเทศไทย -" 28 "Login เขาสูระบบ Pantipmarket ประกาศนี้สามารถเลื่อนวันหมดอายุไดโดยสมาชิกเทานั้น กรุณา Login สมาชิกเพื่อเลื่อนวันหมดอายุ Username * Password *" 38 "ตั๋วเครื่องบิน กิจกรรม TH JP EN CN TW KR TH プライバシーポリシー ・ 利用規約 に同意の上、ボタンを押してください。 ログイン (無料)するとより便利に利用できます" 34 "สํารวจ บทความ รีวิวปายยา ขอมูลโปรดักส ซิสคอลแลปส โปรไฟล SistaCafe เขาสูระบบ สํารวจ บทความ Original Content ซิสปายยา ขอมูลโปรดักส ซิสคอลแลปส" 36 "โคตรสวยของดี สด ซิงจมมิดควยงามจัดขอเย็ดบางเถอะละเลงลิ้น สิงหาคม 21, 2017 โคตรสวยของดี สด ซิงจมมิดควยงามจัดขอเย็ดบางเถอะละเลงลิ้น" 40 "WALAI AutoLib is library automation system produced by Informatic Innovation Research Unit (IIRU), Walailak University. Starting from date 14 Dec 2561 2019 © WALAI AutoLib. ALL Rights Reserved. Privacy Policy | Terms of Service" 35 "| BANANA GAS PLUS āļĻāļđāļāļĒāđāļāļīāļāļāļąāđāļāđāļāđāļŠāļĢāļāļĒāļāļāđ LPG āđāļĨāļ° NGV āđāļāļĢāļ°āļāļąāļāļĄāļēāļāļĢāļāļēāļ āđāļāļĒāļāđāļēāļāļāļĩāđāļĄāļĩāļāļ§āļēāļĄāļāļģāļāļēāļāļāđāļēāļāļāļēāļĢāļĢāļąāļāļĢāļāļ āļāļāļĒāļāļ§āļĨāļāļąāļ 44 āđāļĨāļĩāļĒāļāļāļēāļāļāđāļ§āļāļĢāļēāļĄāļāļīāļāļāļĢāļē āļĄāļ·āļāļāļ·āļ āļāļīāļāļāđāļ
Chunk 46 · 1,986 chars
"| BANANA GAS PLUS āļĻāļđāļāļĒāđāļāļīāļāļāļąāđāļāđāļāđāļŠāļĢāļāļĒāļāļāđ LPG āđāļĨāļ° NGV āđāļāļĢāļ°āļāļąāļāļĄāļēāļāļĢāļāļēāļ āđāļāļĒāļāđāļēāļāļāļĩāđāļĄāļĩāļāļ§āļēāļĄāļāļģāļāļēāļāļāđāļēāļāļāļēāļĢāļĢāļąāļāļĢāļāļ āļāļāļĒāļāļ§āļĨāļāļąāļ 44 āđāļĨāļĩāļĒāļāļāļēāļāļāđāļ§āļāļĢāļēāļĄāļāļīāļāļāļĢāļē āļĄāļ·āļāļāļ·āļ āļāļīāļāļāđāļ āļāļļāļāđāļāļāļĢāđ āļĄāļ·āļāļāļ·āļ 086-323-5305,086- 308-6869" 42 "9 The Green mile ปาฏิหาริยแดนประหาร The Green mile ปาฏิหาริยแดนประหาร IMDb: 9 189 นาที min 72 views The Green Mile : ปาฏิหาริยแดนประหาร หนังเรื่องนี้เปนเ […] Crim อาชญากรรมDrama ดรามาInter Movie หนังผรั่ง" 25 "› เขาสูระบบ ชื่อผูใช รหัสผาน บันทึกการใชงานของฉัน คุณจํารหัสผานไมได? ← กลับไป" 24 "UFAPOWERBET เว็บขาวกีฬาที่ดีที่สุดในไทย นําเสนอขอมูลขาวสารเกี่ยวกับ ขาวกีฬาทั่วโลก ทั้งหมด สดใหมกอนใคร" 29 "กระชับผิวหนา, แกปญหาเรื่องผิวพรรณ 5 วิธี และ 5 การใชอายครีมขั้นเทพลดริ้วรอย By TopClinicThailand.com On พฤศจิกายน 8, 2020" 85 "Skip to content Menu Close หนาแรก เกี่ยวกับเรา สินคา สินคา สั่งซื้อสินคา ประชาสัมพันธ ประกาศ/ขาวสาร กิจกรรม โปรโมชั่นทั่วไป สมาชิก ติดตอเรา THAILAND LAOS VIETNAM CAMBODIA facebook You Tube Search for: Menu Search for: Menu Add custom text here or remove it Search for: หนาแรก เกี่ยวกับเรา สินคา สินคา สั่งซื้อสินคา ประชาสัมพันธ ประกาศ/ขาวสาร กิจกรรม โปรโมชั่นทั่วไป สมาชิก ติดตอเรา THAILAND LAOS VIETNAM CAMBODIA facebook You Tube Button darkblurbg.jpg" 79 "- - Step 1 หยิบใสตะกรา - Step 2 ระบุขอมูลจัดสง - Step 3 ยืนยันขอมูล - Step 4 เลือกวิธีการชําระเงิน - Step 5 เสร็จสิ้น - - Step 1 / 5 หยิบใสตะกรา |สินคา |ราคา |จํานวน |รวมราคา เลือกซื้อสินคาตอ ขั้นตอนตอไป ตะกราสินคาราคาพิเศษ |สินคา |ราคา |จํานวน |รวมราคา เลือกซื้อสินคาตอ ขั้นตอนตอไป" 81 "【M98】-galaxy 88 เครดิต ฟรี 【M98】-galaxy 88 เครดิต ฟรีเว็บคาสิโนเปดใหม google playinferno joker slot review 855ufabet biz 【M98】-galaxy 88 เครดิต ฟรีalotto888 qq joker สล็อต777 zen 【M98】-galaxy 88 เครดิต ฟรีพนันออนไลน เว็บไหนดี pantip english ทาง เขา บา คา รา รอยัล
Chunk 47 · 1,997 chars
านวน |รวมราคา
เลือกซื้อสินคาตอ ขั้นตอนตอไป"
81
"【M98】-galaxy 88 เครดิต ฟรี 【M98】-galaxy 88 เครดิต ฟรีเว็บคาสิโนเปดใหม google playinferno joker slot review 855ufabet biz
【M98】-galaxy 88 เครดิต ฟรีalotto888 qq joker สล็อต777 zen 【M98】-galaxy 88 เครดิต ฟรีพนันออนไลน เว็บไหนดี pantip english ทาง
เขา บา คา รา รอยัล wikiเว็บคาสิโนออนไลนอันดับ1 usb ที่มาของลิ้ง:【M98】-galaxy 88 เครดิต ฟรี"
80
"{movie_title} Kingmaker (2022) หนังป 2022 IMDB 6.7 Genres: ดรามา เรื่องยอ ระหวางชวงการหาเสียง
อยูดีๆก็เกิดเหตุระเบิดขึ้นที่บานของ คิมอุนบอม (ซอลคยองกู) 1 ในนักการเมืองตัวเต็งที่จะไดเปนประธานาธิบดี เมื่อสืบหาตนตอ
กลับพบวาผูลงมืออาจจะเปน ซอชางแด (อีซอนกยุน) 1 ในทีมงานของเขาเอง"
84
"รายการขาววันใหม “ชวง AEC Energy with Kasemsant” ตอนที่ 5 นโยบายพลังงานแตละประเทศ ออกอากาศ ทุกวันจันทร – อังคาร
เวลา 00.20 น. (เที่ยงคืนยี่สิบนาที) ทางชอง 3 [smartslider3 slider="9"] รายการขาววันใหม “ชวง AEC Energy with Kasemsant” ตอนที่
5 นโยบายพลังงานแตละประเทศ ออกอากาศ ทุกวันจันทร – อังคาร เวลา 00.20 น. (เที่ยงคืนยี่สิบนาที) ทางชอง 3"
82
"[Album] เปดโลกชะนีใหมีมากกวาเครื่องสําอาง เมื่อผูชายชวนคุณมาอานหนังสือ รวมหนังสือจาก "#จินยองอาน" Image Tags
ถาชอบบทความนี้ กด Like ไดเลยนาาา มีบทความดีๆ อีกเพียบ Gallery ที่เกี่ยวของ #ลากซิสเขาดอม! สองวงไอดอล 'BXW (Black X
White)' หลอเท ฟนกรุบ จากเรียลลิตี้ PRODUCE 101 JAPAN ♥ Mollacake| กิจกรรม SistaCafe"
66
"You may also like มังกรฟา🥳 ขอแสดงความยิ […] Bluedragontary.com สลากมังกรฟา รหัส BD00209F
ตัวแทนจําหนายลอตเตอรี่มังกรฟา ลอตเตอรี่ออนไลน สลากกินแบงออนไลน หวยมังกรฟา ขายลอตเตอรี่ออนไลน มีใหเลือกมากกวา 1
ลานหมายเลข คุณกดซื้อกับแพลตฟอรมม […]"
77
"A B C D E F G H I J K L M N O P R S T V W Y 4.8 5.0 5.0 4.7 4.6 4.8 4.8 4.9 4.7 4.7 4.8 4.6 4.8 4.6 CUSTOMER SERVICE
KONVY ฝายบริการลูกคา 02-105-4235 support@konvy.com จันทร - อาทิตย (ยกเวนวันหยุดนักขัตฤกษ)8.00AM - 6.00PM
การสั่งสินคาเปน 0 สินคาทั้งหมด 0 ชิ้น จํานวนเงิน: ฿0"
78
"【Look618】-บา คา รา 6699 【Look618】-บา คา รา 6699ขนาด ไซส เสื้อ กีฬาall betChunk 48 · 1,994 chars
8 5.0 5.0 4.7 4.6 4.8 4.8 4.9 4.7 4.7 4.8 4.6 4.8 4.6 CUSTOMER SERVICE KONVY ฝายบริการลูกคา 02-105-4235 support@konvy.com จันทร - อาทิตย (ยกเวนวันหยุดนักขัตฤกษ)8.00AM - 6.00PM การสั่งสินคาเปน 0 สินคาทั้งหมด 0 ชิ้น จํานวนเงิน: ฿0" 78 "【Look618】-บา คา รา 6699 【Look618】-บา คา รา 6699ขนาด ไซส เสื้อ กีฬาall bet prediction ตัว จริง ลิเวอรพูล【Look618】-บา คา รา 6699สกรีน เสื้อ กีฬา ราคา ยู ฟา ยูโร ปา ลีก 2019 【Look618】-บา คา รา 6699gclub asia 88 link รองเทา ส ตั๊ ด ไน กี้ ใหม ลาสุดรองเทา ส ตั๊ ด หนา กวาง ที่มาของลิ้ง:【Look618】-บา คา รา 6699" Figure 9: Examples of low word count documents. These examples illustrate the generally low quality of such content. Notably, some documents include material from illegal gambling websites and adult content sources. -- 24 of 28 -- Statistic Value Count 200,000 Mean 3.23 Standard Deviation 0.72 Minimum 0 10th Percentile 3 30th Percentile 3 50th Percentile (Median) 3 70th Percentile 3.5 90th Percentile 4 95th Percentile 4 99th Percentile 5 Maximum 54 Table 13: Descriptive statistics of median word length in the sub-document dataset, including the mean, standard deviation, median, minimum, maximum, and selected percentiles. Based on this experiment, and specifically for the Thai language, we conclude that the upper bound of the word count rule can remain unchanged, while the lower bound should be increased. In our pipeline, we set the lower bound to 200 and retain the original upper bound of 100,000. Ideally, the optimal value for the lower bound should be determined through a series of data ablation experiments. However, due to limited resources and computational constraints, we were unable to perform such extensive evaluations. C.2 Median Word Length The original Gopher Rules filter out documents with a median word length less than 3 or greater than 10. In our Thai sub-document dataset, we observe that the majority of documents have a median word length between 3 and 4, or more broadly, between 2 and 5. We
Chunk 49 · 1,999 chars
perform such extensive evaluations.
C.2 Median Word Length
The original Gopher Rules filter out documents with a median word length less than 3 or greater than
10. In our Thai sub-document dataset, we observe that the majority of documents have a median word
length between 3 and 4, or more broadly, between 2 and 5. We conducted the same experiment described
in Appendix F.1 to compute perplexity statistics. However, in this case, we modified the upper bound of
the median word length rule from 10 to 5, as 5 corresponds to the 99th percentile of our dataset. The
results of this experiment are presented below.
As shown in Tabel 15, we found that the perplexity statistics between the two settings did not differ
significantly, which suggests that the original Gopher Rule threshold may also be applicable to Thai.
Consequently, we decided to retain the original threshold values. The Kruskal–Wallis H-statistic com-
paring the Gopher Rules and Experiment settings is 0.027, with a p-value of 0.869, indicating no
statistically significant difference between the medians of the two samples.
In our Dolma pipeline, we therefore chose not to modify the threshold for this rule. However, we
recommend that readers with sufficient computational resources consider conducting a data ablation
study to better understand the impact of this rule and identify the optimal threshold for their specific use
case.
C.3 Gopher Rules Changes Summary
To provide a clear comparison between the original and adapted Gopher Rules used in our work, we
present a detailed summary of all modifications in the table below. These changes reflect adjustments
made to better accommodate the characteristics of the Thai language.
D C4 Rules
Unlike the Gopher rules, we made minimal modifications to the C4 rules taggers. The original source
code for the C4 rules taggers returns the following attributes for a document:
• has_curly_brace: Indicates the presence of the character { in the document.
• has_lorem_ipsum: ChecksChunk 50 · 1,996 chars
ics of the Thai language.
D C4 Rules
Unlike the Gopher rules, we made minimal modifications to the C4 rules taggers. The original source
code for the C4 rules taggers returns the following attributes for a document:
• has_curly_brace: Indicates the presence of the character { in the document.
• has_lorem_ipsum: Checks if the phrase "lorem ipsum" is present in the document.
-- 25 of 28 --
0 10 20 30 40 50
Median Word Length
100
101
102
103
104
105
Frequency
Distribution of Median Word Lengths
x = 3
x = 5
x = 10
Figure 10: Histogram of the distribution of median word lengths in the sub-document dataset. The dis-
tribution is right-skewed. Vertical lines are drawn at x = 3 and x = 10 to indicate the threshold values
used in the original Gopher Rules. An additional line at x = 5 marks the 99th percentile in our dataset.
-- 26 of 28 --
median word length count percentage cumulative percentage
0 4 0.0020 0.0020
1 4,985 2.4925 2.4945
2 9,909 4.9545 7.4490
3 126,156 63.0780 70.5270
4 55,471 27.7355 98.2625
5 2,880 1.4400 99.7025
6 433 0.2165 99.9190
7 83 0.0415 99.9605
8 26 0.0130 99.9735
9 10 0.0050 99.9785
10 11 0.0055 99.9840
11 7 0.0035 99.9875
12 2 0.0010 99.9885
13 3 0.0015 99.9900
14 1 0.0005 99.9905
15 3 0.0015 99.9920
17 6 0.0030 99.9950
18 1 0.0005 99.9955
20 3 0.0015 99.9970
21 1 0.0005 99.9975
23 1 0.0005 99.9980
30 1 0.0005 99.9985
33 1 0.0005 99.9990
34 1 0.0005 99.9995
54 1 0.0005 100.0000
Table 14: Value counts, percentages, and cumulative percentages of median word lengths in the sub-
document dataset. The distribution clearly shows that the majority of documents have a median word
length of 3 or 4.
Statistic Baseline Gopher Rules Experiment
Count 200,000 185,069 184,370
Mean 1143.80 1170.90 1168.76
Standard Deviation 4414.26 2227.70 2203.26
Minimum 1.37 8.71 8.71
10th Percentile 176.05 206.64 206.77
30th Percentile 402.48 434.22 434.07
50th Percentile (Median) 710.95 747.78 747.27
70th Percentile 1136.51 1181.81 1181.60
90th Percentile 2163.38Chunk 51 · 1,998 chars
Experiment Count 200,000 185,069 184,370 Mean 1143.80 1170.90 1168.76 Standard Deviation 4414.26 2227.70 2203.26 Minimum 1.37 8.71 8.71 10th Percentile 176.05 206.64 206.77 30th Percentile 402.48 434.22 434.07 50th Percentile (Median) 710.95 747.78 747.27 70th Percentile 1136.51 1181.81 1181.60 90th Percentile 2163.38 2238.32 2237.23 95th Percentile 3153.02 3264.85 3262.91 99th Percentile 7660.35 7933.48 7917.08 Maximum 814,671.56 191,404.94 191,404.94 Table 15: Perplexity statistics across three experimental settings. Applying the Gopher Rules results in a reduction in perplexity scores. However, further adjusting the upper bound of the median word length rule (from 10 to 5) has no significant additional effect on perplexity in our experiment. -- 27 of 28 -- Original Rules Our Rules Fraction of characters in most common n-gram greater than a threshold Fraction of characters in most common n- gram greater than a threshold Fraction of characters in duplicate n- grams greater than a threshold Fraction of characters in duplicate n-grams greater than a threshold Contains fewer than 50 or more than 100K words Contains fewer than 200 or more than 100K words Median word length is less than 3 or greater than 10 Median word length is less than 3 or greater than 10 Symbol to word ratio greater than 0.10 Symbol to word ratio greater than 0.10 Fraction of words with alpha character less than 0.80 Fraction of words with Thai letters less than 0.80 Contains fewer than 2 of a set of re- quired words Contains fewer than 2 of a set of required words Fraction of lines in document starting with bullet point greater than 0.90 Fraction of lines in document starting with bullet point greater than 0.90 Fraction of lines in document ending with ellipsis greater than 0.30 Fraction of lines ending with ellipsis or “. . . ” greater than 0.30 Fraction of lines in document that are duplicated greater than 0.30 Fraction of lines that are duplicated greater than 0.30 Fraction of characters in
Chunk 52 · 1,801 chars
bullet point greater than 0.90 Fraction of lines in document ending with ellipsis greater than 0.30 Fraction of lines ending with ellipsis or “. . . ” greater than 0.30 Fraction of lines in document that are duplicated greater than 0.30 Fraction of lines that are duplicated greater than 0.30 Fraction of characters in duplicated lines greater than 0.30 Fraction of characters in duplicated lines greater than 0.30 – Contains one of the following: read more Table 16: Summary of modifications to Gopher Rules for the Thai language • has_javascript: Detects the presence of JavaScript code within the document. • has_naughty_word: Identifies the presence of inappropriate or profane words. • lines_with_no_ending_punctuation: Flags lines that do not end with standard punctu- ation marks. • lines_with_too_few_words: Flags lines that contain fewer words than a predefined thresh- old. • line_count: Records the total number of lines in the document. Firstly, we update the corpus used for detecting inappropriate words. We adopted the list provided by AI Singapore, which includes both English and Thai terms. Furthermore, we introduced a new attribute, corrupt_unicode, which identifies the spans of corrupted Unicode characters in the text. These corrupted spans are frequently encountered in our dataset. In the quality filtering pipeline, these spans are replaced with empty strings. We included corrupt_unicode in the C4 rules taggers for convenience. Note that while the primary attribute of this rule is lines_with_no_ending_punctuation, we did not utilize it in our pipeline due to the differing nature of sentence-ending punctuation between Thai and English languages. Specifically, Thai sentences often do not end with punctuation marks, making this attribute less applicable. -- 28 of 28 --