Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

Summary

This study presents the first systematic comparison of equitable tokenizers for multilingual large language models, focusing on eleven Southeast Asian languages. It addresses the structural bias of standard Byte-level Byte-Pair Encoding (BPE), which favors Latin scripts and inflates inference costs for underrepresented languages. The authors evaluate three alternative approaches: Parity-aware BPE, Morphology-Driven Byte Encoding (MYTE), and Byte Latent Transformer (BLT), alongside a BPE baseline. Intrinsic metrics reveal that Parity-aware BPE achieves the best balance between compression efficiency and cross-lingual equity, lying on the Pareto frontier. MYTE offers superior semantic reasoning and machine translation performance due to morphologically rich representations but incurs higher computational costs and lower compression rates. Conversely, BLT underperforms on downstream tasks, likely because its entropy-driven patch segmentation fails to correct corpus imbalances for low-resource languages. Extrinsic evaluations using 1.5B-parameter models show that while standard BPE leads in English classification tasks, MYTE excels in multilingual inference and translation. Statistical analysis confirms that tokenizer choice, rather than script type, is the primary determinant of tokenization parity. The authors recommend Parity-aware BPE as a responsible default for Southeast Asian applications due to its favorable efficiency-equity trade-off, while suggesting MYTE for scenarios prioritizing translation quality where computational budgets allow. These findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally conflicting goals.

PDF viewer

Chunks(32)

Chunk 0 · 1,993 chars

Equity with Efficiency: An Empirical Study of Tokenizers for
Multilingual Large Language Models
Kieron Seven Jun Wei Lee1 Muhammad Reza Qorib2
Andrew Ivan Soegeng1,3 Hwee Tou Ng1
1National University of Singapore 2Carnegie Mellon University 3SAP
e0968891@u.nus.edu, mrqorib@cmu.edu,
andrew.soegeng@u.nus.edu, dcsnght@nus.edu.sg
Abstract
Multilingual large language models (LLMs)
depend on subword tokenization to bridge dis-
crete text and continuous neural representa-
tion. State-of-the-art multilingual LLMs often
use Byte-level Byte-Pair Encoding (BPE) to-
kenizers that structurally favor high-resource
languages and Latin scripts. For speakers of
underrepresented languages, particularly those
across Southeast Asia, this bias inflates infer-
ence costs and widens cross-lingual capability
gaps. We present the first systematic compari-
son of equitable tokenizers on a unified bench-
mark spanning 11 Southeast Asian languages.
Beyond tokenizer-level analysis of compres-
sion efficiency and cross-lingual equity, we as-
sess downstream task performance through con-
trolled 1.5B-parameter language model training
using the same training data. Our results show
that Parity-aware BPE lies on the Pareto fron-
tier of the efficiency-equity trade-off, achiev-
ing strong compression parity at competitive
cost. Morphology-Driven Byte Encoding de-
livers the best semantic reasoning performance
through morphologically richer representations,
albeit at a higher computational expense. Byte
Latent Transformer underperforms on down-
stream tasks, possibly because its architectural
assumptions misalign with the constraints of
limited low-resource training data. Together,
our findings demonstrate that cross-lingual fair-
ness and tokenization efficiency are not funda-
mentally at odds, and offer practical guidance
for designing equitable multilingual models.1
1 Introduction
Multilingual large language models (LLMs) are
central to cross-lingual information access, yet
their performance

Chunk 1 · 1,997 chars

our findings demonstrate that cross-lingual fair-
ness and tokenization efficiency are not funda-
mentally at odds, and offer practical guidance
for designing equitable multilingual models.1
1 Introduction
Multilingual large language models (LLMs) are
central to cross-lingual information access, yet
their performance remains deeply uneven across
languages and scripts. A key driver of this disparity
is tokenization: how raw text is segmented into
subword units shapes model capacity, sequence
1Source code will be publicly released upon paper publi-
cation.
length, and effective context window across lan-
guages (Petrov et al., 2023) .
Byte-level Byte-Pair Encoding (BPE) (Sennrich
et al., 2016) is a widely used tokenization strat-
egy in state-of-the-art LLMs, including the GPT
(OpenAI, 2025) and Llama (Touvron et al., 2023)
families, due to its simplicity and compression effi-
ciency. Byte-level BPE encodes characters as UTF-
8 bytes (Consortium, 2011) and iteratively learns
byte-pair merges based on global co-occurrence
frequency. This procedure introduces a structural
bias as one Latin character is encoded as a single
byte, while one non-Latin character requires two
or more bytes. Combined with English-centric pre-
training corpora, BPE’s merge operations dispro-
portionately favor Latin scripts and high-resource
languages (Arnett et al., 2024).
The practical consequences of such bias are sig-
nificant. Petrov et al. (2023) demonstrated that
GPT-4’s Byte-level BPE tokenizer produces se-
quence length disparities of up to 15×, with Chi-
nese requiring 1.9× more tokens than English, Viet-
namese 2.5×, and Burmese 11.7×. For speakers of
low-resource non-Latin languages such as Khmer
and Lao, these disparities translate directly into
higher inference costs, degraded long-context rea-
soning, and diminished downstream task accuracy
(Tamang and Bora, 2024).
Several tokenizers have been proposed to ad-
dress these inequities. Parity-aware Byte-Pair En-
coding rebalances

Chunk 2 · 1,999 chars

-Latin languages such as Khmer
and Lao, these disparities translate directly into
higher inference costs, degraded long-context rea-
soning, and diminished downstream task accuracy
(Tamang and Bora, 2024).
Several tokenizers have been proposed to ad-
dress these inequities. Parity-aware Byte-Pair En-
coding rebalances merge frequencies across scripts
(Foroutan et al., 2025). Morphology-Driven Byte
Encoding (MYTE) grounds segmentation in mor-
phological structure (Limisiewicz et al., 2024).
Byte Latent Transformer (BLT) sidesteps a fixed
vocabulary by operating directly over dynamic byte
patches (Pagnoni et al., 2025). Each work evalu-
ates its approach against BPE baselines, reporting
improvements in equity and multilingual capability.
However, these methods have never been compared
against each other under uniform experimental con-
1
arXiv:2606.15044v1 [cs.CL] 13 Jun 2026

-- 1 of 15 --

ditions.
In this paper, we present a benchmarking study
to address this gap with the first systematic analysis
of equitable tokenizers. We compare them across
eleven Southeast Asian (SEA) languages: English,
Burmese, Chinese, Indonesian, Khmer, Lao, Malay,
Tagalog, Tamil, Thai, and Vietnamese. Using Byte-
level BPE as a baseline, and controlling for train-
ing data, vocabulary size, and computational bud-
get, we evaluate intrinsic tokenizer metrics and
examine downstream LLM performance by train-
ing 1.5B-parameter decoder-only language models
from scratch. Our study provides a direct empiri-
cal comparison of equitable tokenization methods,
offering actionable insights for NLP practitioners
to build fairer multilingual LLMs.
2 Related Work
2.1 Subword Tokenization
Subword tokenization has become the standard pre-
processing step in multilingual LLMs, to uniformly
segment text in any language into tokens. How-
ever, when trained on heterogeneous multilingual
corpora, these approaches allocate vocabulary ca-
pacity toward languages with high resource or writ-
ten in Latin scripts,

Chunk 3 · 1,991 chars

ubword tokenization has become the standard pre-
processing step in multilingual LLMs, to uniformly
segment text in any language into tokens. How-
ever, when trained on heterogeneous multilingual
corpora, these approaches allocate vocabulary ca-
pacity toward languages with high resource or writ-
ten in Latin scripts, embedding structural bias and
inequity into the vocabulary.
The downstream consequences are well-
documented. Bostrom and Durrett (2020) showed
that BPE tokens frequently diverge from linguis-
tically motivated morpheme boundaries. More
recently, Selvamurugan et al. (2025) quantified
cross-lingual tokenization inequity through
normalized sequence length and subword fertility,
demonstrating that the gap is most pronounced for
underrepresented scripts. These findings motivate
moving beyond global frequency optimization
as the main design criterion for multilingual
tokenizers.
2.2 Parity-aware Byte-Pair Encoding
Parity-aware BPE (PA BPE; Foroutan et al., 2025)
modifies Byte-level BPE by optimizing the worst-
case compression rate across languages. Each
merge iteration selects the pair that most improves
the worst-performing language, trading marginal
global efficiency for tokenization equity.
The approach requires minimal implementation
changes to existing BPE pipelines. On a 30-
language unbalanced dataset, it achieves a lower
Gini coefficient of 0.011 versus 0.064 for Byte-
level BPE, while remaining competitive on com-
pression and outperforming or matching Byte-level
BPE baselines across 13 multilingual benchmarks.
2.3 Morphology-Driven Byte Encoding
MYTE (Limisiewicz et al., 2024) replaces UTF-8’s
character-based convention with morpheme-based
byte codes, as morphemes exhibit more consistent
sequence lengths than characters across languages.
It learns a per-language morpheme inventory to
achieve balanced morphological coverage via Mor-
fessor 2.0 (Smit et al., 2014), and assigns shorter
byte sequences to linguistically meaningful units.
MYTE

Chunk 4 · 1,994 chars

morpheme-based
byte codes, as morphemes exhibit more consistent
sequence lengths than characters across languages.
It learns a per-language morpheme inventory to
achieve balanced morphological coverage via Mor-
fessor 2.0 (Smit et al., 2014), and assigns shorter
byte sequences to linguistically meaningful units.
MYTE produces shorter encoding compared to
UTF-8 for all 99 languages tested, with gains rang-
ing from 1% for Vietnamese and Chinese to nearly
70% for Burmese. Its worst-case tokenizer par-
ity relative to English is 1.7, versus 3.5 for UTF-
8. MyT5, a MYTE-encoded variant of ByT5
(Xue et al., 2022), demonstrated reduced cross-
language perplexity disparity compared to its byte-
level counterpart. It achieves 75.3 F1 on XTREME-
UP (Ruder et al., 2023) question answering versus
73.2 for ByT5.
2.4 Byte Latent Transformer
BLT (Pagnoni et al., 2025) eliminates explicit to-
kenization entirely and comprises three modules:
a lightweight local encoder producing patches, a
large latent transformer processing them, and a
lightweight local decoder reconstructing bytes. An
entropy model drives patch segmentation, allocat-
ing computation proportional to data complexity.
BLT enables a 50% reduction in inference
FLOPs relative to Llama 3’s original tokenizer
without sacrificing downstream task performance.
(Grattafiori et al., 2024). By avoiding a static vo-
cabulary from tokenization, BLT sidesteps mul-
tilingual inequity that arises when high-resource
language tokens dominate and outperforms Llama
3 by 2 BLEU points (Papineni et al., 2002) on trans-
lation into English.
3 Methods
We compare the three tokenizer families discussed
above to a baseline Byte-level BPE tokenizer. We
train all tokenizers on the same dataset to evaluate
their efficiency and cross-lingual equity. We then
train language models from scratch using these
tokenizers and evaluate their downstream task per-
formance. For fairness and reproducibility, data
2

-- 2 of 15 --

sizes are reported in

Chunk 5 · 1,998 chars

Byte-level BPE tokenizer. We
train all tokenizers on the same dataset to evaluate
their efficiency and cross-lingual equity. We then
train language models from scratch using these
tokenizers and evaluate their downstream task per-
formance. For fairness and reproducibility, data
2

-- 2 of 15 --

sizes are reported in number of sentences and bytes,
rather than tokens.
3.1 Training Data
For tokenizer training, we sample a total of 1 mil-
lion sentences (3.5GB) across eleven SEA lan-
guages from multilingual C4 (mC4) (Xue et al.,
2021). Sampling is performed randomly without
replacement following the language proportions
in mC4 to approximate realistic multilingual data
distribution. The resulting per-language sentence
counts are detailed in Appendix A.1.
For language model training, we adopt the same
training dataset as Foroutan et al. (2025) and sam-
ple 100 million sentences (203 GB) from FineWeb2
(Penedo et al., 2025). This dataset size is compara-
ble to what Foroutan et al. (2025) and Limisiewicz
et al. (2024) used to train their language models.
FineWeb2 is a multilingual web corpus with qual-
ity filtering already applied, and we did not apply
further preprocessing before training. Language
proportions are controlled using temperature sam-
pling with τ = 1.21 to boost the representation
of low-resource languages (Foroutan et al., 2025).
The details are provided in Appendix A.2.
Vocabulary sizes are controlled where possible
to enable a fair comparison of the four tokenizers.
MYTE was designed to have 4,096 morphemes
per language to avoid over-segmentation. Thus,
we train tokenizers at three scales: 4,096, 8,192,
and 12,288 tokens per language, across all eleven
SEA languages. For MYTE, this translates to total
morpheme inventories of 45k, 90k, and 135k mor-
phemes. The vocabulary sizes of Byte-level BPE
and Parity-aware BPE are matched to MYTE’s total
morpheme counts at each scale.
BLT’s patch-based representation is not directly
comparable since it does not

Chunk 6 · 1,996 chars

cross all eleven
SEA languages. For MYTE, this translates to total
morpheme inventories of 45k, 90k, and 135k mor-
phemes. The vocabulary sizes of Byte-level BPE
and Parity-aware BPE are matched to MYTE’s total
morpheme counts at each scale.
BLT’s patch-based representation is not directly
comparable since it does not learn a fixed vocab-
ulary. Following the approach of Pagnoni et al.
(2025), we configure BLT’s entropy model to yield
average patch sizes of 4.5, 6, and 8 bytes per patch.
We use tokenizers with vocabulary size of 90k
to train language models, placing them close to the
100k–128k vocabulary size of most LLM tokeniz-
ers (Wegmann et al., 2025). For BLT, we adopt the
entropy model with an average patch size of 4.5
bytes, following the setup of Pagnoni et al. (2025).
Note that BLT is not a tokenizer in the traditional
sense, but is referred to as one here for ease of
comparison.
3.2 Implementation Details
Training of tokenizers and tokenization of language
model training data for MYTE and BPE-based al-
gorithms were performed on a single AMD EPYC
9554P CPU (128 threads). For BLT, the entropy-
based tokenizer was trained on 4× NVIDIA H100
GPUs, and the language model training dataset was
tokenized on 8× NVIDIA H200 GPUs. Statistics
of the language model training dataset after tok-
enization are reported in Table 1.
Tokenizer Duration # Tokens File size
(size) (hour) (billion) (GB)
BLT (4.5) 33 42 204
MYTE (90k) 50 269 538
PA BPE (90k) 3 82 329
BPE (90k) 3 72 288
Table 1: Statistics of language model training dataset
after tokenization by the four tokenizers. Legend: size =
patch size for BLT, morpheme inventory size for MYTE,
vocabulary size for all other models; File Size = Size of
dataset files after tokenization; PA BPE = Parity-aware
BPE; BPE = Byte-level BPE.
Language model training is carried out on 4–8×
NVIDIA H100/H200 GPUs. To enable a fair com-
parison of computational cost, training durations
are converted to an 8× NVIDIA H200 equivalent,
as

Chunk 7 · 1,990 chars

ze for all other models; File Size = Size of
dataset files after tokenization; PA BPE = Parity-aware
BPE; BPE = Byte-level BPE.
Language model training is carried out on 4–8×
NVIDIA H100/H200 GPUs. To enable a fair com-
parison of computational cost, training durations
are converted to an 8× NVIDIA H200 equivalent,
as reported in Table 2. MYTE incurs the highest
training cost at 300 normalized hours due to its sub-
stantially larger token count (269B tokens), while
Byte-level BPE is the most efficient at 68 hours
(72B tokens). Additionally, we trained and com-
pared all language models at an equal token count
of 38B tokens as measured by their respective tok-
enizers. These experiments yielded the same con-
clusions as the models trained on the same dataset,
so we omit them for the sake of brevity.
Model Duration # Tokens
(size) (hour) (billion)
BLT (4.5) 160 42
MYTE (90k) 300 269
PA BPE (90k) 87 82
BPE (90k) 68 72
Table 2: Statistics of language model training.
3

-- 3 of 15 --

3.3 Evaluation Metrics
3.3.1 Intrinsic Metrics
Quantifying tokenizer efficiency and cross-lingual
fairness requires metrics that are agnostic to both
language and model architecture. We identify three
such metrics from recent literature and provide
brief descriptions below. Detailed definitions and
formulae can be found in Appendix B.
Tokenizer parity measures the ratio of the num-
ber of tokens per sentence in a given language rel-
ative to English (Petrov et al., 2023). A tokenizer
parity close to 1 indicates that the tokenizer im-
poses roughly equal computational cost across the
given language and English.
Gini coefficient adapts the income inequality
measure to the domain of tokenization fairness
(Foroutan et al., 2025). It quantifies the distribu-
tion of per-language tokenization costs, with values
ranging from 0 (perfect equality) to 1 (maximal in-
equality). A lower Gini coefficient reflects a more
equitable tokenizer.
Compression rate measures how efficiently a
tokenizer

Chunk 8 · 1,992 chars

he domain of tokenization fairness
(Foroutan et al., 2025). It quantifies the distribu-
tion of per-language tokenization costs, with values
ranging from 0 (perfect equality) to 1 (maximal in-
equality). A lower Gini coefficient reflects a more
equitable tokenizer.
Compression rate measures how efficiently a
tokenizer compresses text (Foroutan et al., 2025). A
higher compression rate indicates that the tokenizer
is more efficient and produces fewer tokens for the
same amount of text.
3.3.2 Extrinsic Metrics
We evaluate trained language models on English
and multilingual classification benchmarks using
Language Model Evaluation Harness (Biderman
et al., 2024) with zero-shot prompting. Details of
the benchmarks can be found in Appendix C. For
machine translation, we evaluate fine-tuned mod-
els with five-shot prompts drawn from the training
dataset of the multi-way parallel FLORES+ corpus
(Costa-jussà et al., 2024), following the setup of
Limisiewicz et al. (2024).
To assess English language understanding, mod-
els are evaluated on three English classification
benchmarks used by Pagnoni et al. (2025): PIQA
(Bisk et al., 2020), HellaSwag (Zellers et al., 2019),
and Arc-C (Clark et al., 2018). These benchmarks
test commonsense reasoning, sentence completion,
and science question answering respectively.
Cross-lingual generalization is assessed through
three multilingual classification benchmarks used
by Foroutan et al. (2025). XNLI (Conneau et al.,
2018) evaluates natural language inference across
multiple languages, XCOPA (Ponti et al., 2020)
tests causal commonsense reasoning in a multilin-
gual setting, and XStoryCloze (Lin et al., 2022)
assesses story completion across languages.
Machine translation is assessed after fine-tuning
via continual pre-training to ensure that results re-
flect the task-adapted performance of the models.
Fine-tuning details and the resulting per-language
sentence counts of the fine-tuning dataset are pro-
vided in Appendix A.3. BLEU

Chunk 9 · 1,997 chars

story completion across languages.
Machine translation is assessed after fine-tuning
via continual pre-training to ensure that results re-
flect the task-adapted performance of the models.
Fine-tuning details and the resulting per-language
sentence counts of the fine-tuning dataset are pro-
vided in Appendix A.3. BLEU (Papineni et al.,
2002) and chrF (Popovi´c, 2015) scores are com-
puted in both EN → XX and XX → EN directions
for all ten non-English SEA languages. These two
metrics range from 0 to 100 and higher scores indi-
cate better translation quality.
4 Experiments
4.1 Intrinsic Evaluation
We evaluate the trained tokenizers on FLORES+ de-
vtest set, which consists of 1,012 aligned sentences
across all eleven SEA languages. For Parity-aware
BPE, we train the base variant with the FLORES+
training dataset as the development corpus follow-
ing the setup of Foroutan et al. (2025). It is the
only tokenizer among those evaluated that requires
parallel data during training.
4.2 Extrinsic Evaluation
4.2.1 Language Model
The base architecture of our language model is
OLMo-2-1B (Walsh et al., 2025), a decoder-only
transformer comprising 16 layers, a hidden dimen-
sion of 2,048, and approximately 1.5 billion param-
eters. We train all models from scratch to ensure
that differences in downstream task performance
are largely attributed to tokenizer choice.
4.2.2 Training Configuration
Models are trained using the AdamW (Loshchilov
and Hutter, 2019) optimizer with a peak learning
rate of 4.0 × 10−4, weight decay of 0.1, β1 = 0.9,
β2 = 0.95, and gradient clipping of 1.0. The learn-
ing rate follows a Warmup-Stable-Decay (WSD)
schedule (Hu et al., 2024): a linear warmup over
the first 1% of training tokens, a stable phase over
the next 89%, and a linear decay over the final
10%. The global batch size is 512 sequences with
a maximum sequence length of 4,096 tokens, cor-
responding to approximately 2 million tokens per
training step.
4.2.3 Statistical Significance
Extrinsic

Chunk 10 · 1,996 chars

near warmup over
the first 1% of training tokens, a stable phase over
the next 89%, and a linear decay over the final
10%. The global batch size is 512 sequences with
a maximum sequence length of 4,096 tokens, cor-
responding to approximately 2 million tokens per
training step.
4.2.3 Statistical Significance
Extrinsic metrics are assessed for statistical sig-
nificance using paired bootstrap resampling with
4

-- 4 of 15 --

1,000 iterations at p < 0.05 (Koehn, 2004). This
ensures that reported performance differences be-
tween models reflect meaningful systematic effects
rather than sampling variation across examples.
5 Results
5.1 Intrinsic Evaluation
Tokenizer CR Gini TP
(size)
BLT (4.5) 0.0127 0.212 2.87
BLT (6) 0.0145 0.219 2.96
BLT (8) 0.0161 0.227 3.06
MYTE (45k) 0.0085 0.085 1.23
MYTE (90k) 0.0089 0.086 1.23
MYTE (135k) 0.0089 0.095 1.26
PA BPE (45k) 0.0250 0.021 1.15
PA BPE (90k) 0.0272 0.028 1.24
PA BPE (135k) 0.0280 0.029 1.25
BPE (45k) 0.0257 0.243 1.93
BPE (90k) 0.0293 0.220 1.73
BPE (135k) 0.0314 0.203 1.61
Table 3: Intrinsic evaluation of tokenizers on identical
training data. Compression rate and tokenizer parity
values are macro-averaged across languages. The best
result is in bold and the second-best is underlined. Leg-
end: CR = Compression Rate, TP = Tokenizer Parity.
Table 3 reveals that Parity-aware BPE achieves
the lowest Gini coefficient across all vocabulary
sizes. This equity gain does not come at the cost of
tokenizer efficiency, as Parity-aware BPE achieves
competitive compression rates relative to Byte-
level BPE.
MYTE has a relatively low Gini coefficient and
tokenizer parity, but its compression rate is the
worst. This indicates that morpheme-based seg-
mentation produces longer token sequences. The
trade-off is consistent with the inflated token counts
observed during language model training.
BLT exhibits poor equity across all vocabulary
sizes despite operating at the byte level without a
fixed vocabulary. Its tokenizer parity is

Chunk 11 · 1,990 chars

indicates that morpheme-based seg-
mentation produces longer token sequences. The
trade-off is consistent with the inflated token counts
observed during language model training.
BLT exhibits poor equity across all vocabulary
sizes despite operating at the byte level without a
fixed vocabulary. Its tokenizer parity is the high-
est among all approaches, suggesting that entropy-
driven patch segmentation provides no built-in
mechanism to correct for corpus imbalances.
Figure 1 situates each tokenizer family within
a two-dimensional efficiency-equity space, where
the ideal direction is toward a higher compression
rate and lower Gini coefficient (bottom-right of the
plot). Parity-aware BPE lies on the Pareto front of
the efficiency-equity space across all vocabulary
sizes, highlighting that cross-lingual fairness and
compression efficiency are not at odds. Byte-level
BPE at its largest vocabulary size of 135k lies on
the Pareto front, as it has the highest compression
rate. On the other hand, both BLT and MYTE are
Pareto-dominated.
5.2 Extrinsic Evaluation
5.2.1 English Classification Benchmarks
Model PIQA HellaSwag Arc-C
(size) (50.00) (25.00) (25.00)
BLT (4.5) 66.10 44.81 24.74
MYTE (90k) 67.14 44.47 27.39
PA BPE (90k) 71.55 53.22 26.19
BPE (90k) 72.31 54.42 28.41
Table 4: Accuracies on English classification bench-
marks. Parenthesized values indicate the expected accu-
racy of a random classifier. The highest score is in bold
and the second-highest is underlined.
Table 4 shows that Byte-level BPE achieves the
highest scores on all three English classification
benchmarks. Its advantage is statistically signif-
icant across all comparisons, except over Parity-
aware BPE on PIQA and MYTE on Arc-C, where
its numerical lead does not reach significance.
Parity-aware BPE is a strong runner-up, perform-
ing significantly better than BLT and MYTE on
both PIQA and HellaSwag. For commonsense rea-
soning and sentence completion benchmarks, BPE-
based tokenizers have a

Chunk 12 · 1,996 chars

pt over Parity-
aware BPE on PIQA and MYTE on Arc-C, where
its numerical lead does not reach significance.
Parity-aware BPE is a strong runner-up, perform-
ing significantly better than BLT and MYTE on
both PIQA and HellaSwag. For commonsense rea-
soning and sentence completion benchmarks, BPE-
based tokenizers have a consistent advantage over
alternative approaches in our evaluation setting.
5.2.2 Multilingual Classification Benchmarks
Table 5 reveals task-dependent performance across
tokenizers, with no consistent winner across all
three benchmarks. MYTE significantly outper-
forms all other tokenizers on XNLI, consistent with
its morphological representations providing richer
cross-lingual semantic signals for inference. Byte-
level BPE achieves the highest scores on XCOPA
and XStoryCloze, suggesting that its efficiency is
best leveraged on tasks requiring causal and narra-
tive reasoning.
5

-- 5 of 15 --

Figure 1: Efficiency-equity Pareto front of the evaluated tokenizers. Values beside markers indicate the patch size
for BLT, morpheme inventory size for MYTE, and vocabulary size for BPE and PA BPE.
Model XNLI XCOPA XStoryCloze
(size) (33.33) (50.00) (50.00)
BLT (4.5) 36.29 53.30 56.52
MYTE (90k) 42.49 54.93 50.55
PA BPE (90k) 40.43 58.70 56.70
BPE (90k) 40.56 61.03 57.18
Table 5: Averaged per-language accuracies across multi-
lingual classification benchmarks. Parenthesized values
indicate the expected accuracy of a random classifier.
The highest score is in bold and the second-highest is
underlined. Detailed results for each language are re-
ported in Appendix D.1.
Model (size) EN → XX XX → EN
BLT (4.5) 10.82 11.70
MYTE (90k) 14.77 13.81
PA BPE (90k) 11.36 12.19
BPE (90k) 13.39 12.78
Table 6: Averaged BLEU scores across ten SEA lan-
guages. The highest score is in bold and the second-
highest is underlined.
5.2.3 Machine Translation
Table 6 aggregates the BLEU translation scores
across ten SEA languages. We also provide the
chrF scores in Appendix D.2.2.

Chunk 13 · 1,996 chars

BPE (90k) 11.36 12.19
BPE (90k) 13.39 12.78
Table 6: Averaged BLEU scores across ten SEA lan-
guages. The highest score is in bold and the second-
highest is underlined.
5.2.3 Machine Translation
Table 6 aggregates the BLEU translation scores
across ten SEA languages. We also provide the
chrF scores in Appendix D.2.2. MYTE achieves
the highest BLEU scores in both translation di-
rections. A consistent directional asymmetry is
observed across both metrics for MYTE, where it
is systematically stronger in EN → XX translation
than XX → EN. MYTE’s morpheme-level segmen-
tation enables finer-grained generation of morpho-
logically complex target word forms in SEA lan-
guages, an advantage that narrows when translating
into English.
6 Analysis
6.1 Effect of Scaling Vocabulary Size
Increasing vocabulary size produces distinct behav-
ior across tokenizers, as seen in Table 3. Byte-level
BPE improves consistently across both efficiency
and cross-lingual fairness with a larger vocabulary
size. In contrast, BLT gains in efficiency but sacri-
fices cross-lingual equity as vocabulary size grows.
MYTE is relatively insensitive to morpheme in-
ventory size scaling within the evaluated range.
Compression rate and tokenizer parity remain
nearly constant across all three morpheme inven-
tory sizes, indicating that its morphological seg-
mentation approach saturates at around 4,096 mor-
phemes per language.
Parity-aware BPE becomes more inequitable as
vocabulary size increases. While it achieves the
lowest Gini coefficient among all models, both
6

-- 6 of 15 --

its Gini coefficient and tokenizer parity worsen
at larger vocabulary sizes, increasing to 0.029 and
1.25 respectively at 135k vocabulary size.
6.2 Fairness Regression in Parity-aware BPE
It seems counterintuitive that Parity-aware BPE
tokenizers are less equitable as vocabulary size
increases. To investigate this, we analyzed per-
language token counts produced by Parity-aware
BPE tokenizers of different vocabulary sizes.

Chunk 14 · 1,997 chars

espectively at 135k vocabulary size.
6.2 Fairness Regression in Parity-aware BPE
It seems counterintuitive that Parity-aware BPE
tokenizers are less equitable as vocabulary size
increases. To investigate this, we analyzed per-
language token counts produced by Parity-aware
BPE tokenizers of different vocabulary sizes. We
use the FLORES+ training dataset (rather than the
test dataset) because it serves as the development
corpus during tokenizer training. Examining this
dataset would isolate the effect of vocabulary scal-
ing and avoid confounding factors from unseen
data such as differing vocabulary distributions.
Figure 2: Per-language token counts on the FLORES+
training dataset by Parity-aware BPE tokenizers of vary-
ing vocabulary sizes.
Figure 2 shows that increasing the vocabulary
size results in a larger reduction in the tokens re-
quired for English sentences compared to the other
SEA languages. This results in token count dispar-
ity between English and the other SEA languages to
widen by 36% as vocabulary size scales from 45k
to 135k, causing tokenizer parity to increase (i.e.,
worsen). We observe that Parity-aware BPE only
limits worst-case per-language tokenizer parity, so
its fairness mechanism does not prevent English
from acquiring more merges as vocabulary size
increases.
6.3 Effect of Tokenizer Choice and Script
Type
To investigate whether tokenizer parity outcomes
are driven by the choice of tokenizer, the script
type of the target language, or their interaction,
we apply a two-way mixed ANOVA test (Meyers
et al., 2009). Running separate pairwise t-tests for
each tokenizer–script combination would inflate
the Type I error rate multiplicatively.
The between-subject factor is script type, cate-
gorized as Latin (Indonesian, Malay, Tagalog, Viet-
namese) or Abugida (Burmese, Khmer, Lao, Tamil,
Thai) according to Limisiewicz et al. (2024). Chi-
nese is excluded from this comparison as it is the
only CJK-script language. The within-subject fac-
tor is

Chunk 15 · 1,995 chars

licatively.
The between-subject factor is script type, cate-
gorized as Latin (Indonesian, Malay, Tagalog, Viet-
namese) or Abugida (Burmese, Khmer, Lao, Tamil,
Thai) according to Limisiewicz et al. (2024). Chi-
nese is excluded from this comparison as it is the
only CJK-script language. The within-subject fac-
tor is tokenizer choice, as measuring the same lan-
guage across tokenizers introduces within-subject
correlation. We assess the effect of these factors on
per-language tokenizer parity, measure statistical
significance at α = 0.05, and report partial η2 as
the effect size measure.
Tokenizer Script type
(size) Latin Abugida
BLT (4.5) 2.36 3.38
MYTE (90k) 1.19 1.32
PA BPE (90k) 1.26 1.22
BPE (90k) 1.30 2.18
Table 7: Per-language tokenizer parity, macro-averaged
by tokenizer and script type. Latin scripts exclude En-
glish. The lowest value is in bold and the second-lowest
is underlined.
Tokenizer choice is the only statistically signif-
icant factor affecting tokenizer parity (p < 0.001,
partial η2 = 0.752). The large effect size indi-
cates that tokenizer choice explains the majority
of variance in tokenizer parity, regardless of script
type. At the same time, script type alone does
not reach significance (p = 0.090). As shown in
Table 7, Parity-aware BPE and MYTE achieve near-
uniform tokenizer parity across both script types.
In contrast, Byte-level BPE imposes a 1.68× tok-
enizer parity penalty on Abugida scripts relative to
Latin scripts. BLT also exhibits a high cross-script
disparity (1.43×), consistent with its entropy model
being undertrained on non-Latin scripts.
Fundamentally, tokenizer parity directly deter-
mines inference cost. A tokenizer parity of k for a
7

-- 7 of 15 --

given language implies that a user pays k times the
per-token API cost relative to English to process
the same semantic content. Under Byte-level BPE,
Abugida-script users incur 1.68× higher costs on
average than English users for semantically equiva-
lent prompts. These

Chunk 16 · 1,995 chars

A tokenizer parity of k for a
7

-- 7 of 15 --

given language implies that a user pays k times the
per-token API cost relative to English to process
the same semantic content. Under Byte-level BPE,
Abugida-script users incur 1.68× higher costs on
average than English users for semantically equiva-
lent prompts. These results confirm that equitable
tokenizer design, and not script similarity to En-
glish, is the main determinant of whether a tok-
enizer imposes uniform computational costs across
languages.
7 Conclusion
We present the first systematic, dataset-controlled
comparison of BLT, MYTE, Parity-aware BPE,
and Byte-level BPE across eleven SEA languages,
evaluating tokenizer equity, compression efficiency,
and downstream task performance under the same
experimental conditions. Our intrinsic evaluation
demonstrates that cross-lingual equity and tokeniza-
tion efficiency are not fundamentally at odds.
Among the equitable tokenizers analyzed,
MYTE delivers the strongest semantic inference
and machine translation performance through
richer morphological representations, though at the
cost of a higher computational budget and lower
compression efficiency. Despite its architectural
novelty in eliminating fixed tokenizer vocabulary,
we found that BLT underperforms on downstream
tasks as its entropy model receives insufficient ex-
posure to low-resource languages under realistic
multilingual data distribution.
The appropriate choice of tokenizer is ultimately
use-case dependent. We recommend Parity-aware
BPE as a responsible default for multilingual mod-
els targeting SEA languages, given its favorable
position in the efficiency-equity space and rela-
tively strong downstream task performance. How-
ever, the base variant of Parity-aware BPE requires
multi-way parallel data, which may be scarce or un-
available in low-resource settings (Foroutan et al.,
2025). MYTE is preferred when morphology and
translation are critical, and the computational bud-
get permits the

Chunk 17 · 1,991 chars

tively strong downstream task performance. How-
ever, the base variant of Parity-aware BPE requires
multi-way parallel data, which may be scarce or un-
available in low-resource settings (Foroutan et al.,
2025). MYTE is preferred when morphology and
translation are critical, and the computational bud-
get permits the associated training overhead.
Our ANOVA results establish that tokenizer
choice is the primary factor affecting tokenizer
parity. The difference in inference costs between
English and SEA-language users can be addressed
through the choice of the tokenizer. Equitable to-
kenization has direct, quantifiable consequences
for the 671 million speakers across Southeast Asia
(Lovenia et al., 2024), for whom token-priced APIs
create unequal access costs. How tokenization is
carried out across languages shapes the economic
accessibility of multilingual models for underrep-
resented language communities.
Limitations
First, all language models are trained at the 1.5B-
parameter scale due to computational resource con-
straints. We leave the investigation of larger model
sizes to future work. Next, BLT cannot be scaled
along the same vocabulary dimension as the other
tokenizers since it operates without a fixed vocab-
ulary, which prevents direct matched-vocabulary
size comparison. Finally, we evaluate only base
pretrained models. We believe this is sufficient, as
supervised fine-tuning or alignment training does
not affect the tokenizer’s fairness or efficiency.
We do not anticipate any immediate societal or
individual harm arising from this work. Neverthe-
less, we advise users to exercise caution, as our
models have not been subjected to safety or value
alignment procedures.
References
Catherine Arnett, Tyler A. Chang, and Benjamin Bergen.
2024. A bit of a problem: Measurement disparities
in dataset sizes across languages. In Proceedings of
SIGUL, pages 1–9.
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika,
Leo Gao, Jonathan Tow, Baber Abbasi, Alham

Chunk 18 · 1,998 chars

fety or value
alignment procedures.
References
Catherine Arnett, Tyler A. Chang, and Benjamin Bergen.
2024. A bit of a problem: Measurement disparities
in dataset sizes across languages. In Proceedings of
SIGUL, pages 1–9.
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika,
Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri
Aji, Pawan Sasanka Ammanamanchi, Sidney Black,
Jordan Clive, Anthony DiPofi, Julen Etxaniz, Ben-
jamin Fattori, Jessica Zosa Forde, Charles Foster, Jef-
frey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan
Li, and 11 others. 2024. Lessons from the trenches on
reproducible evaluation of language models. arXiv
preprint arXiv:2405.14782.
Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng
Gao, and Yejin Choi. 2020. PIQA: Reasoning about
physical commonsense in natural language. In Pro-
ceedings of AAAI, pages 7432–7439.
Kaj Bostrom and Greg Durrett. 2020. Byte pair encod-
ing is suboptimal for language model pretraining. In
Findings of EMNLP, pages 4617–4624.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question an-
swering? try ARC, the AI2 reasoning challenge.
arXiv preprint arXiv:1803.05457.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina
Williams, Samuel Bowman, Holger Schwenk, and
8

-- 8 of 15 --

Veselin Stoyanov. 2018. XNLI: Evaluating cross-
lingual sentence representations. In Proceedings of
EMNLP, pages 2475–2485.
The Unicode Consortium. 2011. The Unicode standard.
Technical Report Version 6.0.0, Unicode Consortium.
Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Kalbassi, Janice Lam, Daniel Licht, Jean Mail-
lard, Anna Sun, Skyler Wang, Guillaume Wen-
zek, Al Youngblood, Bapi Akula, Loic Barrault,
Gabriel Mejia Gonzalez, Prangthip Hansanti, John
Hoffman, and 19 others. 2024. Scaling neural ma-
chine translation to 200 languages. Nature, pages
841–846.
Negar Foroutan, Clara Meister, Debjit

Chunk 19 · 1,992 chars

ice Lam, Daniel Licht, Jean Mail-
lard, Anna Sun, Skyler Wang, Guillaume Wen-
zek, Al Youngblood, Bapi Akula, Loic Barrault,
Gabriel Mejia Gonzalez, Prangthip Hansanti, John
Hoffman, and 19 others. 2024. Scaling neural ma-
chine translation to 200 languages. Nature, pages
841–846.
Negar Foroutan, Clara Meister, Debjit Paul, Joel
Niklaus, Sina Ahmadi, Antoine Bosselut, and Rico
Sennrich. 2025. Parity-aware Byte-Pair Encod-
ing: Improving cross-lingual fairness in tokenization.
arXiv preprint arXiv:2508.04796.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schel-
ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh
Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi-
tra, Archie Sravankumar, Artem Korenev, Arthur
Hinsvark, and 542 others. 2024. The Llama 3 herd
of models. arXiv preprint arXiv:2407.21783.
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu
Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang
Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng
Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao,
Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai,
and 6 others. 2024. MiniCPM: Unveiling the poten-
tial of small language models with scalable training
strategies. arXiv preprint arXiv:2404.06395.
Sheriff Issaka, Erick Rosas Gonzalez, Lieqi Liu,
Evans Kofi Agyei, Lucas Bandarkar, Nanyun Peng,
David Ifeoluwa Adelani, Francisco Guzmán, and
Saadia Gabriel. 2026. Translation as a scalable
proxy for multilingual evaluation. arXiv preprint
arXiv:2601.11778.
Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In Proceedings of
EMNLP, pages 388–395.
Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Ore-
vaoghene Ahia, and Luke Zettlemoyer. 2024. MYTE:
Morphology-driven byte encoding for better and
fairer multilingual language modeling. In Proceed-
ings of ACL, pages 15059–15076.
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu
Wang, Shuohui Chen, Daniel Simig, Myle Ott,

Chunk 20 · 1,993 chars

isiewicz, Terra Blevins, Hila Gonen, Ore-
vaoghene Ahia, and Luke Zettlemoyer. 2024. MYTE:
Morphology-driven byte encoding for better and
fairer multilingual language modeling. In Proceed-
ings of ACL, pages 15059–15076.
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu
Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na-
man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth
Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav
Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle-
moyer, Zornitsa Kozareva, Mona Diab, and 2 others.
2022. Few-shot learning with multilingual generative
language models. In Proceedings of EMNLP, pages
9019–9052.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In Proceedings of ICLR.
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana
Akbar, Lester James V. Miranda, Jennifer San-
toso, Elyanah Aco, Akhdan Fadhilah, Jonibek
Mansurov, Joseph Marvin Imperial, Onno P. Kamp-
man, Joel Ruben Antony Moniz, Muhammad
Ravi Shulthan Habibi, Frederikus Hudi, Railey Mon-
talan, Ryan Ignatius, Joanito Agili Lopo, William
Nixon, Börje F. Karlsson, James Jaya, and 42 oth-
ers. 2024. SEACrowd: A multilingual multimodal
data hub and benchmark suite for Southeast Asian
languages. In Proceedings of EMNLP, pages 5155–
5203.
Lawrence S. Meyers, Glenn Gamst, and A. J. Guarino.
2009. Two-way mixed ANOVA design, page 253–266.
Cambridge University Press.
Tan Sang Nguyen, Muhammad Reza Qorib, and
Hwee Tou Ng. 2026. OpenSeal: Good, fast,
and cheap construction of an open-source South-
east Asian LLM via parallel data. arXiv preprint
arXiv:2602.02266.
OpenAI. 2025. OpenAI/Tiktoken: Tiktoken is a fast
BPE tokeniser for use with OpenAI’s models.
Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Ro-
driguez, John Nguyen, Benjamin Muller, Margaret
Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke
Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtz-
man, and Srini Iyer. 2025. Byte Latent Transformer:
Patches scale better than tokens. In Proceedings of
ACL, pages

Chunk 21 · 1,999 chars

’s models.
Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Ro-
driguez, John Nguyen, Benjamin Muller, Margaret
Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke
Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtz-
man, and Srini Iyer. 2025. Byte Latent Transformer:
Patches scale better than tokens. In Proceedings of
ACL, pages 9238–9258.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: A method for automatic
evaluation of machine translation. In Proceedings of
ACL, pages 311–318.
Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec,
Bettina Messmer, Negar Foroutan, Amir Hossein
Kargaran, Colin Raffel, Martin Jaggi, Leandro Von
Werra, and Thomas Wolf. 2025. FineWeb2: One
pipeline to scale them all – adapting pre-training
data processing to every language. arXiv preprint
arXiv:2506.20920.
Aleksandar Petrov, Emanuele La Malfa, Philip H. S.
Torr, and Adel Bibi. 2023. Language model tok-
enizers introduce unfairness between languages. In
Proceedings of NeurIPS, pages 36963–36990.
Edoardo Maria Ponti, Goran Glavaš, Olga Majewska,
Qianchu Liu, Ivan Vuli´c, and Anna Korhonen. 2020.
XCOPA: A multilingual dataset for causal common-
sense reasoning. In Proceedings of EMNLP, pages
2362–2376.
Maja Popovi´c. 2015. chrF: Character n-gram F-score
for automatic MT evaluation. In Proceedings of
WMT, pages 392–395.
9

-- 9 of 15 --

Muhammad Reza Qorib, Junyi Li, and Hwee Tou Ng.
2025. Just go parallel: Improving the multilingual ca-
pabilities of large language models. In Proceedings
of ACL, pages 33411–33424.
Sebastian Ruder, Jonathan Clark, Alexander Gutkin, Mi-
hir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani,
Parker Riley, Jean-Michel Sarr, Xinyi Wang, John Wi-
eting, Nitish Gupta, Anna Katanova, Christo Kirov,
Dana Dickinson, Brian Roark, Bidisha Samanta,
Connie Tao, David Adelani, and 8 others. 2023.
XTREME-UP: A user-centric scarce-data benchmark
for under-represented languages. In Findings of
EMNLP, pages 1856–1884.
Aishwarya Selvamurugan, Raj Dandekar, Rajat

Chunk 22 · 1,999 chars

g, John Wi-
eting, Nitish Gupta, Anna Katanova, Christo Kirov,
Dana Dickinson, Brian Roark, Bidisha Samanta,
Connie Tao, David Adelani, and 8 others. 2023.
XTREME-UP: A user-centric scarce-data benchmark
for under-represented languages. In Findings of
EMNLP, pages 1856–1884.
Aishwarya Selvamurugan, Raj Dandekar, Rajat Dan-
dekar, and Sreedath Panat. 2025. From bias to bal-
ance: How multilingual dataset composition affects
tokenizer performance across languages. In Work-
shop on LM4UC.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In Proceedings of ACL, pages 1715–
1725.
Peter Smit, Sami Virpioja, Stig-Arne Grönroos, and
Mikko Kurimo. 2014. Morfessor 2.0: Toolkit for sta-
tistical morphological segmentation. In Proceedings
of EACL, pages 21–24.
Sagar Tamang and Dibya Jyoti Bora. 2024. Evaluat-
ing tokenizer performance of large language mod-
els across official Indian languages. arXiv preprint
arXiv:2411.12240.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 oth-
ers. 2023. Llama 2: Open foundation and fine-tuned
chat models. arXiv preprint arXiv:2307.09288.
Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld,
Kyle Lo, Shane Arora, Akshita Bhagia, Yuling
Gu, Shengyi Huang, Matt Jordan, Nathan Lambert,
Dustin Schwenk, Oyvind Tafjord, Taira Anderson,
David Atkinson, Faeze Brahman, Christopher Clark,
Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, and
23 others. 2025. 2 OLMo 2 furious. In Proceedings
of COLM.
Anna Wegmann, Dong Nguyen, and David Jurgens.
2025. Tokenization is sensitive to language variation.
In Findings of ACL, pages 10958–10983.
Linting Xue, Aditya Barua, Noah Constant, Rami Al-
Rfou, Sharan Narang, Mihir Kale, Adam Roberts,
and Colin Raffel.

Chunk 23 · 1,992 chars

and
23 others. 2025. 2 OLMo 2 furious. In Proceedings
of COLM.
Anna Wegmann, Dong Nguyen, and David Jurgens.
2025. Tokenization is sensitive to language variation.
In Findings of ACL, pages 10958–10983.
Linting Xue, Aditya Barua, Noah Constant, Rami Al-
Rfou, Sharan Narang, Mihir Kale, Adam Roberts,
and Colin Raffel. 2022. ByT5: Towards a token-free
future with pre-trained byte-to-byte models. Trans-
actions of ACL, 10:291–306.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mT5: A massively multilingual
pre-trained text-to-text transformer. In Proceedings
of NAACL, pages 483–498.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can a
machine really finish your sentence? In Proceedings
of ACL, pages 4791–4800.
10

-- 10 of 15 --

A Training Data Details
A.1 Tokenizer Training Data
ISO Language # Sentences
Code
en English 604,793
id Indonesian 129,906
th Thai 115,159
vi Vietnamese 90,470
zh Chinese 25,683
ms Malay 21,872
ta Tamil 5,799
tl Tagalog 3,480
my Burmese 1,349
km Khmer 1,253
lo Lao 236
Total 1,000,000
Table 8: Per-language sentence counts in the tokenizer
training dataset.
A.2 Language Model Training Data
We source pretraining data from the FineWeb2 cor-
pus (Penedo et al., 2025), randomly sampling a
subset of 100 million sentences (203 GB) spanning
all eleven SEA languages. To balance coverage be-
tween high-resource and low-resource languages,
we control the language proportions via tempera-
ture sampling, following the approach of Foroutan
et al. (2025).
Temperature Sampling Sampling each lan-
guage proportionally to its word count in FineWeb2
would overwhelmingly favor English at 94.3%. As
such, we sample according to a temperature-scaled
probability:
p(L) ∝ |L|1/τ , (1)
where p(L) is the probability of sampling text from
language L during pre-training, |L| is the number
of words in that language in the corpus, and τ is
a temperature

Chunk 24 · 1,996 chars

word count in FineWeb2
would overwhelmingly favor English at 94.3%. As
such, we sample according to a temperature-scaled
probability:
p(L) ∝ |L|1/τ , (1)
where p(L) is the probability of sampling text from
language L during pre-training, |L| is the number
of words in that language in the corpus, and τ is
a temperature parameter. When τ = 1, sampling
is purely proportional to word frequency. As τ
increases, the distribution becomes increasingly
uniform, thereby boosting the relative sampling
probability of low-resource languages.
We configure τ = 1.21 so that English con-
stitutes 89.7% of the resulting training sentences,
matching the proportion of English data used in the
Llama 2 pretraining dataset (Touvron et al., 2023).
Table 9 shows the raw word frequency of each
language in FineWeb2, along with the adjusted fre-
quency after temperature sampling is applied. For
each language L, we compute |L|1/τ and normal-
ize across all eleven SEA languages to obtain the
final sampling proportion. These proportions are
then used to determine the number of sentences
drawn from each language in our 100M-sentence
subset (Table 10).
A.3 Machine Translation Fine-tuning
Machine translation fine-tuning is performed via
continual pre-training on English-XX parallel sen-
tences for one epoch. The fine-tuning dataset con-
sists of up to 13 million parallel sentence pairs per
language, which were randomly sampled without
replacement from NLLB (Costa-jussà et al., 2024)
where possible, following the approach of Nguyen
et al. (2026). The resulting per-language sentence
counts are shown in Table 11.
Each parallel sentence pair is formatted using
the template by Qorib et al. (2025): "{source
language}: {source sentence} \n {target
language}: {target sentence}". To prevent
the model from developing a bias toward a fixed
source language, the language order of each pair
was randomized independently with equal prob-
ability. This means each example has an equal
chance of being presented as

Chunk 25 · 1,992 chars

l. (2025): "{source
language}: {source sentence} \n {target
language}: {target sentence}". To prevent
the model from developing a bias toward a fixed
source language, the language order of each pair
was randomized independently with equal prob-
ability. This means each example has an equal
chance of being presented as English-first or target
language-first.
B Intrinsic Tokenizer Evaluation Metrics
B.1 Cross-lingual Equity Metrics
Tokenizer parity measures the ratio of the average
number of tokens per sentence in a given language
relative to English (Petrov et al., 2023). For a spe-
cific language L with k aligned sentences, we can
compute its average number of tokens per sentence,
average_tokensL:
1
k
k	X
i=1
# tokens in sentence i (2)
The tokenizer parity for language L, parityL, is
defined as:
average_tokensL
average_tokensEnglish
(3)
The macro-average tokenizer parity for all n non-
English languages is computed as:
1
n
n	X
j=1
parityj (4)
11

-- 11 of 15 --

Language Word frequency (billion), |L| Relative frequency, |L|1/τ Proportion
English 11,500.0 2269.6 89.68%
Chinese 543.5 182.2 7.20%
Indonesian 60.3 29.6 1.17%
Vietnamese 50.9 25.7 1.02%
Thai 24.7 14.1 0.56%
Malay 5.6 4 .2 0.17%
Tamil 1.9 1.7 0.07%
Tagalog 1.6 1.5 0.06%
Burmese 0.9 0.9 0.03%
Khmer 0.7 0.7 0.03%
Lao 0.2 0.3 0.01%
Total 12,190.3 2,530.5 100.00%
Table 9: Word frequencies in FineWeb2 and relative frequencies after temperature sampling (τ = 1.21). Temperature
sampling boosts the relative proportion of low-resource SEA languages while keeping English at 89.7%, matching
Llama 2’s pretraining data distribution.
ISO Language # Sentences
Code
en English 89,689,566
zh Chinese 7,199,747
id Indonesian 1,169,277
vi Vietnamese 1,016,743
th Thai 558,780
ms Malay 165,279
ta Tamil 68,252
tl Tagalog 59,364
my Burmese 34,699
km Khmer 28,295
lo Lao 9,998
Total 100,000,000
Table 10: Per-language sentence counts in the language
model training dataset after temperature sampling.
This ratio measures whether

Chunk 26 · 1,990 chars

9,747
id Indonesian 1,169,277
vi Vietnamese 1,016,743
th Thai 558,780
ms Malay 165,279
ta Tamil 68,252
tl Tagalog 59,364
my Burmese 34,699
km Khmer 28,295
lo Lao 9,998
Total 100,000,000
Table 10: Per-language sentence counts in the language
model training dataset after temperature sampling.
This ratio measures whether tokenizers impose
computational costs unequally across languages.
A macro-average tokenizer parity closer to 1 indi-
cates a more equitable tokenizer across languages.
Gini coefficient assesses tokenization equity by
treating token costs as a distribution (Foroutan
et al., 2025). The token cost for a language, c, is de-
fined as the average number of tokens per sentence
for the language in the parallel corpus. For token
costs c1 ≤ c2 ≤ · · · ≤ cn across n languages, the
Gini coefficient is computed as:
1
n

n + 1 − 2
Pn
i=1(n + 1 − i) ci
Pn
i=1 ci

(5)
ISO Language # Sentences
Code (million)
zh Chinese 13.0
id Indonesian 13.0
vi Vietnamese 13.0
th Thai 13.0
ms Malay 13.0
ta Tamil 13.0
tl Tagalog 13.0
my Burmese 10.0
km Khmer 5.8
lo Lao 4.2
Total 111.0
Table 11: Per-language sentence counts in the fine-
tuning dataset.
Values range from 0 to 1. A lower Gini coefficient
(closer to 0) indicates a more equitable tokenizer
across languages.
B.2 Tokenizer Efficiency Metrics
Compression rate measures how efficiently a tok-
enizer compresses text. It is defined as the average
of the inverse token count per sentence (Foroutan
et al., 2025). For a specific language L, its com-
pression rate, rateL is computed as:
1
k
k	X
i=1
1
# tokens in sentence i (6)
12

-- 12 of 15 --

where k is the number of aligned sentences for
language L in the parallel corpus. Essentially, we
compute a language’s compression rate by evaluat-
ing the inverse of the number of tokens per sentence
and then averaging it.
The macro-average compression rate for all n
languages is computed as:
1
n
n	X
j=1
ratej (7)
Utilizing a parallel corpus controls for seman-
tic differences, by

Chunk 27 · 1,998 chars

parallel corpus. Essentially, we
compute a language’s compression rate by evaluat-
ing the inverse of the number of tokens per sentence
and then averaging it.
The macro-average compression rate for all n
languages is computed as:
1
n
n	X
j=1
ratej (7)
Utilizing a parallel corpus controls for seman-
tic differences, by comparing token counts over
semantically equivalent content. A higher macro-
average compression rate indicates a more efficient
tokenizer across languages. This metric is infor-
mative when viewed alongside tokenizer parity, as
a high overall compression rate can mask under-
compression of individual low-resource languages.
C Extrinsic Metrics
C.1 English Classification Benchmarks
Classification benchmarks evaluate a model’s abil-
ity to understand, analyze, and select the correct cat-
egory from a set of options. The following English
classification benchmarks are used by Pagnoni et al.
(2025) to evaluate a model’s commonsense reason-
ing and general world knowledge.
Physical Intuition Question Answering
(PIQA) (Bisk et al., 2020) probes a model’s under-
standing of everyday physical interactions and how
objects behave in the real world. Each example
presents a goal and two solution candidates, with
the model tasked to identify the more physically
plausible option.
HellaSwag (Zellers et al., 2019) is a common-
sense natural language inference benchmark where
a model must select the most plausible continua-
tion of a given scenario from four candidate end-
ings. The dataset is constructed using adversarial
filtering to ensure that a model possesses genuine
contextual understanding.
Arc-Challenge (Arc-C) (Clark et al., 2018)
evaluates a model’s scientific reasoning ability
through multiple-choice questions drawn from
grade-school science exams. The Challenge subset
selects questions that simple retrieval-based and
word co-occurrence methods fail to answer cor-
rectly, making it a reliable indicator of deeper rea-
soning capabilities.
C.2 Multilingual

Chunk 28 · 1,997 chars

model’s scientific reasoning ability
through multiple-choice questions drawn from
grade-school science exams. The Challenge subset
selects questions that simple retrieval-based and
word co-occurrence methods fail to answer cor-
rectly, making it a reliable indicator of deeper rea-
soning capabilities.
C.2 Multilingual Classification Benchmarks
The following multilingual classification bench-
marks are used by Foroutan et al. (2025) and
they collectively span several of our target lan-
guages. They enable a comprehensive evaluation
of a model’s cross-lingual performance on SEA
languages.
Cross-lingual Natural Language Inference
(XNLI) (Conneau et al., 2018) extends the
MultiNLI dataset to 15 languages and serves as
a standard benchmark for cross-lingual natural lan-
guage understanding. Models must classify the
logical relationship between each pair as one of
three categories: entailment, contradiction, or neu-
tral. This benchmark covers English, Chinese, Thai,
and Vietnamese.
Cross-lingual Choice of Plausible Alternatives
(XCOPA) (Ponti et al., 2020) is a multilingual
benchmark targeting causal commonsense reason-
ing. Given a premise, a model must identify either
the most plausible cause or effect from two candi-
date sentences. XCOPA is evaluated in a zero-shot
setting to assess cross-lingual transfer without fine-
tuning. This benchmark covers English, Chinese,
Indonesian, Tamil, Thai, and Vietnamese.
XStoryCloze (Lin et al., 2022) requires a model
to select the correct ending for a four-sentence nar-
rative from two candidate conclusions in a multi-
lingual setting. It evaluates cross-lingual narrative
understanding and commonsense reasoning. This
benchmark covers English, Burmese, Chinese, and
Indonesian.
C.3 Machine Translation
Machine translation is a natural benchmark for eval-
uating multilingual LLMs, as it tests a model’s
ability to understand and generate text across lan-
guages. For SEA languages, translation quality
serves as a proxy for how well a

Chunk 29 · 1,995 chars

is
benchmark covers English, Burmese, Chinese, and
Indonesian.
C.3 Machine Translation
Machine translation is a natural benchmark for eval-
uating multilingual LLMs, as it tests a model’s
ability to understand and generate text across lan-
guages. For SEA languages, translation quality
serves as a proxy for how well a model has inter-
nalized low-resource linguistic structure (Issaka
et al., 2026).
Machine translation performance is measured
by comparing the machine-generated output to hu-
man reference translations. Two complementary
metrics are typically used, BLEU (Bilingual Evalu-
ation Understudy) (Papineni et al., 2002) and chrF
(character-level F-score) (Popovi´c, 2015). Both
metrics have scores ranging from 0 to 100, with
higher scores indicating better translation qual-
ity. BLEU measures word-level n-gram preci-
sion against reference translations and was used
13

-- 13 of 15 --

by Pagnoni et al. (2025). chrF computes character
n-gram F-score and is suitable for morphologically
rich languages where word-level overlap may be
sparse and was used by Limisiewicz et al. (2024).
The detailed BLEU and chrF translation scores
are shown in Appendix D.2. The original MYTE
paper (Limisiewicz et al., 2024) reports scores only
for English-to-Vietnamese and English-to-Tamil
translation among SEA languages, and our scores
are higher than their reported scores.
D Detailed Results
D.1 Multilingual Classification Benchmarks
Model en zh vi th AVG
(size)
BLT (4.5) 42.10 33.99 34.29 34.79 36.29
MYTE (90k) 50.00 33.99 44.99 40.98 42.49
PA BPE (90k) 47.94 33.91 43.07 36.81 40.43
BPE (90k) 49.30 33.51 43.09 36.35 40.56
Table 12: Per-language XNLI scores. Expected accuracy of a random classifier = 33.33.
Model en zh id vi th ta AVG
(size)
BLT (4.5) 64.20 52.40 52.60 48.60 52.60 49.40 53.30
MYTE (90k) 54.40 51.00 55.80 56.60 55.00 56.80 54.93
PA BPE (90k) 71.40 55.60 58.60 60.20 53.40 53.00 58.70
BPE (90k) 71.60 59.00 61.60 62.80 56.20 55.00 61.03
Table 13: Per-language XCOPA

Chunk 30 · 1,999 chars

ed accuracy of a random classifier = 33.33.
Model en zh id vi th ta AVG
(size)
BLT (4.5) 64.20 52.40 52.60 48.60 52.60 49.40 53.30
MYTE (90k) 54.40 51.00 55.80 56.60 55.00 56.80 54.93
PA BPE (90k) 71.40 55.60 58.60 60.20 53.40 53.00 58.70
BPE (90k) 71.60 59.00 61.60 62.80 56.20 55.00 61.03
Table 13: Per-language XCOPA scores. Expected accuracy of a random classifier = 50.00.
Model en zh id my AVG
(size)
BLT (4.5) 65.45 54.20 55.06 51.36 56.52
MYTE (90k) 52.95 49.83 50.89 48.51 50.55
PA BPE (90k) 65.25 55.06 55.92 50.56 56.70
BPE (90k) 65.78 54.80 57.64 50.50 57.18
Table 14: Per-language XStoryCloze scores. Expected accuracy of a random classifier = 50.00.
14

-- 14 of 15 --

D.2 Machine Translation
D.2.1 BLEU scores
Model zh id vi th ms ta tl my km lo AVG
(size)
BLT (4.5) 9.18 27.01 19.30 7.24 24.98 3.65 11.04 1.92 2.31 1.53 10.82
MYTE (90k) 18.43 34.92 25.36 9.25 27.66 5.61 16.73 2.49 3.40 3.90 14.77
PA BPE (90k) 22.78 26.47 18.30 5.64 21.04 4.38 9.11 1.30 2.56 2.01 11.36
BPE (90k) 30.20 27.72 32.51 6.90 23.10 2.56 8.05 0.58 1.45 0.82 13.39
Table 15: Per-language BLEU scores (EN → XX).
Model zh id vi th ms ta tl my km lo AVG
(size)
BLT (4.5) 10.33 25.63 21.94 8.47 18.30 4.65 17.10 3.92 4.31 2.33 11.70
MYTE (90k) 16.65 30.96 18.71 13.70 23.90 3.76 21.23 3.87 2.40 2.90 13.81
PA BPE (90k) 14.75 26.06 22.82 12.17 17.29 5.35 16.73 2.15 2.10 2.47 12.19
BPE (90k) 14.42 29.75 23.24 11.26 19.20 5.62 15.05 2.39 3.13 3.77 12.78
Table 16: Per-language BLEU scores (XX → EN).
D.2.2 chrF scores
Model zh id vi th ms ta tl my km lo AVG
(size)
BLT (4.5) 13.16 48.75 42.03 30.48 45.46 31.52 31.35 19.91 19.92 20.63 30.32
MYTE (90k) 44.13 59.82 54.22 40.37 47.27 26.22 42.02 22.46 26.29 25.89 38.87
PA BPE (90k) 15.29 57.87 46.46 24.16 51.92 32.62 44.76 20.00 17.82 19.58 33.05
BPE (90k) 27.86 67.47 54.80 30.81 62.05 30.07 51.56 14.89 15.87 13.80 36.92
Table 17: Per-language chrF scores (EN → XX).
Model zh id vi th ms ta tl my km lo AVG
(size)
BLT (4.5) 31.26 54.78 43.12 34.10 44.98 23.80

Chunk 31 · 646 chars

2 22.46 26.29 25.89 38.87
PA BPE (90k) 15.29 57.87 46.46 24.16 51.92 32.62 44.76 20.00 17.82 19.58 33.05
BPE (90k) 27.86 67.47 54.80 30.81 62.05 30.07 51.56 14.89 15.87 13.80 36.92
Table 17: Per-language chrF scores (EN → XX).
Model zh id vi th ms ta tl my km lo AVG
(size)
BLT (4.5) 31.26 54.78 43.12 34.10 44.98 23.80 36.71 19.33 19.54 18.05 32.57
MYTE (90k) 28.18 60.04 40.16 31.13 59.96 23.45 42.58 19.38 20.56 20.32 34.58
PA BPE (90k) 41.60 51.99 50.27 33.43 42.87 25.37 36.85 20.05 20.66 21.99 34.51
BPE (90k) 43.16 62.22 53.37 35.38 45.52 28.89 38.39 21.93 24.15 24.30 37.73
Table 18: Per-language chrF scores (XX → EN).
15

-- 15 of 15 --