The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Summary

This survey investigates why multilingual language models (LMs) show uneven performance across languages, asking whether these gaps stem from intrinsic linguistic difficulty or modeling choices. The authors find that most disparities arise from design decisions, not inherent complexity. Key factors include tokenization, which can fragment morphemes and disadvantage non-Latin scripts; data sampling, which often penalizes byte-heavy languages; and shared-parameter training, which causes negative transfer between typologically diverse languages. Normalizing segmentation, encoding, and data exposure significantly reduces performance gaps. The paper recommends morphology-aware tokenization, byte-normalized sampling, and typology-aware model architectures to achieve more balanced multilingual performance. It also emphasizes the need for evaluation metrics that disentangle tokenization artifacts from true modeling capability. Overall, the study concludes that current modeling paradigms, rather than linguistic diversity itself, drive most performance disparities in multilingual LMs.

PDF viewer

Chunks(50)

Chunk 0 · 1,988 chars

The Roots of Performance Disparity in Multilingual Language Models:
Intrinsic Modeling Difficulty or Design Choices?
Chen Shani1, Yuval Reif2, Nathan Roll1, Dan Jurafsky1, Ekaterina Shutova3
1Stanford University, 2The Hebrew University of Jerusalem, 3University of Amsterdam
Abstract
Multilingual language models (LMs) promise
broader NLP access, yet current systems de-
liver uneven performance across the world’s
languages. This survey examines why these
gaps persist and whether they reflect intrinsic
linguistic difficulty or modeling artifacts. We
organize the literature around two questions:
do linguistic disparities arise from representa-
tion and allocation choices (e.g., tokenization,
encoding, data exposure, parameter sharing)
rather than inherent complexity; and which de-
sign choices mitigate inequities across typolog-
ically diverse languages. We review linguis-
tic features, such as orthography, morphology,
lexical diversity, syntax, information density,
and typological distance, linking each to con-
crete modeling mechanisms. Gaps often shrink
when segmentation, encoding, and data expo-
sure are normalized, suggesting much apparent
difficulty stems from current modeling choices.
We synthesize these insights into design recom-
mendations for tokenization, sampling, archi-
tectures, and evaluation to support more bal-
anced multilingual LMs.
1 Introduction
Multilingual LMs have expanded NLP’s reach by
enabling a single model to perform tasks across
many languages. They are pretrained on text from
hundreds of languages, sharing parameters and rep-
resentations (Devlin et al., 2019; Conneau et al.,
2020a; Le Scao et al., 2022; Imani et al., 2023;
Dang et al., 2024). This enables cross-lingual trans-
fer, where patterns learned in one language improve
performance in others (Pires et al., 2019; Conneau
et al., 2020b; Lauscher et al., 2020; Malkin et al.,
2022; Blevins et al., 2024). Despite these advan-
tages, persistent performance disparities across

Chunk 1 · 1,997 chars

al., 2023;
Dang et al., 2024). This enables cross-lingual trans-
fer, where patterns learned in one language improve
performance in others (Pires et al., 2019; Conneau
et al., 2020b; Lauscher et al., 2020; Malkin et al.,
2022; Blevins et al., 2024). Despite these advan-
tages, persistent performance disparities across lan-
guages limit the practical reach of multilingual
models (Wang et al., 2025; Ghosh et al., 2025).
These disparities systematically follow cross-
linguistic patterns: higher-resource languages and
those structurally similar to dominant training lan-
guages generally perform better than low-resource
or typologically distant ones (Zhao et al., 2025;
Akindotuni, 2025). The disparities often per-
sist even with large-scale pretraining, suggesting
that scaling alone cannot ensure equitable perfor-
mance (Hoffmann et al., 2022; He et al., 2025).
This raises a central question: are some languages
inherently harder to model, or do performance
gaps reflect engineering artifacts and design
choices?
We review how linguistic structure interacts with
multilingual design choices to shape performance
gaps via two questions: whether disparities stem
from intrinsic difficulty or modeling artifacts (e.g.,
tokenization, data allocation, shared-parameter in-
terference); and which design choices mitigate in-
equities. We consolidate our findings into a set of
recommendations for tokenization, data sampling,
model architectures, and evaluation, highlighting
where evaluations confound learnability with tok-
enization or encoding artifacts (Table 1).
Our synthesis suggests that cross-linguistic
gaps rarely reflect intrinsic modeling complex-
ity. Instead, they arise via three mechanisms: (1)
shared-parameter training induces negative transfer
when typological diversity exceeds effective capac-
ity (Pfeiffer et al., 2022; Blevins et al., 2024; Chang
et al., 2024a); (2) tokenization and encoding frag-
ment words or penalize byte-heavy scripts, inflat-
ing sequence length

Chunk 2 · 1,991 chars

they arise via three mechanisms: (1)
shared-parameter training induces negative transfer
when typological diversity exceeds effective capac-
ity (Pfeiffer et al., 2022; Blevins et al., 2024; Chang
et al., 2024a); (2) tokenization and encoding frag-
ment words or penalize byte-heavy scripts, inflat-
ing sequence length without added meaning (Rust
et al., 2021; Arnett et al., 2024; Lundin et al., 2025;
Land and Arnett, 2025); and (3) data sampling and
evaluation misrepresent semantic exposure. Gaps
shrink when normalizing segmentation, encoding,
and exposure or explicitly allocating capacity, indi-
cating that difficulty stems from modeling choices.
This first systematic review of cross-linguistic
modeling difficulty research offers practical design
recommendations for multilingual LMs to achieve
balanced performance across diverse languages.
arXiv:2601.07220v3 [cs.CL] 10 Apr 2026

-- 1 of 17 --

Linguistic Factor 	Observed Artifact 	Modeling Mechanism Design Levers
Orthography and encoding gran-
ularity (§2.1)
Encoding inefficiency (byte pre-
mium); reduced effective expo-
sure under fixed budgets; incon-
sistent written signal
UTF-8 byte-length asymmetries
inflate sequence length and re-
duce effective training signal for
many non-Latin scripts
Byte-normalized sampling; script-
aware or language-adaptive tok-
enization; alternative encodings;
tokenizer-free models (§3.1, §3.2)
Morphology: productivity and
compounding (§2.2, §2.3)
Tokenization (over-segmentation,
longer sequences); diluted train-
ing signal across surface forms
BPE subwords misalign with mor-
pheme boundaries, yielding in-
consistent tokenization
Morphology-aware tokenization;
language-aware vocabularies (§3.1)
Information density and redun-
dancy (§2.5)
Unequal semantic coverage under
fixed token budgets
Token-based budgeting allocates
unequal information per unit
of training, confounding cross-
language comparisons
Information-, byte- or morpheme-
normalized sampling; adaptive scal-
ing of

Chunk 3 · 1,997 chars

anguage-aware vocabularies (§3.1)
Information density and redun-
dancy (§2.5)
Unequal semantic coverage under
fixed token budgets
Token-based budgeting allocates
unequal information per unit
of training, confounding cross-
language comparisons
Information-, byte- or morpheme-
normalized sampling; adaptive scal-
ing of effective exposure to lan-
guage data (§3.2, §3.4)
Typological and syntactic diver-
gence (§2.6, §2.4)
Negative transfer under shared pa-
rameters; degraded syntactic gen-
eralization
Shared capacity induces gradi-
ent conflict and representation
collapse when languages differ
strongly in structure
Modular capacity and language
adapters; typology-aware routing;
controlled sharing (§3.4)
Evaluation sensitivity to tok-
enization and encoding (§2.1,
§2.5)
Perplexity comparability con-
founded by segmentation and byte
length
Subword-level metrics conflate
segmentation decisions with pre-
dictability
Report character/morpheme-level
metrics; tokenization diagnostics
and typology-aware probes (§3.3)
Table 1: Linking linguistic properties to multilingual modeling artifacts, mechanisms, and design levers; section
references point to the supporting evidence discussed in the survey.
2 Linguistic Properties
Human languages evolve under multiple, some-
times competing objectives, producing systematic
trade-offs across morphology, syntax, and phonol-
ogy (Gibson et al., 2019). Information may be
densely packed within words or distributed across
syntax; flexible word order can be balanced by
overt marking such as case or agreement. These
features preserve overall communicative efficiency
despite wide typological diversity (Gibson et al.,
2019; Lian et al., 2023). Human acquisition aligns
with this view: children reliably acquire their am-
bient language, though the timing and difficulty
of specific constructions vary by typology rather
than defining a universal hierarchy of “hard” lan-
guages (Slobin, 1987; Berman, 2014).
We review linguistic properties associated

Chunk 4 · 1,996 chars

al., 2023). Human acquisition aligns
with this view: children reliably acquire their am-
bient language, though the timing and difficulty
of specific constructions vary by typology rather
than defining a universal hierarchy of “hard” lan-
guages (Slobin, 1987; Berman, 2014).
We review linguistic properties associated with
cross-linguistic performance variation, where
learnability refers to sample efficiency and pre-
dictive performance (perplexity, downstream ac-
curacy). Drawing on NLP, computational linguis-
tics, typology, and information theory, we show
how these properties influence tokenization, data
allocation, and architecture in multilingual LMs.
Each subsection defines a property, summarizes
evidence, and discusses factors affecting modeling
success.
2.1 Orthography
Orthography drives cross-linguistic dispari-
ties by shaping encoding efficiency and sur-
face inconsistency across writing systems.
Orthography concerns how linguistic content is
represented in writing. Writing systems differ in
granularity—whether symbols correspond roughly
to phonemes, syllables, or morphemes—and in
transparency, or how predictably written forms
map to sounds (Wydell and Butterworth, 1999;
Ziegler and Goswami, 2005). In humans, these
differences can influence the difficulty of literacy
acquisition, rather than spoken-language learning
itself (Verhoeven and Perfetti, 2022; Chang et al.,
2020). Although children learn to read at different
rates across languages (Lai et al., 2024; Seymour
et al., 2003), skilled adult reading is broadly sim-
ilar across orthographies (Schroeder et al., 2022;
Liversedge et al., 2016).
Language models, however, acquire language
directly from text, without prior phonological, lex-
ical, or semantic knowledge. Thus, orthographic
differences matter mostly as differences in repre-
sentation efficiency—how meaning is encoded into
bytes and tokens. We focus on three consequences
of orthography for modeling: encoding efficiency,
vocabulary allocation

Chunk 5 · 1,993 chars

rectly from text, without prior phonological, lex-
ical, or semantic knowledge. Thus, orthographic
differences matter mostly as differences in repre-
sentation efficiency—how meaning is encoded into
bytes and tokens. We focus on three consequences
of orthography for modeling: encoding efficiency,
vocabulary allocation in multilingual tokenizers,
and the surface consistency of written forms.
First, writing systems differ in how much infor-
mation they express per written symbol (Huang
et al., 2024) and in how those symbols are en-
coded under UTF-8 (Yergeau, 2003). Alphabetic
systems such as English distribute meaning across
sequences of letters that roughly track phonolog-
ical units, whereas logographic systems such as
Chinese often express comparable content with
fewer, denser characters (Tan et al., 2001). In abugi-
das such as Devanagari, characters are organized

-- 2 of 17 --

around consonantal bases, with vowels often ex-
pressed through attached diacritics (Velayuthan and
Sarveswaran, 2025). These differences lead to dis-
parities at the byte level: English characters oc-
cupy one byte, Arabic characters two, and Chinese
characters three; in Devanagari, what readers per-
ceive as a single written unit may consist of a base
consonant plus multiple diacritics, each separately
encoded in three bytes (Lemire and Muła, 2022;
Lavanya et al., 2005).
This creates a byte premium: under a fixed to-
ken or context budget, equal numbers of bytes do
not correspond to equal amounts of linguistic con-
tent across languages (Arnett et al., 2024). For
languages written in multi-byte scripts, the same
content occupies longer encoded sequences, reduc-
ing effective exposure during training and shrink-
ing the amount of text that fits within a context
window (Moon et al., 2025). The problem is not
only that some scripts yield longer sequences, but
that equal budgets systematically allocate unequal
amounts of usable input across languages.
Second, orthography affects how

Chunk 6 · 1,999 chars

effective exposure during training and shrink-
ing the amount of text that fits within a context
window (Moon et al., 2025). The problem is not
only that some scripts yield longer sequences, but
that equal budgets systematically allocate unequal
amounts of usable input across languages.
Second, orthography affects how efficiently to-
kenizers can build reusable units. Most modern
language models operate on subword vocabular-
ies learned from corpus statistics, often via byte-
pair encoding (BPE; Sennrich et al., 2016) starting
from a byte-level vocabulary. Tokenization effi-
ciency therefore depends not only on how much
information each written symbol carries, but also
on how easily recurring sequences can be merged
into reusable units, which tends to disadvantage
scripts whose characters are encoded with multiple
bytes (Sennrich et al., 2016; Zouhar et al., 2023b;
Kargaran et al., 2024). Thus, even under the same
vocabulary budget, this can produce large dispari-
ties in compression across languages.
In multilingual settings, shared-vocabulary to-
kenization can allocate capacity unevenly: high-
resource Latin-script languages tend to receive
larger and more informative subwords, while many
non-Latin scripts are segmented into shorter frag-
ments with less information per token (Petrov et al.,
2023; Ahia et al., 2023). Conversely, sharing or
aligning scripts can improve transfer. For unseen
or low-resource languages, transliteration into a
script already well represented in the model can
improve downstream performance (Muller et al.,
2021; Moosa et al., 2023); it can also improve
cross-lingual alignment, while recent work iden-
tifies script mismatch itself as a major barrier to
cross-script knowledge transfer (Moosa et al., 2023;
Bandarkar et al., 2026).
Byte-level tokenization also introduces distor-
tions specific to multi-byte scripts. Because BPE
merges frequent byte sequences rather than linguis-
tically meaningful units, tokens may split charac-
ters across

Chunk 7 · 1,999 chars

itself as a major barrier to
cross-script knowledge transfer (Moosa et al., 2023;
Bandarkar et al., 2026).
Byte-level tokenization also introduces distor-
tions specific to multi-byte scripts. Because BPE
merges frequent byte sequences rather than linguis-
tically meaningful units, tokens may split charac-
ters across byte boundaries or contain only partial
UTF-8 sequences (Firestone et al., 2025; Jang et al.,
2025; Land and Arnett, 2025). Such fragments
need not correspond to phonological, morphologi-
cal, or semantic structure. Unrelated symbols may
therefore partially overlap in tokenization simply
because they share bytes, while meaningful sub-
character structure may be obscured when token
boundaries fail to align with it (Haslett, 2025). A
straightforward character-level vocabulary would
avoid some of these artifacts, but in multilingual
settings even a base vocabulary of Unicode char-
acters would already be extremely large—on the
order of 130k types (Petrov et al., 2023; Ahia et al.,
2023). Consequently, high information density per
character does not necessarily yield equally effi-
cient tokenization.
Third, orthography can shape the surface con-
sistency of the training signal, because the same
underlying content may appear in multiple writ-
ten forms. This can arise from optional diacritics,
as in Arabic and Hebrew (Inoue et al., 2026; Gor-
man and Pinter, 2025); from widespread spelling
inconsistencies (Obeid et al., 2020; Adouane et al.,
2019); and from routine script alternation, such as
Simplified versus Traditional Chinese (Lyu et al.,
2025) or South Asian languages commonly written
in both native and Latin scripts (Roark et al., 2020).
As a result, models may observe the same mean-
ing dispersed across several orthographic variants
rather than concentrated in a single stable form.
Tokenizer-free models can reduce some of these
disparities (Pagnoni et al., 2025; Clark et al., 2022),
and recent methods such as MYTE mitigate script-
specific penalties

Chunk 8 · 1,997 chars

As a result, models may observe the same mean-
ing dispersed across several orthographic variants
rather than concentrated in a single stable form.
Tokenizer-free models can reduce some of these
disparities (Pagnoni et al., 2025; Clark et al., 2022),
and recent methods such as MYTE mitigate script-
specific penalties by introducing alternatives to
UTF-8 (Limisiewicz et al., 2024; Land and Arnett,
2025). However, these approaches do not elim-
inate orthographic asymmetries: byte premiums
still lengthen sequences even for tokenizer-free
models, character granularity remains incompara-
ble across scripts, and surface variation can still
fragment training signal. Overall, orthography con-
tributes to cross-linguistic disparities by making
encoding efficiency unequal across writing systems
and written signal more or less consistent before
modeling even begins.

-- 3 of 17 --

2.2 Morphological Complexity
Apparent performance gaps from rich mor-
phology often stem from segmentation qual-
ity, vocabulary budget, and data allocation.
Morphological complexity concerns how lan-
guages change or combine words to express distinc-
tions such as tense, number, case, or word meaning,
through processes such as inflection, derivation,
and compounding (Haspelmath and Sims, 2013).
Children ambiently learn the word-building pat-
terns of their native language, although the pace of
acquisition varies with the regularity and typology
of the system (Clark, 2017). For adult second-
language learners, these patterns can remain diffi-
cult, especially when they differ substantially from
those of the learner’s first language (Ellis, 2022).
Perhaps because of this, morphology has often
been assumed to increase language modeling diffi-
culty (Cotterell et al., 2018; Gerz et al., 2018; Park
et al., 2021; Mielke et al., 2019).
A common explanation is sparsity: morphologi-
cally rich languages realize each lexeme in many
surface forms, lowering the frequency of individ-
ual forms and increasing

Chunk 9 · 1,994 chars

as often
been assumed to increase language modeling diffi-
culty (Cotterell et al., 2018; Gerz et al., 2018; Park
et al., 2021; Mielke et al., 2019).
A common explanation is sparsity: morphologi-
cally rich languages realize each lexeme in many
surface forms, lowering the frequency of individ-
ual forms and increasing the burden on models to
generalize across paradigms even when the under-
lying rules are regular (Park et al., 2021). Early
multilingual studies seemed to support this view:
Cotterell et al. (2018) found that lower performance
correlates with morphological richness across 21
languages, and that this effect was largely removed
by lemmatization. Similarly, Gerz et al. (2018) re-
ported substantial morphology-related differences
across 50 typologically diverse languages.
Later work, however, suggests that much of this
apparent effect is not intrinsic to morphology itself.
Instead, it is amplified by how current pipelines seg-
ment, encode, and sample morphologically com-
plex languages (Mielke et al., 2019). In particu-
lar, morphology-aware segmentation substantially
reduces surprisal or performance gaps induced by
standard BPE (Park et al., 2021; Mager et al., 2022),
and analyses of WordPiece and BPE show that they
often fail to preserve morpheme structure in lan-
guages with complex inflection or derivation (Klein
and Tsarfaty, 2020; Lerner and Yvon, 2025). Re-
cent work further shows that once tokenization
or effective exposure are controlled, morpholog-
ical complexity is a much weaker predictor of LM
performance than previously assumed (Arnett and
Bergen, 2025; Asgari et al., 2025; Rust et al., 2021).
Taken together, these results suggest that mor-
phology affects language modeling through three
interacting mechanisms. First, tokenization deter-
mines whether recurring morphemes are preserved
as reusable units or broken into arbitrary fragments;
when morpheme boundaries are obscured, mod-
els can generalize less effectively across related
word

Chunk 10 · 1,998 chars

suggest that mor-
phology affects language modeling through three
interacting mechanisms. First, tokenization deter-
mines whether recurring morphemes are preserved
as reusable units or broken into arbitrary fragments;
when morpheme boundaries are obscured, mod-
els can generalize less effectively across related
word forms (Mager et al., 2022; Park et al., 2021;
Bostrom and Durrett, 2020; Gazit et al., 2025).
Second, rich morphology can increase sequence
cost when grammatical information is distributed
across more fragmented token sequences, so the
same content consumes more of the model’s con-
text, and equal token budgets result in less effec-
tive data exposure (Arnett et al., 2024; Foroutan
et al., 2025; Asgari et al., 2025). Third, rich mor-
phology can spread training signal across many
low-frequency forms, while multilingual vocabu-
lary learning may allocate less useful capacity to
the morphemes needed to represent them, leaving
less training signal for each individual form (Reif
et al., 2025; Park et al., 2021; Rust et al., 2021).
Morphology is therefore best treated as an in-
teraction effect rather than a uniform source of
difficulty for language modeling. The same ty-
pological feature can appear harmful under one
tokenizer or training budget and largely disappear
under another. When these factors are removed
(e.g., exposure is normalized for sequence-length
and vocabulary-allocation effects) the gap between
morphologically simpler and richer languages be-
comes substantially smaller.
2.3 Lexical Diversity and Vocabulary Size
Lexical diversity effects reflect tokenization
misalignment, not linguistic complexity.
Lexical diversity captures how many distinct lex-
ical types (lexemes and multiword expressions)
a corpus contains and how evenly their frequen-
cies are distributed. In human language acquisi-
tion, learning is strongly frequency-driven: high-
frequency forms are acquired earlier, while low-
frequency items are acquired later and remain
harder

Chunk 11 · 1,994 chars

how many distinct lex-
ical types (lexemes and multiword expressions)
a corpus contains and how evenly their frequen-
cies are distributed. In human language acquisi-
tion, learning is strongly frequency-driven: high-
frequency forms are acquired earlier, while low-
frequency items are acquired later and remain
harder to access (Ambridge et al., 2015), reflecting
the long-tailed (Zipfian) distribution of words (Zipf,
1935). This connects lexical diversity to learnabil-
ity: larger effective vocabularies entail longer tails
of rare items, raising sample complexity for learn-
ing word meanings even if speakers ultimately mas-
ter them.

-- 4 of 17 --

Cross-linguistic differences in lexical diversity
reflect lexicalization choices (what is expressed
as a single word versus a multiword expression)
and word-formation productivity (derivation and
compounding) (Booij, 2005; Baayen, 2009). For
instance, languages differ in how motion events are
lexicalized (e.g., encoding manner versus path in
the verb) (Talmy, 2000; Allen et al., 2007). Lex-
ical diversity is typically measured from word-
segmented corpora via type-frequency distribu-
tions, using indices like Type-Token Ratio and
its length-normalized variants (Covington and Mc-
Fall, 2010; McCarthy and Jarvis, 2010; Kettunen,
2014).1
In multilingual LM analyses, lexical diversity
predicts perplexity and transfer quality (Mielke
et al., 2019; Pelloni et al., 2022). However, perplex-
ity alone does not reveal which linguistic attributes
are learned (Meister and Cotterell, 2021). Output-
side analyses also examine linguistic diversity in
generations: Guo et al. (2025) evaluate model out-
puts along lexical, syntactic, and semantic diversity
dimensions and find that current LLMs fall short
of human-level linguistic diversity.
More generally, lexical diversity is a robust
predictor of difficulty for LMs: Head-POS en-
tropy (Dehouck and Denis, 2018) and raw type
counts can outperform typological features in pre-
dicting

Chunk 12 · 1,994 chars

al, syntactic, and semantic diversity
dimensions and find that current LLMs fall short
of human-level linguistic diversity.
More generally, lexical diversity is a robust
predictor of difficulty for LMs: Head-POS en-
tropy (Dehouck and Denis, 2018) and raw type
counts can outperform typological features in pre-
dicting language modeling difficulty (Mielke et al.,
2019), and tokenization-sensitive measures such as
Subword Evenness predict cross-lingual transfer
and multilingual perplexity (Pelloni et al., 2022).
Vocabulary-richness features also predict GPT-2
perplexity in English and interact with segmenta-
tion choices across typologies (Miaschi et al., 2021;
Parra, 2024).
However, much of this effect reflects segmen-
tation artifacts: in morphologically complex lan-
guages, frequency-based subwords fragment words
into many pieces, inflating sequence length and
reducing effective exposure per unit of semantic
content (Lundin et al., 2025). When training data
is equalized by byte premium or when tokenization
artifacts are otherwise controlled, apparent lexical-
diversity effects weaken substantially (Arnett and
Bergen, 2025). Lexical diversity, therefore, chal-
lenges LMs mainly under tokenization schemes
that misalign with linguistic structure. Vocabulary
size remains a strong predictor, mainly due to seg-
1In corpus linguistics, these indices typically treat “tokens”
as word tokens in a word-segmented corpus (not subword
tokens produced by NLP tokenizers).
mentation and data sparsity rather than inherent
lexical complexity.
2.4 Syntactic Features
Syntactic features affect modeling difficulty
indirectly through interactions with mor-
phology, vocabulary size, and tokenization.
Syntactic features describe how languages or-
ganize words into phrases and clauses, including
word order, case marking, and dependency struc-
ture. Syntax and morphology often provide alter-
native encodings for the same grammatical distinc-
tions: a language may rely more on word order

Chunk 13 · 1,995 chars

bulary size, and tokenization.
Syntactic features describe how languages or-
ganize words into phrases and clauses, including
word order, case marking, and dependency struc-
ture. Syntax and morphology often provide alter-
native encodings for the same grammatical distinc-
tions: a language may rely more on word order or
more on overt marking (case, agreement) to signal
roles and relations while preserving overall com-
municative efficiency (Sinnemäki, 2008; Lian et al.,
2023; Levshina, 2021; Fedzechkina et al., 2017). In
humans, these trade-offs influence which cues must
be tracked rather than creating a global difficulty
hierarchy.
The evidence on the effects of word order and
syntactic variation on language modelling difficulty
remains mixed (Mielke et al., 2019; Miaschi et al.,
2021), and syntax is generally less investigated
than morphology and tokenization in this context.
Analyses based on typological features typically
find that syntactic typology explains less variance
in surprisal and perplexity than tokenization or lex-
ical measures, with the largest effects occurring
when critical syntactic cues rely on morphemes
that subword tokenizers fragment (Mielke et al.,
2019). Case marking illustrates this interaction: un-
der standard BPE, languages with productive case
systems show higher surprisal, but morphology-
aware segmentation reduces the gap by segmenting
case morphemes more consistently (Park et al.,
2021), increasing their effective frequency and pre-
serving cues for syntactic roles. Word order effects
are mixed: basic order alone is not a reliable predic-
tor of perplexity (Mielke et al., 2019), and reducing
word-order-specific encoding can improve cross-
lingual adaptation (Liu et al., 2021). Dependency
distance metrics or embedding depth could, in prin-
ciple, affect language modelling difficulty, but to
the best of our knowledge, they have not yet been
studied in this context.
In sum, syntactic differences rarely govern lan-
guage modelling

Chunk 14 · 1,996 chars

coding can improve cross-
lingual adaptation (Liu et al., 2021). Dependency
distance metrics or embedding depth could, in prin-
ciple, affect language modelling difficulty, but to
the best of our knowledge, they have not yet been
studied in this context.
In sum, syntactic differences rarely govern lan-
guage modelling difficulty when taken in isolation;
rather they interact with tokenization artifacts that
inflate sequence length or obscure morphological

-- 5 of 17 --

cues (Arnett and Bergen, 2025). Consequently,
syntax-related performance gaps often reflect archi-
tectural constraints and English-centric positional
heuristics rather than inherent modelling difficulty.
Cognitively-motivated inductive biases, such as rel-
ative position encodings and syntactically informed
attention, can mitigate these issues (Shaw et al.,
2018; Dufter et al., 2022; Strubell et al., 2018;
Kuribayashi et al., 2024), but positional design
choices matter and multilingual evidence for newer
schemes (ALiBi, RoPE) remains mixed (Ravis-
hankar and Søgaard, 2021; Press et al., 2022; Su
et al., 2024).
Overall, syntactic variation shapes cross-
linguistic gaps mainly through interactions with
morphology, tokenization, and vocabulary size;
once normalized, syntax alone explains less of the
variance, though it remains important for general-
ization and cross-lingual transfer.
2.5 Information-Theoretic Measures
Entropy differences largely capture repre-
sentational choices, like word length or mor-
phological encoding, rather than the intrin-
sic learnability of a language.
Information-theoretic metrics quantify pre-
dictability and redundancy, but they also reflect
morphology, orthography, and other representa-
tional choices rather than pure learnability. Some
potentially informative metrics remain difficult
to define or measure, leaving room for future
work. Information-theoretic measures quantify
predictability and redundancy: entropy captures
average uncertainty, surprisal measures the

Chunk 15 · 1,989 chars

orthography, and other representa-
tional choices rather than pure learnability. Some
potentially informative metrics remain difficult
to define or measure, leaving room for future
work. Information-theoretic measures quantify
predictability and redundancy: entropy captures
average uncertainty, surprisal measures the nega-
tive log probability of an observed unit, and com-
pression rate approximates achievable code length
under efficient encoding. These metrics provide
a principled way to compare languages in terms
of predictability and coding efficiency, linking
cross-entropy in LMs to fundamental data statis-
tics (Shannon, 1948). In human processing, sur-
prisal theory formalizes the connection between
predictability and cognitive difficulty (Hale, 2001;
Smith and Levy, 2013).
A central insight from psycholinguistics and
quantitative linguistics is that languages main-
tain stable information rates through compensatory
trade-offs. Spoken languages converge on near-
constant bits-per-second rates (Coupé et al., 2019;
Jaeger, 2010), and morphologically rich languages
exhibit higher per-word entropy because they en-
code more information per word (Bentz et al., 2017;
Koplenig et al., 2025).
Large-scale studies show systematic differences
in entropy at the character and word levels, bal-
anced by structural features like word length (Ko-
plenig et al., 2025). This reflects Uniform Infor-
mation Density (UID), where languages spread
information to keep local surprisal relatively sta-
ble (Jaeger, 2010; Levy and Jaeger, 2006), though
UID might not be a universal law (Meister et al.,
2021).
For LMs, entropy interacts with tokenization and
sampling: high-entropy sequences require more
data, and token-based budgets can exaggerate dif-
ficulty when scripts or tokenizers inflate sequence
length. Byte-inefficient scripts and fragmented to-
kenization can inflate apparent entropy without
adding semantic content (Rust et al., 2021). At
the human-processing level, LM

Chunk 16 · 1,990 chars

g: high-entropy sequences require more
data, and token-based budgets can exaggerate dif-
ficulty when scripts or tokenizers inflate sequence
length. Byte-inefficient scripts and fragmented to-
kenization can inflate apparent entropy without
adding semantic content (Rust et al., 2021). At
the human-processing level, LM surprisal esti-
mates can predict reading times across multiple
languages, suggesting that surprisal is a useful–but
imperfect–proxy for cognitive difficulty in cross-
linguistic comparisons (Levy, 2008; Goodkind and
Bicknell, 2018; Hollenstein et al., 2021; de Varda
and Marelli, 2022; Wilcox et al., 2023; Kuperman
et al., 2025).
Controlling for encoding efficiency and tokeniza-
tion substantially reduces cross-linguistic surprisal
gaps and narrows perplexity differences, indicating
that part of the observed entropy variation reflects
representation and sampling confounds (Arnett
et al., 2024; Rust et al., 2021; Foroutan et al., 2025;
Tsvetkov and Kipnis, 2024). However, perplexity
remains an imperfect proxy for downstream perfor-
mance: low perplexity can coexist with weak ro-
bustness, particularly in low-resource settings (Lui-
tel et al., 2025; Gurgurov et al., 2025; Zhuang and
Sun, 2025; Liu et al., 2022; Lourie et al., 2025).
Compression-based metrics provide architecture-
independent baselines by evaluating predictabil-
ity at fixed representational units. Bits-per-
character/byte (BPC) estimates cross-entropy per
character/byte, reducing reliance on subword to-
kenization and enabling comparisons that align
with LM perplexity and transfer (De Souza et al.,
2024; Tsvetkov and Kipnis, 2024). However, BPC
is encoding-sensitive: UTF-8 byte premiums and
script granularity distort comparisons even after
byte normalization (Arnett et al., 2024; Moon et al.,
2025; Foroutan et al., 2025; Deletang et al., 2024).
In sum, information-theoretic differences re-

-- 6 of 17 --

flect language encoding choices rather than in-
herent learnability, and

Chunk 17 · 1,990 chars

ve: UTF-8 byte premiums and
script granularity distort comparisons even after
byte normalization (Arnett et al., 2024; Moon et al.,
2025; Foroutan et al., 2025; Deletang et al., 2024).
In sum, information-theoretic differences re-

-- 6 of 17 --

flect language encoding choices rather than in-
herent learnability, and normalizing for density,
byte length, or morphemes reduces many cross-
linguistic gaps.
2.6 Typological Distance
Typological diversity can cause difficulty in
shared-parameter & -vocabulary settings.
Let’s now move from single-language difficulty
to cross-linguistic transfer, where patterns learned
in one language can improve performance in others.
First, languages differ in many ways; this di-
versity represents alternative solutions to similar
communicative constraints (Bickel, 2015; Comrie,
1989; Ponti et al., 2019). We can measure the
difference between a pair of languages via their
typological distance, which captures similarity in
grammar (syntax, morphology), lexicon (cognates,
word choice), and phonology, or via genealogical
relatedness, which reflects shared ancestry.
In human L2 acquisition, linguistic distance pre-
dicts attainment and learning difficulty (Chiswick
and Miller, 2005; Isphording and Otten, 2014;
Schepens et al., 2020). Similar results have been
suggested in models. Early multilingual mod-
els show that shared vocabularies bias represen-
tations toward related languages (Pires et al., 2019;
Conneau et al., 2020b), with mBERT organizing
languages along genealogical lines (Rama et al.,
2020).
At a finer level, WALS-based similarity (Dryer
and Haspelmath, 2013) predicts transfer quality be-
yond raw resource size (Lin et al., 2019), with fea-
tures like word order and head direction particularly
predictive (K et al., 2020; Blaschke et al., 2025).
Tokenization-based diagnostics like Subword Even-
ness (Pelloni et al., 2022) and information-theoretic
metrics like Information Parity (Tsvetkov and Kip-
nis, 2024) can predict

Chunk 18 · 1,993 chars

size (Lin et al., 2019), with fea-
tures like word order and head direction particularly
predictive (K et al., 2020; Blaschke et al., 2025).
Tokenization-based diagnostics like Subword Even-
ness (Pelloni et al., 2022) and information-theoretic
metrics like Information Parity (Tsvetkov and Kip-
nis, 2024) can predict cross-lingual transfer. Vo-
cabulary overlap can sometimes predict positive
transfer but sometimes be detrimental, depending
on the exact task (Limisiewicz et al., 2023). For
example Kallini et al. (2025) found that even vo-
cabulary overlap of semantically unrelated words
can be useful.
At larger scales, the curse of multilinguality
refers to declining per-language performance as
more languages share parameters (Conneau et al.,
2020a), with low-resource and typologically distant
languages suffering most (Lauscher et al., 2020).
Typological distance amplifies interference and ex-
acerbates vocabulary fragmentation. Controlled
studies show that adding related languages im-
proves low-resource performance, but may sur-
prisingly hurt performance on high-resource lan-
guages (Chang et al., 2024a). Gradient conflicts
are common when distant languages are trained
jointly (Wang et al., 2020).
In summary, shared-parameter training with sim-
ilar languages can help, but can also induce inter-
ference as typological diversity grows. Modular
approaches that allocate language-specific capac-
ity reduce conflict while preserving positive trans-
fer (Pfeiffer et al., 2022; Blevins et al., 2024).
2.7 Summarizing Linguistic Properties
Across features, many performance gaps arise from
mismatches between linguistic structure and mod-
eling choices rather than intrinsic language diffi-
culty. Tokenization and encoding can fragment
cues and lengthen sequences, sampling can create
unequal exposure, and shared-parameter training
can cause negative transfer when typological di-
versity exceeds capacity. These factors also con-
found evaluation: low perplexity does not

Chunk 19 · 1,999 chars

ther than intrinsic language diffi-
culty. Tokenization and encoding can fragment
cues and lengthen sequences, sampling can create
unequal exposure, and shared-parameter training
can cause negative transfer when typological di-
versity exceeds capacity. These factors also con-
found evaluation: low perplexity does not guaran-
tee robust downstream performance, especially in
low-resource settings. When segmentation, encod-
ing, and exposure are normalized, many apparent
cross-linguistic gaps shrink, showing that current
modeling paradigms, not linguistic diversity itself,
drive much of the disparity.
3 Design Implications
These findings motivate design implications for
tokenization, sampling, architecture, evaluation,
and corpus construction; we focus on interventions
most directly supported by the surveyed evidence.
3.1 Tokenization: From Frequency-Based
Segments to Linguistically Informed Units
Tokenization is one of the main design levers
through which cross-linguistic disparities are ei-
ther amplified or mitigated in multilingual lan-
guage models. Frequency-based subword algo-
rithms such as BPE and WordPiece often fragment
morphemes and disproportionately disadvantage
multi-byte scripts, inflating sequence length and
compute cost while obscuring linguistically mean-
ingful units (Park et al., 2021; Ali et al., 2024;
Land and Arnett, 2025; Petrov et al., 2023; Arnett
et al., 2024; Lundin et al., 2025). Across studies,

-- 7 of 17 --

segmentation quality explains a substantial share
of cross-linguistic performance differences, and
morphology-, script-, and encoding-aware meth-
ods improve performance and efficiency across
languages (Limisiewicz et al., 2024; Asgari et al.,
2025; Mager et al., 2022).
Assessing tokenization quality remains nontriv-
ial. Common diagnostics include compression,
sequence length, corpus token count, vocabulary-
balance measures, and related distributional met-
rics, but improvements on these measures do not
always translate directly

Chunk 20 · 1,989 chars

al., 2024; Asgari et al.,
2025; Mager et al., 2022).
Assessing tokenization quality remains nontriv-
ial. Common diagnostics include compression,
sequence length, corpus token count, vocabulary-
balance measures, and related distributional met-
rics, but improvements on these measures do not
always translate directly into better downstream
performance (Schmidt et al., 2024; Zouhar et al.,
2023a; Goldman et al., 2024; Dagan et al., 2024;
Gallé, 2019). Tokenization should therefore be
evaluated jointly in terms of efficiency, downstream
utility, and parity across languages.
The surveyed work points to three broad inter-
vention families. First, morphology-aware and
language-adaptive tokenization can better preserve
recurring morphemes and improve parity across
languages (Asgari et al., 2025; Foroutan et al.,
2025; Ahia et al., 2024). Second, alternative
byte encodings can reduce disparities that arise
when standard UTF-8 and shared-vocabulary BPE
allocate capacity unevenly across scripts or cre-
ate partial-byte artifacts (Limisiewicz et al., 2024;
Land and Arnett, 2025). Third, tokenizer-free and
character-level models can reduce script-specific
tokenization penalties by avoiding fixed subword
vocabularies altogether and reduce cross-lingual
performance gaps, though typically at the cost of
longer sequences and higher compute (Pagnoni
et al., 2025; Clark et al., 2022).
Implication 1: Treat tokenization as a first-class
multilingual design choice rather than a fixed pre-
processing step. Prefer morphology-aware, script-
aware, or language-adaptive tokenizers that better
preserve meaningful units and allocate vocabulary
capacity more evenly across languages. Consider
alternative byte encodings to reduce systematic
disadvantages introdcued by standard UTF-8 en-
coding. Tokenizer-free models can further reduce
cross-lingual disparities, at the cost of longer se-
quences and higher compute.
3.2 Data Sampling and Byte Normalization
Token-based sampling penalizes

Chunk 21 · 1,995 chars

ss languages. Consider
alternative byte encodings to reduce systematic
disadvantages introdcued by standard UTF-8 en-
coding. Tokenizer-free models can further reduce
cross-lingual disparities, at the cost of longer se-
quences and higher compute.
3.2 Data Sampling and Byte Normalization
Token-based sampling penalizes byte-heavy
scripts, whereas byte-normalized sampling
narrows gaps (Arnett et al., 2024; Wei et al.,
2021), motivating sampling strategies that target
semantic exposure rather than raw token counts
(e.g., UniMax, byte-premium scaling; Chung
et al., 2023; Chang et al., 2024b; He et al., 2025).
Implication 2: Pretraining should use byte-
normalized, information-normalized, or
morpheme-normalized sampling for equal
semantic coverage across languages. Data balanc-
ing should reflect linguistic diversity rather than
corpus availability, correcting for segmentation
bias, type proliferation, and script inefficiency.
3.3 Beyond One-Size-Fits-All Benchmarks
Current multilingual benchmarks often conflate
linguistic difficulty with tokenization artifacts
or dataset size. Perplexity is sensitive to tok-
enizer choice, whereas character-, morpheme-, and
byte-level metrics provide more robust compar-
isons (Tsvetkov and Kipnis, 2024; Kanjirangat
et al., 2025). Tokenizer-quality diagnostics and
standardized reporting help disentangle measure-
ment bias from true modeling capability (Chelom-
bitko et al., 2024; Bender and Friedman, 2018).
Cross-linguistic syntactic challenge suites like
CLAMS offer controlled tests of generalization and
reveal consistent gaps between monolingual and
multilingual models (Mueller et al., 2020). For mor-
phology, community benchmarks (e.g. SIGMOR-
PHON) provide fine-grained metrics that comple-
ment perplexity-based ones (Cotterell et al., 2017).
Probe-based evaluations show representational
disparities, such as weaker subject/object identi-
fication in case-rich languages when models are
trained on fixed word order (Papadimitriou et

Chunk 22 · 1,998 chars

ty benchmarks (e.g. SIGMOR-
PHON) provide fine-grained metrics that comple-
ment perplexity-based ones (Cotterell et al., 2017).
Probe-based evaluations show representational
disparities, such as weaker subject/object identi-
fication in case-rich languages when models are
trained on fixed word order (Papadimitriou et al.,
2021), motivating typology-aware competency as-
sessments.
Implication 3: Evaluation should use linguisti-
cally informed metrics and typology-aware probes
beyond subword perplexity. Benchmarks should
disaggregate performance by morphology, script,
and word order to avoid masking inequities.
3.4 Balanced Corpora and Pretraining
Pretraining corpus choice strongly shapes multi-
lingual performance. Web data overindexes En-
glish and underrepresents high-vitality languages,
correlating poorly with global populations (Dunn,
2020; Dunn and Adams, 2020; Mehmood et al.,
2017; Mor, 2025; Joshi et al., 2020; Khanna and
Li, 2025; Bella et al., 2023). Corpus composition
often tracks speaker counts over linguistic diversity,
while data statements support accountable multilin-
gual reporting (Bender and Friedman, 2018).

-- 8 of 17 --

Multilingual corpora favor Indo-European lan-
guages and underrepresent complex morphology,
minority scripts, and small populations. Balancing
must account not only for token counts but also for
linguistic density: information per token, morpho-
logical productivity, and rare-form distributions.
Resources such as UniMorph and high-coverage
dependency treebanks can support typology-aware
evaluation of coverage, even without direct training
supervision (Nivre et al., 2020). These findings mo-
tivate moving beyond a single monolithic model:
leveraging language similarity or tailoring compo-
nents to typologically related clusters can boost
learning for low-resource languages without forc-
ing uniform representations (Malkin et al., 2022).
Implication 4: Corpus design should explicitly
encode linguistic diversity by accounting for

Chunk 23 · 1,991 chars

single monolithic model:
leveraging language similarity or tailoring compo-
nents to typologically related clusters can boost
learning for low-resource languages without forc-
ing uniform representations (Malkin et al., 2022).
Implication 4: Corpus design should explicitly
encode linguistic diversity by accounting for rep-
resentational efficiency and linguistic density, en-
suring that languages with high morphological or
typological variation receive equivalent semantic
coverage, not merely equivalent token counts.
4 Conclusions and Future Work
Multilingual performance is shaped less by inher-
ent linguistic complexity than by design choices:
tokenization, data allocation, and interference-
aware training. Future work should explore
language-adaptive strategies: predicting data and
capacity needs per language, designing curricula
that prioritize transfer from related languages, and
developing architectures that dynamically allocate
resources across typologically distinct languages.
Truly low-resource and endangered languages re-
quire innovative approaches under scarcity.
Evaluation must also evolve: metrics should re-
flect cross-linguistic differences in task difficulty
while capturing fairness and accessibility. Aligning
model inductive biases with human learning can
guide more robust multilingual NLP.
By embracing linguistic diversity as a design
principle, we can build models that are more adapt-
able, equitable, and capable of supporting the full
spectrum of the world’s languages.
5 Limitations
Despite our analysis of multilingual performance,
several limitations warrant consideration. First,
our work focuses primarily on representation- and
architecture-driven factors (tokenization, encod-
ing, shared parameters) and does not fully capture
other potential sources of difficulty, such as prag-
matic, discourse-level, or sociolinguistic phenom-
ena, which may affect real-world usage.
Second, most of our empirical insights rely on
pretrained models and

Chunk 24 · 1,997 chars

- and
architecture-driven factors (tokenization, encod-
ing, shared parameters) and does not fully capture
other potential sources of difficulty, such as prag-
matic, discourse-level, or sociolinguistic phenom-
ena, which may affect real-world usage.
Second, most of our empirical insights rely on
pretrained models and standard evaluation datasets,
which may underrepresent truly low-resource or
endangered languages. Data sparsity, orthographic
variation, and non-standardized corpora in such lan-
guages could yield patterns not observed in higher-
resource languages.
Third, while we consider cross-linguistic typol-
ogy, our analysis is largely English-centric in ar-
chitecture and benchmark design, which may bias
conclusions about syntax, word order, and posi-
tional encoding effects.
Fourth, information-theoretic measures capture
correlations with morphology and orthography
rather than intrinsic learnability. Metrics for hi-
erarchical structure, discourse-level predictability,
or multimodal signals remain underexplored, leav-
ing important aspects of language modeling outside
our current framework.
Finally, despite our thorough literature survey,
it is possible that relevant works were overlooked.
We welcome pointers to such papers to keep this
survey up to date.
Addressing these limitations in future work will
be crucial for building truly language-adaptive, eq-
uitable, and robust multilingual models.
AI usage: The paper used AI assistance for
rephrasing, for finding additional relevant papers,
and occasionally for summarizing them.
References
Wafia Adouane, Jean-Philippe Bernardy, and Simon
Dobnik. 2019. Normalising non-standardised orthog-
raphy in Algerian code-switched user-generated data.
In Proceedings of the 5th Workshop on Noisy User-
generated Text (W-NUT 2019), pages 131–140, Hong
Kong, China. Association for Computational Linguis-
tics.
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin
Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, and
Noah A Smith.

Chunk 25 · 1,994 chars

in Algerian code-switched user-generated data.
In Proceedings of the 5th Workshop on Noisy User-
generated Text (W-NUT 2019), pages 131–140, Hong
Kong, China. Association for Computational Linguis-
tics.
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin
Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, and
Noah A Smith. 2024. Magnet: Improving the mul-
tilingual fairness of language models with adaptive
gradient-based tokenization. Advances in Neural In-
formation Processing Systems, 37:47790–47814.
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo
Kasai, David R Mortensen, Noah A Smith, and Yulia
Tsvetkov. 2023. Do all languages cost the same? tok-
enization in the era of commercial language models.
arXiv preprint arXiv:2305.13707.
Doyin Akindotuni. 2025. Resource asymmetry in mul-
tilingual nlp: A comprehensive review and critique.

-- 9 of 17 --

Journal of Computer and Communications, 13(7):14–
47.
Mehdi Ali, Michael Fromm, Klaudia Thellmann,
Richard Rutmann, Max Lübbering, Johannes Lev-
eling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper
Buschhoff, Charvi Jain, Alexander Weber, Lena Ju-
rkschat, Hammam Abdelwahab, Chelsea John, Pedro
Ortiz Suarez, Malte Ostendorff, Samuel Weinbach,
Rafet Sifa, and 2 others. 2024. Tokenizer choice
for LLM training: Negligible or crucial? In Find-
ings of the Association for Computational Linguis-
tics: NAACL 2024, pages 3907–3924, Mexico City,
Mexico. Association for Computational Linguistics.
Shanley Allen, Aslı Özyürek, Sotaro Kita, Amanda
Brown, Reyhan Furman, Tomoko Ishizuka, and Mi-
hoko Fujii. 2007. Language-specific and universal
influences in children’s syntactic packaging of man-
ner and path: A comparison of English, Japanese,
and Turkish. Cognition, 102(1):16–48.
Ben Ambridge, Evan Kidd, Caroline F. Rowland, and
Anna L. Theakston. 2015. The ubiquity of frequency
effects in first language acquisition. Journal of Child
Language, 42(2):239–273.
Catherine Arnett and Benjamin Bergen. 2025. Why do
language models perform

Chunk 26 · 1,998 chars

on of English, Japanese,
and Turkish. Cognition, 102(1):16–48.
Ben Ambridge, Evan Kidd, Caroline F. Rowland, and
Anna L. Theakston. 2015. The ubiquity of frequency
effects in first language acquisition. Journal of Child
Language, 42(2):239–273.
Catherine Arnett and Benjamin Bergen. 2025. Why do
language models perform worse for morphologically
complex languages? In Proceedings of the 31st Inter-
national Conference on Computational Linguistics,
pages 6607–6623, Abu Dhabi, UAE. Association for
Computational Linguistics.
Catherine Arnett, Tyler A Chang, and Benjamin Bergen.
2024. A bit of a problem: Measurement disparities
in dataset sizes across languages. In Proceedings of
the 3rd Annual Meeting of the Special Interest Group
on Under-resourced Languages@ LREC-COLING
2024, pages 1–9.
Ehsaneddin Asgari, Yassine El Kheir, and Mohammad
Ali Sadraei Javaheri. 2025. Morphbpe: A morpho-
aware tokenizer bridging linguistic complexity for
efficient llm training across morphologies. arXiv
preprint arXiv:2502.00894.
R. Harald Baayen. 2009. Corpus linguistics in mor-
phology: Morphological productivity. In Corpus
Linguistics: An International Handbook, volume 2,
pages 899–919. De Gruyter Mouton.
Lucas Bandarkar, Alan Ansell, and Trevor Cohn. 2026.
Large reasoning models struggle to transfer para-
metric knowledge across scripts. arXiv preprint
arXiv:2603.17070.
Gábor Bella, Paula Helm, Gertraud Koch, and Fausto
Giunchiglia. 2023. Towards bridging the digital lan-
guage divide. CoRR, abs/2307.13405.
Emily M. Bender and Batya Friedman. 2018. Data
statements for natural language processing: Toward
mitigating system bias and enabling better science.
Transactions of the Association for Computational
Linguistics, 6:587–604.
Christian Bentz, Dimitrios Alikaniotis, Michael
Cysouw, and Ramon Ferrer-i Cancho. 2017. The en-
tropy of words—learnability and expressivity across
more than 1000 languages. Entropy, 19(6):275.
Ruth A. Berman. 2014. Cross-linguistic comparisons in
child language

Chunk 27 · 1,997 chars

e Association for Computational
Linguistics, 6:587–604.
Christian Bentz, Dimitrios Alikaniotis, Michael
Cysouw, and Ramon Ferrer-i Cancho. 2017. The en-
tropy of words—learnability and expressivity across
more than 1000 languages. Entropy, 19(6):275.
Ruth A. Berman. 2014. Cross-linguistic comparisons in
child language research. Journal of Child Language,
41:26 – 37.
Balthasar Bickel. 2015. Distributional typology: Statis-
tical inquiries into the dynamics of linguistic diver-
sity. In The Oxford Handbook of Linguistic Analysis.
Oxford University Press.
Verena Blaschke, Masha Fedzechkina, and Maartje
Ter Hoeve. 2025. Analyzing the effect of linguis-
tic similarity on cross-lingual transfer: Tasks and
experimental setups matter. In Findings of the As-
sociation for Computational Linguistics: ACL 2025,
pages 8653–8684, Vienna, Austria. Association for
Computational Linguistics.
Terra Blevins, Tomasz Limisiewicz, Suchin Gururan-
gan, Margaret Li, Hila Gonen, Noah A. Smith, and
Luke Zettlemoyer. 2024. Breaking the curse of multi-
linguality with cross-lingual expert language models.
In Proceedings of the 2024 Conference on Empirical
Methods in Natural Language Processing, EMNLP
2024, Miami, FL, USA, November 12-16, 2024, pages
10822–10837. Association for Computational Lin-
guistics.
Geert Booij. 2005. Compounding and derivation: Evi-
dence for construction morphology. In Wolfgang U.
Dressler, Dieter Kastovsky, Oskar E. Pfeiffer, and
Franz Rainer, editors, Morphology and its Demar-
cations, pages 109–132. John Benjamins Publishing
Company.
Kaj Bostrom and Greg Durrett. 2020. Byte pair encod-
ing is suboptimal for language model pretraining. In
Findings of the Association for Computational Lin-
guistics: EMNLP 2020, pages 4617–4624, Online.
Association for Computational Linguistics.
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and
Benjamin Bergen. 2024a. When is multilinguality
a curse? language modeling for 250 high- and low-
resource languages. In Proceedings of the

Chunk 28 · 1,982 chars

the Association for Computational Lin-
guistics: EMNLP 2020, pages 4617–4624, Online.
Association for Computational Linguistics.
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and
Benjamin Bergen. 2024a. When is multilinguality
a curse? language modeling for 250 high- and low-
resource languages. In Proceedings of the 2024 Con-
ference on Empirical Methods in Natural Language
Processing, pages 4074–4096.
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and
Benjamin K. Bergen. 2024b. Goldfish: Monolin-
gual language models for 350 languages. CoRR,
abs/2408.10441.
Ya-Ning Chang, JSH Taylor, Kathleen Rastle, and
Padraic Monaghan. 2020. The relationships between
oral language and reading instruction: Evidence from
a computational model of reading. Cognitive Psy-
chology, 123:101336.

-- 10 of 17 --

Iaroslav Chelombitko, Egor Safronov, and Aleksey
Komissarov. 2024. Qtok: A comprehensive frame-
work for evaluating multilingual tokenizer qual-
ity in large language models. arXiv preprint
arXiv:2410.12989.
Barry R Chiswick and Paul W Miller. 2005. Linguis-
tic distance: A quantitative measure of the distance
between english and other languages. Journal of mul-
tilingual and multicultural development, 26(1):1–11.
Hyung Won Chung, Xavier Garcia, Adam Roberts,
Yi Tay, Orhan Firat, Sharan Narang, and Noah Con-
stant. 2023. Unimax: Fairer and more effective
language sampling for large-scale multilingual pre-
training. In The Eleventh International Conference
on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023. OpenReview.net.
Eve V Clark. 2017. Morphology in language acquisi-
tion. The handbook of morphology, pages 374–389.
Jonathan H. Clark, Dan Garrette, Iulia Turc, and John
Wieting. 2022. Canine: Pre-training an efficient
tokenization-free encoder for language representa-
tion. Transactions of the Association for Computa-
tional Linguistics, 10:73–91.
Bernard Comrie. 1989. Language Universals and Lin-
guistic Typology: Syntax and Morphology, 2

Chunk 29 · 1,993 chars

H. Clark, Dan Garrette, Iulia Turc, and John
Wieting. 2022. Canine: Pre-training an efficient
tokenization-free encoder for language representa-
tion. Transactions of the Association for Computa-
tional Linguistics, 10:73–91.
Bernard Comrie. 1989. Language Universals and Lin-
guistic Typology: Syntax and Morphology, 2 edition.
University of Chicago Press, Chicago, IL.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Édouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020a. Unsupervised
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440–
8451.
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle-
moyer, and Veselin Stoyanov. 2020b. Emerging
cross-lingual structure in pretrained language mod-
els. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
6022–6034, Online. Association for Computational
Linguistics.
Ryan Cotterell, Christo Kirov, John Sylak-Glassman,
Géraldine Walther, Ekaterina Vylomova, Patrick Xia,
Manaal Faruqui, Sandra Kübler, David Yarowsky,
Jason Eisner, and Mans Hulden. 2017. CoNLL–
SIGMORPHON 2017 shared task: Universal mor-
phological reinflection in 52 languages. In Proceed-
ings of the CoNLL SIGMORPHON 2017 Shared Task:
Universal Morphological Reinflection, pages 1–30.
Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and
Brian Roark. 2018. Are all languages equally hard
to language-model? In Proceedings of the 2018
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, Volume 2 (Short Papers),
pages 536–541.
Christophe Coupé, Yoon Mi Oh, Dan Dediu, and
François Pellegrino. 2019. Different languages, sim-
ilar encoding efficiency: Comparable information
rates across the human communicative niche. Sci-
ence Advances, 5(9):eaaw2594.
Michael A. Covington and Joe D.

Chunk 30 · 1,997 chars

anguage Technologies, Volume 2 (Short Papers),
pages 536–541.
Christophe Coupé, Yoon Mi Oh, Dan Dediu, and
François Pellegrino. 2019. Different languages, sim-
ilar encoding efficiency: Comparable information
rates across the human communicative niche. Sci-
ence Advances, 5(9):eaaw2594.
Michael A. Covington and Joe D. McFall. 2010. Cutting
the Gordian knot: The moving-average type–token
ratio (MATTR). Journal of Quantitative Linguistics,
17(2):94–100.
Gautier Dagan, Gabriel Synnaeve, and Baptiste Roziere.
2024. Getting the most out of your tokenizer for
pre-training and domain adaptation. In Forty-first
International Conference on Machine Learning.
John Dang, Shivalika Singh, Daniel D’souza, Arash
Ahmadian, Alejandro Salamanca, Madeline Smith,
Aidan Peppin, Sungjin Hong, Manoj Govindassamy,
Terrence Zhao, Sandra Kublik, Meor Amer, Viraat
Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom
Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-
Berliac, and 26 others. 2024. Aya expanse: Combin-
ing research breakthroughs for a new multilingual
frontier. ArXiv, abs/2412.04261.
Leandro De Souza, Thales Almeida, Roberto Lotufo,
and Rodrigo Frassetto Nogueira. 2024. Measuring
cross-lingual transfer in bytes. In Proceedings of
the 2024 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies (Volume 1: Long Pa-
pers), pages 7526–7537.
Andrea de Varda and Marco Marelli. 2022. The effects
of surprisal across languages: Results from native
and non-native reading. In Findings of the Associa-
tion for Computational Linguistics: AACL-IJCNLP
2022, pages 138–144. Association for Computational
Linguistics.
Mathieu Dehouck and Pascal Denis. 2018. A frame-
work for understanding the role of morphology in
Universal Dependency parsing. In Proceedings of the
2018 Conference on Empirical Methods in Natural
Language Processing, pages 2864–2870, Brussels,
Belgium. Association for Computational Linguistics.
Gregoire Deletang, Anian

Chunk 31 · 1,998 chars

hieu Dehouck and Pascal Denis. 2018. A frame-
work for understanding the role of morphology in
Universal Dependency parsing. In Proceedings of the
2018 Conference on Empirical Methods in Natural
Language Processing, pages 2864–2870, Brussels,
Belgium. Association for Computational Linguistics.
Gregoire Deletang, Anian Ruoss, Paul-Ambroise
Duquenne, Elliot Catt, Tim Genewein, Christo-
pher Mattern, Jordi Grau-Moya, Li Kevin Wenliang,
Matthew Aitchison, Laurent Orseau, Marcus Hutter,
and Joel Veness. 2024. Language modeling is com-
pression. In The Twelfth International Conference on
Learning Representations.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In NAACL-HLT 2019, pages 4171–4186.
Matthew S. Dryer and Martin Haspelmath. 2013. The
World Atlas of Language Structures Online. Max
Planck Institute for Evolutionary Anthropology,
Leipzig.

-- 11 of 17 --

Philipp Dufter, Martin Schmitt, and Hinrich Schütze.
2022. Position information in transformers: An
overview. Computational Linguistics, 48(3):733–
763.
Jonathan Dunn. 2020. Mapping languages: the corpus
of global language use. Language Resources and
Evaluation, 54(4):999–1018.
Jonathan Dunn and Benjamin Adams. 2020. Mapping
languages and demographics with georeferenced cor-
pora. arXiv preprint arXiv:2004.00809.
Nick C Ellis. 2022. Second language learning of mor-
phology. Journal of the European Second Language
Association, 6(1).
Maryia Fedzechkina, Elissa L Newport, and T Florian
Jaeger. 2017. Balancing effort and information trans-
mission during language acquisition: Evidence from
word order and case marking. Cognitive science,
41(2):416–446.
Preston Firestone, Shubham Ugare, Gagandeep Singh,
and Sasa Misailovic. 2025. UTF-8 plumbing: Byte-
level tokenizers unavoidably enable LLMs to gen-
erate ill-formed UTF-8. In Second Conference on
Language Modeling.
Negar Foroutan, Clara Meister, Deepanway Paul,

Chunk 32 · 1,998 chars

rder and case marking. Cognitive science,
41(2):416–446.
Preston Firestone, Shubham Ugare, Gagandeep Singh,
and Sasa Misailovic. 2025. UTF-8 plumbing: Byte-
level tokenizers unavoidably enable LLMs to gen-
erate ill-formed UTF-8. In Second Conference on
Language Modeling.
Negar Foroutan, Clara Meister, Deepanway Paul, Joel
Niklaus, and 1 others. 2025. Parity-aware byte-pair
encoding: Improving cross-lingual fairness in tok-
enization. arXiv preprint arXiv:2508.04796.
Matthias Gallé. 2019. Investigating the effectiveness of
BPE: The power of shorter sequences. In Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 1375–1381, Hong
Kong, China. Association for Computational Linguis-
tics.
Bar Gazit, Shaltiel Shmidman, Avi Shmidman, and Yu-
val Pinter. 2025. Splintering nonconcatenative lan-
guages for better tokenization. In Findings of the As-
sociation for Computational Linguistics: ACL 2025,
pages 22405–22417, Vienna, Austria. Association
for Computational Linguistics.
Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti, Roi
Reichart, and Anna Korhonen. 2018. On the rela-
tion between linguistic typology and (limitations of)
multilingual language modeling. In Proceedings of
the 2018 Conference on Empirical Methods in Natu-
ral Language Processing, pages 316–327, Brussels,
Belgium. Association for Computational Linguistics.
Akash Ghosh, Debayan Datta, Sriparna Saha, and Chi-
rag Agarwal. 2025. A survey of multilingual reason-
ing in language models. In Findings of the Associ-
ation for Computational Linguistics: EMNLP 2025,
pages 8920–8936, Suzhou, China. Association for
Computational Linguistics.
Edward Gibson, Richard Futrell, Steven T. Piantadosi,
Isabelle Dautriche, Kyle Mahowald, Leon Bergen,
and Roger Levy. 2019. How efficiency shapes human
language. Trends in Cognitive Sciences, 23(5):389–
407.
Omer Goldman, Avi Caciularu, Matan

Chunk 33 · 1,996 chars

,
pages 8920–8936, Suzhou, China. Association for
Computational Linguistics.
Edward Gibson, Richard Futrell, Steven T. Piantadosi,
Isabelle Dautriche, Kyle Mahowald, Leon Bergen,
and Roger Levy. 2019. How efficiency shapes human
language. Trends in Cognitive Sciences, 23(5):389–
407.
Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao,
Idan Szpektor, and Reut Tsarfaty. 2024. Unpacking
tokenization: Evaluating text compression and its
correlation with model performance. In Findings of
the Association for Computational Linguistics: ACL
2024, pages 2274–2286, Bangkok, Thailand. Associ-
ation for Computational Linguistics.
Adam Goodkind and Klinton Bicknell. 2018. Predictive
power of word surprisal for reading times is a linear
function of language model quality. In Proceedings
of the 8th Workshop on Cognitive Modeling and Com-
putational Linguistics (CMCL 2018), pages 10–18,
Salt Lake City, Utah. Association for Computational
Linguistics.
Kyle Gorman and Yuval Pinter. 2025. Don’t touch my
diacritics. In Proceedings of the 2025 Conference
of the Nations of the Americas Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies (Volume 2: Short Papers), pages
285–291, Albuquerque, New Mexico. Association
for Computational Linguistics.
Yanzhu Guo, Guokan Shang, and Chloé Clavel. 2025.
Benchmarking linguistic diversity of large language
models. Transactions of the Association for Compu-
tational Linguistics.
Daniil Gurgurov, Ivan Vykopal, Josef van Genabith,
and Simon Ostermann. 2025. Small models, big im-
pact: Efficient corpus and graph-based adaptation of
small multilingual language models for low-resource
languages. arXiv preprint arXiv:2502.10140.
John Hale. 2001. A probabilistic Earley parser as a psy-
cholinguistic model. In Second Meeting of the North
American Chapter of the Association for Computa-
tional Linguistics.
David A. Haslett. 2025. Tokenization changes meaning
in large language models: Evidence from Chinese.
Computational

Chunk 34 · 1,998 chars

reprint arXiv:2502.10140.
John Hale. 2001. A probabilistic Earley parser as a psy-
cholinguistic model. In Second Meeting of the North
American Chapter of the Association for Computa-
tional Linguistics.
David A. Haslett. 2025. Tokenization changes meaning
in large language models: Evidence from Chinese.
Computational Linguistics, 51(3):785–814.
Martin Haspelmath and Andrea Sims. 2013. Under-
standing morphology. Routledge.
Yifei He, Alon Benhaim, Barun Patra, Praneetha Vad-
damanu, Sanchit Ahuja, Parul Chopra, Vishrav
Chaudhary, Han Zhao, and Xia Song. 2025. Scal-
ing laws for multilingual language models. In Find-
ings of the Association for Computational Linguis-
tics: ACL 2025, pages 4257–4273, Vienna, Austria.
Association for Computational Linguistics.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes
Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Katie Millican, George van den Driessche, Bogdan

-- 12 of 17 --

Damoc, Aurelia Guy, Simon Osindero, Karen Si-
monyan, Erich Elsen, and 3 others. 2022. Training
compute-optimal large language models. In Proceed-
ings of the 36th International Conference on Neu-
ral Information Processing Systems, NIPS ’22, Red
Hook, NY, USA. Curran Associates Inc.
Nora Hollenstein, Federico Pirovano, Ce Zhang, Lena
Jäger, and Lisa Beinborn. 2021. Multilingual lan-
guage models predict human reading behavior. In
Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 106–123, Online. Association for Computa-
tional Linguistics.
Linjieqiong Huang, Erik D. Reichle, and Xingshan
Li. 2024. Comparative analyses of the information
content of letters, characters, and inter-word spaces
across writing systems. Annals of the New York
Academy of Sciences, 1537(1):129–139.
Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran,
Silvia Severini, Masoud Jalili

Chunk 35 · 1,963 chars

.
Linjieqiong Huang, Erik D. Reichle, and Xingshan
Li. 2024. Comparative analyses of the information
content of letters, characters, and inter-word spaces
across writing systems. Annals of the New York
Academy of Sciences, 1537(1):129–139.
Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran,
Silvia Severini, Masoud Jalili Sabet, Nora Kass-
ner, Chunlan Ma, Helmut Schmid, André Martins,
François Yvon, and Hinrich Schütze. 2023. Glot500:
Scaling multilingual corpora and language models to
500 languages. In Proceedings of the 61st Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 1082–1117,
Toronto, Canada. Association for Computational Lin-
guistics.
Go Inoue, Bashar Alhafni, Nizar Habash, and Timothy
Baldwin. 2026. Do diacritics matter? evaluating
the impact of Arabic diacritics on tokenization and
LLM benchmarks. In Findings of the Association for
Computational Linguistics: EACL 2026, pages 426–
442, Rabat, Morocco. Association for Computational
Linguistics.
Ingo E Isphording and Sebastian Otten. 2014. Linguis-
tic barriers in the destination language acquisition
of immigrants. Journal of economic Behavior &
organization, 105:30–50.
T. Florian Jaeger. 2010. Redundancy and reduction:
Speakers manage syntactic information density. Cog-
nitive Psychology, 61:23–62.
Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park,
and Seungwon Shin. 2025. Improbable bigrams ex-
pose vulnerabilities of incomplete tokens in byte-
level tokenizers. In Proceedings of the 2025 Con-
ference on Empirical Methods in Natural Language
Processing, pages 18209–18216, Suzhou, China. As-
sociation for Computational Linguistics.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika
Bali, and Monojit Choudhury. 2020. The state and
fate of linguistic diversity and inclusion in the NLP
world. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
6282–6293, Online. Association for

Chunk 36 · 1,998 chars

putational Linguistics.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika
Bali, and Monojit Choudhury. 2020. The state and
fate of linguistic diversity and inclusion in the NLP
world. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
6282–6293, Online. Association for Computational
Linguistics.
Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan
Roth. 2020. Cross-lingual ability of multilingual bert:
An empirical study. In International Conference on
Learning Representations.
Julie Kallini, Dan Jurafsky, Christopher Potts, and Mar-
tijn Bartelds. 2025. False friends are not foes: Inves-
tigating vocabulary overlap in multilingual language
models. In Findings of the Association for Compu-
tational Linguistics: EMNLP 2025, pages 21138–
21154.
Vani Kanjirangat, Tanja Samardzic, Ljiljana Dolamic,
and Fabio Rinaldi. 2025. Tokenization and represen-
tation biases in multilingual models on dialectal NLP
tasks. In Proceedings of the 2025 Conference on
Empirical Methods in Natural Language Processing,
pages 23992–24010, Suzhou, China. Association for
Computational Linguistics.
Amir Hossein Kargaran, François Yvon, and Hinrich
Schütze. 2024. GlotScript: A resource and tool for
low resource writing system identification. In Pro-
ceedings of the 2024 Joint International Conference
on Computational Linguistics, Language Resources
and Evaluation (LREC-COLING 2024), pages 7774–
7784, Torino, Italia. ELRA and ICCL.
Kimmo Kettunen. 2014. Can type-token ratio be used to
show morphological complexity of languages? Jour-
nal of Quantitative Linguistics, 21(3):223–245.
Saurabh Khanna and Xinxu Li. 2025. Invisible lan-
guages of the LLM universe. CoRR, abs/2510.11557.
Stav Klein and Reut Tsarfaty. 2020. Getting the ##life
out of living: How adequate are word-pieces for mod-
elling complex morphology? In Proceedings of the
17th SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology,
pages 204–209.
Alexander

Chunk 37 · 1,992 chars

es of the LLM universe. CoRR, abs/2510.11557.
Stav Klein and Reut Tsarfaty. 2020. Getting the ##life
out of living: How adequate are word-pieces for mod-
elling complex morphology? In Proceedings of the
17th SIGMORPHON Workshop on Computational
Research in Phonetics, Phonology, and Morphology,
pages 204–209.
Alexander Koplenig, Sascha Wolfer, Jan Oliver Rüdi-
ger, and Peter Meyer. 2025. Human languages trade
off complexity against efficiency. PLOS Complex
Systems, 2(2):1–42.
Victor Kuperman, Sascha Schroeder, Cengiz Acartürk,
Niket Agrawal, Dominick Maia Alexandre,
Lena Sophia Bolliger, Jan Brasser, César Campos-
Rojas, Denis Drieghe, Dušica Filipovi´c Ður ¯devi´c,
Luiz Vinicius Gadelha de Freitas, Sofya Goldina,
Romualdo Ibáñez Orellana, Lena A. Jäger, Ómar I.
Jóhannesson, Anurag Khare, Nik Kharlamov, Hanne
B. S. Knudsen, Árni Kristjánsson, and 31 others.
2025. New data on text reading in english as a
second language: The wave 2 expansion of the
multilingual eye-movement corpus (meco). Studies
in Second Language Acquisition, 47:677 – 695.
Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei
Oseki, Ted Briscoe, and Timothy Baldwin. 2024.
Emergent word order universals from cognitively-
motivated language models. In Proceedings of the

-- 13 of 17 --

62nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
14522–14543.
Jialin Lai, Juan F Quinonez-Beltran, and R Malatesha
Joshi. 2024. A bigger picture of early literacy and
biliteracy acquisition in abugidas: Perspectives from
asian and african languages. Reading Research Quar-
terly, 59(3):499–513.
Sam Land and Catherine Arnett. 2025. Bpe stays on
script: Structured encoding for robust multilingual
pretokenization. arXiv preprint arXiv:2505.24689.
Anne Lauscher, Vinit Ravishankar, Ivan Vuli´c, and
Goran Glavaš. 2020. From zero to hero: On the
limitations of zero-shot language transfer with mul-
tilingual Transformers. In Proceedings of the 2020
Conference on Empirical

Chunk 38 · 1,996 chars

Structured encoding for robust multilingual
pretokenization. arXiv preprint arXiv:2505.24689.
Anne Lauscher, Vinit Ravishankar, Ivan Vuli´c, and
Goran Glavaš. 2020. From zero to hero: On the
limitations of zero-shot language transfer with mul-
tilingual Transformers. In Proceedings of the 2020
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 4483–4499, On-
line. Association for Computational Linguistics.
Prahallad Lavanya, Prahallad Kishore, and Ganapa Thi-
raju Madhavi. 2005. A simple approach for building
transliteration editors for indian languages. Jour-
nal of Zhejiang University-SCIENCE A, 6(11):1354–
1361.
Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas
Bekman, M Saiful Bari, Stella Biderman, Hady Elsa-
har, Niklas Muennighoff, Jason Phang, Ofir Press,
Colin Raffel, Victor Sanh, Sheng Shen, Lintang
Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Lau-
nay, and Iz Beltagy. 2022. What language model to
train if you have one million GPU hours? In Find-
ings of the Association for Computational Linguistics:
EMNLP 2022, pages 765–782, Abu Dhabi, United
Arab Emirates. Association for Computational Lin-
guistics.
Daniel Lemire and Wojciech Muła. 2022. Transcoding
billions of unicode characters per second with simd
instructions. Software: Practice and Experience,
52(2):484–508.
Paul Lerner and François Yvon. 2025. Unlike
“likely”,“unlike” is unlikely: Bpe-based segmenta-
tion hurts morphological derivations in llms. In Pro-
ceedings of the 31st International Conference on
Computational Linguistics, pages 5181–5190.
Natalia Levshina. 2021. Cross-linguistic trade-offs and
causal relationships between cues to grammatical sub-
ject and object, and the problem of efficiency-related
explanations. Frontiers in Psychology, 12:648200.
Roger Levy. 2008. Expectation-based syntactic compre-
hension. Cognition, 106(3):1126–1177.
Roger Levy and T. Jaeger. 2006. Speakers optimize
information density through syntactic reduction. In
Advances in Neural

Chunk 39 · 1,995 chars

-
ject and object, and the problem of efficiency-related
explanations. Frontiers in Psychology, 12:648200.
Roger Levy. 2008. Expectation-based syntactic compre-
hension. Cognition, 106(3):1126–1177.
Roger Levy and T. Jaeger. 2006. Speakers optimize
information density through syntactic reduction. In
Advances in Neural Information Processing Systems,
volume 19. MIT Press.
Yuchen Lian, Arianna Bisazza, and Tessa Verhoef. 2023.
Communication drives the emergence of language
universals in neural agents: Evidence from the word-
order/case-marking trade-off. Transactions of the
Association for Computational Linguistics, 11:1033–
1047.
Tomasz Limisiewicz, Jiri Balhar, and David Marecek.
2023. Tokenization impacts multilingual language
modeling: Assessing vocabulary allocation and over-
lap across languages. In Findings of ACL 2023.
Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Ore-
vaoghene Ahia, and Luke Zettlemoyer. 2024. Myte:
Morphology-driven byte encoding for better and
fairer multilingual language modeling. In Proceed-
ings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 15059–15076.
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,
Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junx-
ian He, Zhisong Zhang, Xuezhe Ma, Antonios Anas-
tasopoulos, Patrick Littell, and Graham Neubig. 2019.
Choosing transfer languages for cross-lingual learn-
ing. In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pages
3125–3135.
Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu
Ma. 2022. Same pre-training loss, better downstream:
Implicit bias matters for language models. In Inter-
national Conference on Machine Learning.
Zihan Liu, Genta Indra Winata, Samuel Cahyawijaya,
Andrea Madotto, Zhaojiang Lin, and Pascale Fung.
2021. On the importance of word order information
in cross-lingual sequence labeling. In Thirty-Fifth
AAAI Conference on Artificial Intelligence, AAAI
2021, Virtual

Chunk 40 · 1,998 chars

ls. In Inter-
national Conference on Machine Learning.
Zihan Liu, Genta Indra Winata, Samuel Cahyawijaya,
Andrea Madotto, Zhaojiang Lin, and Pascale Fung.
2021. On the importance of word order information
in cross-lingual sequence labeling. In Thirty-Fifth
AAAI Conference on Artificial Intelligence, AAAI
2021, Virtual Event, February 2-9, 2021, pages
13461–13469. AAAI Press.
Simon P Liversedge, Denis Drieghe, Xin Li, Guoli Yan,
Xuejun Bai, and Jukka Hyönä. 2016. Universality in
eye movements and reading: A trilingual investiga-
tion. Cognition, 147:1–20.
Nicholas Lourie, Michael Y. Hu, and Kyunghyun Cho.
2025. Scaling laws are unreliable for downstream
tasks: A reality check. ArXiv.
Nishan Luitel, Nishant Bekoju, Anil Kumar Sah, and
1 others. 2025. Can perplexity predict finetuning
performance? an investigation of tokenization ef-
fects on sequential language models for nepali. In
Proceedings of the Fourth Workshop on Multilingual
Representation Learning.
Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza
Louzan, Victor Wei, David Adelani, and Cody Car-
roll. 2025. The token tax: Systematic bias in multilin-
gual tokenization. arXiv preprint arXiv:2509.05486.
Hanjia Lyu, Jiebo Luo, Jian Kang, and Allison Koe-
necke. 2025. Characterizing bias: Benchmarking
large language models in simplified versus traditional

-- 14 of 17 --

chinese. In Proceedings of the 2025 ACM Confer-
ence on Fairness, Accountability, and Transparency,
FAccT ’25, page 2815–2846, New York, NY, USA.
Association for Computing Machinery.
Manuel Mager, Arturo Oncevay, Elisabeth Maier, Katha-
rina von der Wense, and Thang Vu. 2022. Bpe vs.
morphological segmentation: A case study on ma-
chine translation of four polysynthetic languages. In
Findings of the Association for Computational Lin-
guistics: ACL 2022, pages 961–971.
Dan Malkin, Tomasz Limisiewicz, and Gabriel
Stanovsky. 2022. A balanced data approach for eval-
uating cross-lingual transfer: Mapping the linguistic
blood bank. In Proceedings of

Chunk 41 · 1,987 chars

chine translation of four polysynthetic languages. In
Findings of the Association for Computational Lin-
guistics: ACL 2022, pages 961–971.
Dan Malkin, Tomasz Limisiewicz, and Gabriel
Stanovsky. 2022. A balanced data approach for eval-
uating cross-lingual transfer: Mapping the linguistic
blood bank. In Proceedings of the 2022 Conference
of the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 4903–4915, Seattle, United States.
Association for Computational Linguistics.
Philip M. McCarthy and Scott Jarvis. 2010. MTLD,
vocd-D, and HD-D: A validation study of sophis-
ticated approaches to lexical diversity assessment.
Behavior Research Methods, 42(2):381–392.
Muhammad Asif Mehmood, Hafiz Muhammad Shafiq,
and 1 others. 2017. Understanding regional context
of world wide web using common crawl corpus. In
2017 IEEE 13th Malaysia International Conference
on Communications (MICC), pages 1–6. IEEE.
Clara Meister and Ryan Cotterell. 2021. Language
model evaluation beyond perplexity. In Proceedings
of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 5328–5339.
Clara Meister, Tiago Pimentel, Patrick Haller, Lena
Jäger, Ryan Cotterell, and Roger Levy. 2021. Revisit-
ing the Uniform Information Density hypothesis. In
Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pages 963–
980, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Alessio Miaschi, Dominique Brunato, Felice
Dell’Orletta, and Giulia Venturi. 2021. What
makes my model perplexed? a linguistic investi-
gation on neural language models perplexity. In
Proceedings of Deep Learning Inside Out (DeeLIO):
The 2nd Workshop on Knowledge Extraction and
Integration for Deep Learning Architectures, pages
40–47.
Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman,

Chunk 42 · 1,994 chars

ia Venturi. 2021. What
makes my model perplexed? a linguistic investi-
gation on neural language models perplexity. In
Proceedings of Deep Learning Inside Out (DeeLIO):
The 2nd Workshop on Knowledge Extraction and
Integration for Deep Learning Architectures, pages
40–47.
Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian
Roark, and Jason Eisner. 2019. What kind of lan-
guage is hard to language-model? In Proceedings
of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 4975–4989.
Seonmin Moon, Tatsuya Hiraoka, and Naoaki Okazaki.
2025. Bit-level bpe: Below the byte boundary. arXiv
preprint arXiv:2506.07541.
Ibraheem Muhammad Moosa, Mahmud Elahi Akhter,
and Ashfia Binte Habib. 2023. Does transliteration
help multilingual language modeling? In Findings
of the Association for Computational Linguistics:
EACL 2023, pages 670–685, Dubrovnik, Croatia. As-
sociation for Computational Linguistics.
Niva Mor. 2025. It’s a global village (if you speak the
right language): On language models, digital sidelin-
ing, and participation. Wisconsin International Law
Journal, 42:329.
Aaron Mueller, Garrett Nicolai, Panayiota Petrou-
Zeniou, Natalia Talmina, and Tal Linzen. 2020.
Cross-linguistic syntactic evaluation of word predic-
tion models. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 5523–5539, Online. Association for Computa-
tional Linguistics.
Benjamin Muller, Antonios Anastasopoulos, Benoît
Sagot, and Djamé Seddah. 2021. When being un-
seen from mBERT is just the beginning: Handling
new languages with multilingual language models.
In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 448–462, Online. Association for Computa-
tional Linguistics.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
ter, Jan Hajiˇc, Christopher D. Manning, Sampo
Pyysalo, Sebastian Schuster, Francis Tyers,

Chunk 43 · 1,989 chars

he North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 448–462, Online. Association for Computa-
tional Linguistics.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
ter, Jan Hajiˇc, Christopher D. Manning, Sampo
Pyysalo, Sebastian Schuster, Francis Tyers, and
Daniel Zeman. 2020. Universal Dependencies v2:
An evergrowing multilingual treebank collection. In
Proceedings of the Twelfth Language Resources and
Evaluation Conference, pages 4034–4043, Marseille,
France. European Language Resources Association.
Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima
Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl
Eryani, Alexander Erdmann, and Nizar Habash. 2020.
CAMeL tools: An open source python toolkit for
Arabic natural language processing. In Proceedings
of the Twelfth Language Resources and Evaluation
Conference, pages 7022–7032, Marseille, France. Eu-
ropean Language Resources Association.
Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Ro-
driguez, John Nguyen, Benjamin Muller, Margaret
Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke
Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtz-
man, and Srini Iyer. 2025. Byte latent transformer:
Patches scale better than tokens. In Proceedings
of the 63rd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 9238–9258, Vienna, Austria. Association for
Computational Linguistics.
Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, and
Kyle Mahowald. 2021. Deep subjecthood: Higher-
order grammatical features in multilingual BERT. In
Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Lin-
guistics: Main Volume, pages 2522–2532, Online.
Association for Computational Linguistics.

-- 15 of 17 --

Hyunji Hayley Park, Katherine J. Zhang, Coleman Ha-
ley, Kenneth Steimel, Han Liu, and Lane Schwartz.
2021. Morphology matters: A multilingual language
modeling analysis. Transactions of the

Chunk 44 · 1,987 chars

r Computational Lin-
guistics: Main Volume, pages 2522–2532, Online.
Association for Computational Linguistics.

-- 15 of 17 --

Hyunji Hayley Park, Katherine J. Zhang, Coleman Ha-
ley, Kenneth Steimel, Han Liu, and Lane Schwartz.
2021. Morphology matters: A multilingual language
modeling analysis. Transactions of the Association
for Computational Linguistics, 9:261–276.
Isidro Parra. 2024. Morphological typology in bpe sub-
word productivity and language modeling. arXiv
preprint arXiv:2410.23656.
Olga Pelloni, Anastassia Shaitarova, and Tanja
Samardzic. 2022. Subword evenness (sue) as a pre-
dictor of cross-lingual transfer to low-resource lan-
guages. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing,
EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
cember 7-11, 2022, pages 7428–7445. Association
for Computational Linguistics.
Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and
Adel Bibi. 2023. Language model tokenizers intro-
duce unfairness between languages. In Advances in
Neural Information Processing Systems, volume 36,
pages 36963–36990.
Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James
Cross, Sebastian Riedel, and Mikel Artetxe. 2022.
Lifting the curse of multilinguality by pre-training
modular transformers. In Proceedings of the 2022
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 3479–3495, Seattle,
United States. Association for Computational Lin-
guistics.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
How multilingual is multilingual bert? In Proceed-
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 4996–5001.
Edoardo Maria Ponti, Helen O’Horan, Yevgeni Berzak,
Ivan Vuli´c, Roi Reichart, Thierry Poibeau, Ekate-
rina Shutova, and Anna Korhonen. 2019. Modeling
language variation and universals: A survey on ty-
pological linguistics for natural language processing.
Computational

Chunk 45 · 1,975 chars

for Computational Linguistics, pages 4996–5001.
Edoardo Maria Ponti, Helen O’Horan, Yevgeni Berzak,
Ivan Vuli´c, Roi Reichart, Thierry Poibeau, Ekate-
rina Shutova, and Anna Korhonen. 2019. Modeling
language variation and universals: A survey on ty-
pological linguistics for natural language processing.
Computational Linguistics, 45(3):559–601.
Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train
short, test long: Attention with linear biases enables
input length extrapolation. In The Tenth International
Conference on Learning Representations, ICLR 2022,
Virtual Event, April 25-29, 2022. OpenReview.net.
Taraka Rama, Lisa Beinborn, and Steffen Eger. 2020.
Probing multilingual BERT for genetic and typo-
logical signals. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics,
pages 1214–1228, Barcelona, Spain (Online). Inter-
national Committee on Computational Linguistics.
Vinit Ravishankar and Anders Søgaard. 2021. The im-
pact of positional encodings on multilingual com-
pression. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing,
pages 763–777, Online and Punta Cana, Dominican
Republic. Association for Computational Linguistics.
Yuval Reif, Guy Kaplan, and Roy Schwartz. 2025. Vo-
cab diet: Reshaping the vocabulary of llms with vec-
tor arithmetic. arXiv preprint arXiv:2510.17001.
Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov,
Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, and
Keith Hall. 2020. Processing South Asian languages
written in the Latin script: the Dakshina dataset. In
Proceedings of the Twelfth Language Resources and
Evaluation Conference, pages 2413–2423, Marseille,
France. European Language Resources Association.
Phillip Rust, Jonas Pfeiffer, Ivan Vuli´c, Sebastian Ruder,
and Iryna Gurevych. 2021. How good is your tok-
enizer? on the monolingual performance of multilin-
gual language models. In Proceedings of the 59th
Annual Meeting of the Association for

Chunk 46 · 1,991 chars

s 2413–2423, Marseille,
France. European Language Resources Association.
Phillip Rust, Jonas Pfeiffer, Ivan Vuli´c, Sebastian Ruder,
and Iryna Gurevych. 2021. How good is your tok-
enizer? on the monolingual performance of multilin-
gual language models. In Proceedings of the 59th
Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Confer-
ence on Natural Language Processing (Volume 1:
Long Papers), pages 3118–3135.
Job Schepens, Roeland Van Hout, and T Florian Jaeger.
2020. Big data suggest strong constraints of linguis-
tic similarity on adult language learning. Cognition,
194:104056.
Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec
Alameddine, Omri Uzan, Yuval Pinter, and Chris
Tanner. 2024. Tokenization is more than compres-
sion. In Proceedings of the 2024 Conference on
Empirical Methods in Natural Language Processing,
pages 678–702, Miami, Florida, USA. Association
for Computational Linguistics.
Sascha Schroeder, Tuomo Häikiö, Ascensión Pagán,
Jonathan H Dickins, Jukka Hyönä, and Simon P Liv-
ersedge. 2022. Eye movements of children and adults
reading in three different orthographies. Journal of
Experimental Psychology: Learning, Memory, and
Cognition, 48(10):1518.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In ACL 2016, pages 1715–1725.
Philip HK Seymour, Mikko Aro, Jane M Erskine, and
Collaboration with COST Action A8 Network. 2003.
Foundation literacy acquisition in european orthogra-
phies. British Journal of psychology, 94(2):143–174.
Claude E. Shannon. 1948. A Mathematical Theory of
Communication. Bell System Technical Journal.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018.
Self-attention with relative position representations.
In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 464–468.
Kaius

Chunk 47 · 1,988 chars

nical Journal.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018.
Self-attention with relative position representations.
In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 464–468.
Kaius Sinnemäki. 2008. Complexity trade-offs in core
argument marking. Language complexity, pages 67–
88.
Dan Slobin. 1987. The crosslinguistic study of language
acquisition. The Modern Language Journal, 71:371.

-- 16 of 17 --

Nathaniel J. Smith and Roger Levy. 2013. The effect
of word predictability on reading time is logarithmic.
Cognition, 128(3):302–319.
Emma Strubell, Patrick Verga, Daniel Andor, David
Weiss, and Andrew McCallum. 2018. Linguistically-
informed self-attention for semantic role labeling.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
5027–5038, Brussels, Belgium. Association for Com-
putational Linguistics.
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng
Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: En-
hanced transformer with rotary position embedding.
Neurocomputing, 568:127063.
Leonard Talmy. 2000. Toward a Cognitive Semantics,
Volume 2: Typology and Process in Concept Struc-
turing. MIT Press, Cambridge, MA.
Li Hai Tan, Ho-Ling Liu, Charles A Perfetti, John A
Spinks, Peter T Fox, and Jia-Hong Gao. 2001. The
neural system underlying chinese logograph reading.
Neuroimage, 13(5):836–846.
Alexander Tsvetkov and Alon Kipnis. 2024. Informa-
tion parity: Measuring and predicting the multilin-
gual capabilities of language models. In Findings
of the Association for Computational Linguistics:
EMNLP 2024, pages 7971–7989, Miami, Florida,
USA. Association for Computational Linguistics.
Menan Velayuthan and Kengatharaiyer Sarveswaran.
2025. Egalitarian language representation in lan-
guage models: It all begins with tokenizers. In
Proceedings of the 31st International Conference

Chunk 48 · 1,990 chars

for Computational Linguistics:
EMNLP 2024, pages 7971–7989, Miami, Florida,
USA. Association for Computational Linguistics.
Menan Velayuthan and Kengatharaiyer Sarveswaran.
2025. Egalitarian language representation in lan-
guage models: It all begins with tokenizers. In
Proceedings of the 31st International Conference on
Computational Linguistics, pages 5987–5996, Abu
Dhabi, UAE. Association for Computational Linguis-
tics.
Ludo Verhoeven and Charles Perfetti. 2022. Universals
in learning to read across languages and writing sys-
tems. Scientific Studies of Reading, 26(2):150–164.
Chenglong Wang, Haoyu Tang, Xiyuan Yang, Yueqi
Xie, Jina Suh, Sunayana Sitaram, Junming Huang,
Yu Xie, Pengjun Zhao, Zhaoya Gong, and 1 others.
2025. Uncovering inequalities in new knowledge
learning by large language models across different
languages. Proceedings of the National Academy of
Sciences, 122(51):e2514626122.
Zirui Wang, Zachary C. Lipton, and Yulia Tsvetkov.
2020. On negative interference in multilingual mod-
els: Findings and a meta-learning treatment. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 4438–4450, Online. Association for Computa-
tional Linguistics.
Junqiu Wei, Qun Liu, Yinpeng Guo, and Xin Jiang.
2021. Training multilingual pre-trained language
model with byte-level subwords. arXiv preprint
arXiv:2101.09469.
Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan
Cotterell, and Roger P. Levy. 2023. Testing the pre-
dictions of surprisal theory in 11 languages. Transac-
tions of the Association for Computational Linguis-
tics, 11:1451–1470.
Taeko Nakayama Wydell and Brian Butterworth. 1999.
A case study of an english-japanese bilingual with
monolingual dyslexia. Cognition, 70(3):273–305.
François Yergeau. 2003. UTF-8, a transformation for-
mat of ISO 10646. RFC 3629.
Raoyuan Zhao, Yihong Liu, Hinrich Schütze, and
Michael A Hedderich. 2025. A comprehensive eval-
uation of multilingual chain-of-thought

Chunk 49 · 1,564 chars

99.
A case study of an english-japanese bilingual with
monolingual dyslexia. Cognition, 70(3):273–305.
François Yergeau. 2003. UTF-8, a transformation for-
mat of ISO 10646. RFC 3629.
Raoyuan Zhao, Yihong Liu, Hinrich Schütze, and
Michael A Hedderich. 2025. A comprehensive eval-
uation of multilingual chain-of-thought reasoning:
Performance, consistency, and faithfulness across
languages. arXiv preprint arXiv:2510.09555.
Wei Zhuang and Yan Sun. 2025. Cute: A multilingual
dataset for enhancing cross-lingual knowledge trans-
fer in low-resource languages. In Proceedings of
the 31st International Conference on Computational
Linguistics.
Johannes C. Ziegler and Usha Goswami. 2005. Read-
ing acquisition, developmental dyslexia, and skilled
reading across languages: a psycholinguistic grain
size theory. Psychological bulletin, 131 1:3–29.
George K. Zipf. 1935. The Psycho-Biology of Language.
Houghton Mifflin.
Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du,
Mrinmaya Sachan, and Ryan Cotterell. 2023a. Tok-
enization and the noiseless channel. In Proceedings
of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 5184–5207, Toronto, Canada. Association for
Computational Linguistics.
Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du, Tim
Vieira, Mrinmaya Sachan, and Ryan Cotterell. 2023b.
A formal perspective on byte-pair encoding. In Find-
ings of the Association for Computational Linguis-
tics: ACL 2023, pages 598–614, Toronto, Canada.
Association for Computational Linguistics.

-- 17 of 17 --