Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Summary
This study investigates how tokenization disparities create infrastructure bias in large language models (LLMs), disproportionately affecting low-resource and non-Latin script languages. Using the FLORES-200 dataset, the authors evaluated tokenization efficiency across over 200 languages with the tiktoken library. Key metrics included Tokens Per Sentence (TPS), Characters Per Token (CPT), and Relative Tokenization Cost (RTC), benchmarked against English. Results revealed significant inefficiencies: non-Latin and morphologically complex languages often required 3–5 times more tokens than English, with some scripts needing up to 7 times more. These disparities translate into higher computational costs, reduced context utilization, and economic barriers for speakers of underrepresented languages. The study highlights systemic biases in current tokenization systems, which are optimized for high-resource languages. The authors call for linguistically informed tokenization strategies and adaptive vocabulary methods to promote computational equity and inclusivity in multilingual AI.
PDF viewer
Chunks(18)
Chunk 0 · 1,997 chars
Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency 1stHailay Kidu Teklehaymanot L3S Research Center Leibniz University Hannover Hannover, Germany teklehaymanot@L3S.de 2nd Wolfgang Nejdl L3S Research Center Leibniz University Hannover Hannover, Germany nejdl@L3S.de Abstract—Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large- scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using estab- lished evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and mor- phologically complex languages incur significantly greater token inflation, often 3–5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems. Index Terms—Multilingual
Chunk 1 · 1,996 chars
computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems. Index Terms—Multilingual Models, Tokenization, Infrastruc- ture Bias, Subword System, Large Language Models I. INTRODUCTION Recent advances in large language models (LLMs) have transformed natural language processing [1], [2], yet these developments remain disproportionately concentrated on high- resource languages, particularly English, leaving the ma- jority of the world’s languages underrepresented in both research and technological deployment. While multilingual LLMs (mLLMs) theoretically enable cross-lingual knowledge transfer from high-resource to low-resource languages through shared linguistic representations [3], substantial performance disparities persist across linguistically diverse populations [4]. Recent analyses of code-switching datasets [5] reveal sys- temic biases toward English and a lack of sociolinguistic representativeness in data collection and annotation, reflecting broader disparities in multilingual modeling. The global NLP research landscape exhibits a significant bias toward high- resource languages, resulting in substantial performance gaps for underrepresented languages due to limited annotated data and inadequate computational support [6], [7]. As highlighted by [5], the lack of representativeness in English-dominant and unbalanced datasets reflects broader structural inequities in multilingual NLP. These disparities parallel the biases embedded in tokenization systems, underscoring how insuffi- ciently representative data undermines fairness and inclusivity across language technologies. Despite advances in transfer learning and transformer architectures that partially address these disparities through cross-lingual knowledge transfer [8], the
Chunk 2 · 1,992 chars
biases embedded in tokenization systems, underscoring how insuffi- ciently representative data undermines fairness and inclusivity across language technologies. Despite advances in transfer learning and transformer architectures that partially address these disparities through cross-lingual knowledge transfer [8], the extent and limitations of multilingual generalization remain under investigation. A critical but underexplored factor contributing to these disparities is tokenization, the fundamental preprocessing step that transforms raw text into subword units for model input. Tokenization algorithms, predominantly optimized for high- resource languages, may inadequately handle the morpho- logical complexity and structural properties of low-resource languages, thereby exacerbating existing inequalities [9]. This study presents a systematic investigation of tokenization dis- parities across over 200 languages using the FLORES-200 benchmark. By analyzing state-of-the-art tokenizers, we quan- tify variations in cross-lingual efficiency and identify tokeniza- tion as a key driver of performance inequities in multilingual NLP. Our findings lay the groundwork for developing more equitable tokenization strategies that enhance fairness and inclusivity across diverse linguistic contexts. II. RELATED WORK A. Background Tokenization constitutes a fundamental preprocessing step in natural language processing systems [10], [11], trans- forming raw text into discrete units for model consumption. Contemporary multilingual language models predominantly arXiv:2510.12389v1 [cs.CL] 14 Oct 2025 -- 1 of 6 -- employ subword tokenization strategies to address vocabu- lary constraints while maintaining representational efficiency across linguistically diverse corpora. Subword Tokenization Subword tokenization decomposes words into semanti- cally meaningful subunits, effectively addressing the open vocabulary problem in neural language models [12]. This approach enables efficient
Chunk 3 · 1,999 chars
u- lary constraints while maintaining representational efficiency across linguistically diverse corpora. Subword Tokenization Subword tokenization decomposes words into semanti- cally meaningful subunits, effectively addressing the open vocabulary problem in neural language models [12]. This approach enables efficient processing of both frequent and rare lexical items by representing common words directly while decomposing infrequent terms into familiar subword components [13]. Operating at an intermediate granularity between character-level and word-level representations, sub- word methods capture morphological regularities particularly beneficial for morphologically rich languages [9]. Predominant subword algorithms include Byte Pair Encoding (BPE) [14], which iteratively merges frequent symbol pairs; WordPiece [15], employing probabilistic merging strategies; and Senten- cePiece [16], treating input as raw Unicode sequences without explicit word boundaries. While transformer-based multilin- gual models rely extensively on these approaches, vocabulary construction often reflects high-resource language dominance, potentially introducing systematic biases for underrepresented languages [11]. Tokenization Efficiency Disparities Recent investigations have revealed substantial tokeniza- tion disparities across languages, with significant implications for computational equity. [17] demonstrated efficiency vari- ations up to 13-fold between English and other languages across 108 languages, highlighting how existing tokenization schemes systematically favor high-resource languages. [18] and [19] further documented tokenization length variations reaching 15-fold for semantically equivalent texts, particularly affecting non-Latin scripts and morphologically complex lan- guages. These disparities carry practical consequences beyond performance metrics, directly impacting computational costs and accessibility. Token-based pricing models in commercial language services result
Chunk 4 · 1,996 chars
old for semantically equivalent texts, particularly affecting non-Latin scripts and morphologically complex lan- guages. These disparities carry practical consequences beyond performance metrics, directly impacting computational costs and accessibility. Token-based pricing models in commercial language services result in disproportionately higher usage costs for speakers of underrepresented languages, creating economic barriers to AI access. Proposed solutions include multilingually fair subword algorithms [19] and adaptive gradient-based approaches such as MAGNET [20], designed to minimize over-segmentation while accommodating diverse morphological and orthographic characteristics. Multilingual Vocabulary Allocation Multilingual models face vocabulary allocation challenges, distributing limited vocabulary capacity across multiple languages [21]. Inductive biases encoded in tokenizers reflect training corpus distributions, often favoring high-resource languages and resulting in suboptimal represen- tation quality for underrepresented scripts [22], [23]. Poor vo- cabulary allocation leads to excessive subword fragmentation, particularly detrimental for word-level tasks requiring accurate morphological segmentation [13]. III. METHODOLOGY A. Dataset This study employs the FLORES-200 dataset [24], pro- viding standardized parallel text across 200 languages repre- senting diverse linguistic families, scripts, and morphological structures. We utilize the devtest split (1,012 sentences per language) to ensure cross-linguistic consistency and eliminate potential data contamination effects. B. Tokenization Model We employ the cl100k 1 base tokenizer via the tiktoken library, representing production-grade Byte Pair Encoding implementations deployed in state-of-the-art large language models, including GPT-3.5 and GPT-4. While computationally efficient, this tokenizer’s training data exhibits high-resource language bias, potentially introducing systematic inefficiencies for
Chunk 5 · 1,992 chars
tiktoken library, representing production-grade Byte Pair Encoding implementations deployed in state-of-the-art large language models, including GPT-3.5 and GPT-4. While computationally efficient, this tokenizer’s training data exhibits high-resource language bias, potentially introducing systematic inefficiencies for underrepresented languages. IV. EXPERIMENTAL PROCEDURE For this study, tokenization experiments are conducted using the tiktoken library, which provides access to Ope- nAI’s production-grade Byte Pair Encoding (BPE) tokenizers. The tiktoken library implements highly efficient tokenization routines specifically designed for transformer-based language models, including those deployed in OpenAIs GPT-3.5 and GPT-4 architectures. In particular, we employ the cl100kbase encoding, a subword vocabulary widely adopted across Ope- nAI models that supports a broad range of languages and scripts, though it is primarily optimized for high-resource lan- guages. The experimental procedure consists of the following steps: a) Preprocessing and Normalization: All text samples from the FLORES-200 devtest set are first normalized using Unicode Normalization Form C (NFC). This normalization step ensures consistency in character representation across languages, which is particularly critical for scripts that allow multiple valid Unicode encodings of the same grapheme cluster. Such normalization is essential to eliminate artifacts that could distort tokenization behavior, especially in non- Latin and morphologically complex languages. b) Tokenization with tiktoken: Following normalization, each sentence is tokenized using the cl100kbase tokenizer provided by tiktoken2. This tokenizer employs a data-driven subword segmentation strategy optimized for compression and computational efficiency. However, due to its construction based primarily on high-resource language corpora, it may introduce biases when applied to underrepresented languages that diverge significantly in
Chunk 6 · 1,991 chars
d by tiktoken2. This tokenizer employs a data-driven subword segmentation strategy optimized for compression and computational efficiency. However, due to its construction based primarily on high-resource language corpora, it may introduce biases when applied to underrepresented languages that diverge significantly in morphological or orthographic structure. 1OpenAI’s tiktoken 2OpenAI’s tiktoken -- 2 of 6 -- c) Tokenization Statistics Extraction: After tokenization, we compute and record several key statistics for each sentence, including: The total number of tokens generated. The total number of characters processed. Per language aggregated metrics such as: Tokens Per Sentence (TPS): the average number of tokens per sentence. Characters Per Token (CPT): the average number of characters represented by each token. d) Cross-Linguistic Efficiency Analysis: To quantify cross-linguistic disparities, we calculate the Relative Tokeniza- tion Cost (RTC) for each language. This metric compares each language’s TPS value to that of English, which serves as the reference baseline. The RTC provides a normalized indicator of tokenization efficiency, allowing us to evaluate how much more or less efficiently each language is tokenized relative to English. This experimental setup enables a system- atic, language-agnostic evaluation of tokenization efficiency across all 200 FLORES languages. By employing a consistent preprocessing pipeline and a single tokenizer configuration, we aim to isolate and quantify the extent of tokenization inef- ficiencies that may arise from applying widely used subword algorithms to linguistically diverse languages in state-of-the- art multilingual models. V. TOKENIZER EVALUATION METRICS To systematically assess tokenization efficiency across di- verse languages, we adopt several quantitative metrics that capture both absolute and relative characteristics of tokenized outputs. These metrics allow for cross-linguistic comparisons that highlight
Chunk 7 · 1,997 chars
- art multilingual models. V. TOKENIZER EVALUATION METRICS To systematically assess tokenization efficiency across di- verse languages, we adopt several quantitative metrics that capture both absolute and relative characteristics of tokenized outputs. These metrics allow for cross-linguistic comparisons that highlight disparities introduced by subword segmentation strategies in multilingual language models. A. Tokens Per Sentence (TPS) The Tokens Per Sentence (TPS) metric computes the average number of tokens generated per sentence after tok- enization for each language. Formally, for a language L with N sentences: T P S(L) = 1 N N X i=1 Ti where Ti denotes the number of tokens in sentence i. This metric reflects the granularity of subword segmentation and provides a language-specific view of tokenization density. B. Characters Per Token (CPT) The Characters Per Token (CPT) metric evaluates the average number of characters represented by each token: CP T (L) = PN i=1 Ci PN i=1 Ti where Ci is the character count of sentence i, and Ti is its corresponding token count. CPT indicates how efficiently the tokenizer compresses character sequences into tokens. Lower CPT values typically reflect finer-grained segmentations, while higher values may suggest more compact tokenization. C. Relative Tokenization Cost (RTC) To directly quantify cross-linguistic disparities, we define the Relative Tokenization Cost (RTC) using English as the baseline reference language. For any language L, RTC is computed as: RT C(L) = T P S(L) T P S(English) An RTC value greater than 1 indicates that language L requires more tokens to represent an equivalent sentence compared to English, signaling potential inefficiencies in to- kenization for that language. Conversely, an RTC value less than 1 suggests that language L is tokenized more compactly than English. D. Aggregate Efficiency Comparison We conduct aggregate efficiency comparisons across the entire FLORES-200 [25] language set to quantify
Chunk 8 · 1,996 chars
lish, signaling potential inefficiencies in to- kenization for that language. Conversely, an RTC value less than 1 suggests that language L is tokenized more compactly than English. D. Aggregate Efficiency Comparison We conduct aggregate efficiency comparisons across the entire FLORES-200 [25] language set to quantify global dis- parities and assess the extent to which low-resource languages are disproportionately affected. This comprehensive analysis facilitates the identification of systematic biases in existing subword tokenization schemes and quantifies linguistic in- clusivity across languages with varying resource availability, orthographic systems, and morphological complexity. VI. RESULTS AND EVALUATION Our comprehensive evaluation encompasses 200+ languages representing diverse script families, including Latin, Arabic, Devanagari, Ethiopic, Cyrillic, and numerous others. The multilingual corpus exhibits substantial imbalance, with Latin- script languages contributing the largest proportion of content, followed by Arabic-script, Devanagari, and Ethiopic language families. Script Distribution Analysis Figure 1 illustrates the distribution of languages across script families, revealing significant concentration in Latin- based writing systems. This imbalance reflects historical and contemporary patterns of digital language representation, with implications for tokenization algorithm development and de- ployment. Scripts employed by single languages including Armenian (Armn), Georgian (Geor), and Japanese (Jpan) rep- resent particularly vulnerable cases where tokenization ineffi- ciencies may disproportionately impact linguistic accessibility. Cross-Linguistic Tokenization Efficiency Our empirical analysis reveals substantial disparities in tokenization efficiency across languages and script families, as demonstrated through multiple complementary analytical perspectives. -- 3 of 6 -- Fig. 1. Comparison of scripts used by multiple languages. The bar chart
Chunk 9 · 1,981 chars
Cross-Linguistic Tokenization Efficiency Our empirical analysis reveals substantial disparities in tokenization efficiency across languages and script families, as demonstrated through multiple complementary analytical perspectives. -- 3 of 6 -- Fig. 1. Comparison of scripts used by multiple languages. The bar chart shows the number of languages using each script, with Latin (Latn) being the most widely adopted. An annotation box lists scripts used by only a single language, such as Armenian (Armn), Georgian (Geor), and Japanese (Jpan). This figure highlights the distribution and concentration of script usage across languages. Tokens Per Sentence (TPS) Analysis Figure 2 quantifies the extreme range of tokenization bur- dens across writing systems. Myanmar script requires the highest token density (357.2 TPS), followed by Ol Chiki (342.1 TPS) and Oriya (334.0 TPS), representing nearly 7- fold higher tokenization costs compared to the most efficient scripts. Latin script achieves optimal efficiency (50.2 TPS), with Han Traditional (56.1 TPS) and Han Simplified (56.8 TPS) also demonstrating relatively compact tokenization. The language-wide mean of 90.3 TPS serves as a reference point, revealing that numerous scripts require 3-4 times more tokens than this average, creating substantial computational inequal- ities. TPS measurements demonstrate significant variation across languages, with morphologically complex languages exhibit- ing substantially higher token densities. Languages employing agglutinative morphological processes require extensive sub- word fragmentation, resulting in elevated TPS values that di- rectly impact computational resource requirements and context window utilization. Characters Per Token (CPT) Patterns Figure 3 illustrates dramatic script-dependent variations in compression efficiency. Latin script achieves the highest compression efficiency (2.61 CPT), followed by Cyrillic (1.58 CPT) and Greek (1.14 CPT). Non-Latin scripts
Chunk 10 · 1,992 chars
nal resource requirements and context window utilization. Characters Per Token (CPT) Patterns Figure 3 illustrates dramatic script-dependent variations in compression efficiency. Latin script achieves the highest compression efficiency (2.61 CPT), followed by Cyrillic (1.58 CPT) and Greek (1.14 CPT). Non-Latin scripts consistently demonstrate lower efficiency: Arabic (1.28 CPT), Devanagari (0.99 CPT), and numerous Asian scripts falling below 1.0 CPT. Notably, Tibetan (0.49 CPT), Oriya (0.40 CPT), and Ol Fig. 2. Tokenization Efficiency Across Writing Scripts (Tokens Per Sentence - TPS). Cross-script comparison quantifies extreme tokenization burden varia- tions, with the Myanmar script requiring the highest token density (357.2 TPS) and the Latin script achieving optimal efficiency (50.2 TPS). The language- wide mean of 90.3 TPS serves as a reference, revealing that numerous scripts require 3-4 times more tokens than average, with some scripts demanding nearly 7-fold higher computational costs for equivalent semantic content. Chiki (0.41 CPT) exhibit severe tokenization inefficiencies, indicating excessive subword fragmentation for these writing systems. Characters Per Token (CPT) Analysis by Language Family Figure 4 demonstrates substantial variation in CPT values across language families, ranging from 0.55 (Kannada) to 2.85 (Creole languages). Creole languages exhibit the highest com- pression efficiency (2.85 CPT), followed by Isolate languages -- 4 of 6 -- Fig. 3. Average Characters Per Token (CPT) by Writing Script. Script-based analysis reveals dramatic efficiency variations, with Latin script achieving optimal compression (2.61 CPT) compared to numerous Asian scripts falling below 1.0 CPT. Non-Latin scripts consistently demonstrate lower efficiency, with Tibetan (0.49 CPT), Oriya (0.40 CPT), and Ol Chiki (0.41 CPT) exhibiting severe tokenization inefficiencies. This pattern reflects the Latin- script optimization inherent in contemporary tokenization
Chunk 11 · 1,996 chars
ared to numerous Asian scripts falling below 1.0 CPT. Non-Latin scripts consistently demonstrate lower efficiency, with Tibetan (0.49 CPT), Oriya (0.40 CPT), and Ol Chiki (0.41 CPT) exhibiting severe tokenization inefficiencies. This pattern reflects the Latin- script optimization inherent in contemporary tokenization algorithms. Fig. 4. Average Characters Per Token (CPT) by Language Family. The analysis reveals substantial variation across language families, with Creole languages demonstrating the highest tokenization efficiency (2.85 CPT) and Kannada exhibiting the lowest efficiency (0.55 CPT). Higher CPT values indicate more efficient character-to-token compression, while lower values reflect excessive subword fragmentation. Notable patterns include the supe- rior performance of Creole, Isolate, and Austronesian families compared to morphologically complex families such as Dravidian and Tai-Kadai. (2.75 CPT) and Austronesian languages (2.64 CPT). Con- versely, Kannada (0.55 CPT), Dravidian (0.63 CPT), and Tai- Kadai (0.79 CPT) demonstrate substantially lower compres- sion ratios, indicating excessive subword fragmentation that severely impacts computational efficiency for these language families. Relative Tokenization Cost (RTC) Disparities RTC measurements quantify the magnitude of cross- linguistic tokenization inequalities relative to English baseline performance. Our results demonstrate RTC values ranging from below 1.0 for closely related Indo-European languages to values exceeding 4.0 for morphologically rich and non- Latin script languages. These disparities translate directly into differential computational costs and accessibility barriers for speakers of underrepresented languages, with some languages requiring up to four times the computational resources for equivalent semantic processing. Aggregate Efficiency Comparison (AEC) The aggregate analysis confirms systematic tokenization biases favoring high-resource, Latin-script languages. Low- resource
Chunk 12 · 1,993 chars
s for speakers of underrepresented languages, with some languages requiring up to four times the computational resources for equivalent semantic processing. Aggregate Efficiency Comparison (AEC) The aggregate analysis confirms systematic tokenization biases favoring high-resource, Latin-script languages. Low- resource languages, particularly those employing unique scripts or complex morphological systems, consistently demonstrate reduced efficiency across all evaluation metrics. This pattern reflects the training data composition of con- temporary tokenizers and highlights fundamental accessibility challenges in current multilingual AI systems. VII. IMPACT ON MODEL PERFORMANCE AND RESOURCE UTILIZATION These tokenization disparities carry significant practical implications for multilingual language model deployment: a) Computational Resource Inequality: Languages with elevated RTC values require disproportionate computational resources for equivalent semantic processing, creating systematic disadvantages for underrepresented language communities. b) Context Window Limitations: Higher token densities reduce effective context utilization for complex languages, potentially degrading model performance on tasks requiring extended contextual understanding. c) Economic Accessibility Barriers: Token-based pricing models in commercial language services amplify these dispar- ities, resulting in substantially higher usage costs for speakers of inefficiently tokenized languages. d) Performance Degradation: Excessive subword frag- mentation may impair semantic representation quality, partic- ularly for word-level tasks requiring accurate morphological analysis. These findings underscore the critical need for developing language-aware tokenization strategies that address systematic inequalities in multilingual AI system accessibility and perfor- mance across linguistically diverse populations. VIII. DISCUSSIONS This study frames its findings as evidence of both system- atic
Chunk 13 · 1,998 chars
. These findings underscore the critical need for developing language-aware tokenization strategies that address systematic inequalities in multilingual AI system accessibility and perfor- mance across linguistically diverse populations. VIII. DISCUSSIONS This study frames its findings as evidence of both system- atic algorithmic and infrastructural biases rather than mere technical limitations. It reveals how existing tokenization algo- rithms inherently favor high-resource languages, while infras- tructural disparities in data availability and tool support further exacerbate the gap. The analysis connects these technical disparities to broader issues of social and linguistic equity, emphasizing how such design choices can reinforce economic and accessibility barriers. Overall, the study underscores the failure of current multilingual systems to achieve inclusive and equitable language representation. IX. CONCLUSIONS Summarizes key findings: 7-fold efficiency disparities Con- nects technical inequalities to real-world barriers Calls for language-aware tokenization strategies Emphasizes need for equity over efficiency in multilingual AI. -- 5 of 6 -- LIMITATIONS Acknowledges single tokenizer/dataset constraints Notes focus on efficiency rather than downstream performance rec- ognizes potential oversimplification within script families Pro- vides clear directions for future research. ACKNOWLEDGMENT This research was supported by the German Academic Ex- change Service (DAAD) through the Hilde Domin Programme (funding no. 57615863). The authors gratefully acknowledge this support. REFERENCES [1] J. Armengol-Estap´e, C. P. Carrino, C. Rodriguez-Penagos, O. de Gib- ert Bonet, C. Armentano-Oller, A. Gonzalez-Agirre, M. Melero, and M. Villegas, “Are multilingual models the best choice for moderately under-resourced languages? a comprehensive assessment for catalan,” arXiv preprint arXiv:2107.07903, 2021. [2] A. Mu˜noz Ortiz, “Dependency parsing as sequence labeling for
Chunk 14 · 1,990 chars
O. de Gib- ert Bonet, C. Armentano-Oller, A. Gonzalez-Agirre, M. Melero, and M. Villegas, “Are multilingual models the best choice for moderately under-resourced languages? a comprehensive assessment for catalan,” arXiv preprint arXiv:2107.07903, 2021. [2] A. Mu˜noz Ortiz, “Dependency parsing as sequence labeling for low- resource languages,” Universidad Nacional de Educaci´on a Distancia (Espa˜na), Escuela T´ecnica Superior de Ingenieros en Inform´atica, Tech- nical Report, 2021. [3] Y. Xu, L. Hu, J. Zhao, Z. Qiu, Y. Ye, and H. Gu, “A survey on multilingual large language models: Corpora, alignment, and bias,” arXiv preprint arXiv:2404.00929, 2024. [4] S. Mutuvi, E. Boros, A. Doucet, G. Lejeune, A. Jatowt, and M. Odeo, “Analyzing the impact of tokenization on multilingual epidemic surveil- lance in low-resource languages,” in Document Analysis and Recognition - ICDAR 2023, G. A. Fink, R. Jain, K. Kise, and R. Zanibbi, Eds. Cham: Springer Nature Switzerland, 2023, pp. 17–32. [5] A. S. Do˘gru¨oz, S. Sitaram, and Z. X. Yong, “Representativeness as a forgotten lesson for multilingual and code-switched data collection and preparation,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 5751–5767. [Online]. Available: https: //aclanthology.org/2023.findings-emnlp.382/ [6] A. Magueresse, V. Carles, and E. Heetderks, “Low-resource lan- guages: A review of past work and future challenges,” arXiv preprint arXiv:2006.07264, 2020. [7] Y. Zhu, B. Heinzerling, I. Vuli´c, M. Strube, R. Reichart, and A. Ko- rhonen, “On the importance of subword information for morphological tasks in truly low-resource languages,” arXiv preprint arXiv:1909.12375, 2019. [8] C. Theodoropoulos and M.-F. Moens, “An information extraction study: Take in mind the tokenization!” in Conference of the European Society for Fuzzy Logic and Technology. Springer, 2023, pp.
Chunk 15 · 1,996 chars
ance of subword information for morphological tasks in truly low-resource languages,” arXiv preprint arXiv:1909.12375, 2019. [8] C. Theodoropoulos and M.-F. Moens, “An information extraction study: Take in mind the tokenization!” in Conference of the European Society for Fuzzy Logic and Technology. Springer, 2023, pp. 593–606. [9] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gall´e, A. Raja, C. Si, W. Y. Lee, B. Sagot et al., “Between words and charac- ters: A brief history of open-vocabulary modeling and tokenization in nlp,” arXiv preprint arXiv:2112.10508, 2021. [10] M. Hassler and G. Fliedl, “Text preparation through extended tokeniza- tion,” WIT Transactions on Information and Communication Technolo- gies, vol. 37, 2006. [11] C. Toraman, E. H. Yilmaz, F. S¸ ahinuc¸, and O. Ozcelik, “Impact of tokenization on language models: An analysis for turkish,” ACM Trans- actions on Asian and Low-Resource Language Information Processing, vol. 22, no. 4, pp. 1–21, 2023. [12] C. Park, Y. Yang, K. Park, and H. Lim, “Decoding strategies for improving low-resource machine translation,” Electronics, vol. 9, no. 10, p. 1562, 2020. [13] T. Limisiewicz, J. Balhar, and D. Mareˇcek, “Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and overlap across languages,” arXiv preprint arXiv:2305.17179, 2023. [14] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), K. Erk and N. A. Smith, Eds. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1715–1725. [Online]. Available: https://aclanthology.org/P16-1162 [15] X. Song, A. Salcianu, Y. Song, D. Dopson, and D. Zhou, “Fast wordpiece tokenization,” arXiv preprint arXiv:2012.15524, 2020. [16] T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer
Chunk 16 · 1,995 chars
, pp. 1715–1725. [Online]. Available: https://aclanthology.org/P16-1162 [15] X. Song, A. Salcianu, Y. Song, D. Dopson, and D. Zhou, “Fast wordpiece tokenization,” arXiv preprint arXiv:2012.15524, 2020. [16] T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018. [17] M. Asprovska and N. Hunter, “The tokenization problem: Understanding generative ai’s computational language bias,” Ubiquity Proceedings, vol. 4, no. 1, 2024. [18] O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. Mortensen, N. Smith, and Y. Tsvetkov, “Do all languages cost the same? tokenization in the era of commercial language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 9904–9923. [Online]. Available: https://aclanthology.org/2023.emnlp-main.614/ [19] A. Petrov, E. La Malfa, P. Torr, and A. Bibi, “Language model tokenizers introduce unfairness between languages,” Advances in neural information processing systems, vol. 36, pp. 36 963–36 990, 2023. [20] O. Ahia, S. Kumar, H. Gonen, V. Hoffman, T. Limisiewicz, Y. Tsvetkov, and N. A. Smith, “Magnet: Improving the multilingual fairness of lan- guage models with adaptive gradient-based tokenization,” arXiv preprint arXiv:2407.08818, 2024. [21] B. Zheng, L. Dong, S. Huang, S. Singhal, W. Che, T. Liu, X. Song, and F. Wei, “Allocating large vocabulary capacity for cross-lingual language model pre-training,” arXiv preprint arXiv:2109.07306, 2021. [22] P. Vyas, A. Kuznetsova, and D. S. Williamson, “Optimally encoding inductive biases into the transformer improves end-to-end speech trans- lation.” in Interspeech, 2021, pp. 2287–2291. [23] F. Remy, P. Delobelle, H. Avetisyan, A. Khabibullina, M. de Lhoneux, and T. Demeester, “Trans-tokenization and cross-lingual vocabulary transfers:
Chunk 17 · 1,302 chars
etsova, and D. S. Williamson, “Optimally encoding inductive biases into the transformer improves end-to-end speech trans- lation.” in Interspeech, 2021, pp. 2287–2291. [23] F. Remy, P. Delobelle, H. Avetisyan, A. Khabibullina, M. de Lhoneux, and T. Demeester, “Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of llms for low-resource nlp,” arXiv preprint arXiv:2408.04303, 2024. [24] NLLB Team, M. R. Costa-juss`a, J. Cross, O. C¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzm´an, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang, “Scaling neural machine translation to 200 languages,” Nature, vol. 630, no. 8018, pp. 841–846, 2024. [Online]. Available: https://doi.org/10.1038/s41586-024-07335-x [25] M. R. Costa-Juss`a, J. Cross, O. C¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022. -- 6 of 6 --