BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition
Summary
This paper introduces BBPE16, a UTF-16-based byte-level byte-pair encoding (BBPE) tokenizer designed to improve multilingual speech recognition (ASR). Traditional UTF-8-based BBPE is language-agnostic and supports full Unicode coverage but suffers from variable-length encoding, which increases token sequence lengths for non-Latin scripts like Chinese, Japanese, and Korean (CJK). This leads to higher computational and memory costs. BBPE16 uses UTF-16, which provides a uniform 2-byte representation for most modern scripts, including CJK, enabling more efficient tokenization. Experiments across monolingual, bilingual, and trilingual ASR settings show that BBPE16 achieves comparable or better accuracy than UTF-8-based BBPE while significantly improving cross-lingual token sharing. For Chinese, it reduces token counts by up to 10.4% and decoding iterations by up to 10.3%, leading to faster training and inference. The tokenizer is compatible with existing systems and offers practical benefits for multilingual ASR, especially for non-Latin scripts.
PDF viewer
Chunks(17)
Chunk 0 · 1,997 chars
BBPE16: UTF-16-BASED BYTE-LEVEL BYTE-PAIR ENCODING FOR IMPROVED
MULTILINGUAL SPEECH RECOGNITION
Hyunsik Kim∗, Haeri Kim*, Munhak Lee, Kyungmin Lee
Samsung Research
{hyunsik777.kim, haeri.kim, mun-hak.lee, k.m.lee}@samsung.com
ABSTRACT
Multilingual automatic speech recognition (ASR) requires to-
kenization that efficiently covers many writing systems. Byte-level
BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic
design and full Unicode coverage, but its variable-length encoding
inflates token sequences for non-Latin scripts, such as Chinese,
Japanese, and Korean (CJK). Longer sequences increase computa-
tional load and memory use. We propose BBPE16, a UTF-16-based
BBPE tokenizer that represents most modern scripts with a uniform
2-byte code unit. BBPE16 preserves BBPE’s language-agnostic
properties while substantially improving cross-lingual token shar-
ing. Across monolingual, bilingual, and trilingual ASR, and in a
multilingual continual-learning setup, BBPE16 attains comparable
or better accuracy; for Chinese, it reduces token counts by up to
10.4% and lowers decoding iterations by up to 10.3%. These re-
ductions speed up fine-tuning and inference and decrease memory
usage, making BBPE16 a practical tokenization choice for multilin-
gual ASR.
Index Terms— Byte-level BPE, UTF-16, tokenization, speech
recognition, multilingual ASR
1. INTRODUCTION
As automatic speech recognition (ASR) has advanced, attention has
shifted from single-language systems to multilingual ASR. The goal
is not only higher accuracy [1] but also broader language cover-
age—ranging from dozens [2] to, in some cases, hundreds of lan-
guages [3, 4]. To achieve this breadth, the model must have an out-
put vocabulary capable of representing a wide variety of writing sys-
tems; therefore, tokenization becomes a primary design considera-
tion.
While character- or subword-level byte-pair encoding (BPE) [5]
is often sufficient for monolingual ASR, multilingual setups com-
monly adoptChunk 1 · 1,997 chars
breadth, the model must have an out- put vocabulary capable of representing a wide variety of writing sys- tems; therefore, tokenization becomes a primary design considera- tion. While character- or subword-level byte-pair encoding (BPE) [5] is often sufficient for monolingual ASR, multilingual setups com- monly adopt byte-level BPE (BBPE) to ensure full Unicode cov- erage and to avoid language-specific preprocessing. Contemporary BBPE implementations are predominantly UTF-8–based [6], which brings practical advantages such as ASCII compatibility and ro- bustness. However, UTF-8 also introduces friction in cross-script settings. Its variable-length encoding yields uneven token bound- aries across languages and, for non-Latin scripts such as Chinese, Japanese, and Korean (CJK), inflates sequence length. This length- ening increases computation and memory use, which in turn raises model and runtime complexity. These effects make efficient, scalable multilingual ASR more challenging, which motivates alternative BBPE designs that better align with cross-lingual settings. ∗Equal contribution. In real-world deployments, multilingual ASR systems are ini- tially pretrained on massive multilingual corpora and then undergo continual learning—i.e., incremental adaptation to new domains, dialects, or underrepresented languages while preserving prior knowledge—to maintain or improve performance in specific tar- get settings. In such settings, models must incorporate new data without catastrophic forgetting of previously learned languages. Against this backdrop, continual learning for multilingual ASR has become an active area of research [7, 8]. These continual-learning scenarios impose strong requirements on tokenizer design: the tok- enizer should facilitate efficient cross-lingual sharing and maintain stability as new languages or domains are incrementally added. In this work, we propose BBPE16, a UTF-16-based BBPE tok- enizer that addresses limitations of UTF-8-based BBPE
Chunk 2 · 1,998 chars
earning scenarios impose strong requirements on tokenizer design: the tok- enizer should facilitate efficient cross-lingual sharing and maintain stability as new languages or domains are incrementally added. In this work, we propose BBPE16, a UTF-16-based BBPE tok- enizer that addresses limitations of UTF-8-based BBPE while main- taining compatibility with existing infrastructure. UTF-16 provides a uniform 2-byte representation for the vast majority of characters in the Basic Multilingual Plane (BMP), which includes most mod- ern scripts, including CJK characters. Importantly, in our trilingual setup BBPE16 exhibits markedly superior cross-lingual token shar- ing capability, generating tokens shared among English, Korean, and Chinese, whereas BBPE yields none. Our key contributions are: • A novel UTF-16-based BBPE tokenizer that achieves com- parable ASR performance with significantly improved token efficiency and cross-lingual token sharing • Up to 10.4% token reduction for Chinese in a multilingual continual-learning scenario, alongside reductions in decoding iterations, yielding practical training and inference speedups 2. BACKGROUND 2.1. Byte-Level BPE (BBPE) Modern Transformer [9]-based models (e.g., BERT [10], GPT- 3 [11]) are trained on hundreds of billions of tokens and contain tens of billions of parameters. At this scale, word-level vocabularies suffer from severe sparsity and vocabulary-size explosion prob- lems. Subword methods such as BPE [5] and SentencePiece [12] reduce sparsity while still capturing a wide range of morphological variations. However, classic BPE still requires language-specific pre- processing, such as word tokenizers and whitespace handling. When processing multilingual corpora or corpora that contain a substantial number of low-frequency characters, maintaining language-specific preprocessing rules is cumbersome. BBPE first converts the input text into a byte sequence and then applies BPE on that sequence. This makes the tokenizer
Chunk 3 · 1,997 chars
nd whitespace handling. When processing multilingual corpora or corpora that contain a substantial number of low-frequency characters, maintaining language-specific preprocessing rules is cumbersome. BBPE first converts the input text into a byte sequence and then applies BPE on that sequence. This makes the tokenizer language- agnostic: as long as the text is encoded in UTF-8, any Unicode character can be processed by the same pipeline. BBPE has several Copyright 2026 IEEE. Published in ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), scheduled for 4-8 May 2026 in Barcelona, Spain. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. arXiv:2602.01717v1 [cs.CL] 2 Feb 2026 -- 1 of 5 -- advantages over conventional BPE: in multilingual settings, it dis- tributes token usage more uniformly across the corpus and yields a larger set of tokens shared among languages [13]. Moreover, be- cause BBPE’s vocabulary encompasses all possible byte values, it inherently eliminates out-of-vocabulary tokens. 2.2. Limitations of UTF-8-based BBPE While UTF-8 has become the de facto standard for text encoding due to its backward compatibility with ASCII and error resilience, it introduces several challenges for BBPE. Variable-length encoding: Characters require 1-4 bytes depend- ing on their Unicode code point, leading to inconsistent tokeniza- tion patterns. To mitigate the discrepancy that arises when con- structing BBPE for English–Chinese bilingual data caused by the variable-length nature of the
Chunk 4 · 1,999 chars
s several challenges for BBPE. Variable-length encoding: Characters require 1-4 bytes depend- ing on their Unicode code point, leading to inconsistent tokeniza- tion patterns. To mitigate the discrepancy that arises when con- structing BBPE for English–Chinese bilingual data caused by the variable-length nature of the encoding, [14] introduced a length penalty and an alphabet penalty. Susceptibility to decoding errors: While UTF-8 enables strict val- idation by rejecting invalid byte sequences, this also makes it more susceptible to decoding errors. To address these decoding errors, sev- eral studies have applied a dynamic-programming algorithm [6, 13], while another line of work explored a solution based on vector quan- tization [15]. Although prior research has explored different approaches to mitigate the limitations of UTF-8-based BBPE, this work introduces a novel tokenizer that addresses these shortcomings through a simple modification. 3. BBPE16: UTF-16-BASED TOKENIZATION 3.1. Motivation for a UTF-16-based BBPE UTF-16 uses 16-bit (2-byte) code units as its fundamental unit. Char- acters in the BMP, which spans U+0000-U+FFFF, are represented with a single code unit (2 bytes). The BMP includes most modern scripts, such as (1) Latin characters, (2) CJK unified ideographs, (3) Hangul syllables, and (4) major scripts like Arabic, Hebrew, and De- vanagari. This results in a language-agnostic and uniform 2-byte rep- resentation for each character, whereas UTF-8 employs a variable- length byte encoding per character. Table 1. Comparison of UTF-8 and UTF-16 (little-endian) encod- ings for the Korean Hangul syllable ‘한’ Example: 한 (U+D55C) Encoding Byte 1 Byte 2 Byte 3 UTF-8 11101101 10010101 10011100 (ED) (95) (9C) UTF-16 01011100 11010101 - (5C) (D5) Table 1 compares UTF-8 and UTF-16 encoding for non-Latin scripts, focusing on CJK characters, with the Korean Hangul syllable ‘한’ shown as a representative example. In this study, we assume the use of UTF-16 encoding in
Chunk 5 · 1,997 chars
Byte 2 Byte 3 UTF-8 11101101 10010101 10011100 (ED) (95) (9C) UTF-16 01011100 11010101 - (5C) (D5) Table 1 compares UTF-8 and UTF-16 encoding for non-Latin scripts, focusing on CJK characters, with the Korean Hangul syllable ‘한’ shown as a representative example. In this study, we assume the use of UTF-16 encoding in little-endian format and thus we ignore the Byte Order Mark (BOM). Bold and underlined bits (e.g., 1110) indicate prefix bits prescribed by the UTF-8 encoding specification, while regular bits indicate data bits. In UTF-8, multiple bits are used as a length-indicating prefix, while in UTF-16 all bits encode char- acter data. Because UTF-16 eliminates the overhead of length-prefix bits, a UTF-16-based tokenization scheme in turn allows the vocab- ulary to be constructed more compactly and efficiently than UTF-8- based BBPE. 3.2. BBPE16: UTF-16-based BBPE We propose BBPE16, a BBPE tokenizer that operates on UTF-16 rather than the conventional UTF-8 encoding. BBPE16 follows the standard BPE merge algorithm, but all merges are performed on the byte sequences derived from UTF-16 code units. The processing pipeline of BBPE16 is outlined below: 1. Encoding: Convert the input text from UTF-8 to UTF-16 (little-endian) 2. Byte extraction: Obtain the raw UTF-16 bytes and discard the BOM 3. BPE training: Learn merge rules over the UTF-16 byte se- quences 4. Tokenization: Apply the learned merges to new UTF-16 en- coded texts 5. Decoding: Reconstruct the UTF-16 text from tokens and con- vert back to UTF-8 text As described in Section 3.1, we adopt UTF-16 little-endian en- coding, and therefore the BOM is discarded as part of the preprocess- ing step. Input and output remain UTF-8 encoded, ensuring compat- ibility with existing systems. Only the internal tokenization oper- ates on UTF-16, making BBPE16 a drop-in replacement for UTF-8- based BBPE. 4. EXPERIMENTAL SETTINGS 4.1. Datasets We evaluated BBPE16 across multiple language settings using LibriSpeech [16] for
Chunk 6 · 1,991 chars
Input and output remain UTF-8 encoded, ensuring compat- ibility with existing systems. Only the internal tokenization oper- ates on UTF-16, making BBPE16 a drop-in replacement for UTF-8- based BBPE. 4. EXPERIMENTAL SETTINGS 4.1. Datasets We evaluated BBPE16 across multiple language settings using LibriSpeech [16] for English, KsponSpeech [17] for Korean, and AISHELL-1 [18] for Chinese. For the continual-learning experi- ments we used Wall Street Journal (WSJ) [19] for English, Zeroth- Korean [20] for Korean, and Common Voice [21] Chinese dataset for Chinese. For LibriSpeech corpus, we used speed perturbation (factors 0.9×, 1.0×, and 1.1×), generating 843,723 total training utterances from original 281,241. Speed perturbation was applied only to En- glish monolingual training, not to bilingual or trilingual settings. Text was normalized using the ESPnet [22] pipeline with only es- sential tokens and apostrophes retained. For KsponSpeech, pronun- ciation notation was used with English letters converted to Korean phonetics. For Zeroth-Korean, we randomly split the training data into 90% for training and 10% for development. For Common Voice Chinese, embedded English letters were uppercased. All datasets were filtered to exclude utterances over 30 seconds or with empty normalized text. This removed 496 utterances from KsponSpeech and 9 from WSJ. 4.2. Model Architecture and Inference Settings We employed ESPnet [22] to train attention-based encoder-decoder (AED) models with E-Branchformer [23] encoders. The encoder consists of 17 blocks with 512-dimensional outputs, 1,024 lin- ear units, and 8 attention heads with relative positional encoding. The decoder uses 6 Transformer [9] blocks with 2,048 linear units and 8 attention heads. All models used identical architectures, dif- fering only in the tokenization strategy. All models—except the continual-learning model described in Section 5.3—were trained for 80 epochs. The continual-learning model, which starts from
Chunk 7 · 1,999 chars
ecoder uses 6 Transformer [9] blocks with 2,048 linear units and 8 attention heads. All models used identical architectures, dif- fering only in the tokenization strategy. All models—except the continual-learning model described in Section 5.3—were trained for 80 epochs. The continual-learning model, which starts from the trilingual model, was trained for 30 epochs. Decoding used beam search with beam size 4. -- 2 of 5 -- 4.3. Tokenizer Configurations We compared three tokenization approaches: • BPE: Standard character-level BPE • BBPE: Byte-level BPE on UTF-8 encoding • BBPE16: Our proposed UTF-16-based BBPE 5. RESULTS AND ANALYSIS 5.1. Monolingual and Bilingual Scenarios 5.1.1. ASR Performance To verify that BBPE16 introduces no degradation in ASR perfor- mance, we trained and evaluated monolingual English, monolingual Korean, and bilingual (English & Korean) ASR models using a vari- ety of tokenizers, and compared the resulting recognition accuracies. Table 2. WER (%) comparison across tokenizers in monolingual and bilingual scenarios Language Vocab Size Test Set BPE BBPE BBPE16 English 1000 test-clean 2.1 2.2 2.1 test-other 4.8 4.7 4.6 Korean 3000 Eval-clean 18.5 18.7 18.6 Eval-other 21.5 21.8 22.0 Bilingual (En & Ko) 5000 test-clean 2.5 2.7 2.6 test-other 5.8 6.1 6.0 Eval-clean 19.0 18.9 19.1 Eval-other 22.1 22.6 22.2 Table 2 presents word error rate (WER) comparisons across dif- ferent settings. In both monolingual and bilingual settings, BBPE16 shows no significant performance difference compared with BPE and BBPE on any test set. 5.2. Trilingual Scenario To understand the cross-lingual capabilities and the tokenizer effi- ciency of BBPE16, we compared trilingual tokenizers trained on combined datasets (LibriSpeech, KsponSpeech, and AISHELL-1). We used a vocabulary size of 7,000 for trilingual tokenizers. 5.2.1. Shared Token Analysis We analyzed token sharing patterns across English, Korean and Chi- nese with trilingual tokenizers. Table 3 shows the number
Chunk 8 · 1,996 chars
compared trilingual tokenizers trained on combined datasets (LibriSpeech, KsponSpeech, and AISHELL-1). We used a vocabulary size of 7,000 for trilingual tokenizers. 5.2.1. Shared Token Analysis We analyzed token sharing patterns across English, Korean and Chi- nese with trilingual tokenizers. Table 3 shows the number of com- mon tokens shared between language pairs for BBPE and BBPE16. Table 3. Number of shared tokens across language pairs in trilingual tokenizers Language Pair BBPE BBPE16 English-Korean 0 42 Korean-Chinese 95 573 Chinese-English 0 55 All 3 Languages 0 42 The results demonstrate BBPE16’s remarkable advantage in cre- ating cross-lingual token compatibility. Notably, BBPE yields no shared tokens for the English-Korean pair, the Chinese-English pair, and for the three-language set combined, whereas BBPE16 produces at least 42 shared tokens in each of those three cases1. BBPE16 shows substantial improvements in Korean-Chinese token sharing (95→573) and notable gains in other language pairs, demonstrating its effectiveness for multilingual tokenization. 5.2.2. Token Count Analysis Using the same trilingual tokenizer, we evaluated compression effi- ciency across individual languages. Table 4 presents detailed token statistics. Table 4. Average token counts per utterance for each language us- ing trilingual tokenizers, with the last column showing the relative reduction of BBPE16 compared to BBPE Language BPE BBPE BBPE16 BBPE16 vs BBPE English 76.5 45.4 45.2 -0.4% Korean 23.5 16.5 16.3 -1.2% Chinese 22.3 19.5 18.6 -4.6% This analysis reveals the advantages of BBPE16, showing slight improvement for English (-0.4%), modest improvement for Korean (-1.2%), and significant gains for Chinese (-4.6%) compared to BBPE. Notably, BPE allocates most of its limited merged-token budget to cover the entire Korean and Chinese character ranges, which leaves relatively few tokens available for English. As a result, BPE produces a large number of tokens for English text
Chunk 9 · 1,996 chars
rean (-1.2%), and significant gains for Chinese (-4.6%) compared to BBPE. Notably, BPE allocates most of its limited merged-token budget to cover the entire Korean and Chinese character ranges, which leaves relatively few tokens available for English. As a result, BPE produces a large number of tokens for English text while still remaining competitive for Korean and Chinese. The efficiency of BBPE16 directly leads to faster training and lower memory usage. 5.2.3. Vocabulary Coverage Analysis Beyond token efficiency, vocabulary coverage (the proportion of the total vocabulary utilized by a given language) is a critical metric for tokenizer effectiveness. Higher vocabulary coverage reduces wasted token slots, enabling more efficient token space usage and better uti- lization of the model’s embedding parameters in multilingual set- tings. Table 5 illustrates the percentage of vocabulary utilized by each trilingual tokenizer across languages. Table 5. Vocabulary coverage (%) across trilingual tokenizers and languages Language BPE BBPE BBPE16 English 4.5 44.1 48.1 Korean 35.0 39.9 43.0 Chinese 60.5 14.7 17.8 BBPE16 consistently achieves higher vocabulary coverage across all languages, with improvements of 4.0% for English, 3.1% for Korean, and 3.1% for Chinese compared to BBPE. The superior vocabulary coverage of BBPE16 indicates that its token inventory is used more efficiently across all three languages. In the trilingual experiments, the English vocabulary usage pro- portion under BPE is markedly lower than that of Korean and Chi- nese. This is because English contains far fewer distinct characters, whereas the token budget must accommodate the far larger and more diverse character sets of Korean and Chinese. As discussed in Section 5.2.1, BBPE16’s cross-lingual sharing leads the aggregated per-language coverage to exceed 100%. In con- 1The byte sequence ” ˙G ¯AW” and ” ˙G ¯AL” are examples of cross-lingual shared tokens, appearing in English, Korean, and Chinese.
Chunk 10 · 1,983 chars
ger and more diverse character sets of Korean and Chinese. As discussed in Section 5.2.1, BBPE16’s cross-lingual sharing leads the aggregated per-language coverage to exceed 100%. In con- 1The byte sequence ” ˙G ¯AW” and ” ˙G ¯AL” are examples of cross-lingual shared tokens, appearing in English, Korean, and Chinese. -- 3 of 5 -- Table 6. Performance comparison between BBPE and BBPE16 in trilingual and continual-learning scenarios, with WER (%) for En- glish and Korean and CER (%) for Chinese Dataset Trilingual Continual-Learning BBPE BBPE16 BBPE BBPE16 Base datasets LibriSpeech test-clean 2.7 2.6 2.6 2.5 test-other 6.1 6.1 5.8 5.8 KsponSpeech Eval-clean 18.7 19.0 18.7 18.7 Eval-other 22.2 22.4 22.0 21.9 AISHELL-1 5.9 5.7 5.6 5.6 Additional datasets WSJ 10.7 10.8 4.8 4.2 Zeroth 76.0 47.7 7.6 7.5 CVC 245.7 273.9 15.6 15.6 trast, many pre-allocated byte tokens in BBPE are never selected during training, so its aggregated coverage remains below 100%. 5.3. Continual-Learning Scenario We evaluated the effectiveness of BBPE16 not only in the trilingual setting but also in a continual-learning scenario using trilingual to- kenizers. These experiments demonstrate that BBPE16 offers clear advantages when incorporating new data. We further evaluated BBPE16 in a continual-learning setup by fine-tuning the trilingual model on the base corpus (LibriSpeech, KsponSpeech, and AISHELL-1) plus three additional datasets: WSJ (English), Zeroth-Korean (Korean), and Common Voice Chinese (Chinese). For this scenario, we excluded the standard BPE tokenizer from the experiments because its out-of-vocabulary handling makes it un- suitable for this particular setting. Note: From this point, Zeroth-Korean is referred to as Zeroth, and Common Voice Chinese as CVC. 5.3.1. ASR Performance The ASR performance comparison in the continual-learning sce- nario is presented in Table 6. The values in Table 6 are WER for En- glish and Korean datasets, and character error rate (CER) for
Chunk 11 · 1,995 chars
etting. Note: From this point, Zeroth-Korean is referred to as Zeroth, and Common Voice Chinese as CVC. 5.3.1. ASR Performance The ASR performance comparison in the continual-learning sce- nario is presented in Table 6. The values in Table 6 are WER for En- glish and Korean datasets, and character error rate (CER) for Chinese dataset. Continual learning substantially improves performance on all additional datasets over the trilingual baseline. Overall, BBPE16 delivers performance comparable to that of BBPE—matching BBPE on CVC while offering modest improvements on WSJ and Zeroth. 5.3.2. Token Count Analysis Table 7 shows the token count comparison for each additional dataset in the continual-learning scenario. BBPE16 demonstrates consistent advantages across all languages, with no change for English but im- provement for the Korean dataset, and substantial advantages for the Chinese dataset, achieving a 10.4% token reduction on the CVC compared to BBPE. This significant reduction in the Chinese dataset leads to computational savings during continual learning for Chi- nese. Table 7. Average token counts per utterance for each additional dataset in the continual-learning scenario, with the last column showing the relative reduction of BBPE16 compared to BBPE Dataset BBPE BBPE16 BBPE16 vs BBPE WSJ 28.7 28.7 0.0% Zeroth 37.0 36.7 -0.8% CVC 28.9 25.9 -10.4% Table 8. Average decoding iterations per utterance for each test set in the continual-learning scenario, with the last column showing the relative reduction of BBPE16 compared to BBPE Test Set BBPE BBPE16 BBPE16 vs BBPE WSJ 27.3 27.2 -0.4% Zeroth 36.8 36.5 -0.8% CVC 27.3 24.5 -10.3% 5.3.3. Inference Efficiency In Table 8, we report the average number of decoding iterations per utterance, defined as the number of decoding steps required to emit the end-of-sequence token under beam search. Consistent with the training token counts, BBPE16 achieves reductions in the decod- ing iterations across all languages: modest
Chunk 12 · 1,996 chars
Efficiency In Table 8, we report the average number of decoding iterations per utterance, defined as the number of decoding steps required to emit the end-of-sequence token under beam search. Consistent with the training token counts, BBPE16 achieves reductions in the decod- ing iterations across all languages: modest improvements for En- glish (-0.4%) and Korean (-0.8%), and substantial efficiency gains for Chinese (-10.3%). This demonstrates that BBPE16 offers consis- tent efficiency improvements across all languages, with particularly significant gains for Chinese both during training and inference. 6. DISCUSSION In many practical scenarios, the training corpus is dominated by large-scale English data, while languages such as CJK are repre- sented with relatively smaller amounts of data. Subsequently, it is common to augment the training set with additional data to improve performance in non-English languages. In such cases, BBPE16 of- fers advantages over conventional BBPE by achieving more efficient tokenization through cross-lingual token sharing and 2-byte encod- ing. Furthermore, BBPE16 retains compatibility with existing BBPE since it differs only in replacing UTF-8 with UTF-16 for text en- coding. Beyond the scope of speech recognition, this property also makes BBPE16 applicable to broader language modeling tasks, in- cluding large language models. 7. CONCLUSION We present BBPE16, a UTF-16-based BBPE tokenizer that over- comes the key limitations of BBPE in multilingual settings. By exploiting UTF-16’s uniform code-unit representation, BBPE16 reduces token counts by up to 10.4% in a multilingual continual- learning scenario, alleviating the efficiency bottlenecks of multilin- gual ASR. Rather than sacrificing English token efficiency, BBPE16 im- proves tokenization across all languages, yielding better cross-lingual token sharing and lower computational cost. As multilingual ASR becomes increasingly essential, BBPE16 offers a practical, high-efficiency
Chunk 13 · 1,998 chars
e efficiency bottlenecks of multilin- gual ASR. Rather than sacrificing English token efficiency, BBPE16 im- proves tokenization across all languages, yielding better cross-lingual token sharing and lower computational cost. As multilingual ASR becomes increasingly essential, BBPE16 offers a practical, high-efficiency solution that especially benefits non-Latin scripts like CJK while maintaining competitive or supe- rior performance across diverse multilingual scenarios. -- 4 of 5 -- 8. REFERENCES [1] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao, “Multilingual speech recognition with a single end-to-end model,” in 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4904–4908. [2] V. Pratap, A. Sriram, P. Tomasello, A. Hannun, V. Liptchinsky, G. Synnaeve, and R. Collobert, “Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters,” in Interspeech 2020, 2020, pp. 4751–4755. [3] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, Z. Meng, K. Hu, A. Rosenberg, R. Prabhavalkar, D. S. Park, P. Haghani, J. Riesa, G. Perng, H. Soltau, T. Strohman, B. Ramabhadran, T. Sainath, P. Moreno, C.-C. Chiu, J. Schalkwyk, F. Beaufays, and Y. Wu, “Google USM: Scaling automatic speech recogni- tion beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023. [4] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scal- ing speech technology to 1,000+ languages,” J. Mach. Learn. Res., vol. 25, no. 1, Jan. 2024. [5] R. Sennrich, B. Haddow, and A. Birch, “Neural machine trans- lation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: As- sociation for Computational Linguistics, Aug. 2016, pp. 1715– 1725. [6]
Chunk 14 · 1,992 chars
24. [5] R. Sennrich, B. Haddow, and A. Birch, “Neural machine trans- lation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: As- sociation for Computational Linguistics, Aug. 2016, pp. 1715– 1725. [6] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in ICASSP 2019 - 2019 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5621–5625. [7] L. Della Libera, P. Mousavi, S. Zaiem, C. Subakan, and M. Ra- vanelli, “CL-MASR: A continual learning benchmark for mul- tilingual ASR,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 32, p. 4931–4944, Oct. 2024. [8] C. Y. Kwok, J. Q. Yip, and E. S. Chng, “Continual Learn- ing Optimizations for Auto-regressive Decoder of Multilingual ASR systems,” in Interspeech 2024, 2024, pp. 1225–1229. [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017. [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D.
Chunk 15 · 1,998 chars
. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Ad- vances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. [12] T. Kudo and J. Richardson, “SentencePiece: A simple and lan- guage independent subword tokenizer and detokenizer for neu- ral text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Compu- tational Linguistics, Nov. 2018, pp. 66–71. [13] C. Wang, K. Cho, and J. Gu, “Neural machine translation with byte-level subwords,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 9154–9160, Apr. 2020. [14] L. Deng, R. Hsiao, and A. Ghoshal, “Bilingual end-to-end ASR with byte-level subwords,” in ICASSP 2022 - 2022 IEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing (ICASSP), 2022, pp. 6417–6421. [15] R. Hsiao, L. Deng, E. McDermott, R. Travadi, and X. Zhuang, “Optimizing byte-level representation for end-to-end ASR,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 462–467. [16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2015, pp. 5206– 5210. [17] J.-U. Bang, S. Yun, S.-H. Kim, M.-Y. Choi, M.-K. Lee, Y.-J. Kim, D.-H. Kim, J. Park, Y.-J. Lee, and S.-H. Kim, “Kspon- Speech: Korean spontaneous speech corpus for automatic speech recognition,” Applied Sciences, vol. 10, no. 19, 2020. [18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL- 1: An open-source
Chunk 16 · 1,703 chars
10. [17] J.-U. Bang, S. Yun, S.-H. Kim, M.-Y. Choi, M.-K. Lee, Y.-J. Kim, D.-H. Kim, J. Park, Y.-J. Lee, and S.-H. Kim, “Kspon- Speech: Korean spontaneous speech corpus for automatic speech recognition,” Applied Sciences, vol. 10, no. 19, 2020. [18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL- 1: An open-source Mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Orien- tal Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O- COCOSDA), 2017, pp. 1–5. [19] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992. [20] L. Jo and W. Lee, “Zeroth-Korean,” 2019. [Online]. Available: https://www.openslr.org/40/ [21] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon Voice: A massively-multilingual speech corpus,” in Pro- ceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Re- sources Association, May 2020, pp. 4218–4222. [22] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to- end speech processing toolkit,” in Interspeech 2018, 2018, pp. 2207–2211. [23] K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-Branchformer: Branchformer with enhanced merging for speech recognition,” in 2022 IEEE Spoken Lan- guage Technology Workshop (SLT), 2023, pp. 84–91. -- 5 of 5 --