BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

Addresses token representation gaps for non-Latin scripts (CJK and Southeast Asian languages) in multilingual LLMs and ASR systems. Shows that byte-level BPE using UTF-8 inflates token sequences for non-Latin scripts, while UTF-16-based BBPE preserves language-agnostic properties while improving token efficiency.

Summary

This paper introduces BBPE16, a UTF-16-based byte-level byte-pair encoding (BBPE) tokenizer designed to improve multilingual speech recognition (ASR). Traditional UTF-8-based BBPE is language-agnostic and supports full Unicode coverage but suffers from variable-length encoding, which increases token sequence lengths for non-Latin scripts like Chinese, Japanese, and Korean (CJK). This leads to higher computational and memory costs. BBPE16 uses UTF-16, which provides a uniform 2-byte representation for most modern scripts, including CJK, enabling more efficient tokenization. Experiments across monolingual, bilingual, and trilingual ASR settings show that BBPE16 achieves comparable or better accuracy than UTF-8-based BBPE while significantly improving cross-lingual token sharing. For Chinese, it reduces token counts by up to 10.4% and decoding iterations by up to 10.3%, leading to faster training and inference. The tokenizer is compatible with existing systems and offers practical benefits for multilingual ASR, especially for non-Latin scripts.

PDF viewer

Chunks(17)

Chunk 0 · 1,997 chars

BBPE16: UTF-16-BASED BYTE-LEVEL BYTE-PAIR ENCODING FOR IMPROVED
MULTILINGUAL SPEECH RECOGNITION
Hyunsik Kim∗, Haeri Kim*, Munhak Lee, Kyungmin Lee
Samsung Research
{hyunsik777.kim, haeri.kim, mun-hak.lee, k.m.lee}@samsung.com
ABSTRACT
Multilingual automatic speech recognition (ASR) requires to-
kenization that efficiently covers many writing systems. Byte-level
BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic
design and full Unicode coverage, but its variable-length encoding
inflates token sequences for non-Latin scripts, such as Chinese,
Japanese, and Korean (CJK). Longer sequences increase computa-
tional load and memory use. We propose BBPE16, a UTF-16-based
BBPE tokenizer that represents most modern scripts with a uniform
2-byte code unit. BBPE16 preserves BBPE’s language-agnostic
properties while substantially improving cross-lingual token shar-
ing. Across monolingual, bilingual, and trilingual ASR, and in a
multilingual continual-learning setup, BBPE16 attains comparable
or better accuracy; for Chinese, it reduces token counts by up to
10.4% and lowers decoding iterations by up to 10.3%. These re-
ductions speed up fine-tuning and inference and decrease memory
usage, making BBPE16 a practical tokenization choice for multilin-
gual ASR.
Index Terms— Byte-level BPE, UTF-16, tokenization, speech
recognition, multilingual ASR
1. INTRODUCTION
As automatic speech recognition (ASR) has advanced, attention has
shifted from single-language systems to multilingual ASR. The goal
is not only higher accuracy [1] but also broader language cover-
age—ranging from dozens [2] to, in some cases, hundreds of lan-
guages [3, 4]. To achieve this breadth, the model must have an out-
put vocabulary capable of representing a wide variety of writing sys-
tems; therefore, tokenization becomes a primary design considera-
tion.
While character- or subword-level byte-pair encoding (BPE) [5]
is often sufficient for monolingual ASR, multilingual setups com-
monly adopt

Chunk 1 · 1,997 chars

breadth, the model must have an out-
put vocabulary capable of representing a wide variety of writing sys-
tems; therefore, tokenization becomes a primary design considera-
tion.
While character- or subword-level byte-pair encoding (BPE) [5]
is often sufficient for monolingual ASR, multilingual setups com-
monly adopt byte-level BPE (BBPE) to ensure full Unicode cov-
erage and to avoid language-specific preprocessing. Contemporary
BBPE implementations are predominantly UTF-8–based [6], which
brings practical advantages such as ASCII compatibility and ro-
bustness. However, UTF-8 also introduces friction in cross-script
settings. Its variable-length encoding yields uneven token bound-
aries across languages and, for non-Latin scripts such as Chinese,
Japanese, and Korean (CJK), inflates sequence length. This length-
ening increases computation and memory use, which in turn raises
model and runtime complexity. These effects make efficient, scalable
multilingual ASR more challenging, which motivates alternative
BBPE designs that better align with cross-lingual settings.
∗Equal contribution.
In real-world deployments, multilingual ASR systems are ini-
tially pretrained on massive multilingual corpora and then undergo
continual learning—i.e., incremental adaptation to new domains,
dialects, or underrepresented languages while preserving prior
knowledge—to maintain or improve performance in specific tar-
get settings. In such settings, models must incorporate new data
without catastrophic forgetting of previously learned languages.
Against this backdrop, continual learning for multilingual ASR has
become an active area of research [7, 8]. These continual-learning
scenarios impose strong requirements on tokenizer design: the tok-
enizer should facilitate efficient cross-lingual sharing and maintain
stability as new languages or domains are incrementally added.
In this work, we propose BBPE16, a UTF-16-based BBPE tok-
enizer that addresses limitations of UTF-8-based BBPE

Chunk 2 · 1,998 chars

earning
scenarios impose strong requirements on tokenizer design: the tok-
enizer should facilitate efficient cross-lingual sharing and maintain
stability as new languages or domains are incrementally added.
In this work, we propose BBPE16, a UTF-16-based BBPE tok-
enizer that addresses limitations of UTF-8-based BBPE while main-
taining compatibility with existing infrastructure. UTF-16 provides
a uniform 2-byte representation for the vast majority of characters
in the Basic Multilingual Plane (BMP), which includes most mod-
ern scripts, including CJK characters. Importantly, in our trilingual
setup BBPE16 exhibits markedly superior cross-lingual token shar-
ing capability, generating tokens shared among English, Korean, and
Chinese, whereas BBPE yields none. Our key contributions are:
• A novel UTF-16-based BBPE tokenizer that achieves com-
parable ASR performance with significantly improved token
efficiency and cross-lingual token sharing
• Up to 10.4% token reduction for Chinese in a multilingual
continual-learning scenario, alongside reductions in decoding
iterations, yielding practical training and inference speedups
2. BACKGROUND
2.1. Byte-Level BPE (BBPE)
Modern Transformer [9]-based models (e.g., BERT [10], GPT-
3 [11]) are trained on hundreds of billions of tokens and contain
tens of billions of parameters. At this scale, word-level vocabularies
suffer from severe sparsity and vocabulary-size explosion prob-
lems. Subword methods such as BPE [5] and SentencePiece [12]
reduce sparsity while still capturing a wide range of morphological
variations. However, classic BPE still requires language-specific pre-
processing, such as word tokenizers and whitespace handling. When
processing multilingual corpora or corpora that contain a substantial
number of low-frequency characters, maintaining language-specific
preprocessing rules is cumbersome.
BBPE first converts the input text into a byte sequence and then
applies BPE on that sequence. This makes the tokenizer

Chunk 3 · 1,997 chars

nd whitespace handling. When
processing multilingual corpora or corpora that contain a substantial
number of low-frequency characters, maintaining language-specific
preprocessing rules is cumbersome.
BBPE first converts the input text into a byte sequence and then
applies BPE on that sequence. This makes the tokenizer language-
agnostic: as long as the text is encoded in UTF-8, any Unicode
character can be processed by the same pipeline. BBPE has several
Copyright 2026 IEEE. Published in ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), scheduled for 4-8 May 2026 in Barcelona, Spain. Personal use of this material is permitted. However,
permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained
from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
arXiv:2602.01717v1 [cs.CL] 2 Feb 2026

-- 1 of 5 --

advantages over conventional BPE: in multilingual settings, it dis-
tributes token usage more uniformly across the corpus and yields
a larger set of tokens shared among languages [13]. Moreover, be-
cause BBPE’s vocabulary encompasses all possible byte values, it
inherently eliminates out-of-vocabulary tokens.
2.2. Limitations of UTF-8-based BBPE
While UTF-8 has become the de facto standard for text encoding
due to its backward compatibility with ASCII and error resilience, it
introduces several challenges for BBPE.
Variable-length encoding: Characters require 1-4 bytes depend-
ing on their Unicode code point, leading to inconsistent tokeniza-
tion patterns. To mitigate the discrepancy that arises when con-
structing BBPE for English–Chinese bilingual data caused by the
variable-length nature of the

Chunk 4 · 1,999 chars

s several challenges for BBPE.
Variable-length encoding: Characters require 1-4 bytes depend-
ing on their Unicode code point, leading to inconsistent tokeniza-
tion patterns. To mitigate the discrepancy that arises when con-
structing BBPE for English–Chinese bilingual data caused by the
variable-length nature of the encoding, [14] introduced a length
penalty and an alphabet penalty.
Susceptibility to decoding errors: While UTF-8 enables strict val-
idation by rejecting invalid byte sequences, this also makes it more
susceptible to decoding errors. To address these decoding errors, sev-
eral studies have applied a dynamic-programming algorithm [6, 13],
while another line of work explored a solution based on vector quan-
tization [15].
Although prior research has explored different approaches to
mitigate the limitations of UTF-8-based BBPE, this work introduces
a novel tokenizer that addresses these shortcomings through a simple
modification.
3. BBPE16: UTF-16-BASED TOKENIZATION
3.1. Motivation for a UTF-16-based BBPE
UTF-16 uses 16-bit (2-byte) code units as its fundamental unit. Char-
acters in the BMP, which spans U+0000-U+FFFF, are represented
with a single code unit (2 bytes). The BMP includes most modern
scripts, such as (1) Latin characters, (2) CJK unified ideographs, (3)
Hangul syllables, and (4) major scripts like Arabic, Hebrew, and De-
vanagari. This results in a language-agnostic and uniform 2-byte rep-
resentation for each character, whereas UTF-8 employs a variable-
length byte encoding per character.
Table 1. Comparison of UTF-8 and UTF-16 (little-endian) encod-
ings for the Korean Hangul syllable ‘한’
Example: 한 (U+D55C)
Encoding Byte 1 Byte 2 Byte 3
UTF-8 11101101 10010101 10011100
(ED) (95) (9C)
UTF-16 01011100 11010101 -
(5C) (D5)
Table 1 compares UTF-8 and UTF-16 encoding for non-Latin
scripts, focusing on CJK characters, with the Korean Hangul syllable
‘한’ shown as a representative example. In this study, we assume the
use of UTF-16 encoding in

Chunk 5 · 1,997 chars

Byte 2 Byte 3
UTF-8 11101101 10010101 10011100
(ED) (95) (9C)
UTF-16 01011100 11010101 -
(5C) (D5)
Table 1 compares UTF-8 and UTF-16 encoding for non-Latin
scripts, focusing on CJK characters, with the Korean Hangul syllable
‘한’ shown as a representative example. In this study, we assume the
use of UTF-16 encoding in little-endian format and thus we ignore
the Byte Order Mark (BOM). Bold and underlined bits (e.g., 1110)
indicate prefix bits prescribed by the UTF-8 encoding specification,
while regular bits indicate data bits. In UTF-8, multiple bits are used
as a length-indicating prefix, while in UTF-16 all bits encode char-
acter data. Because UTF-16 eliminates the overhead of length-prefix
bits, a UTF-16-based tokenization scheme in turn allows the vocab-
ulary to be constructed more compactly and efficiently than UTF-8-
based BBPE.
3.2. BBPE16: UTF-16-based BBPE
We propose BBPE16, a BBPE tokenizer that operates on UTF-16
rather than the conventional UTF-8 encoding. BBPE16 follows the
standard BPE merge algorithm, but all merges are performed on the
byte sequences derived from UTF-16 code units. The processing
pipeline of BBPE16 is outlined below:
1. Encoding: Convert the input text from UTF-8 to UTF-16
(little-endian)
2. Byte extraction: Obtain the raw UTF-16 bytes and discard
the BOM
3. BPE training: Learn merge rules over the UTF-16 byte se-
quences
4. Tokenization: Apply the learned merges to new UTF-16 en-
coded texts
5. Decoding: Reconstruct the UTF-16 text from tokens and con-
vert back to UTF-8 text
As described in Section 3.1, we adopt UTF-16 little-endian en-
coding, and therefore the BOM is discarded as part of the preprocess-
ing step. Input and output remain UTF-8 encoded, ensuring compat-
ibility with existing systems. Only the internal tokenization oper-
ates on UTF-16, making BBPE16 a drop-in replacement for UTF-8-
based BBPE.
4. EXPERIMENTAL SETTINGS
4.1. Datasets
We evaluated BBPE16 across multiple language settings using
LibriSpeech [16] for

Chunk 6 · 1,991 chars

Input and output remain UTF-8 encoded, ensuring compat-
ibility with existing systems. Only the internal tokenization oper-
ates on UTF-16, making BBPE16 a drop-in replacement for UTF-8-
based BBPE.
4. EXPERIMENTAL SETTINGS
4.1. Datasets
We evaluated BBPE16 across multiple language settings using
LibriSpeech [16] for English, KsponSpeech [17] for Korean, and
AISHELL-1 [18] for Chinese. For the continual-learning experi-
ments we used Wall Street Journal (WSJ) [19] for English, Zeroth-
Korean [20] for Korean, and Common Voice [21] Chinese dataset
for Chinese.
For LibriSpeech corpus, we used speed perturbation (factors
0.9×, 1.0×, and 1.1×), generating 843,723 total training utterances
from original 281,241. Speed perturbation was applied only to En-
glish monolingual training, not to bilingual or trilingual settings.
Text was normalized using the ESPnet [22] pipeline with only es-
sential tokens and apostrophes retained. For KsponSpeech, pronun-
ciation notation was used with English letters converted to Korean
phonetics. For Zeroth-Korean, we randomly split the training data
into 90% for training and 10% for development. For Common Voice
Chinese, embedded English letters were uppercased. All datasets
were filtered to exclude utterances over 30 seconds or with empty
normalized text. This removed 496 utterances from KsponSpeech
and 9 from WSJ.
4.2. Model Architecture and Inference Settings
We employed ESPnet [22] to train attention-based encoder-decoder
(AED) models with E-Branchformer [23] encoders. The encoder
consists of 17 blocks with 512-dimensional outputs, 1,024 lin-
ear units, and 8 attention heads with relative positional encoding.
The decoder uses 6 Transformer [9] blocks with 2,048 linear units
and 8 attention heads. All models used identical architectures, dif-
fering only in the tokenization strategy. All models—except the
continual-learning model described in Section 5.3—were trained
for 80 epochs. The continual-learning model, which starts from

Chunk 7 · 1,999 chars

ecoder uses 6 Transformer [9] blocks with 2,048 linear units
and 8 attention heads. All models used identical architectures, dif-
fering only in the tokenization strategy. All models—except the
continual-learning model described in Section 5.3—were trained
for 80 epochs. The continual-learning model, which starts from the
trilingual model, was trained for 30 epochs. Decoding used beam
search with beam size 4.

-- 2 of 5 --

4.3. Tokenizer Configurations
We compared three tokenization approaches:
• BPE: Standard character-level BPE
• BBPE: Byte-level BPE on UTF-8 encoding
• BBPE16: Our proposed UTF-16-based BBPE
5. RESULTS AND ANALYSIS
5.1. Monolingual and Bilingual Scenarios
5.1.1. ASR Performance
To verify that BBPE16 introduces no degradation in ASR perfor-
mance, we trained and evaluated monolingual English, monolingual
Korean, and bilingual (English & Korean) ASR models using a vari-
ety of tokenizers, and compared the resulting recognition accuracies.
Table 2. WER (%) comparison across tokenizers in monolingual and
bilingual scenarios
Language Vocab
Size Test Set BPE BBPE BBPE16
English 1000 test-clean 2.1 2.2 2.1
test-other 4.8 4.7 4.6
Korean 3000 Eval-clean 18.5 18.7 18.6
Eval-other 21.5 21.8 22.0
Bilingual
(En & Ko) 5000
test-clean 2.5 2.7 2.6
test-other 5.8 6.1 6.0
Eval-clean 19.0 18.9 19.1
Eval-other 22.1 22.6 22.2
Table 2 presents word error rate (WER) comparisons across dif-
ferent settings. In both monolingual and bilingual settings, BBPE16
shows no significant performance difference compared with BPE
and BBPE on any test set.
5.2. Trilingual Scenario
To understand the cross-lingual capabilities and the tokenizer effi-
ciency of BBPE16, we compared trilingual tokenizers trained on
combined datasets (LibriSpeech, KsponSpeech, and AISHELL-1).
We used a vocabulary size of 7,000 for trilingual tokenizers.
5.2.1. Shared Token Analysis
We analyzed token sharing patterns across English, Korean and Chi-
nese with trilingual tokenizers. Table 3 shows the number

Chunk 8 · 1,996 chars

compared trilingual tokenizers trained on
combined datasets (LibriSpeech, KsponSpeech, and AISHELL-1).
We used a vocabulary size of 7,000 for trilingual tokenizers.
5.2.1. Shared Token Analysis
We analyzed token sharing patterns across English, Korean and Chi-
nese with trilingual tokenizers. Table 3 shows the number of com-
mon tokens shared between language pairs for BBPE and BBPE16.
Table 3. Number of shared tokens across language pairs in trilingual
tokenizers
Language Pair BBPE BBPE16
English-Korean 0 42
Korean-Chinese 95 573
Chinese-English 0 55
All 3 Languages 0 42
The results demonstrate BBPE16’s remarkable advantage in cre-
ating cross-lingual token compatibility. Notably, BBPE yields no
shared tokens for the English-Korean pair, the Chinese-English pair,
and for the three-language set combined, whereas BBPE16 produces
at least 42 shared tokens in each of those three cases1. BBPE16
shows substantial improvements in Korean-Chinese token sharing
(95→573) and notable gains in other language pairs, demonstrating
its effectiveness for multilingual tokenization.
5.2.2. Token Count Analysis
Using the same trilingual tokenizer, we evaluated compression effi-
ciency across individual languages. Table 4 presents detailed token
statistics.
Table 4. Average token counts per utterance for each language us-
ing trilingual tokenizers, with the last column showing the relative
reduction of BBPE16 compared to BBPE
Language BPE BBPE BBPE16 BBPE16 vs BBPE
English 76.5 45.4 45.2 -0.4%
Korean 23.5 16.5 16.3 -1.2%
Chinese 22.3 19.5 18.6 -4.6%
This analysis reveals the advantages of BBPE16, showing slight
improvement for English (-0.4%), modest improvement for Korean
(-1.2%), and significant gains for Chinese (-4.6%) compared to
BBPE. Notably, BPE allocates most of its limited merged-token
budget to cover the entire Korean and Chinese character ranges,
which leaves relatively few tokens available for English. As a result,
BPE produces a large number of tokens for English text

Chunk 9 · 1,996 chars

rean
(-1.2%), and significant gains for Chinese (-4.6%) compared to
BBPE. Notably, BPE allocates most of its limited merged-token
budget to cover the entire Korean and Chinese character ranges,
which leaves relatively few tokens available for English. As a result,
BPE produces a large number of tokens for English text while still
remaining competitive for Korean and Chinese. The efficiency of
BBPE16 directly leads to faster training and lower memory usage.
5.2.3. Vocabulary Coverage Analysis
Beyond token efficiency, vocabulary coverage (the proportion of the
total vocabulary utilized by a given language) is a critical metric for
tokenizer effectiveness. Higher vocabulary coverage reduces wasted
token slots, enabling more efficient token space usage and better uti-
lization of the model’s embedding parameters in multilingual set-
tings. Table 5 illustrates the percentage of vocabulary utilized by
each trilingual tokenizer across languages.
Table 5. Vocabulary coverage (%) across trilingual tokenizers and
languages
Language BPE BBPE BBPE16
English 4.5 44.1 48.1
Korean 35.0 39.9 43.0
Chinese 60.5 14.7 17.8
BBPE16 consistently achieves higher vocabulary coverage
across all languages, with improvements of 4.0% for English, 3.1%
for Korean, and 3.1% for Chinese compared to BBPE. The superior
vocabulary coverage of BBPE16 indicates that its token inventory is
used more efficiently across all three languages.
In the trilingual experiments, the English vocabulary usage pro-
portion under BPE is markedly lower than that of Korean and Chi-
nese. This is because English contains far fewer distinct characters,
whereas the token budget must accommodate the far larger and more
diverse character sets of Korean and Chinese.
As discussed in Section 5.2.1, BBPE16’s cross-lingual sharing
leads the aggregated per-language coverage to exceed 100%. In con-
1The byte sequence ” ˙G ¯AW” and ” ˙G ¯AL” are examples of cross-lingual
shared tokens, appearing in English, Korean, and Chinese.

Chunk 10 · 1,983 chars

ger and more
diverse character sets of Korean and Chinese.
As discussed in Section 5.2.1, BBPE16’s cross-lingual sharing
leads the aggregated per-language coverage to exceed 100%. In con-
1The byte sequence ” ˙G ¯AW” and ” ˙G ¯AL” are examples of cross-lingual
shared tokens, appearing in English, Korean, and Chinese.

-- 3 of 5 --

Table 6. Performance comparison between BBPE and BBPE16 in
trilingual and continual-learning scenarios, with WER (%) for En-
glish and Korean and CER (%) for Chinese
Dataset Trilingual Continual-Learning
BBPE BBPE16 BBPE BBPE16
Base datasets
LibriSpeech
test-clean 2.7 2.6 2.6 2.5
test-other 6.1 6.1 5.8 5.8
KsponSpeech
Eval-clean 18.7 19.0 18.7 18.7
Eval-other 22.2 22.4 22.0 21.9
AISHELL-1 5.9 5.7 5.6 5.6
Additional datasets
WSJ 10.7 10.8 4.8 4.2
Zeroth 76.0 47.7 7.6 7.5
CVC 245.7 273.9 15.6 15.6
trast, many pre-allocated byte tokens in BBPE are never selected
during training, so its aggregated coverage remains below 100%.
5.3. Continual-Learning Scenario
We evaluated the effectiveness of BBPE16 not only in the trilingual
setting but also in a continual-learning scenario using trilingual to-
kenizers. These experiments demonstrate that BBPE16 offers clear
advantages when incorporating new data.
We further evaluated BBPE16 in a continual-learning setup by
fine-tuning the trilingual model on the base corpus (LibriSpeech,
KsponSpeech, and AISHELL-1) plus three additional datasets: WSJ
(English), Zeroth-Korean (Korean), and Common Voice Chinese
(Chinese).
For this scenario, we excluded the standard BPE tokenizer from
the experiments because its out-of-vocabulary handling makes it un-
suitable for this particular setting.
Note: From this point, Zeroth-Korean is referred to as Zeroth, and
Common Voice Chinese as CVC.
5.3.1. ASR Performance
The ASR performance comparison in the continual-learning sce-
nario is presented in Table 6. The values in Table 6 are WER for En-
glish and Korean datasets, and character error rate (CER) for

Chunk 11 · 1,995 chars

etting.
Note: From this point, Zeroth-Korean is referred to as Zeroth, and
Common Voice Chinese as CVC.
5.3.1. ASR Performance
The ASR performance comparison in the continual-learning sce-
nario is presented in Table 6. The values in Table 6 are WER for En-
glish and Korean datasets, and character error rate (CER) for Chinese
dataset. Continual learning substantially improves performance on
all additional datasets over the trilingual baseline. Overall, BBPE16
delivers performance comparable to that of BBPE—matching BBPE
on CVC while offering modest improvements on WSJ and Zeroth.
5.3.2. Token Count Analysis
Table 7 shows the token count comparison for each additional dataset
in the continual-learning scenario. BBPE16 demonstrates consistent
advantages across all languages, with no change for English but im-
provement for the Korean dataset, and substantial advantages for
the Chinese dataset, achieving a 10.4% token reduction on the CVC
compared to BBPE. This significant reduction in the Chinese dataset
leads to computational savings during continual learning for Chi-
nese.
Table 7. Average token counts per utterance for each additional
dataset in the continual-learning scenario, with the last column
showing the relative reduction of BBPE16 compared to BBPE
Dataset BBPE BBPE16 BBPE16 vs BBPE
WSJ 28.7 28.7 0.0%
Zeroth 37.0 36.7 -0.8%
CVC 28.9 25.9 -10.4%
Table 8. Average decoding iterations per utterance for each test set
in the continual-learning scenario, with the last column showing the
relative reduction of BBPE16 compared to BBPE
Test Set BBPE BBPE16 BBPE16 vs BBPE
WSJ 27.3 27.2 -0.4%
Zeroth 36.8 36.5 -0.8%
CVC 27.3 24.5 -10.3%
5.3.3. Inference Efficiency
In Table 8, we report the average number of decoding iterations per
utterance, defined as the number of decoding steps required to emit
the end-of-sequence token under beam search. Consistent with the
training token counts, BBPE16 achieves reductions in the decod-
ing iterations across all languages: modest

Chunk 12 · 1,996 chars

Efficiency
In Table 8, we report the average number of decoding iterations per
utterance, defined as the number of decoding steps required to emit
the end-of-sequence token under beam search. Consistent with the
training token counts, BBPE16 achieves reductions in the decod-
ing iterations across all languages: modest improvements for En-
glish (-0.4%) and Korean (-0.8%), and substantial efficiency gains
for Chinese (-10.3%). This demonstrates that BBPE16 offers consis-
tent efficiency improvements across all languages, with particularly
significant gains for Chinese both during training and inference.
6. DISCUSSION
In many practical scenarios, the training corpus is dominated by
large-scale English data, while languages such as CJK are repre-
sented with relatively smaller amounts of data. Subsequently, it is
common to augment the training set with additional data to improve
performance in non-English languages. In such cases, BBPE16 of-
fers advantages over conventional BBPE by achieving more efficient
tokenization through cross-lingual token sharing and 2-byte encod-
ing. Furthermore, BBPE16 retains compatibility with existing BBPE
since it differs only in replacing UTF-8 with UTF-16 for text en-
coding. Beyond the scope of speech recognition, this property also
makes BBPE16 applicable to broader language modeling tasks, in-
cluding large language models.
7. CONCLUSION
We present BBPE16, a UTF-16-based BBPE tokenizer that over-
comes the key limitations of BBPE in multilingual settings. By
exploiting UTF-16’s uniform code-unit representation, BBPE16
reduces token counts by up to 10.4% in a multilingual continual-
learning scenario, alleviating the efficiency bottlenecks of multilin-
gual ASR.
Rather than sacrificing English token efficiency, BBPE16 im-
proves tokenization across all languages, yielding better cross-lingual
token sharing and lower computational cost.
As multilingual ASR becomes increasingly essential, BBPE16
offers a practical, high-efficiency

Chunk 13 · 1,998 chars

e efficiency bottlenecks of multilin-
gual ASR.
Rather than sacrificing English token efficiency, BBPE16 im-
proves tokenization across all languages, yielding better cross-lingual
token sharing and lower computational cost.
As multilingual ASR becomes increasingly essential, BBPE16
offers a practical, high-efficiency solution that especially benefits
non-Latin scripts like CJK while maintaining competitive or supe-
rior performance across diverse multilingual scenarios.

-- 4 of 5 --

8. REFERENCES
[1] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno,
E. Weinstein, and K. Rao, “Multilingual speech recognition
with a single end-to-end model,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 4904–4908.
[2] V. Pratap, A. Sriram, P. Tomasello, A. Hannun, V. Liptchinsky,
G. Synnaeve, and R. Collobert, “Massively multilingual ASR:
50 languages, 1 model, 1 billion parameters,” in Interspeech
2020, 2020, pp. 4751–4755.
[3] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen,
N. Chen, B. Li, V. Axelrod, G. Wang, Z. Meng, K. Hu,
A. Rosenberg, R. Prabhavalkar, D. S. Park, P. Haghani,
J. Riesa, G. Perng, H. Soltau, T. Strohman, B. Ramabhadran,
T. Sainath, P. Moreno, C.-C. Chiu, J. Schalkwyk, F. Beaufays,
and Y. Wu, “Google USM: Scaling automatic speech recogni-
tion beyond 100 languages,” arXiv preprint arXiv:2303.01037,
2023.
[4] V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu,
A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski,
Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scal-
ing speech technology to 1,000+ languages,” J. Mach. Learn.
Res., vol. 25, no. 1, Jan. 2024.
[5] R. Sennrich, B. Haddow, and A. Birch, “Neural machine trans-
lation of rare words with subword units,” in Proceedings of
the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Berlin, Germany: As-
sociation for Computational Linguistics, Aug. 2016, pp. 1715–
1725.
[6]

Chunk 14 · 1,992 chars

24.
[5] R. Sennrich, B. Haddow, and A. Birch, “Neural machine trans-
lation of rare words with subword units,” in Proceedings of
the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Berlin, Germany: As-
sociation for Computational Linguistics, Aug. 2016, pp. 1715–
1725.
[6] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are
all you need: End-to-end multilingual speech recognition and
synthesis with bytes,” in ICASSP 2019 - 2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019, pp. 5621–5625.
[7] L. Della Libera, P. Mousavi, S. Zaiem, C. Subakan, and M. Ra-
vanelli, “CL-MASR: A continual learning benchmark for mul-
tilingual ASR,” IEEE/ACM Trans. Audio, Speech and Lang.
Proc., vol. 32, p. 4931–4944, Oct. 2024.
[8] C. Y. Kwok, J. Q. Yip, and E. S. Chng, “Continual Learn-
ing Optimizations for Auto-regressive Decoder of Multilingual
ASR systems,” in Interspeech 2024, 2024, pp. 1225–1229.
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is
all you need,” in Advances in Neural Information Processing
Systems, vol. 30. Curran Associates, Inc., 2017.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:
Pre-training of deep bidirectional transformers for language
understanding,” in Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers). Minneapolis, Minnesota: Association for
Computational Linguistics, Jun. 2019, pp. 4171–4186.
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse,
M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, and
D.

Chunk 15 · 1,998 chars

. Subbiah, J. D. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse,
M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, and
D. Amodei, “Language models are few-shot learners,” in Ad-
vances in Neural Information Processing Systems, vol. 33.
Curran Associates, Inc., 2020, pp. 1877–1901.
[12] T. Kudo and J. Richardson, “SentencePiece: A simple and lan-
guage independent subword tokenizer and detokenizer for neu-
ral text processing,” in Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing: System
Demonstrations. Brussels, Belgium: Association for Compu-
tational Linguistics, Nov. 2018, pp. 66–71.
[13] C. Wang, K. Cho, and J. Gu, “Neural machine translation with
byte-level subwords,” Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 34, no. 05, pp. 9154–9160, Apr.
2020.
[14] L. Deng, R. Hsiao, and A. Ghoshal, “Bilingual end-to-end ASR
with byte-level subwords,” in ICASSP 2022 - 2022 IEEE Inter-
national Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), 2022, pp. 6417–6421.
[15] R. Hsiao, L. Deng, E. McDermott, R. Travadi, and X. Zhuang,
“Optimizing byte-level representation for end-to-end ASR,”
in 2024 IEEE Spoken Language Technology Workshop (SLT),
2024, pp. 462–467.
[16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: An ASR corpus based on public domain audio
books,” in 2015 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–
5210.
[17] J.-U. Bang, S. Yun, S.-H. Kim, M.-Y. Choi, M.-K. Lee, Y.-J.
Kim, D.-H. Kim, J. Park, Y.-J. Lee, and S.-H. Kim, “Kspon-
Speech: Korean spontaneous speech corpus for automatic
speech recognition,” Applied Sciences, vol. 10, no. 19, 2020.
[18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-
1: An open-source

Chunk 16 · 1,703 chars

10.
[17] J.-U. Bang, S. Yun, S.-H. Kim, M.-Y. Choi, M.-K. Lee, Y.-J.
Kim, D.-H. Kim, J. Park, Y.-J. Lee, and S.-H. Kim, “Kspon-
Speech: Korean spontaneous speech corpus for automatic
speech recognition,” Applied Sciences, vol. 10, no. 19, 2020.
[18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-
1: An open-source Mandarin speech corpus and a speech
recognition baseline,” in 2017 20th Conference of the Orien-
tal Chapter of the International Coordinating Committee on
Speech Databases and Speech I/O Systems and Assessment (O-
COCOSDA), 2017, pp. 1–5.
[19] D. B. Paul and J. M. Baker, “The design for the Wall Street
Journal-based CSR corpus,” in Speech and Natural Language:
Proceedings of a Workshop Held at Harriman, New York,
February 23-26, 1992, 1992.
[20] L. Jo and W. Lee, “Zeroth-Korean,” 2019. [Online]. Available:
https://www.openslr.org/40/
[21] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen-
retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com-
mon Voice: A massively-multilingual speech corpus,” in Pro-
ceedings of the Twelfth Language Resources and Evaluation
Conference. Marseille, France: European Language Re-
sources Association, May 2020, pp. 4218–4222.
[22] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba,
Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner,
N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-
end speech processing toolkit,” in Interspeech 2018, 2018, pp.
2207–2211.
[23] K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and
S. Watanabe, “E-Branchformer: Branchformer with enhanced
merging for speech recognition,” in 2022 IEEE Spoken Lan-
guage Technology Workshop (SLT), 2023, pp. 84–91.

-- 5 of 5 --