Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

Summary

This paper addresses the challenge of building speech large language models (SLLMs) for multitask understanding in low-resource languages, using Thai as a case study. Existing SLLMs, which combine speech encoders, adapters, and large language models (LLMs), perform well in high-resource languages but struggle in low-resource settings due to three main issues: suboptimal speech encoders, computationally expensive ASR-based alignment, and a lack of multitask understanding data. To overcome these, the authors introduce XLSR-Thai, a self-supervised speech encoder trained on 36,000 hours of Thai speech, improving ASR performance and supporting multitask understanding. They also propose U-Align, a more efficient and effective speech-text alignment method that directly aligns speech representations with text embeddings using DTW-loss, outperforming ASR-based alignment in accuracy and computational cost. Additionally, the Thai-SUP pipeline generates Thai spoken language understanding data from high-resource English datasets via LLM-based augmentation, translation, and TTS, creating the first Thai dataset with over 1,000 hours of data for intent classification, named entity recognition, and speech rephrasing. Experiments show that these methods significantly enhance multitask understanding in Thai SLLMs. The authors open-source XLSR-Thai and Thai-SUP to support future research.

PDF viewer

Chunks(17)

Chunk 0 · 1,995 chars

TOWARDS BUILDING SPEECH LARGE LANGUAGE MODELS FOR MULTITASK
UNDERSTANDING IN LOW-RESOURCE LANGUAGES
Mingchen Shao1, Bingshen Mu1, Chengyou Wang1, Hai Li2, Ying Yan2, Zhonghua Fu1, Lei Xie1,∗
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi’an, China
2iQIYI, Inc., China
ABSTRACT
Speech large language models (SLLMs) built on speech en-
coders, adapters, and LLMs demonstrate remarkable multitask un-
derstanding performance in high-resource languages such as English
and Chinese. However, their effectiveness substantially degrades in
low-resource languages such as Thai. This limitation arises from
three factors: (1) existing commonly used speech encoders, like the
Whisper family, underperform in low-resource languages and lack
support for broader spoken language understanding tasks; (2) the
ASR-based alignment paradigm requires training the entire SLLM,
leading to high computational cost; (3) paired speech–text data in
low-resource languages is scarce. To overcome these challenges
in the low-resource language Thai, we introduce XLSR-Thai, the
first self-supervised learning (SSL) speech encoder for Thai. It’s
obtained by continuously training the typical SSL XLSR model
on 36,000 hours of Thai speech data. Furthermore, we propose
U-Align, a speech-text alignment method that is more resource-
efficient and multitask-effective than typical ASR-based alignment.
Finally, we present Thai-SUP, a pipeline for generating Thai spo-
ken language understanding data from high-resource languages,
yielding the first Thai spoken language understanding dataset over
1000 hours. Multiple experiments demonstrate the effectiveness of
our methods in building a Thai multitask understanding SLLM. We
open-source XLSR-Thai and Thai-SUP to facilitate future research.1
Index Terms— XLSR-Thai, U-Align, Thai-SUP
1. INTRODUCTION
Large language models (LLMs) have demonstrated exceptional ca-
pabilities in numerous natural

Chunk 1 · 1,996 chars

nts demonstrate the effectiveness of
our methods in building a Thai multitask understanding SLLM. We
open-source XLSR-Thai and Thai-SUP to facilitate future research.1
Index Terms— XLSR-Thai, U-Align, Thai-SUP
1. INTRODUCTION
Large language models (LLMs) have demonstrated exceptional ca-
pabilities in numerous natural language processing tasks, including
text understanding, generation, and reasoning [1, 2, 3]. This ca-
pability has promoted considerable development in speech LLMs
(SLLMs), which extend the LLMs to process speech input directly.
In particular, SLLMs have shown notable success in diverse spoken
language understanding tasks [4, 5, 6], including automatic speech
recognition (ASR), intent classification (IC), named entity recogni-
tion (NER), and speech rephrasing (SR) [7, 8, 9].
To construct SLLMs, one approach discretizes speech into to-
kens and trains the model with the standard next-token prediction
objective [10, 11, 12]. A more widely adopted and empirically
validated paradigm leverages a pretrained speech encoder to ex-
tract continuous speech representations, which are mapped to the
LLM embedding space via an adapter [13, 14, 15]. Building on
these designs, existing SLLMs have demonstrated remarkable per-
formance across multiple spoken language understanding tasks in
* Corresponding author.
1https://huggingface.co/datasets/mcshao/Thai-understanding
high-resource languages like English and Chinese. However, the
performance of SLLMs remains substantially constrained in low-
resource languages like Thai. To address this limitation, the research
question can be summarized as: How to build SLLMs that achieve
strong performance on multitask understanding in low-resource
languages?
As the core component of SLLMs for processing speech in-
put, the speech encoder plays a vital role in capturing rich acous-
tic and linguistic information. Existing SLLMs typically use self-
supervised learning (SSL) encoders or supervised ASR encoders,
with the Whisper

Chunk 2 · 1,994 chars

titask understanding in low-resource
languages?
As the core component of SLLMs for processing speech in-
put, the speech encoder plays a vital role in capturing rich acous-
tic and linguistic information. Existing SLLMs typically use self-
supervised learning (SSL) encoders or supervised ASR encoders,
with the Whisper [16] family being a popular choice. Although
trained in large-scale multilingual speech data, their performance re-
mains suboptimal in low-resource languages [17]. Moreover, since
the Whisper family is limited to tasks such as ASR, speech transla-
tion, and voice activity detection, it imposes potential constraints on
developing SLLMs for multitask understanding.
The adapter aligns the speech embeddings produced by the
speech encoder with the text embedding space of the LLM, playing
a crucial role in enabling the LLM to understand speech. Existing
SLLMs typically begin by optimizing only the adapter on ASR
tasks within the entire SLLM framework to establish speech–text
alignment, and then leverage spoken language understanding data to
extend the SLLMs’ multitask understanding capabilities [7, 13, 18].
However, since this ASR-based alignment requires training the en-
tire SLLM to fit the ASR objective, it incurs a high computational
cost, and the alignment process is restricted to the ASR target rather
than establishing universal speech-text alignment.
The scarcity of multitask spoken language understanding data
in low-resource languages is another critical factor limiting the per-
formance of current SLLMs. Unlike ASR corpora that only re-
quire utterance-level transcriptions, multitask understanding datasets
must additionally provide task-specific supervision, such as intent
labels, named entity labels, and paraphrase pairs. Since annotating
speech data in such languages is costly, leveraging unlabeled data
through self-supervised learning and transferring paired data from
high-resource languages represent practical approaches.
In this work, we

Chunk 3 · 1,998 chars

de task-specific supervision, such as intent
labels, named entity labels, and paraphrase pairs. Since annotating
speech data in such languages is costly, leveraging unlabeled data
through self-supervised learning and transferring paired data from
high-resource languages represent practical approaches.
In this work, we propose a comprehensive solution for devel-
oping multitask understanding SLLMs in a low-resource language,
and take Thai as a representative case. For Thai, existing speech
encoders such as the Zipformer proposed in EThai-ASR [17] or
monsoon-Whisper-medium-gigaspeech22 are built on limited Thai
ASR annotations, and thus remain insufficient to support multitask
understanding. Furthermore, the spoken language understanding
data required for building SLLMs is entirely absent in Thai. To
leverage large amounts of unlabeled data and enhance the multitask
capability of the speech encoder, we introduce XLSR-Thai, an SSL
2https://huggingface.co/scb10x/monsoon-whisper-medium-gigaspeech2
arXiv:2509.14804v1 [cs.SD] 18 Sep 2025

-- 1 of 5 --

U-Align
Transcription
XLSR-Thai 	Adapter
LLM
Tokenizer
LLM
Embedding 	DTW-Loss
LLM	
Task Prompt:กรุณาระบุเอ
นทิตีทีมีชือในไฟล์เสียงนี
LLM
Tokenizer 	CE-Loss 	Lable
Stage1:
Alignment training
Stage2:
Mulit-Task Finetuning
XLSR-Thai 	Adapter 	LLM 	CE-Loss 	Transcription
ASR-Based Alignment
ASR Fine-Tuning
Frozen
Trainable
Fig. 1: The architecture of U-Align. Stage1: use the DTW-loss to align adapted speech representations with textual embeddings of tran-
scriptions without involving the LLM; Stage2: initialize the adapter from Stage 1 and condition the frozen LLM with task-specific prompts
and speech representations. In contrast, ASR-based alignment optimizes only the adapter on ASR tasks within the entire SLLM.
speech encoder obtained by continuously training the typical SSL
XLSR model [19] on 36,000 hours of Thai unlabeled speech. Mean-
while, we propose U-Align, a more resource-efficient and multitask-
effective universal

Chunk 4 · 1,991 chars

ions. In contrast, ASR-based alignment optimizes only the adapter on ASR tasks within the entire SLLM.
speech encoder obtained by continuously training the typical SSL
XLSR model [19] on 36,000 hours of Thai unlabeled speech. Mean-
while, we propose U-Align, a more resource-efficient and multitask-
effective universal speech–text alignment approach. Different from
ASR-based alignment, which indirectly achieves speech-text align-
ment by optimizing the entire SLLM through the ASR task, U-Align
works by directly aligning the adapted speech representations with
the textual embedding of the corresponding transcriptions without
involving the LLM, making the speech inputs fed into the LLM
more similar to the corresponding text embeddings. By using
this method, LLM can interpret speech as naturally as it does text,
achieving a more resource-efficient and multitask-effective universal
speech–text alignment. Besides, we propose the Thai-SUP pipeline,
which generates low-resource Thai spoken language understanding
data from high-resource English text understanding corpora. This
is achieved through LLM-based data augmentation and translation,
followed by text-to-speech (TTS) synthesis. Based on this pipeline,
we produce the first open-source Thai spoken language understand-
ing dataset, comprising 1,000 hours of data across IC, NER, and SR
tasks. Experimental results demonstrate that XLSR-Thai improves
ASR performance and boosts multitask understanding, while U-
Align achieves higher accuracy across IC, NER, SR, and ASR with
lower computational cost than ASR-based alignment.
In summary, we propose a language-agnostic and transfer-
able solution for building multitask understanding SLLMs in low-
resource languages, which integrates effective encoder training,
universal speech–text alignment, and data generation strategies.
Specifically, for Thai, our contributions can be outlined as follows:
• XLSR-Thai: the first open-source Thai SSL speech encoder,
providing a strong

Chunk 5 · 1,995 chars

ng multitask understanding SLLMs in low-
resource languages, which integrates effective encoder training,
universal speech–text alignment, and data generation strategies.
Specifically, for Thai, our contributions can be outlined as follows:
• XLSR-Thai: the first open-source Thai SSL speech encoder,
providing a strong foundation for multitask understanding by
extracting comprehensive speech representations.
• U-Align: a resource-efficient and multitask-effective univer-
sal speech–text alignment method that directly narrows the
gap between speech representations and their corresponding
text embeddings.
• Thai-SUP: a pipeline to generate low-resource spoken lan-
guage understanding data from high-resource text data with
LLM-based augmentation, translation, and TTS, yielding the
first open-source Thai spoken language understanding dataset
over 1,000 hours across IC, NER, and SR tasks.
2. PROPOSED METHODS
To develop SLLMs with strong multitask understanding capability in
low-resource languages, we propose a comprehensive solution and
take Thai as a representative case. To extract rich speech represen-
tations and support multitask requirements, we continue pretraining
a multilingual SSL XLSR model on readily available unlabeled
speech. We further introduce U-Align, a universal speech–text
alignment method that is both more resource-efficient and more
effective for multitask learning. Besides, we design the Thai-SUP
pipeline, which leverages LLM-based data augmentation and trans-
lation combined with TTS to transfer abundant high-resource text
understanding data into low-resource spoken language understand-
ing supervision. This approach addresses the key challenges in
building low-resource language SLLMs, namely insufficient en-
coder capacity, suboptimal speech–text alignment, and data scarcity.
2.1. XLSR-Thai
While speech encoders trained on ASR tasks tend to capture primar-
ily semantic information, we first introduce the SSL speech encoder
for Thai, XLSR-Thai,

Chunk 6 · 1,996 chars

key challenges in
building low-resource language SLLMs, namely insufficient en-
coder capacity, suboptimal speech–text alignment, and data scarcity.
2.1. XLSR-Thai
While speech encoders trained on ASR tasks tend to capture primar-
ily semantic information, we first introduce the SSL speech encoder
for Thai, XLSR-Thai, specifically designed to acquire both linguis-
tic and paralinguistic cues essential for multitask understanding. Al-
though the original XLSR model provides general speech represen-
tations from multilingual pretraining, it has seen only a few dozen
hours of Thai data, leading to weak Thai-specific learning.
To address this, we develop XLSR-Thai by continuously pre-
training the XLSR model on a large-scale corpus of 16,000 hours
of open-source Thai speech and 20,000 hours of in-house unlabeled
Thai speech. This extensive pretraining yields more robust and gen-
eralizable Thai speech representations, allowing XLSR-Thai to cap-
ture both linguistic structures and essential paralinguistic cues, mak-
ing it more effective for multitask understanding.
2.2. U-Align
2.2.1. Model architecture
We adopt XLSR-Thai as the speech encoder to capture both semantic
and paralinguistic information. To bridge the speech-text modalities,
we use a LayerNorm, a CNN subsampler, and a projection MLP
as the modality adapter. For the LLM decoder, we use the frozen
Typhoon2-LLaMa2-3B model [20], generating text conditioned on
task prompts and adapted speech embeddings.

-- 2 of 5 --

1. Data Collection
High-resource text
understanding data
2. Data Augmentation
Deepseek
...
3. Text Translation
Gemini
...
4. Text to Speech
Low-resource speech
understanding data
Fig. 2: Thai-SUP pipeline. Thai-SUP generates low-resource Thai
spoken language understanding data from high-resource English text
corpora using LLM-based data augmentation, translation, and TTS.
2.2.2. Universal speech-text alignment
Traditional ASR-based alignment methods fine-tune the entire
SLLM to optimize for ASR

Chunk 7 · 1,997 chars

ig. 2: Thai-SUP pipeline. Thai-SUP generates low-resource Thai
spoken language understanding data from high-resource English text
corpora using LLM-based data augmentation, translation, and TTS.
2.2.2. Universal speech-text alignment
Traditional ASR-based alignment methods fine-tune the entire
SLLM to optimize for ASR objectives, leading to high computa-
tional costs and ASR-specific optimization. We propose U-Align,
which directly aligns the adapted speech representations with the
corresponding transcription representations in the LLM embedding
space. This approach ensures that the speech inputs received by the
LLM are more similar to text embeddings, facilitating a more natural
interpretation of speech and enabling universal, multitask-effective
speech-text alignment. Additionally, because the alignment stage
does not involve the LLM, the computational cost is significantly re-
duced. To handle the length mismatch between speech and text, we
align adapted speech embeddings H = {hi}I
i=1 to frozen LLM text
embeddings E = {ej }J
j=1 using a cosine-distance DTW objective.
Let Cij = 1 − ⟨hi,ej ⟩
∥hi∥ ∥ej ∥ . The DTW-loss can be calculated as:
LDTW-loss = 1
|π⋆| min
π∈P
X
(i,j)∈π
Cij , (1)
where P is the set of monotonic warping paths and π⋆ is the optimal
path. Normalizing by |π⋆| avoids sequence-length bias.
In stage2, the frozen LLM receives task-specific prompts and
speech embeddings, followed by fine-tuning the SLLM on spoken
language understanding data to support multitask understanding. A
key feature of U-Align is its ability to align speech embeddings di-
rectly with the corresponding transcription embeddings, enabling the
LLM to interpret speech more naturally, just as it does with text. This
alignment can be achieved using various constraint functions, such
as DTW-loss or CTC-loss. Our experiments show that DTW-loss
outperforms CTC-loss, and thus, we adopt DTW-loss in this work.
2.3. Thai-SUP
To address the scarcity of spoken language understanding data
in

Chunk 8 · 1,997 chars

h more naturally, just as it does with text. This
alignment can be achieved using various constraint functions, such
as DTW-loss or CTC-loss. Our experiments show that DTW-loss
outperforms CTC-loss, and thus, we adopt DTW-loss in this work.
2.3. Thai-SUP
To address the scarcity of spoken language understanding data
in low-resource languages, we build the Thai-SUP pipeline like
Figure 2, which transfers supervision from high-resource text un-
derstanding corpora to low-resource spoken language understanding
datasets. The pipeline applies LLM-based augmentation to diversify
texts, translates the augmented texts into the target language, per-
forms colloquialization and quality filtering to ensure text-to-speech
(TTS) suitability, and finally synthesizes audio via TTS, thereby
constructing large-scale paired speech–text supervision for spoken
language understanding.
As for Thai, we start from open-source English text understand-
ing datasets, SNIPS for IC and WikiANN / CONLL-2023 for NER.
Each original example is augmented via DeepSeek-v3, generating
Table 1: CER(%) performance of XLSR-Thai. “Giga2 Test” in-
dicates the Gigaspeech2 test dataset, “CV Test” denotes the Com-
monVoice test dataset.
Model #Params Giga2 Test CV Test
Conformer-giga2 150M 16.36 6.12
Whisper-medium-giga2 769M 14.15 6.92
XLSR-AED 450M 17.72 5.73
XLSR-Thai-AED 450M 14.88 4.80
XLSR-CTC 300M 16.74 5.06
XLSR-Thai-CTC 300M 13.91 3.97
ten synthetic variants per instance. These candidates are then fil-
tered with Gemini-2.5-flash to remove examples that are unsuitable
for downstream speech tasks. The remaining English examples
are translated into colloquial, spoken-style Thai and rendered into
speech using a Thai fine-tuned LLaSa model [21] to produce high-
quality speech-to-text pairs. For the SR task, we use DeepSeek-v3
to mine and select appropriate ASR speech–text pairs that lend
themselves to paraphrasing, and apply Gemini-2.5-flash to generate
rewritten labels. All synthesized data yields more

Chunk 9 · 1,991 chars

ed into
speech using a Thai fine-tuned LLaSa model [21] to produce high-
quality speech-to-text pairs. For the SR task, we use DeepSeek-v3
to mine and select appropriate ASR speech–text pairs that lend
themselves to paraphrasing, and apply Gemini-2.5-flash to generate
rewritten labels. All synthesized data yields more than 250 hours for
the SR task, 648 hours for NER, and 175 hours for IC.
3. EXPERIMENTS
3.1. Experimental setup
We continue pretraining XLSR on 16,000 hours of public Thai data,
including GigaSpeech2 [22] and MSR-86K [23], and 20,000 hours
of in-house unlabeled Thai to obtain XLSR-Thai. To verify encoder
gains, we fine-tune ASR on GigaSpeech2, MSR-86K, and Common
Voice [24] using either XLSR-Thai or the original XLSR and re-
port character error rate (CER). To assess U-Align’s effectiveness
and efficiency, we compare it with a conventional ASR-based align-
ment under identical model settings on the same datasets. For mul-
titask training, we first run U-Align’s alignment stage on a subset
of 2,000 hours drawn from GigaSpeech2, MSR-86K, and Common
Voice, then perform multitask fine-tuning by adding Thai-SUP to
elicit multitask understanding. We report CER for ASR, classifica-
tion accuracy (ACC) for NER and IC, and an automatic 1–5 rating
for SR computed by Gemini-2.5-Flash.
3.2. Evaluation of XLSR-Thai’s effectiveness
To validate the advancement of the XLSR-Thai encoder, we con-
ducted experiments on both ASR single-task and multitask under-
standing. In the ASR single-task, we fine-tuned the SSL encoder
using two approaches: (i) a CTC approach, where the SSL encoder
and CTC layer are fully fine-tuned, and (ii) an AED approach, where
the SSL encoder is frozen and used as a feature extractor for a Con-
former encoder and Transformer decoder AED model. Besides, we
trained a same-size AED Conformer-giga2 model with the same
data.
As shown in Table 1, our XLSR-Thai outperforms the original
XLSR model in both fine-tuning methods. Additionally, when

Chunk 10 · 1,983 chars

ere
the SSL encoder is frozen and used as a feature extractor for a Con-
former encoder and Transformer decoder AED model. Besides, we
trained a same-size AED Conformer-giga2 model with the same
data.
As shown in Table 1, our XLSR-Thai outperforms the original
XLSR model in both fine-tuning methods. Additionally, when com-
pared with the Conformer-giga2 model, XLSR-Thai-AED shows
significant improvements, indicating that our SSL model yields bet-
ter speech representations. Furthermore, when compared with the
open-source Monsoon-Whisper-Medium-GigaSpeech2, XLSR-Thai
also demonstrates higher potential.
In multitask understanding, as shown in Table 2, using XLSR-
Thai consistently leads to better results than using Whisper as the
encoder, both for ASR-based align and U-Align approaches. This

-- 3 of 5 --

Table 2: Multitask Thai spoken language understanding results. Evaluation metrics: ACC (%) ↑ for IC, ACC (%) ↑ for NER (NER-ALL
for overall, NER-PER for person, NER-ORG for organization, NER-LOC for location, NER-OTH for other entity types); LLM-score (1-5) ↑
for SR; CER (%) ↓ for ASR. Directly-MT trains multitask understanding without pre-alignment.
Model IC NER-ALL NER-PER NER-LOC NER-ORG NER-OTH SR ASR
Whisper + ASR-based Alignment 77.15 37.86 35.61 40.83 38.29 83.27 2.66 14.43
Whisper + U-Align (DTW) 81.24 42.52 43.55 47.28 40.09 87.17 2.91 14.08
XLSR-Thai + Directly-MT 82.26 39.53 41.56 40.90 39.01 88.28 2.71 14.83
XLSR-Thai + ASR-based Alignment 81.71 43.23 47.88 46.43 41.89 87.91 2.89 13.81
XLSR-Thai + U-Align (CTC) 86.98 51.07 48.77 52.31 45.43 87.69 3.10 13.51
XLSR-Thai + U-Align (DTW) 89.68 53.77 53.92 54.43 48.09 90.91 3.02 13.32
U-Align
ASR-Based
Text
Fig. 3: t-SNE visualization of text embedding, ASR-based embed-
ding, and U-Align embeddings.
highlights that XLSR-Thai is more effective for supporting multitask
understanding in SLLM construction.
3.3. Validation of U-Align’s universal speech-text alignment
To verify that U-Align provides

Chunk 11 · 1,999 chars

3.32
U-Align
ASR-Based
Text
Fig. 3: t-SNE visualization of text embedding, ASR-based embed-
ding, and U-Align embeddings.
highlights that XLSR-Thai is more effective for supporting multitask
understanding in SLLM construction.
3.3. Validation of U-Align’s universal speech-text alignment
To verify that U-Align provides multitask-effective universal speech-
text alignment, we conduct multitask understanding experiments.
We design the following experiments. XLSR-Thai+ASR-based
Alignment: first trains modality alignment for one epoch on 2000
hours of ASR data using ASR-based alignment, then adds one epoch
of multitask training with Thai-SUP, adopting XLSR-Thai as speech
encoder. XLSR-Thai+Directly-MT: directly trains multitask capa-
bility on ASR data combined with Thai-SUP for two epoch, without
a separate alignment stage. XLSR-Thai+U-Align: follows our
proposed two-stage method, training one epoch of alignment with
U-Align before adding Thai-SUP for multitask understanding train-
ing. XLSR-Thai+U-Align(CTC): replaces the DTW-loss in the
alignment stage with CTC-loss. Whisper+ASR-based Alignment:
replaces the encoder in XLSR-Thai+ASR-based Alignment with
monsoon-Whisper-medium-GigaSpeech2. Whisper+U-Align: uses
the monsoon-Whisper-medium-gigaspeech2 encoder and applies
U-Align for training.
The experimental results are shown in Table 2. Comparing
XLSR-Thai+ASR-based Alignment, XLSR-Thai+Directly-MT, and
XLSR-Thai+U-Align, we observe that performing speech-text align-
ment before multitask understanding training yields better perfor-
mance than direct multitask understanding training. Moreover,
U-Align achieves superior results over ASR-based alignment, in-
dicating that it provides a more universal and multitask-effective
alignment. The comparison between Whisper+ASR-based align-
ment and Whisper+U-Align also demonstrates that U-Align consis-
tently improves alignment across different encoders, confirming the
robustness of our method.
5 	6 	7 	8 	9 	10 	11 	12
Compute

Chunk 12 · 1,988 chars

-
dicating that it provides a more universal and multitask-effective
alignment. The comparison between Whisper+ASR-based align-
ment and Whisper+U-Align also demonstrates that U-Align consis-
tently improves alignment across different encoders, confirming the
robustness of our method.
5 	6 	7 	8 	9 	10 	11 	12
Compute (×107 TFLOPs)
13
14
CER (%)
ASR-Based Alignment
U-Align
Fig. 4: Comparison of CER(%) performance and compute cost.
3.4. Effectiveness and efficiency of U-Align
We validate U-Align’s effectiveness and efficiency on the ASR task.
The baseline trains the SLLM on ASR data with ASR-based align-
ment, while our method uses the same data in two stages: Stage1
learns modality alignment with U-Align, and Stage2 fine-tunes on
ASR. We measure effectiveness by comparing the performance
achieved by the models under the same computational cost, and
efficiency by comparing the computational cost required to achieve
the same performance. The experimental results shown in Fig. 4,
demonstrate that U-Align consistently performs below ASR-Based
Alignment, indicating that U-Align is both more efficient and more
effective compared to ASR-Based Alignment.
3.5. Ablation study and visualization
As shown in Table 2, U-Align(CTC) performs slightly worse than
U-Align(DTW) but still demonstrates a significant advantage over
ASR-based alignment, proving that our method is not limited to
DTW-loss; any loss function that constrains speech representations
and their corresponding text embeddings can be applied, and it con-
sistently outperforms conventional ASR-based alignment. Fig. 3
shows t-SNE projections of speech and transcription embeddings.
The U-Align embeddings (green) are notably fit to the Text embed-
dings (blue) compared to the ASR-Based embeddings (red), which
are more dispersed. This demonstrates that U-Align aligns speech
representations more closely with text, supporting its effectiveness
for multitask understanding.
4. CONCLUSION
In this work, we propose a

Chunk 13 · 1,999 chars

mbeddings (green) are notably fit to the Text embed-
dings (blue) compared to the ASR-Based embeddings (red), which
are more dispersed. This demonstrates that U-Align aligns speech
representations more closely with text, supporting its effectiveness
for multitask understanding.
4. CONCLUSION
In this work, we propose a comprehensive solution for building
multitask understanding SLLMs for low-resource languages. We
leverage easily accessible unlabeled data for continuously pretrain-
ing XLSR, and introduce U-Align to achieve more resource-efficient
and multitask-effective speech-text alignment, and develop the Thai-
SUP pipeline to transfer high-resource text understanding data to
low-resource spoken language understanding data. Our methods are
demonstrated through experiments on Thai, and this approach can
be extended to any low-resource language.

-- 4 of 5 --

5. REFERENCES
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad,
Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko
Altenschmidt, Sam Altman, Shyamal Anadkat, et al., “GPT-4
Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
[2] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am-
jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya
Batra, Prajjwal Bhargava, Shruti Bhosale, et al., “Llama
2: Open Foundation and Fine-Tuned Chat Models,” arXiv
preprint arXiv:2307.09288, 2023.
[3] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu,
Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu,
Fei Huang, et al., “Qwen2 Technical Report,” arXiv preprint
arXiv:2407.10671, 2024.
[4] Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhi-
fang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang
Lin, et al., “Qwen2-Audio Technical Report,” arXiv preprint
arXiv:2407.10759, 2024.
[5] Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong
Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang,
et al., “Kimi-Audio Technical Report,” arXiv preprint
arXiv:2504.18425, 2025.
[6] Tianpeng Li,

Chunk 14 · 1,996 chars

Lv, Jinzheng He, Junyang
Lin, et al., “Qwen2-Audio Technical Report,” arXiv preprint
arXiv:2407.10759, 2024.
[5] Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong
Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang,
et al., “Kimi-Audio Technical Report,” arXiv preprint
arXiv:2504.18425, 2025.
[6] Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Min-
grui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng
Dong, et al., “Baichuan-Audio: A Unified Framework for End-
to-End Speech Interaction,” arXiv preprint arXiv:2502.17239,
2025.
[7] Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang,
Xixin Wu, and Zhiyong Wu, “Enhancing Generalization
of Speech Large Language Models with Multi-Task Behav-
ior Imitation and Speech-Text Interleaving,” arXiv preprint
arXiv:2505.18644, 2025.
[8] Alexander H. Liu, Andy Ehrenberg, Andy Lo, Cl´ement De-
noix, Corentin Barreau, Guillaume Lample, et al., “Voxtral,”
arXiv preprint arXiv:2507.13264, 2025.
[9] Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang,
Xueyuan Chen, and Helen Meng, “Speech Discrete Tokens
or Continuous Features? A Comparative Analysis for Spo-
ken Language Understanding in SpeechLLMs,” arXiv preprint
arXiv:2508.17863, 2025.
[10] Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang,
Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang, “GLM-
4-Voice: Towards Intelligent and Human-Like End-to-End
Spoken Chatbot,” arXiv preprint arXiv:2412.02612, 2024.
[11] Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie,
Ke Li, Xing Sun, and Long Ma, “Freeze-Omni: A Smart and
Low Latency Speech-to-speech Dialogue Model with Frozen
LLM,” arXiv preprint arXiv:2411.00774, 2024.
[12] Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan
Shiu, and Hung yi Lee, “TASTE: Text-Aligned Speech To-
kenization and Embedding for Spoken Language Modeling,”
arXiv preprint arXiv:2504.07053, 2025.
[13] Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian
Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang, “SALMONN:
Towards Generic

Chunk 15 · 1,998 chars

i-Chang Chen, Kuan-Yi Lee, Da-Shan
Shiu, and Hung yi Lee, “TASTE: Text-Aligned Speech To-
kenization and Embedding for Spoken Language Modeling,”
arXiv preprint arXiv:2504.07053, 2025.
[13] Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian
Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang, “SALMONN:
Towards Generic Hearing Abilities for Large Language Mod-
els,” in Proc. ICLR, 2024.
[14] Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei
Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai,
Longhao Li, et al., “Unveiling the Potential of LLM-Based
ASR on Chinese Open-Source Datasets,” in Proc. ISCSLP,
2024, pp. 26–30.
[15] Bingshen Mu, Yiwen Shao, Kun Wei, Dong Yu, and Lei
Xie, “Efficient Scaling for LLM-based ASR,” arXiv preprint
arXiv:2508.04096, 2025.
[16] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman,
Christine McLeavey, and Ilya Sutskever, “Robust Speech
Recognition via Large-Scale Weak Supervision,” in Proc,
ICML, 2023, pp. 28492–28518.
[17] Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu,
Hai Li, Ying Yan, Junhui Liu, Danming Xie, and Lei Xie,
“Weakly supervised data refinement and flexible sequence
compression for efficient thai llm-based ASR,” arXiv preprint
arXiv:2505.22063, 2025.
[18] Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan
Lin, Zhixian Zhao, Guojian Li, Wenjie Tian, Peikun Chen,
Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang,
Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa
Zhu, Yue Li, Li Zhang, and Lei Xie, “OSUM: Advancing
Open Speech Understanding Models with Limited Resources
in Academia,” arXiv preprint arXiv:2501.13306, 2025.
[19] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakho-
tia, Qiantong Xu, Naman Goyal, Kritika Singh, et al., “XLS-R:
Self-supervised Cross-lingual Speech Representation Learning
at Scale,” in Proc. Interspeech, 2022, pp. 2278–2282.
[20] Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach,
Warit Sirichotedumrong, Surapon Nonesung, et al., “Typhoon
2: A Family of Open Text

Chunk 16 · 1,497 chars

a, Qiantong Xu, Naman Goyal, Kritika Singh, et al., “XLS-R:
Self-supervised Cross-lingual Speech Representation Learning
at Scale,” in Proc. Interspeech, 2022, pp. 2278–2282.
[20] Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach,
Warit Sirichotedumrong, Surapon Nonesung, et al., “Typhoon
2: A Family of Open Text and Multimodal Thai Large Lan-
guage Models,” arXiv preprint arXiv:2412.13702, 2024.
[21] Tianlun Zuo, Jingbin Hu, Yuke Li, Xinfa Zhu, Hai Li, Ying
Yan, Junhui Liu, Danming Xie, and Lei Xie, “XEmoRAG:
Cross-Lingual Emotion Transfer with Controllable Intensity
Using Retrieval-Augmented Generation,” arXiv preprint
arXiv:2508.07302, 2025.
[22] Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jin-
peng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu,
Ziyuan Wang, et al., “GigaSpeech 2: An Evolving, Large-
Scale and Multi-domain ASR Corpus for Low-Resource Lan-
guages with Automated Crawling, Transcription and Refine-
ment,” arXiv preprint arXiv:2406.11546, 2024.
[23] Song Li, Yongbin You, Xuezhi Wang, Zhengkun Tian,
Ke Ding, and Guanglu Wan, “MSR-86K: An Evolving,
Multilingual Corpus with 86,300 Hours of Transcribed Au-
dio for Speech Recognition Research,” arXiv preprint
arXiv:2406.18301, 2024.
[24] Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler,
Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saun-
ders, Francis M. Tyers, and Gregor Weber, “Common Voice: A
Massively-Multilingual Speech Corpus,” in Proc. LREC, 2020,
pp. 4218–4222.

-- 5 of 5 --