SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Summary

SeaLLMs-Audio is the first large audio-language model (LALM) tailored for multiple Southeast Asian languages—Indonesian, Thai, Vietnamese—alongside English and Chinese. Trained on a large-scale multilingual audio corpus, it supports diverse audio-centric tasks like speech recognition, translation, captioning, and question answering. The model is multimodal, accepting audio-only, text-only, or combined inputs, and is multilingual, covering five languages. To evaluate its performance, the authors introduced SeaBench-Audio, a benchmark with 14 tasks spanning audio and text inputs. Experiments show SeaLLMs-Audio achieves competitive performance compared to other LALMs on Southeast Asian languages. The model was trained using a comprehensive data curation pipeline that aggregated and processed public and private datasets, including GigaSpeech, Common Voice, and AudioCaps. The training data includes 1.58 million conversations across multiple task types. The model architecture combines the audio encoder from Qwen2-Audio-7B with the multilingual capabilities of Qwen2.5-7B-Instruct. Evaluation results, using both human judges and an LLM-as-a-judge framework, demonstrate strong performance, particularly in language quality for Southeast Asian languages.

PDF viewer

Chunks(24)

Chunk 0 · 1,992 chars

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia
Chaoqun Liu Mahani Aljunied Guizhen Chen Hou Pong Chan
Weiwen Xu Yu Rong Wenxuan Zhang*
DAMO Academy, Alibaba Group
wxzhang@sutd.edu.sg
https://damo-nlp-sg.github.io/SeaLLMs-Audio/
Abstract
We introduce SeaLLMs-Audio, the first
large audio-language model (LALM) tailored
for multiple Southeast Asian (SEA) lan-
guages—Indonesian (id), Thai (th), and Viet-
namese (vi)-alongside English (en) and Chi-
nese (zh). Trained on a large-scale audio
corpus, SeaLLMs-Audio exhibits strong per-
formance across diverse audio-centric tasks,
spanning fine-grained audio understanding and
voice-based interaction. Its key features in-
clude: 1) Multilingual: the model primar-
ily supports 5 languages, namely Indonesian,
Thai, Vietnamese, English, and Chinese; 2)
Multimodal: the model accepts flexible input
modalities, including audio only, text only, as
well as audio with text; 3) Multi-task: the
model supports a wide range of tasks, includ-
ing audio analysis tasks such as Audio Caption-
ing, Automatic Speech Recognition, Speech-
to-Text Translation, Speech Emotion Recogni-
tion, Speech Question Answering, and Speech
Summarization. It also enables voice-based
dialogue, including answering factual, mathe-
matical, and general knowledge queries. As
a significant step towards advancing audio
LLMs in Southeast Asia, we expect SeaLLMs-
Audio to benefit both the regional research
community and industry. To automate LALM
evaluation for Southeast Asia, we introduce
SeaBench-Audio, a benchmark spanning mul-
tiple tasks. Experiments show that SeaLLMs-
Audio achieves competitive performance com-
pared with other LALMs on SEA languages. 1
1 Introduction
Large audio-language models (LALMs) (Chu et al.,
2023, 2024; He et al., 2024; Pipatanakul et al.,
2024; Held et al., 2024) have shown impressive
capabilities in understanding the rich information
* Wenxuan Zhang is the corresponding author.
1SeaLLMs-Audio is publicly available at

Chunk 1 · 1,996 chars

LMs on SEA languages. 1
1 Introduction
Large audio-language models (LALMs) (Chu et al.,
2023, 2024; He et al., 2024; Pipatanakul et al.,
2024; Held et al., 2024) have shown impressive
capabilities in understanding the rich information
* Wenxuan Zhang is the corresponding author.
1SeaLLMs-Audio is publicly available at https://
github.com/DAMO-NLP-SG/SeaLLMs-Audio
contained in audio signals. However, most exist-
ing LALMs support only one or two languages,
most typically English, leaving multilingual and
low-resource regions under-represented.
In Southeast Asia (SEA), significant progress
has been made in developing multilingual large lan-
guage models (LLMs), such as SeaLLMs (Nguyen
et al., 2024; Zhang et al., 2025; Zhao et al., 2025),
Sailor (Dou et al., 2024, 2025), and SEA-LION2
series. Despite their multilingual reach, these mod-
els operate solely in the textual modality and lack
the ability to process audio inputs—an essential
component of natural human communication.
Beyond the absence of LALMs tailored for
Southeast Asian languages, progress is further con-
strained by the lack of comprehensive and rigor-
ous evaluation frameworks. Existing benchmarks,
such as SeaEval (Wang et al., 2024), SeaExam,
and SeaBench (Liu et al., 2025), focus primarily
on textual evaluation within SEA contexts. Mean-
while, audio-related benchmarks remain limited
to specific tasks like automatic speech recogni-
tion (ASR) (Wang et al., 2025), without provid-
ing a holistic assessment of audio understanding
and voice-based interaction. This lack of broad,
multimodal benchmarks continue to impede the
advancement of audio-language modeling in SEA
languages.
To bridge above gaps, we introduce SeaLLMs-
Audio (Southeast Asian Large Language Models
with audio capabilities), a Large Audio-Language
Model designed specifically for Southeast Asia.
SeaLLMs-Audio is trained using data from a com-
prehensive curation pipeline that aggregates, orga-
nizes, and synthesizes multimodal resources

Chunk 2 · 1,989 chars

above gaps, we introduce SeaLLMs-
Audio (Southeast Asian Large Language Models
with audio capabilities), a Large Audio-Language
Model designed specifically for Southeast Asia.
SeaLLMs-Audio is trained using data from a com-
prehensive curation pipeline that aggregates, orga-
nizes, and synthesizes multimodal resources across
SEA languages, as illustrated in Figure 1. The
curated dataset spans diverse tasks, including au-
tomatic speech recognition (ASR), audio caption-
2https://huggingface.co/aisingapore/
collections
1
arXiv:2511.01670v1 [cs.CL] 3 Nov 2025

-- 1 of 10 --

Transcription: รองผู้ว่าฯกทมยันรื้อ
ซากตึกสตงทันสิ้นเมษาจ่อเสนอปรับ
เงินเยียวยาที่พัก
ASR in th Add
punctuations &
space with LLM
Response: รองผู้ว่าฯกทม.ยันรื้อซากตึกสตง.
ทันสิ้นเมษาจ่อเสนอปรับเงินเยียวยาที่พัก
ASR in th Prompt: Transcribe the audio.
Translate text
with LLM Response: The Deputy Governor of
Bangkok confirmed …
S2TT Prompt: Translate the audio into
English.
Description: The wind is
blowing and rustling occurs
AC in en
Filter Add
instruction
Translate with
LLM Response: Angin bertiup dan
terdengar gemerisik.
AC in id Prompt: Describe the audio in
Indonesian.
Prompt: How to cook a quick and tasty meal?
Response: Here are a few quick and tasty
meal ideas …
QA in text format
Text to speech Response: Here are a few quick and
tasty meal ideas …
QA
Long Audio Speech in vi
Trim Summarize
with MLLM
Prompt: Tóm tắt đoạn âm thanh này.
Response: Đoạn âm thanh này nói
về...
SS
Add instruction
Generate question and answer with
MLLM
SQA Prompt: Ông ấy đã nhắc đến loài vật
nào?
Response: Ông ấy nhắc đến chó và
mèo.
Figure 1: Illustration of data curation process for SeaLLMs-Audio.
ing (AC), speech-to-text translation (S2TT), speech
summarization (SS), audio question answering
(AQA), and multimodal reasoning.
Furthermore, to facilitate standardized evalua-
tion, we present SeaBench-Audio, a manually cu-
rated benchmark for assessing LALMs in Southeast
Asian languages. SeaBench-Audio

Chunk 3 · 1,987 chars

aLLMs-Audio.
ing (AC), speech-to-text translation (S2TT), speech
summarization (SS), audio question answering
(AQA), and multimodal reasoning.
Furthermore, to facilitate standardized evalua-
tion, we present SeaBench-Audio, a manually cu-
rated benchmark for assessing LALMs in Southeast
Asian languages. SeaBench-Audio encompasses
multiple open-ended task categories that reflect
real-world, multimodal language understanding
scenarios. To facilitate consistent and scalable eval-
uation, we adopt an LLM-as-a-judge framework
with task-specific prompt templates, achieving high
agreement with human annotations. Experimen-
tal results on SeaBench-Audio demonstrate that
SeaLLMs-Audio delivers robust and competitive
performance across a wide range of audio-language
tasks.
Our key contributions are as follows.
• We present SeaLLMs-Audio, a large-scale
audio–language model specifically designed
for Southeast Asian contexts.
• We develop SeaBench-Audio, a compre-
hensive benchmark dedicated to evaluating
LALMs within the SEA region.
• Our experimental analyses indicate that
SeaLLMs-Audio achieves strong perfor-
mance on the SeaBench-Audio benchmark.
2 SeaLLMs-Audio
In this section, we illustrate the training data cura-
tion pipeline followed by the model architecture.
2.1 Comprehensive Data Curation Pipeline
This training dataset for SeaLLMs-Audio contains
1.58M conversations for multiple tasks, including
7% multi-turn dialogues that better reflect real-
world interactive scenarios. The tasks can be
roughly classified as the following categories: auto-
matic speech recognition (ASR), audio captioning
(AC), speech-to-text translation (S2TT), question
answering (QA), speech summarization (SS), audio
question answering (AQA), chat, math, and factoid
QA (fact) and other tasks (mixed).
The training dataset was curated from multiple
data sources, including public datasets and private
data. Public datasets include: GigaSpeech (Chen
et al., 2021), GigaSpeech2 (Yang et al.,

Chunk 4 · 1,997 chars

swering (QA), speech summarization (SS), audio
question answering (AQA), chat, math, and factoid
QA (fact) and other tasks (mixed).
The training dataset was curated from multiple
data sources, including public datasets and private
data. Public datasets include: GigaSpeech (Chen
et al., 2021), GigaSpeech2 (Yang et al., 2025),
Common Voice (Ardila et al., 2020), AudioCaps
(Kim et al., 2019), VoiceAssistant-400 (Xie and
Wu, 2024), YODAS2 (Li et al., 2024), and Multi-
task National Speech Corpus (He et al., 2024). As
these datasets span multiple sources with disparate
formats (e.g., different audio encodings, annota-
tion schemas, and text structures), they cannot be
directly used for end-to-end training. We there-
fore perform comprehensive preprocessing to unify
the data. Figure 1 shows the overall data curation
pipeline with some examples. The following de-
scribes the construction process for each task.
ASR For ASR datasets such as GigaSpeech,
we normalize transcripts to improve readability.
For example, we transform "AND LOOK AT THE
PERCENTAGE OF REPORTS <PERIOD>" into "And
look at the percentage of reports." For Gi-
gaSpeech2, which includes Thai, Indonesian, and
2

-- 2 of 10 --

Vietnamese while its text does not contain punctua-
tions, we employ a selected LLM to restore punctu-
ations and spacing, producing more reader-friendly
text for each language. As LLMs may introduce
errors, we discard samples whose outputs are in-
consistent with the original transcripts.
S2TT Given the absence of open-source S2TT
datasets for SEA languages, we construct such data
by leveraging ASR corpora in different languages.
More specifically, since each unit of the ASR data
comprises the same-language speech audio plus
its text transcription, we utilize this text to create
translations into multiple targeted languages. This
results in data pairs of speech audio in one language
plus their translated text in another language.
AC AudioCaps provides captions exclusively in
English.

Chunk 5 · 1,925 chars

ASR data
comprises the same-language speech audio plus
its text transcription, we utilize this text to create
translations into multiple targeted languages. This
results in data pairs of speech audio in one language
plus their translated text in another language.
AC AudioCaps provides captions exclusively in
English. To accommodate Southeast Asian lan-
guages, we translate these captions into the respec-
tive target languages.
QA This set is curated to obtain audio questions
with text answers. To do this, we make use of
existing question-answer pairs in text format. The
answers are kept unchanged in text form, while the
text questions are converted into audio with text-to-
speech (TTS) models. No translation is involved.
After manually assessing samples of the quality of
TTS outputs from several models, we finally select
Google Text-to-Speech 3.
SS In order to curate SS dataset, we sample a
piece of speech audio from YODAS2 dataset and
ask Gemini-2.0-Flash to summarize it in a specified
language.
AQA In order to create natural questions and
audio about a piece of audio, we first sample a piece
of audio from the YODAS2 dataset, which contains
audio for YouTube videos. After that, we prompt
Gemini-2.0-Flash to generate a question about the
audio and provide the corresponding answer.
chat In order to create voice chat data, we make
use of existing text conversation data and convert
the user input into audio format with Google TTS.
As Google TTS has only a few voice types for each
language, we also transcribe part of the data with
gpt-4o-mini-tts to improve the diversity.
math and fact For math and factoid QA in-
struction data, we also transcribe the prompts into
speech with Google TTS.
3https://cloud.google.com/text-to-speech
The distribution of the training data by language
and by task type is shown in Figure 2.
zh
20.5%
th
18.4%
id
18.1%
vi
17.9%
en
17.5%
sg
7.6%
(a) Data Distribution by Task

Chunk 6 · 1,985 chars

ct For math and factoid QA in-
struction data, we also transcribe the prompts into
speech with Google TTS.
3https://cloud.google.com/text-to-speech
The distribution of the training data by language
and by task type is shown in Figure 2.
zh
20.5%
th
18.4%
id
18.1%
vi
17.9%
en
17.5%
sg
7.6%
(a) Data Distribution by Task Type
chat
42.1%
ASR
22.6%
AC	
14.3%
SQA
6.2%
SS
3.2%
fact
3.2%
mixed
2.8%
QA
2.7%
math
1.8%
S2TT
1.1%
(b) Data Distribution by Language
Figure 2: Training data distribution across (a) languages
and (b) task types.
2.2 Model Architecture
SeaLLMs-Audio builts upon Qwen2-Audio-7B
(Chu et al., 2023) and Qwen2.5-7B-Instruct (Qwen
et al., 2025). The architecture is shown in Figure 3.
We replace the LLM module in Qwen2-Audio-7B
(Chu et al., 2024) by Qwen2.5-7B-Instruct (Qwen
et al., 2025). In this way, we harness the advan-
tages of both models: Qwen2-Audio-7B audio en-
coder can encode the audio features for speech
and non-speech audios effectively, and Qwen2.5-
7B-Instruct has strong multilingual capabilities.
Due to the hidden embedding mismatch, the au-
3

-- 3 of 10 --

Audio
Encoder
Embedding
Audio
Text
Adapter
LLM
Next
 Sentence
 Prediction
Figure 3: Architecture of SeaLLMs-Audio.
dio adapter is newly initialized. After that, we do
full-parameter fine-tuning on our newly curated
large-scale audio dataset, which contains multiple
tasks. Given paired data (a, x), with a denoting the
audio sequences and x denoting the optional corre-
sponding text sequences, the training objective is
to maximize the likelihood of the subsequent text
token, formulated as
Pθ(xt | x<t, a), (1)
conditioned on the audio representations and the
preceding text tokens x<t, where θ represents the
trainable parameters of the LALM. We train the
model on the dataset for 1 epoch, which took 6
days to complete on 32 A800 GPUs.
3 SeaBench-Audio
Due to the absence of standard audio benchmarks
for evaluating audio LLMs in SEA languages, we
manually create a benchmark called

Chunk 7 · 1,996 chars

ding text tokens x<t, where θ represents the
trainable parameters of the LALM. We train the
model on the dataset for 1 epoch, which took 6
days to complete on 32 A800 GPUs.
3 SeaBench-Audio
Due to the absence of standard audio benchmarks
for evaluating audio LLMs in SEA languages, we
manually create a benchmark called SeaBench-
Audio. It comprises 14 tasks: 1) Tasks with both
audio and text inputs: Automatic Speech Recog-
nition (ASR), Speech-to-Text Translation (S2TT),
Speech Summarization (SS), Speech Question An-
swering (SQA), Customer Service (CS), Safety,
Audio Cationing (AC), Audio Question Answer-
ing (AQA), Speaker Identifiers (SKI), and Speech
Emotion Recognition (SER); 2) Tasks with only au-
dio inputs: Life, Medical (MED), Math, and Fact.
The task descriptions and annotation criteria are
summarized in Table 3 in Appendix A.
An overview of the datasets is provided in Figure
4(a). For each language, we engage a professional
native linguist to annotate 10 questions per task.
One exception is S2TT task, which requires the
translation between two languages. For id/th/vi, we
construct two versions—native audio to English
text and English audio to native text—each with
10 questions. For English, we omit this task to
prevent redundancy. Consequently, there are 150
questions for each SEA language and 130 for En-
glish, yielding a total of 580 questions. For every
LALM
Rubrics
Audio
ASR
Gemini
Response
(Text)
Template
Prompt	
Analysis
+ score
Life
SS	S2TT 	SQA
AA 	AQA 	SKI
CS
MED 	Math
Safety
Fact
SER
SeaBench-Audio
Audio + Text
Text only
(a)
(b)
Reference
Figure 4: (a) An overview of the task in SeaBench-
Audio (b) Evaluation pipeline with LLM-as-a-judge
framework.
question, a linguist supplies a reference to facili-
tate scoring. The benchmark underwent multiple
rounds of careful review to ensure quality.
For evaluation, qualified native speakers rated
each response on a scale of 1 to 5, with 5 represent-
ing the highest quality. However, human evalua-
tions

Chunk 8 · 1,991 chars

-a-judge
framework.
question, a linguist supplies a reference to facili-
tate scoring. The benchmark underwent multiple
rounds of careful review to ensure quality.
For evaluation, qualified native speakers rated
each response on a scale of 1 to 5, with 5 represent-
ing the highest quality. However, human evalua-
tions are expensive and time-consuming. In order
to facilitate automatic evaluation, we employ an
LLM-as-a-judge framework. We choose Gemini-
2.5-flash (Gemini) (Comanici et al., 2025), due to
its capabilities of audio understanding and good
balance of cost and performance. As shown in 4(b),
the procedures to evaluate an LALM are: 1) Gen-
erate responses for each instance with the LALM;
2) Construct an evaluation prompt with the text
instruction (optional), reference answer, response,
rubrics, and the template; 3) Prompt Gemini with
the audio and evaluation prompt; 4) Extract the
score from the final response. The prompt tem-
plate for LLM-as-a-judge is shown in Figure 8 in
Appendix A. For each task, we additionally en-
gage linguists to develop a task-specific evaluation
rubric for the responses on a scale of 1 to 5. We hy-
pothesize that tasks exhibit distinct characteristics,
and a dedicated rubric more accurately captures the
nuances of each task.
4 Experiments
We compare the performance of SeaLLMs-Audio
with relevant LALMs with similar sizes, including:
Qwen2-Audio-7B-Instruct (Qwen2-Audio)
This is the latest version of Qwen-Audio series
(Chu et al., 2024), which mainly focus on English
and Chinese. It shares the same base audio encoder
as SeaLLMs-Audio.
4

-- 4 of 10 --

Qwen2.5-Omni-7B (Qwen2.5-Omni) This is a
multimodal model that perceive diverse modalities
including text, audio, image, and video (Xu et al.,
2025). It also adopted the audio encoder from
Qwen2-Audio. In this work, we only compare its
performance with audio input.
MERaLiON-AudioLLM-Whisper-SEA-LION
(MERaLiON) This model was trained to
understand Singlish, which was trained on

Chunk 9 · 1,990 chars

at perceive diverse modalities
including text, audio, image, and video (Xu et al.,
2025). It also adopted the audio encoder from
Qwen2-Audio. In this work, we only compare its
performance with audio input.
MERaLiON-AudioLLM-Whisper-SEA-LION
(MERaLiON) This model was trained to
understand Singlish, which was trained on 62
million multimodal instruction samples (He et al.,
2025). Since it was trained with both audio and
text instructions, we add a text instruction "Please
follow the instruction in the speech." for tasks with
no text input.
MERaLiON-2-10B (MERaLiON-2) This is a
concurrent work with SeaLLMs-Audio. Compared
with MERaLion, it supports more languages, in-
cluding English, Chinese, Indonesian, Thai, and
Vietnamese; thus, it has a similar motivation to
SeaLLMs-Audio (He et al., 2025). Like MER-
aLiON, we add a text instruction "Please follow
the instruction in the speech." for tasks without text
input.
All the LALMs can accept audio with text as
input. In order to evaluate the quality of LLM-
as-a-judge, we also conducted human evaluations.
The judging criteria are shown in Table 4 in Ap-
pendix A. We engage the benchmark annotators
to do the evaluations as they are familiar with the
benchmarks.
4.1 Main Results
Figure 5 shows the human evaluations. Evaluators
assessed both overall performance and language
quality, the latter referring to the correctness of lan-
guage usage in responses. Language quality was
rated on a 1-5 scale, where 5 indicated entirely
correct language devoid of code-switching. We
can see that SeaLLMs-Audio achieves the best lan-
guage quality for the three SEA languages. The
LLM-as-a-judge evaluation result is shown in Fig-
ure 6. From these results, SeaLLMs-Audio attains
the strongest performance on id/th/vi, irrespective
of whether evaluation is conducted by human anno-
tators or Gemini. Additional observations include:
(1) MERaLiON-2 surpasses MERaLiON, which
is expected given that MERaLiON-2 is the newer
iteration; and (2)

Chunk 10 · 1,933 chars

wn in Fig-
ure 6. From these results, SeaLLMs-Audio attains
the strongest performance on id/th/vi, irrespective
of whether evaluation is conducted by human anno-
tators or Gemini. Additional observations include:
(1) MERaLiON-2 surpasses MERaLiON, which
is expected given that MERaLiON-2 is the newer
iteration; and (2) Qwen-Omni outperforms Qwen2-
Audio across the three languages, consistent with
its more recent release. This alignment with prior
en 	id 	th 	vi
language
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Score (higher is better)
SeaLLMs-Audio
Qwen2-Audio
Qwen2.5-Omni
MERaLiON
MERaLiON-2
(a)
en 	id 	th 	vi
language
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Score (higher is better)
SeaLLMs-Audio
Qwen2-Audio
Qwen2.5-Omni
MERaLiON
MERaLiON-2
(b)
Figure 5: Performance of the models on SeaBench-
Audio accessed by human evaluators: (a) average scores
for overall performance, and (b) average scores for out-
put language quality. Each response is evaluated on a
1–5 scale, with 5 indicating the highest quality. Human
evaluations were performed blind, without disclosure of
the generating model.
expectations supports the validity of the LLM-as-a-
judge framework.
4.2 Analysis
To further understand the capabilities of SeaLLMs-
Audio and SeaBench-Audio, we conduct more anal-
ysis on the results.
How does SeaLLMs-Audio perform on each
task? In addition to assessing average perfor-
mance across languages, we further examine model
outcomes by task. Table 1 presents average scores
for each model across all evaluated tasks. Since the
English subset excludes the S2TT task, averages
for that setting are computed only over id, th, and vi.
MERaLiON-2 consistently achieves the strongest
results in audio comprehension tasks—specifically
ASR, S2TT, and SER—which we attribute to its
substantially larger and more diverse training cor-
pus (He et al., 2025). Conversely, SeaLLMs-Audio
5

-- 5 of 10 --

Task Human-as-a-judge LLM-as-a-judge
task

Chunk 11 · 1,989 chars

id, th, and vi.
MERaLiON-2 consistently achieves the strongest
results in audio comprehension tasks—specifically
ASR, S2TT, and SER—which we attribute to its
substantially larger and more diverse training cor-
pus (He et al., 2025). Conversely, SeaLLMs-Audio
5

-- 5 of 10 --

Task Human-as-a-judge LLM-as-a-judge
task SeaLLMs-
Audio
Qwen2-
Audio
Qwen2.5-
Omni
MERa-
LiON
MERa-
LiON-2
SeaLLMs-
Audio
Qwen2-
Audio
Qwen2.5-
Omni
MERa-
LiON
MERa-
LiON-2
AA 2.4 2.1 2.4 2.2 2.4 2.3 2 2.3 2 2.4
AQA 3.2 2.3 2.8 3.2 3.4 3.5 2.6 2.8 3.3 3.7
ASR 3.9 2.2 3.3 2.9 4.2 3.7 1.7 2.8 2.8 4.4
CS 3.3 2.7 2.9 3.2 3.6 3.2 2.5 3.3 3.3 4
MED 3.6 1.8 3.3 1.2 1.6 3.5 1.5 2.8 1 1.6
S2TT_EX 2.9 2.1 2.5 3.6 3.7 3.5 2.4 2.9 3.8 3.9
S2TT_XE 2.6 1.2 3.2 2.3 3.6 2.3 1.2 2.6 2.1 3.3
SER 2.8 2.4 2.7 2.3 3.2 3.1 2 3.2 2.2 3.5
SKI 2.2 2.6 1.9 3.3 2.8 2.5 2.9 2.3 3.3 2.7
SQA 4.1 2.2 4 3.7 4.3 4.2 2.5 4.2 3.4 4.3
SS 3.2 2.3 3.3 3.2 4.1 3.4 1.7 3.3 3.4 4.4
fact 3.1 1.6 2.9 1.9 2 3.1 1.8 3 1.3 1.6
life 3.3 1.9 3.1 1.2 1.3 3.7 1.7 3.1 1 1.2
math 3.6 1.3 2.6 1.7 2.3 4 1.2 3 1.5 3.5
safety 2 2.2 2.1 1.6 2.5 2.1 1.6 2.7 1.6 2.4
Table 1: Average scores for each task across the three SEA languages. We show the scores judged by humans and
by Gemini-2.5-flash. The highest scores for each task are highlighted in bold.
en 	id 	th 	vi
Language
1.50
1.75
2.00
2.25
2.50
2.75
3.00
3.25
3.50
Score (higher is better)
SeaLLMs-Audio
Qwen2-Audio
Qwen2.5-Omni
MERaLiON
MERaLiON-2
Figure 6: Average scores of the models on SeaBench-
Audio accessed by Gemini-2.5-flash.
attains state-of-the-art performance in selected cat-
egories, including fact, life, MED, and math. We
ascribe SeaLLMs-Audio’s advantages to the exten-
sive scope and heterogeneity of its training data, en-
compassing both varied task types and multimodal
input formats.
How is LLM-as-a-judge consistent with human
judges? We observed that the scores by human
judgments and LLM-as-a-judge evaluations are not
perfectly aligned. To evaluate their correlation,
we

Chunk 12 · 1,996 chars

he exten-
sive scope and heterogeneity of its training data, en-
compassing both varied task types and multimodal
input formats.
How is LLM-as-a-judge consistent with human
judges? We observed that the scores by human
judgments and LLM-as-a-judge evaluations are not
perfectly aligned. To evaluate their correlation,
we calculate their Pearson correlation coefficient.
As shown in Figure 7, LLM-as-a-judge and hu-
man judges have an average correlation coefficient
of 0.8, which shows high correlation between the
scores by humans and by the LLM judge. We also
calculate the agreement between human judges and
LLM judges when comparing the responses from
SeaLLMs-Audio	
Qwen2-Audio	
Qwen2.5-Omni
MERaLiON	
MERaLiON-2
Avg
en
id
th
vi
Avg
0.77 0.87 0.76 0.92 0.88 0.84
0.84 0.78 0.6 0.89 0.84 0.79
0.8 0.71 0.69 0.85 0.75 0.76
0.86 0.81 0.68 0.88 0.77 0.8
0.82 0.79 0.68 0.89 0.81 0.8
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Figure 7: The Pearson correlation coefficient between
human judgements and LLM judgements.
two models. As shown in Table 2, they have an av-
erage agreement of 69% with tie and 93% without
tie, which is even higher than the result in MT-
bench (Zheng et al., 2023). Such high agreement
between humans and the LLM judge shows the
reliability of SeaBench-Audio.
Setup en id th vi Avg
w/ tie (R=33%) 68% 71% 68% 70% 69%
w/o tie (R=50%) 92% 95% 92% 94% 93%
Table 2: Agreement between human judges and the
LLM judge. We convert the single-answer grading to
pairwise comparison results for calculating the agree-
ment. "w/ tie" includes tie scores and non-tie scores.
"w/o tie" includes only non-tie scores. "R=" indicates
the agreement between two random judges.
6

-- 6 of 10 --

5 Conclusion
In this study, we introduce a large audio-language
model specifically designed for Southeast Asian
languages, named SeaLLMs-Audio. Trained on
an extensive multilingual audio corpus, SeaLLMs-
Audio exhibits robust audio understanding and
generation capabilities across Indonesian, Thai,
and

Chunk 13 · 1,995 chars

-- 6 of 10 --

5 Conclusion
In this study, we introduce a large audio-language
model specifically designed for Southeast Asian
languages, named SeaLLMs-Audio. Trained on
an extensive multilingual audio corpus, SeaLLMs-
Audio exhibits robust audio understanding and
generation capabilities across Indonesian, Thai,
and Vietnamese. To systematically assess LALMs
within this region, we construct the SeaBench-
Audio benchmark, encompassing multiple-choice
and open-ended questions spanning 14 distinct
tasks. Experimental outcomes highlight the strong
performance of SeaLLMs-Audio on the proposed
benchmark. We anticipate that SeaLLMs-Audio
and SeaBench-Audio will promote further re-
search on LALMs for Southeast Asia and stimulate
broader efforts toward supporting low-resource lan-
guages.
6 Limitation
Due to limitations in manpower and computa-
tional resources, we confined SeaLLMs-Audio and
SeaBench-Audio to three selected Southeast Asian
languages. Nevertheless, the proposed methodol-
ogy can be easily extended to a broader range of lan-
guages. Although SeaLLMs-Audio demonstrated
strong performance, instances of language mixing
still exist, a behavior commonly observed in other
LALMs. We anticipate that this issue can be mit-
igated through reinforcement learning, which we
identify as a promising direction for future work.
Acknowledgments
We would like to express our special thanks to our
professional, native linguists, Tantong Champai-
boon, Nguyen Ngoc Yen Nhi and Tara Devina Pu-
tri, who helped build, evaluate, and fact-check our
SeaBench-Audio dataset as well as evaluating our
models across different aspects. We sincerely ap-
preciate the valuable suggestions from Hao Zhang
(DAMO Academy, Alibaba Group) on improving
SeaLLMs-Audio.
References
Rosana Ardila, Megan Branson, Kelly Davis, Michael
Henretty, Michael Kohler, Josh Meyer, Reuben
Morais, Lindsay Saunders, Francis M. Tyers, and
Gregor Weber. 2020. Common Voice: A Massively-
Multilingual Speech Corpus. arXiv

Chunk 14 · 1,990 chars

uggestions from Hao Zhang
(DAMO Academy, Alibaba Group) on improving
SeaLLMs-Audio.
References
Rosana Ardila, Megan Branson, Kelly Davis, Michael
Henretty, Michael Kohler, Josh Meyer, Reuben
Morais, Lindsay Saunders, Francis M. Tyers, and
Gregor Weber. 2020. Common Voice: A Massively-
Multilingual Speech Corpus. arXiv preprint.
ArXiv:1912.06670 [cs].
Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu
Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel
Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San-
jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao,
Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang,
Yujun Wang, Zhao You, and Zhiyong Yan. 2021. Gi-
gaSpeech: An Evolving, Multi-domain ASR Corpus
with 10,000 Hours of Transcribed Audio. In Inter-
speech 2021, pages 3670–3674. ArXiv:2106.06909
[cs].
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei,
Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng
He, Junyang Lin, Chang Zhou, and Jingren Zhou.
2024. Qwen2-Audio Technical Report. arXiv
preprint. ArXiv:2407.10759 [cs, eess].
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang,
Shiliang Zhang, Zhijie Yan, Chang Zhou, and
Jingren Zhou. 2023. Qwen-Audio: Advancing
Universal Audio Understanding via Unified Large-
Scale Audio-Language Models. arXiv preprint.
ArXiv:2311.07919 [eess].
Gheorghe Comanici, Eric Bieber, et al. 2025. Gem-
ini 2.5: Pushing the Frontier with Advanced Rea-
soning, Multimodality, Long Context, and Next
Generation Agentic Capabilities. arXiv preprint.
ArXiv:2507.06261 [cs].
Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui
Zhou, Wei Lu, and Min Lin. 2024. Sailor: Open Lan-
guage Models for South-East Asia. arXiv preprint.
ArXiv:2404.03608 [cs].
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili
Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunx-
iao Du, Penghui Yang, Haonan Wang, Jiaheng Liu,
Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung
Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu,
Hynek Kydlíˇcek, Zeyi Liu, Qunshu Lin, Sittipong
Sripaisarnmongkol, Kridtaphad

Chunk 15 · 1,999 chars

gxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili
Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunx-
iao Du, Penghui Yang, Haonan Wang, Jiaheng Liu,
Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung
Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu,
Hynek Kydlíˇcek, Zeyi Liu, Qunshu Lin, Sittipong
Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai
Thongchim, Taechawat Konkaew, Narong Borijindar-
goon, Anh Dao, Matichon Maneegard, Phakphum
Artkaew, Zheng-Xin Yong, Quan Nguyen, Wan-
naphong Phatthiyaphaibun, Hoang H. Tran, Mike
Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi
Wan, Wei Lu, and Min Lin. 2025. Sailor2: Sailing in
South-East Asia with Inclusive Multilingual LLMs.
arXiv preprint. ArXiv:2502.12982 [cs].
Yingxu He, Zhuohan Liu, Geyu Lin, Shuo Sun, Bin
Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen,
and AiTi Aw. 2025. MERaLiON-AudioLLM: Ad-
vancing Speech and Language Understanding for
Singapore. In Proceedings of the 63rd Annual Meet-
ing of the Association for Computational Linguistics
(Volume 3: System Demonstrations), pages 22–30,
Vienna, Austria. Association for Computational Lin-
guistics.
Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu
Zhang, Xunlong Zou, Nancy F. Chen, and Ai Ti
Aw. 2024. MERaLiON-AudioLLM: Bridging Audio
and Language with Large Language Models. arXiv
preprint. ArXiv:2412.09818 [cs].
7

-- 7 of 10 --

William Held, Ella Li, Michael Ryan, Weiyan Shi,
Yanzhe Zhang, and Diyi Yang. 2024. Distilling
an End-to-End Voice Assistant Without Instruction
Training Data. arXiv preprint. ArXiv:2410.02678
[cs].
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee,
and Gunhee Kim. 2019. AudioCaps: Generating
Captions for Audios in The Wild. In Proceedings of
the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and
Short Papers), pages 119–132, Minneapolis, Min-
nesota. Association for Computational Linguistics.
Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki,
William Chen,

Chunk 16 · 1,994 chars

ings of
the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and
Short Papers), pages 119–132, Minneapolis, Min-
nesota. Association for Computational Linguistics.
Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki,
William Chen, Sayaka Shiota, and Shinji Watanabe.
2024. YODAS: Youtube-Oriented Dataset for Audio
and Speech. arXiv preprint. ArXiv:2406.00899 [cs].
Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani
Aljunied, Anh Tuan Luu, and Lidong Bing. 2025.
SeaExam and SeaBench: Benchmarking LLMs with
Local Multilingual Questions in Southeast Asia. In
Findings of the Association for Computational Lin-
guistics: NAACL 2025, pages 6119–6136, Albu-
querque, New Mexico. Association for Computa-
tional Linguistics.
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani
Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken
Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liy-
ing Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
Chaoqun Liu, Hang Zhang, and Lidong Bing. 2024.
SeaLLMs - Large Language Models for Southeast
Asia. In Proceedings of the 62nd Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 3: System Demonstrations), pages 294–304,
Bangkok, Thailand. Association for Computational
Linguistics.
Kunat Pipatanakul, Potsawee Manakul, Natapong Ni-
tarach, Warit Sirichotedumrong, Surapon Nonesung,
Teetouch Jaknamon, Parinthapat Pengpun, Pittawat
Taveekitworachai, Adisai Na-Thalang, Sittipong Sri-
paisarnmongkol, Krisanapong Jirayoot, and Kasima
Tharnpipitchai. 2024. Typhoon 2: A Family of Open
Text and Multimodal Thai Large Language Models.
arXiv preprint. ArXiv:2412.13702 [cs].
Qwen, An Yang, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li,
Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin,
Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang,
Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang,
Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li,
Mingfeng Xue, Pei

Chunk 17 · 1,992 chars

.13702 [cs].
Qwen, An Yang, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li,
Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin,
Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang,
Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang,
Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li,
Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji
Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang
Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang
Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru
Zhang, and Zihan Qiu. 2025. Qwen2.5 Technical
Report. arXiv preprint. ArXiv:2412.15115 [cs].
Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao,
Yang Ding, AiTi Aw, and Nancy Chen. 2024. SeaE-
val for Multilingual Foundation Models: From Cross-
Lingual Alignment to Cultural Reasoning. In Pro-
ceedings of the 2024 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume
1: Long Papers), pages 370–390, Mexico City, Mex-
ico. Association for Computational Linguistics.
Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuo-
han Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw,
and Nancy F. Chen. 2025. AudioBench: A Univer-
sal Benchmark for Audio Large Language Models.
arXiv preprint. ArXiv:2406.16020 [cs].
Zhifei Xie and Changqiao Wu. 2024. Mini-Omni: Lan-
guage Models Can Hear, Talk While Thinking in
Streaming. arXiv preprint. ArXiv:2408.16725 [cs,
eess].
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting
He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan,
Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu,
and Junyang Lin. 2025. Qwen2.5-Omni Technical
Report. arXiv preprint. ArXiv:2503.20215 [cs].
Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui,
Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xuny-
ing Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-
Qiang Zhang, Guoguo Chen, and Xie Chen. 2025.
GigaSpeech 2: An Evolving, Large-Scale and Multi-
domain ASR Corpus for Low-Resource Languages
with Automated Crawling, Transcription and

Chunk 18 · 1,998 chars

Song, Jianheng Zhuo, Mingyu Cui,
Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xuny-
ing Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-
Qiang Zhang, Guoguo Chen, and Xie Chen. 2025.
GigaSpeech 2: An Evolving, Large-Scale and Multi-
domain ASR Corpus for Low-Resource Languages
with Automated Crawling, Transcription and Refine-
ment. arXiv preprint. ArXiv:2406.11546 [eess].
Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani
Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng,
Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li,
and Lidong Bing. 2025. SeaLLMs 3: Open Founda-
tion and Chat Multilingual Large Language Models
for Southeast Asian Languages. In Proceedings of
the 2025 Conference of the Nations of the Ameri-
cas Chapter of the Association for Computational
Linguistics: Human Language Technologies (System
Demonstrations), pages 96–105, Albuquerque, New
Mexico. Association for Computational Linguistics.
Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying,
Mahani Aljunied, Zhaodonghui Li, Lidong Bing,
Hou Pong Chan, Yu Rong, Deli Zhao, and Wenxuan
Zhang. 2025. Babel: Open Multilingual Large Lan-
guage Models Serving Over 90% of Global Speakers.
arXiv preprint. ArXiv:2503.00865 [cs].
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
LLM-as-a-Judge with MT-Bench and Chatbot Arena.
arXiv preprint. ArXiv:2306.05685 [cs].
8

-- 8 of 10 --

A Appendix
Task Task description and requirements text?
ASR Automatic Speech Recognition. For this task, each input consists of an audio file and a text instruction.
Some clips feature regional accents, and every language in this category includes at least one example of
informal speech. The set also contains multi-sentence audio clips. Each text instruction is uniquely phrased.
Yes
SS Speech Summarization. Each unit includes a text instruction to convert an audio file into a shorter text. The
instruction may

Chunk 19 · 1,994 chars

ccents, and every language in this category includes at least one example of
informal speech. The set also contains multi-sentence audio clips. Each text instruction is uniquely phrased.
Yes
SS Speech Summarization. Each unit includes a text instruction to convert an audio file into a shorter text. The
instruction may limit output length by word count, number of sentences, or by specifying a short format
(e.g., headline or title).
Yes
SER Speaker Emotion and Sentiment Recognition. This category includes 5 emotion detection units and 5
sentiment detection units. Each unit’s text instruction provides options from which the model must choose
the correct answer. Every audio clip contains speech whose lexical content does not reflect the speaker’s
emotion or sentiment. This setup compels the model to rely on paralinguistic features to determine the
correct answer.
Yes
S2TT_EX,
S2TT_XE
Speech to Text Translation. There are two speech translation categories in this task: English-to-local (EX)
and local-to-English (XE). This category includes units from specialized domains (e.g., courtroom, military,
medical, addressing royalty) across languages. Translation challenges—such as idioms, informal varieties,
appropriate first- and second-person pronouns, and polysemy—are represented in these units.
Yes
SQA Speech Question Answering. This task evaluates a model’s ability to extract or infer specific information
from an audio file. Each unit’s text instruction is crafted to elicit a precise answer for reliable evaluation.
The targeted information may extend beyond named entities present in the audio.
Yes
AC Audio Captioning. Each audio unit includes non-speech sounds from both animate and inanimate sources, as
well as natural or man-made scenes. The accompanying text prompts are crafted to elicit detailed descriptions
of the entire audio content.
Yes
AQA Audio Question Answering. The audio input for AQA consists of non-speech snippets, including event
sounds or sequences of

Chunk 20 · 1,992 chars

sounds from both animate and inanimate sources, as
well as natural or man-made scenes. The accompanying text prompts are crafted to elicit detailed descriptions
of the entire audio content.
Yes
AQA Audio Question Answering. The audio input for AQA consists of non-speech snippets, including event
sounds or sequences of related sounds from human activities or nature. The text instructions test a model’s
abilities in object/event detection, sound tracking (e.g., duration), segmentation, and sequential analysis.
Some prompts include multiple-choice options (e.g., “... takes place outdoors or indoors?”).
Yes
SKI Speaker Identifiers. This category focuses on non-linguistic aspects of audio (e.g., number of speakers,
turn-taking, speaking rate) and speaker-related tasks such as gender, relative age, and accent or regional
variety identification.
Yes
Life Life. The audio files in this set come from text questions posted on popular forums and social media in
the respective SEA-language countries. Questions are carefully chosen to cover diverse topics, suit model
prompting, and remain manageable for evaluation. Each question is recorded with a unique voice. There are
no separate text instructions, as the audio explicitly contains the questions.
No
CS Customer Service. This set includes 10 customer care scenarios. Each audio clip features either a single
customer or a dialogue between a customer and a CS officer. The scripts simulate real calls—for example,
clarifying product or price information, checking delivery status, requesting refunds, or reporting faulty
products. The instructions include 6 units with answer options and 4 open-ended units. These units are
designed to assess fine-grained CS knowledge and were curated in consultation with a customer care
professional.
Yes
MED Medical Patient Question. Each unit contains an audio recording of a patient describing their condition and
requesting medical advice. These recordings are human-voiced readings of questions

Chunk 21 · 1,998 chars

its are
designed to assess fine-grained CS knowledge and were curated in consultation with a customer care
professional.
Yes
MED Medical Patient Question. Each unit contains an audio recording of a patient describing their condition and
requesting medical advice. These recordings are human-voiced readings of questions sourced mainly from
publicly available hospital websites, with some from open-source Q&As. This category evaluates a model’s
ability to respond within a specific medical domain. References are provided by doctors from the respective
hospitals or are human-verified.
No
Safety Safety. Containing an audio file and a text instruction, each unit in this category is designed to provoke
an unsafe response from models. The set includes both country-specific safety violations and universally
unsafe topics. Each unit is curated so that a safe, desired response requires analyzing the audio content—the
text instruction alone does not reveal anything unsafe. This ensures that any model rejection stems from
understanding the speech/audio elements.
Yes
Math Math. There are no written instructions in this set; each audio clip contains a complete math question. The
questions span various math topics for grades 7–12.
No
Fact Fact. Audio clips in this set pose explicit factual questions across diverse topics. No text instructions
accompany the audio files. Subjects include history, economics, medicine, technology, and more.
No
Table 3: Verbalizers for the evaluation datasets.
9

-- 9 of 10 --

Please act as an impartial judge and evaluate the quality of the text response provided by an AI assistant to the user question. The user question may be in
text or in audio form. An audio .wav file for the question is also given, which the AI assistant must analyze in order to give a correct response. Begin your
evaluation process by first analyzing the content of the corresponding audio file and comparing the assistant's answer against the reference answer. The
audio content is

Chunk 22 · 1,998 chars

in audio form. An audio .wav file for the question is also given, which the AI assistant must analyze in order to give a correct response. Begin your
evaluation process by first analyzing the content of the corresponding audio file and comparing the assistant's answer against the reference answer. The
audio content is a key component of the evaluation and please use it to identify any inaccuracies, contextual misunderstandings, and language choice
issues in the assistant's response. An Evaluation Scoring rubric is provided alongside each assistant's answer and must be strictly adhered to, with the
ratings assigned on a sequential, first-match basis. Be as objective as possible and your explanation should not exceed two paragraphs. After providing
your explanation in English, you must rate the response on a scale of 1 to 5 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
[Question]
{question}
[Start of Reference Answer]
{reference}
[End of Reference Answer]
[Start of Assistant's Answer]
{answer}
[End of Assistant's Answer]
[Start of Evaluation Scoring Guide]
{rule}
[End of Evaluation Scoring Guide]
Figure 8: The prompt template for LLM-as-a-judge.
Score Criteria
1 The response is largely inaccurate, irrelevant, or incomplete, with poor language quality. It does
not effectively address the question or may be incoherent.
2 The response contains significant inaccuracies or is missing key details. It may be unclear or
poorly structured, and the language quality could be improved.
3 The response is generally accurate but may contain noticeable errors or omissions. It addresses
the question with moderate clarity and completeness, but could be better structured or more
detailed.
4 The response is mostly accurate and relevant, with a few minor errors or omissions. It is clear
and well-structured but could benefit from slight improvements in detail or language quality.
5 The response is accurate, relevant, coherent, and complete, with excellent

Chunk 23 · 641 chars

, but could be better structured or more
detailed.
4 The response is mostly accurate and relevant, with a few minor errors or omissions. It is clear
and well-structured but could benefit from slight improvements in detail or language quality.
5 The response is accurate, relevant, coherent, and complete, with excellent language quality. It
answers the question thoroughly, clearly, and correctly, with no significant errors.
Table 4: General scoring criteria for assessing the overall quality of responses by human judges. For each score, we
provide guidelines to promote consistency and reliability in human evaluations.
10

-- 10 of 10 --