SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia
Summary
SeaLLMs-Audio is the first large audio-language model (LALM) tailored for multiple Southeast Asian languages—Indonesian, Thai, Vietnamese—alongside English and Chinese. Trained on a large-scale multilingual audio corpus, it supports diverse audio-centric tasks like speech recognition, translation, captioning, and question answering. The model is multimodal, accepting audio-only, text-only, or combined inputs, and is multilingual, covering five languages. To evaluate its performance, the authors introduced SeaBench-Audio, a benchmark with 14 tasks spanning audio and text inputs. Experiments show SeaLLMs-Audio achieves competitive performance compared to other LALMs on Southeast Asian languages. The model was trained using a comprehensive data curation pipeline that aggregated and processed public and private datasets, including GigaSpeech, Common Voice, and AudioCaps. The training data includes 1.58 million conversations across multiple task types. The model architecture combines the audio encoder from Qwen2-Audio-7B with the multilingual capabilities of Qwen2.5-7B-Instruct. Evaluation results, using both human judges and an LLM-as-a-judge framework, demonstrate strong performance, particularly in language quality for Southeast Asian languages.
PDF viewer
Chunks(24)
Chunk 0 · 1,992 chars
SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia Chaoqun Liu Mahani Aljunied Guizhen Chen Hou Pong Chan Weiwen Xu Yu Rong Wenxuan Zhang* DAMO Academy, Alibaba Group wxzhang@sutd.edu.sg https://damo-nlp-sg.github.io/SeaLLMs-Audio/ Abstract We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) lan- guages—Indonesian (id), Thai (th), and Viet- namese (vi)-alongside English (en) and Chi- nese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong per- formance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features in- clude: 1) Multilingual: the model primar- ily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, includ- ing audio analysis tasks such as Audio Caption- ing, Automatic Speech Recognition, Speech- to-Text Translation, Speech Emotion Recogni- tion, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathe- matical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs- Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning mul- tiple tasks. Experiments show that SeaLLMs- Audio achieves competitive performance com- pared with other LALMs on SEA languages. 1 1 Introduction Large audio-language models (LALMs) (Chu et al., 2023, 2024; He et al., 2024; Pipatanakul et al., 2024; Held et al., 2024) have shown impressive capabilities in understanding the rich information * Wenxuan Zhang is the corresponding author. 1SeaLLMs-Audio is publicly available at
Chunk 1 · 1,996 chars
LMs on SEA languages. 1 1 Introduction Large audio-language models (LALMs) (Chu et al., 2023, 2024; He et al., 2024; Pipatanakul et al., 2024; Held et al., 2024) have shown impressive capabilities in understanding the rich information * Wenxuan Zhang is the corresponding author. 1SeaLLMs-Audio is publicly available at https:// github.com/DAMO-NLP-SG/SeaLLMs-Audio contained in audio signals. However, most exist- ing LALMs support only one or two languages, most typically English, leaving multilingual and low-resource regions under-represented. In Southeast Asia (SEA), significant progress has been made in developing multilingual large lan- guage models (LLMs), such as SeaLLMs (Nguyen et al., 2024; Zhang et al., 2025; Zhao et al., 2025), Sailor (Dou et al., 2024, 2025), and SEA-LION2 series. Despite their multilingual reach, these mod- els operate solely in the textual modality and lack the ability to process audio inputs—an essential component of natural human communication. Beyond the absence of LALMs tailored for Southeast Asian languages, progress is further con- strained by the lack of comprehensive and rigor- ous evaluation frameworks. Existing benchmarks, such as SeaEval (Wang et al., 2024), SeaExam, and SeaBench (Liu et al., 2025), focus primarily on textual evaluation within SEA contexts. Mean- while, audio-related benchmarks remain limited to specific tasks like automatic speech recogni- tion (ASR) (Wang et al., 2025), without provid- ing a holistic assessment of audio understanding and voice-based interaction. This lack of broad, multimodal benchmarks continue to impede the advancement of audio-language modeling in SEA languages. To bridge above gaps, we introduce SeaLLMs- Audio (Southeast Asian Large Language Models with audio capabilities), a Large Audio-Language Model designed specifically for Southeast Asia. SeaLLMs-Audio is trained using data from a com- prehensive curation pipeline that aggregates, orga- nizes, and synthesizes multimodal resources
Chunk 2 · 1,989 chars
above gaps, we introduce SeaLLMs- Audio (Southeast Asian Large Language Models with audio capabilities), a Large Audio-Language Model designed specifically for Southeast Asia. SeaLLMs-Audio is trained using data from a com- prehensive curation pipeline that aggregates, orga- nizes, and synthesizes multimodal resources across SEA languages, as illustrated in Figure 1. The curated dataset spans diverse tasks, including au- tomatic speech recognition (ASR), audio caption- 2https://huggingface.co/aisingapore/ collections 1 arXiv:2511.01670v1 [cs.CL] 3 Nov 2025 -- 1 of 10 -- Transcription: รองผู้ว่าฯกทมยันรื้อ ซากตึกสตงทันสิ้นเมษาจ่อเสนอปรับ เงินเยียวยาที่พัก ASR in th Add punctuations & space with LLM Response: รองผู้ว่าฯกทม.ยันรื้อซากตึกสตง. ทันสิ้นเมษาจ่อเสนอปรับเงินเยียวยาที่พัก ASR in th Prompt: Transcribe the audio. Translate text with LLM Response: The Deputy Governor of Bangkok confirmed … S2TT Prompt: Translate the audio into English. Description: The wind is blowing and rustling occurs AC in en Filter Add instruction Translate with LLM Response: Angin bertiup dan terdengar gemerisik. AC in id Prompt: Describe the audio in Indonesian. Prompt: How to cook a quick and tasty meal? Response: Here are a few quick and tasty meal ideas … QA in text format Text to speech Response: Here are a few quick and tasty meal ideas … QA Long Audio Speech in vi Trim Summarize with MLLM Prompt: Tóm tắt đoạn âm thanh này. Response: Đoạn âm thanh này nói về... SS Add instruction Generate question and answer with MLLM SQA Prompt: Ông ấy đã nhắc đến loài vật nào? Response: Ông ấy nhắc đến chó và mèo. Figure 1: Illustration of data curation process for SeaLLMs-Audio. ing (AC), speech-to-text translation (S2TT), speech summarization (SS), audio question answering (AQA), and multimodal reasoning. Furthermore, to facilitate standardized evalua- tion, we present SeaBench-Audio, a manually cu- rated benchmark for assessing LALMs in Southeast Asian languages. SeaBench-Audio
Chunk 3 · 1,987 chars
aLLMs-Audio. ing (AC), speech-to-text translation (S2TT), speech summarization (SS), audio question answering (AQA), and multimodal reasoning. Furthermore, to facilitate standardized evalua- tion, we present SeaBench-Audio, a manually cu- rated benchmark for assessing LALMs in Southeast Asian languages. SeaBench-Audio encompasses multiple open-ended task categories that reflect real-world, multimodal language understanding scenarios. To facilitate consistent and scalable eval- uation, we adopt an LLM-as-a-judge framework with task-specific prompt templates, achieving high agreement with human annotations. Experimen- tal results on SeaBench-Audio demonstrate that SeaLLMs-Audio delivers robust and competitive performance across a wide range of audio-language tasks. Our key contributions are as follows. • We present SeaLLMs-Audio, a large-scale audio–language model specifically designed for Southeast Asian contexts. • We develop SeaBench-Audio, a compre- hensive benchmark dedicated to evaluating LALMs within the SEA region. • Our experimental analyses indicate that SeaLLMs-Audio achieves strong perfor- mance on the SeaBench-Audio benchmark. 2 SeaLLMs-Audio In this section, we illustrate the training data cura- tion pipeline followed by the model architecture. 2.1 Comprehensive Data Curation Pipeline This training dataset for SeaLLMs-Audio contains 1.58M conversations for multiple tasks, including 7% multi-turn dialogues that better reflect real- world interactive scenarios. The tasks can be roughly classified as the following categories: auto- matic speech recognition (ASR), audio captioning (AC), speech-to-text translation (S2TT), question answering (QA), speech summarization (SS), audio question answering (AQA), chat, math, and factoid QA (fact) and other tasks (mixed). The training dataset was curated from multiple data sources, including public datasets and private data. Public datasets include: GigaSpeech (Chen et al., 2021), GigaSpeech2 (Yang et al.,
Chunk 4 · 1,997 chars
swering (QA), speech summarization (SS), audio question answering (AQA), chat, math, and factoid QA (fact) and other tasks (mixed). The training dataset was curated from multiple data sources, including public datasets and private data. Public datasets include: GigaSpeech (Chen et al., 2021), GigaSpeech2 (Yang et al., 2025), Common Voice (Ardila et al., 2020), AudioCaps (Kim et al., 2019), VoiceAssistant-400 (Xie and Wu, 2024), YODAS2 (Li et al., 2024), and Multi- task National Speech Corpus (He et al., 2024). As these datasets span multiple sources with disparate formats (e.g., different audio encodings, annota- tion schemas, and text structures), they cannot be directly used for end-to-end training. We there- fore perform comprehensive preprocessing to unify the data. Figure 1 shows the overall data curation pipeline with some examples. The following de- scribes the construction process for each task. ASR For ASR datasets such as GigaSpeech, we normalize transcripts to improve readability. For example, we transform "AND LOOK AT THE PERCENTAGE OF REPORTS <PERIOD>" into "And look at the percentage of reports." For Gi- gaSpeech2, which includes Thai, Indonesian, and 2 -- 2 of 10 -- Vietnamese while its text does not contain punctua- tions, we employ a selected LLM to restore punctu- ations and spacing, producing more reader-friendly text for each language. As LLMs may introduce errors, we discard samples whose outputs are in- consistent with the original transcripts. S2TT Given the absence of open-source S2TT datasets for SEA languages, we construct such data by leveraging ASR corpora in different languages. More specifically, since each unit of the ASR data comprises the same-language speech audio plus its text transcription, we utilize this text to create translations into multiple targeted languages. This results in data pairs of speech audio in one language plus their translated text in another language. AC AudioCaps provides captions exclusively in English.
Chunk 5 · 1,925 chars
ASR data comprises the same-language speech audio plus its text transcription, we utilize this text to create translations into multiple targeted languages. This results in data pairs of speech audio in one language plus their translated text in another language. AC AudioCaps provides captions exclusively in English. To accommodate Southeast Asian lan- guages, we translate these captions into the respec- tive target languages. QA This set is curated to obtain audio questions with text answers. To do this, we make use of existing question-answer pairs in text format. The answers are kept unchanged in text form, while the text questions are converted into audio with text-to- speech (TTS) models. No translation is involved. After manually assessing samples of the quality of TTS outputs from several models, we finally select Google Text-to-Speech 3. SS In order to curate SS dataset, we sample a piece of speech audio from YODAS2 dataset and ask Gemini-2.0-Flash to summarize it in a specified language. AQA In order to create natural questions and audio about a piece of audio, we first sample a piece of audio from the YODAS2 dataset, which contains audio for YouTube videos. After that, we prompt Gemini-2.0-Flash to generate a question about the audio and provide the corresponding answer. chat In order to create voice chat data, we make use of existing text conversation data and convert the user input into audio format with Google TTS. As Google TTS has only a few voice types for each language, we also transcribe part of the data with gpt-4o-mini-tts to improve the diversity. math and fact For math and factoid QA in- struction data, we also transcribe the prompts into speech with Google TTS. 3https://cloud.google.com/text-to-speech The distribution of the training data by language and by task type is shown in Figure 2. zh 20.5% th 18.4% id 18.1% vi 17.9% en 17.5% sg 7.6% (a) Data Distribution by Task
Chunk 6 · 1,985 chars
ct For math and factoid QA in- struction data, we also transcribe the prompts into speech with Google TTS. 3https://cloud.google.com/text-to-speech The distribution of the training data by language and by task type is shown in Figure 2. zh 20.5% th 18.4% id 18.1% vi 17.9% en 17.5% sg 7.6% (a) Data Distribution by Task Type chat 42.1% ASR 22.6% AC 14.3% SQA 6.2% SS 3.2% fact 3.2% mixed 2.8% QA 2.7% math 1.8% S2TT 1.1% (b) Data Distribution by Language Figure 2: Training data distribution across (a) languages and (b) task types. 2.2 Model Architecture SeaLLMs-Audio builts upon Qwen2-Audio-7B (Chu et al., 2023) and Qwen2.5-7B-Instruct (Qwen et al., 2025). The architecture is shown in Figure 3. We replace the LLM module in Qwen2-Audio-7B (Chu et al., 2024) by Qwen2.5-7B-Instruct (Qwen et al., 2025). In this way, we harness the advan- tages of both models: Qwen2-Audio-7B audio en- coder can encode the audio features for speech and non-speech audios effectively, and Qwen2.5- 7B-Instruct has strong multilingual capabilities. Due to the hidden embedding mismatch, the au- 3 -- 3 of 10 -- Audio Encoder Embedding Audio Text Adapter LLM Next Sentence Prediction Figure 3: Architecture of SeaLLMs-Audio. dio adapter is newly initialized. After that, we do full-parameter fine-tuning on our newly curated large-scale audio dataset, which contains multiple tasks. Given paired data (a, x), with a denoting the audio sequences and x denoting the optional corre- sponding text sequences, the training objective is to maximize the likelihood of the subsequent text token, formulated as Pθ(xt | x<t, a), (1) conditioned on the audio representations and the preceding text tokens x<t, where θ represents the trainable parameters of the LALM. We train the model on the dataset for 1 epoch, which took 6 days to complete on 32 A800 GPUs. 3 SeaBench-Audio Due to the absence of standard audio benchmarks for evaluating audio LLMs in SEA languages, we manually create a benchmark called
Chunk 7 · 1,996 chars
ding text tokens x<t, where θ represents the trainable parameters of the LALM. We train the model on the dataset for 1 epoch, which took 6 days to complete on 32 A800 GPUs. 3 SeaBench-Audio Due to the absence of standard audio benchmarks for evaluating audio LLMs in SEA languages, we manually create a benchmark called SeaBench- Audio. It comprises 14 tasks: 1) Tasks with both audio and text inputs: Automatic Speech Recog- nition (ASR), Speech-to-Text Translation (S2TT), Speech Summarization (SS), Speech Question An- swering (SQA), Customer Service (CS), Safety, Audio Cationing (AC), Audio Question Answer- ing (AQA), Speaker Identifiers (SKI), and Speech Emotion Recognition (SER); 2) Tasks with only au- dio inputs: Life, Medical (MED), Math, and Fact. The task descriptions and annotation criteria are summarized in Table 3 in Appendix A. An overview of the datasets is provided in Figure 4(a). For each language, we engage a professional native linguist to annotate 10 questions per task. One exception is S2TT task, which requires the translation between two languages. For id/th/vi, we construct two versions—native audio to English text and English audio to native text—each with 10 questions. For English, we omit this task to prevent redundancy. Consequently, there are 150 questions for each SEA language and 130 for En- glish, yielding a total of 580 questions. For every LALM Rubrics Audio ASR Gemini Response (Text) Template Prompt Analysis + score Life SS S2TT SQA AA AQA SKI CS MED Math Safety Fact SER SeaBench-Audio Audio + Text Text only (a) (b) Reference Figure 4: (a) An overview of the task in SeaBench- Audio (b) Evaluation pipeline with LLM-as-a-judge framework. question, a linguist supplies a reference to facili- tate scoring. The benchmark underwent multiple rounds of careful review to ensure quality. For evaluation, qualified native speakers rated each response on a scale of 1 to 5, with 5 represent- ing the highest quality. However, human evalua- tions
Chunk 8 · 1,991 chars
-a-judge framework. question, a linguist supplies a reference to facili- tate scoring. The benchmark underwent multiple rounds of careful review to ensure quality. For evaluation, qualified native speakers rated each response on a scale of 1 to 5, with 5 represent- ing the highest quality. However, human evalua- tions are expensive and time-consuming. In order to facilitate automatic evaluation, we employ an LLM-as-a-judge framework. We choose Gemini- 2.5-flash (Gemini) (Comanici et al., 2025), due to its capabilities of audio understanding and good balance of cost and performance. As shown in 4(b), the procedures to evaluate an LALM are: 1) Gen- erate responses for each instance with the LALM; 2) Construct an evaluation prompt with the text instruction (optional), reference answer, response, rubrics, and the template; 3) Prompt Gemini with the audio and evaluation prompt; 4) Extract the score from the final response. The prompt tem- plate for LLM-as-a-judge is shown in Figure 8 in Appendix A. For each task, we additionally en- gage linguists to develop a task-specific evaluation rubric for the responses on a scale of 1 to 5. We hy- pothesize that tasks exhibit distinct characteristics, and a dedicated rubric more accurately captures the nuances of each task. 4 Experiments We compare the performance of SeaLLMs-Audio with relevant LALMs with similar sizes, including: Qwen2-Audio-7B-Instruct (Qwen2-Audio) This is the latest version of Qwen-Audio series (Chu et al., 2024), which mainly focus on English and Chinese. It shares the same base audio encoder as SeaLLMs-Audio. 4 -- 4 of 10 -- Qwen2.5-Omni-7B (Qwen2.5-Omni) This is a multimodal model that perceive diverse modalities including text, audio, image, and video (Xu et al., 2025). It also adopted the audio encoder from Qwen2-Audio. In this work, we only compare its performance with audio input. MERaLiON-AudioLLM-Whisper-SEA-LION (MERaLiON) This model was trained to understand Singlish, which was trained on
Chunk 9 · 1,990 chars
at perceive diverse modalities including text, audio, image, and video (Xu et al., 2025). It also adopted the audio encoder from Qwen2-Audio. In this work, we only compare its performance with audio input. MERaLiON-AudioLLM-Whisper-SEA-LION (MERaLiON) This model was trained to understand Singlish, which was trained on 62 million multimodal instruction samples (He et al., 2025). Since it was trained with both audio and text instructions, we add a text instruction "Please follow the instruction in the speech." for tasks with no text input. MERaLiON-2-10B (MERaLiON-2) This is a concurrent work with SeaLLMs-Audio. Compared with MERaLion, it supports more languages, in- cluding English, Chinese, Indonesian, Thai, and Vietnamese; thus, it has a similar motivation to SeaLLMs-Audio (He et al., 2025). Like MER- aLiON, we add a text instruction "Please follow the instruction in the speech." for tasks without text input. All the LALMs can accept audio with text as input. In order to evaluate the quality of LLM- as-a-judge, we also conducted human evaluations. The judging criteria are shown in Table 4 in Ap- pendix A. We engage the benchmark annotators to do the evaluations as they are familiar with the benchmarks. 4.1 Main Results Figure 5 shows the human evaluations. Evaluators assessed both overall performance and language quality, the latter referring to the correctness of lan- guage usage in responses. Language quality was rated on a 1-5 scale, where 5 indicated entirely correct language devoid of code-switching. We can see that SeaLLMs-Audio achieves the best lan- guage quality for the three SEA languages. The LLM-as-a-judge evaluation result is shown in Fig- ure 6. From these results, SeaLLMs-Audio attains the strongest performance on id/th/vi, irrespective of whether evaluation is conducted by human anno- tators or Gemini. Additional observations include: (1) MERaLiON-2 surpasses MERaLiON, which is expected given that MERaLiON-2 is the newer iteration; and (2)
Chunk 10 · 1,933 chars
wn in Fig- ure 6. From these results, SeaLLMs-Audio attains the strongest performance on id/th/vi, irrespective of whether evaluation is conducted by human anno- tators or Gemini. Additional observations include: (1) MERaLiON-2 surpasses MERaLiON, which is expected given that MERaLiON-2 is the newer iteration; and (2) Qwen-Omni outperforms Qwen2- Audio across the three languages, consistent with its more recent release. This alignment with prior en id th vi language 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Score (higher is better) SeaLLMs-Audio Qwen2-Audio Qwen2.5-Omni MERaLiON MERaLiON-2 (a) en id th vi language 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Score (higher is better) SeaLLMs-Audio Qwen2-Audio Qwen2.5-Omni MERaLiON MERaLiON-2 (b) Figure 5: Performance of the models on SeaBench- Audio accessed by human evaluators: (a) average scores for overall performance, and (b) average scores for out- put language quality. Each response is evaluated on a 1–5 scale, with 5 indicating the highest quality. Human evaluations were performed blind, without disclosure of the generating model. expectations supports the validity of the LLM-as-a- judge framework. 4.2 Analysis To further understand the capabilities of SeaLLMs- Audio and SeaBench-Audio, we conduct more anal- ysis on the results. How does SeaLLMs-Audio perform on each task? In addition to assessing average perfor- mance across languages, we further examine model outcomes by task. Table 1 presents average scores for each model across all evaluated tasks. Since the English subset excludes the S2TT task, averages for that setting are computed only over id, th, and vi. MERaLiON-2 consistently achieves the strongest results in audio comprehension tasks—specifically ASR, S2TT, and SER—which we attribute to its substantially larger and more diverse training cor- pus (He et al., 2025). Conversely, SeaLLMs-Audio 5 -- 5 of 10 -- Task Human-as-a-judge LLM-as-a-judge task
Chunk 11 · 1,989 chars
id, th, and vi. MERaLiON-2 consistently achieves the strongest results in audio comprehension tasks—specifically ASR, S2TT, and SER—which we attribute to its substantially larger and more diverse training cor- pus (He et al., 2025). Conversely, SeaLLMs-Audio 5 -- 5 of 10 -- Task Human-as-a-judge LLM-as-a-judge task SeaLLMs- Audio Qwen2- Audio Qwen2.5- Omni MERa- LiON MERa- LiON-2 SeaLLMs- Audio Qwen2- Audio Qwen2.5- Omni MERa- LiON MERa- LiON-2 AA 2.4 2.1 2.4 2.2 2.4 2.3 2 2.3 2 2.4 AQA 3.2 2.3 2.8 3.2 3.4 3.5 2.6 2.8 3.3 3.7 ASR 3.9 2.2 3.3 2.9 4.2 3.7 1.7 2.8 2.8 4.4 CS 3.3 2.7 2.9 3.2 3.6 3.2 2.5 3.3 3.3 4 MED 3.6 1.8 3.3 1.2 1.6 3.5 1.5 2.8 1 1.6 S2TT_EX 2.9 2.1 2.5 3.6 3.7 3.5 2.4 2.9 3.8 3.9 S2TT_XE 2.6 1.2 3.2 2.3 3.6 2.3 1.2 2.6 2.1 3.3 SER 2.8 2.4 2.7 2.3 3.2 3.1 2 3.2 2.2 3.5 SKI 2.2 2.6 1.9 3.3 2.8 2.5 2.9 2.3 3.3 2.7 SQA 4.1 2.2 4 3.7 4.3 4.2 2.5 4.2 3.4 4.3 SS 3.2 2.3 3.3 3.2 4.1 3.4 1.7 3.3 3.4 4.4 fact 3.1 1.6 2.9 1.9 2 3.1 1.8 3 1.3 1.6 life 3.3 1.9 3.1 1.2 1.3 3.7 1.7 3.1 1 1.2 math 3.6 1.3 2.6 1.7 2.3 4 1.2 3 1.5 3.5 safety 2 2.2 2.1 1.6 2.5 2.1 1.6 2.7 1.6 2.4 Table 1: Average scores for each task across the three SEA languages. We show the scores judged by humans and by Gemini-2.5-flash. The highest scores for each task are highlighted in bold. en id th vi Language 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 Score (higher is better) SeaLLMs-Audio Qwen2-Audio Qwen2.5-Omni MERaLiON MERaLiON-2 Figure 6: Average scores of the models on SeaBench- Audio accessed by Gemini-2.5-flash. attains state-of-the-art performance in selected cat- egories, including fact, life, MED, and math. We ascribe SeaLLMs-Audio’s advantages to the exten- sive scope and heterogeneity of its training data, en- compassing both varied task types and multimodal input formats. How is LLM-as-a-judge consistent with human judges? We observed that the scores by human judgments and LLM-as-a-judge evaluations are not perfectly aligned. To evaluate their correlation, we
Chunk 12 · 1,996 chars
he exten- sive scope and heterogeneity of its training data, en- compassing both varied task types and multimodal input formats. How is LLM-as-a-judge consistent with human judges? We observed that the scores by human judgments and LLM-as-a-judge evaluations are not perfectly aligned. To evaluate their correlation, we calculate their Pearson correlation coefficient. As shown in Figure 7, LLM-as-a-judge and hu- man judges have an average correlation coefficient of 0.8, which shows high correlation between the scores by humans and by the LLM judge. We also calculate the agreement between human judges and LLM judges when comparing the responses from SeaLLMs-Audio Qwen2-Audio Qwen2.5-Omni MERaLiON MERaLiON-2 Avg en id th vi Avg 0.77 0.87 0.76 0.92 0.88 0.84 0.84 0.78 0.6 0.89 0.84 0.79 0.8 0.71 0.69 0.85 0.75 0.76 0.86 0.81 0.68 0.88 0.77 0.8 0.82 0.79 0.68 0.89 0.81 0.8 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Figure 7: The Pearson correlation coefficient between human judgements and LLM judgements. two models. As shown in Table 2, they have an av- erage agreement of 69% with tie and 93% without tie, which is even higher than the result in MT- bench (Zheng et al., 2023). Such high agreement between humans and the LLM judge shows the reliability of SeaBench-Audio. Setup en id th vi Avg w/ tie (R=33%) 68% 71% 68% 70% 69% w/o tie (R=50%) 92% 95% 92% 94% 93% Table 2: Agreement between human judges and the LLM judge. We convert the single-answer grading to pairwise comparison results for calculating the agree- ment. "w/ tie" includes tie scores and non-tie scores. "w/o tie" includes only non-tie scores. "R=" indicates the agreement between two random judges. 6 -- 6 of 10 -- 5 Conclusion In this study, we introduce a large audio-language model specifically designed for Southeast Asian languages, named SeaLLMs-Audio. Trained on an extensive multilingual audio corpus, SeaLLMs- Audio exhibits robust audio understanding and generation capabilities across Indonesian, Thai, and
Chunk 13 · 1,995 chars
-- 6 of 10 -- 5 Conclusion In this study, we introduce a large audio-language model specifically designed for Southeast Asian languages, named SeaLLMs-Audio. Trained on an extensive multilingual audio corpus, SeaLLMs- Audio exhibits robust audio understanding and generation capabilities across Indonesian, Thai, and Vietnamese. To systematically assess LALMs within this region, we construct the SeaBench- Audio benchmark, encompassing multiple-choice and open-ended questions spanning 14 distinct tasks. Experimental outcomes highlight the strong performance of SeaLLMs-Audio on the proposed benchmark. We anticipate that SeaLLMs-Audio and SeaBench-Audio will promote further re- search on LALMs for Southeast Asia and stimulate broader efforts toward supporting low-resource lan- guages. 6 Limitation Due to limitations in manpower and computa- tional resources, we confined SeaLLMs-Audio and SeaBench-Audio to three selected Southeast Asian languages. Nevertheless, the proposed methodol- ogy can be easily extended to a broader range of lan- guages. Although SeaLLMs-Audio demonstrated strong performance, instances of language mixing still exist, a behavior commonly observed in other LALMs. We anticipate that this issue can be mit- igated through reinforcement learning, which we identify as a promising direction for future work. Acknowledgments We would like to express our special thanks to our professional, native linguists, Tantong Champai- boon, Nguyen Ngoc Yen Nhi and Tara Devina Pu- tri, who helped build, evaluate, and fact-check our SeaBench-Audio dataset as well as evaluating our models across different aspects. We sincerely ap- preciate the valuable suggestions from Hao Zhang (DAMO Academy, Alibaba Group) on improving SeaLLMs-Audio. References Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A Massively- Multilingual Speech Corpus. arXiv
Chunk 14 · 1,990 chars
uggestions from Hao Zhang (DAMO Academy, Alibaba Group) on improving SeaLLMs-Audio. References Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A Massively- Multilingual Speech Corpus. arXiv preprint. ArXiv:1912.06670 [cs]. Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, San- jeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. 2021. Gi- gaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio. In Inter- speech 2021, pages 3670–3674. ArXiv:2106.06909 [cs]. Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. 2024. Qwen2-Audio Technical Report. arXiv preprint. ArXiv:2407.10759 [cs, eess]. Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large- Scale Audio-Language Models. arXiv preprint. ArXiv:2311.07919 [eess]. Gheorghe Comanici, Eric Bieber, et al. 2025. Gem- ini 2.5: Pushing the Frontier with Advanced Rea- soning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv preprint. ArXiv:2507.06261 [cs]. Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, and Min Lin. 2024. Sailor: Open Lan- guage Models for South-East Asia. arXiv preprint. ArXiv:2404.03608 [cs]. Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunx- iao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíˇcek, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad
Chunk 15 · 1,999 chars
gxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunx- iao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíˇcek, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindar- goon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wan- naphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin. 2025. Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs. arXiv preprint. ArXiv:2502.12982 [cs]. Yingxu He, Zhuohan Liu, Geyu Lin, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, and AiTi Aw. 2025. MERaLiON-AudioLLM: Ad- vancing Speech and Language Understanding for Singapore. In Proceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 22–30, Vienna, Austria. Association for Computational Lin- guistics. Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, and Ai Ti Aw. 2024. MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models. arXiv preprint. ArXiv:2412.09818 [cs]. 7 -- 7 of 10 -- William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, and Diyi Yang. 2024. Distilling an End-to-End Voice Assistant Without Instruction Training Data. arXiv preprint. ArXiv:2410.02678 [cs]. Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. AudioCaps: Generating Captions for Audios in The Wild. In Proceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, Minneapolis, Min- nesota. Association for Computational Linguistics. Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen,
Chunk 16 · 1,994 chars
ings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, Minneapolis, Min- nesota. Association for Computational Linguistics. Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, and Shinji Watanabe. 2024. YODAS: Youtube-Oriented Dataset for Audio and Speech. arXiv preprint. ArXiv:2406.00899 [cs]. Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani Aljunied, Anh Tuan Luu, and Lidong Bing. 2025. SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia. In Findings of the Association for Computational Lin- guistics: NAACL 2025, pages 6119–6136, Albu- querque, New Mexico. Association for Computa- tional Linguistics. Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liy- ing Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2024. SeaLLMs - Large Language Models for Southeast Asia. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 3: System Demonstrations), pages 294–304, Bangkok, Thailand. Association for Computational Linguistics. Kunat Pipatanakul, Potsawee Manakul, Natapong Ni- tarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sri- paisarnmongkol, Krisanapong Jirayoot, and Kasima Tharnpipitchai. 2024. Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models. arXiv preprint. ArXiv:2412.13702 [cs]. Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei
Chunk 17 · 1,992 chars
.13702 [cs]. Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2025. Qwen2.5 Technical Report. arXiv preprint. ArXiv:2412.15115 [cs]. Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, AiTi Aw, and Nancy Chen. 2024. SeaE- val for Multilingual Foundation Models: From Cross- Lingual Alignment to Cultural Reasoning. In Pro- ceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 370–390, Mexico City, Mex- ico. Association for Computational Linguistics. Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuo- han Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F. Chen. 2025. AudioBench: A Univer- sal Benchmark for Audio Large Language Models. arXiv preprint. ArXiv:2406.16020 [cs]. Zhifei Xie and Changqiao Wu. 2024. Mini-Omni: Lan- guage Models Can Hear, Talk While Thinking in Streaming. arXiv preprint. ArXiv:2408.16725 [cs, eess]. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report. arXiv preprint. ArXiv:2503.20215 [cs]. Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xuny- ing Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei- Qiang Zhang, Guoguo Chen, and Xie Chen. 2025. GigaSpeech 2: An Evolving, Large-Scale and Multi- domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and
Chunk 18 · 1,998 chars
Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xuny- ing Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei- Qiang Zhang, Guoguo Chen, and Xie Chen. 2025. GigaSpeech 2: An Evolving, Large-Scale and Multi- domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refine- ment. arXiv preprint. ArXiv:2406.11546 [eess]. Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, and Lidong Bing. 2025. SeaLLMs 3: Open Founda- tion and Chat Multilingual Large Language Models for Southeast Asian Languages. In Proceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), pages 96–105, Albuquerque, New Mexico. Association for Computational Linguistics. Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, and Wenxuan Zhang. 2025. Babel: Open Multilingual Large Lan- guage Models Serving Over 90% of Global Speakers. arXiv preprint. ArXiv:2503.00865 [cs]. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint. ArXiv:2306.05685 [cs]. 8 -- 8 of 10 -- A Appendix Task Task description and requirements text? ASR Automatic Speech Recognition. For this task, each input consists of an audio file and a text instruction. Some clips feature regional accents, and every language in this category includes at least one example of informal speech. The set also contains multi-sentence audio clips. Each text instruction is uniquely phrased. Yes SS Speech Summarization. Each unit includes a text instruction to convert an audio file into a shorter text. The instruction may
Chunk 19 · 1,994 chars
ccents, and every language in this category includes at least one example of informal speech. The set also contains multi-sentence audio clips. Each text instruction is uniquely phrased. Yes SS Speech Summarization. Each unit includes a text instruction to convert an audio file into a shorter text. The instruction may limit output length by word count, number of sentences, or by specifying a short format (e.g., headline or title). Yes SER Speaker Emotion and Sentiment Recognition. This category includes 5 emotion detection units and 5 sentiment detection units. Each unit’s text instruction provides options from which the model must choose the correct answer. Every audio clip contains speech whose lexical content does not reflect the speaker’s emotion or sentiment. This setup compels the model to rely on paralinguistic features to determine the correct answer. Yes S2TT_EX, S2TT_XE Speech to Text Translation. There are two speech translation categories in this task: English-to-local (EX) and local-to-English (XE). This category includes units from specialized domains (e.g., courtroom, military, medical, addressing royalty) across languages. Translation challenges—such as idioms, informal varieties, appropriate first- and second-person pronouns, and polysemy—are represented in these units. Yes SQA Speech Question Answering. This task evaluates a model’s ability to extract or infer specific information from an audio file. Each unit’s text instruction is crafted to elicit a precise answer for reliable evaluation. The targeted information may extend beyond named entities present in the audio. Yes AC Audio Captioning. Each audio unit includes non-speech sounds from both animate and inanimate sources, as well as natural or man-made scenes. The accompanying text prompts are crafted to elicit detailed descriptions of the entire audio content. Yes AQA Audio Question Answering. The audio input for AQA consists of non-speech snippets, including event sounds or sequences of
Chunk 20 · 1,992 chars
sounds from both animate and inanimate sources, as well as natural or man-made scenes. The accompanying text prompts are crafted to elicit detailed descriptions of the entire audio content. Yes AQA Audio Question Answering. The audio input for AQA consists of non-speech snippets, including event sounds or sequences of related sounds from human activities or nature. The text instructions test a model’s abilities in object/event detection, sound tracking (e.g., duration), segmentation, and sequential analysis. Some prompts include multiple-choice options (e.g., “... takes place outdoors or indoors?”). Yes SKI Speaker Identifiers. This category focuses on non-linguistic aspects of audio (e.g., number of speakers, turn-taking, speaking rate) and speaker-related tasks such as gender, relative age, and accent or regional variety identification. Yes Life Life. The audio files in this set come from text questions posted on popular forums and social media in the respective SEA-language countries. Questions are carefully chosen to cover diverse topics, suit model prompting, and remain manageable for evaluation. Each question is recorded with a unique voice. There are no separate text instructions, as the audio explicitly contains the questions. No CS Customer Service. This set includes 10 customer care scenarios. Each audio clip features either a single customer or a dialogue between a customer and a CS officer. The scripts simulate real calls—for example, clarifying product or price information, checking delivery status, requesting refunds, or reporting faulty products. The instructions include 6 units with answer options and 4 open-ended units. These units are designed to assess fine-grained CS knowledge and were curated in consultation with a customer care professional. Yes MED Medical Patient Question. Each unit contains an audio recording of a patient describing their condition and requesting medical advice. These recordings are human-voiced readings of questions
Chunk 21 · 1,998 chars
its are designed to assess fine-grained CS knowledge and were curated in consultation with a customer care professional. Yes MED Medical Patient Question. Each unit contains an audio recording of a patient describing their condition and requesting medical advice. These recordings are human-voiced readings of questions sourced mainly from publicly available hospital websites, with some from open-source Q&As. This category evaluates a model’s ability to respond within a specific medical domain. References are provided by doctors from the respective hospitals or are human-verified. No Safety Safety. Containing an audio file and a text instruction, each unit in this category is designed to provoke an unsafe response from models. The set includes both country-specific safety violations and universally unsafe topics. Each unit is curated so that a safe, desired response requires analyzing the audio content—the text instruction alone does not reveal anything unsafe. This ensures that any model rejection stems from understanding the speech/audio elements. Yes Math Math. There are no written instructions in this set; each audio clip contains a complete math question. The questions span various math topics for grades 7–12. No Fact Fact. Audio clips in this set pose explicit factual questions across diverse topics. No text instructions accompany the audio files. Subjects include history, economics, medicine, technology, and more. No Table 3: Verbalizers for the evaluation datasets. 9 -- 9 of 10 -- Please act as an impartial judge and evaluate the quality of the text response provided by an AI assistant to the user question. The user question may be in text or in audio form. An audio .wav file for the question is also given, which the AI assistant must analyze in order to give a correct response. Begin your evaluation process by first analyzing the content of the corresponding audio file and comparing the assistant's answer against the reference answer. The audio content is
Chunk 22 · 1,998 chars
in audio form. An audio .wav file for the question is also given, which the AI assistant must analyze in order to give a correct response. Begin your
evaluation process by first analyzing the content of the corresponding audio file and comparing the assistant's answer against the reference answer. The
audio content is a key component of the evaluation and please use it to identify any inaccuracies, contextual misunderstandings, and language choice
issues in the assistant's response. An Evaluation Scoring rubric is provided alongside each assistant's answer and must be strictly adhered to, with the
ratings assigned on a sequential, first-match basis. Be as objective as possible and your explanation should not exceed two paragraphs. After providing
your explanation in English, you must rate the response on a scale of 1 to 5 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
[Question]
{question}
[Start of Reference Answer]
{reference}
[End of Reference Answer]
[Start of Assistant's Answer]
{answer}
[End of Assistant's Answer]
[Start of Evaluation Scoring Guide]
{rule}
[End of Evaluation Scoring Guide]
Figure 8: The prompt template for LLM-as-a-judge.
Score Criteria
1 The response is largely inaccurate, irrelevant, or incomplete, with poor language quality. It does
not effectively address the question or may be incoherent.
2 The response contains significant inaccuracies or is missing key details. It may be unclear or
poorly structured, and the language quality could be improved.
3 The response is generally accurate but may contain noticeable errors or omissions. It addresses
the question with moderate clarity and completeness, but could be better structured or more
detailed.
4 The response is mostly accurate and relevant, with a few minor errors or omissions. It is clear
and well-structured but could benefit from slight improvements in detail or language quality.
5 The response is accurate, relevant, coherent, and complete, with excellentChunk 23 · 641 chars
, but could be better structured or more detailed. 4 The response is mostly accurate and relevant, with a few minor errors or omissions. It is clear and well-structured but could benefit from slight improvements in detail or language quality. 5 The response is accurate, relevant, coherent, and complete, with excellent language quality. It answers the question thoroughly, clearly, and correctly, with no significant errors. Table 4: General scoring criteria for assessing the overall quality of responses by human judges. For each score, we provide guidelines to promote consistency and reliability in human evaluations. 10 -- 10 of 10 --