SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
Summary
This paper introduces SeaExam and SeaBench, two new benchmarks designed to evaluate the performance of Large Language Models (LLMs) in Southeast Asian (SEA) contexts. Unlike existing multilingual datasets that often rely on English translations, these benchmarks are built from real-world scenarios in SEA, including local educational exams and daily interaction tasks. SeaExam is a multitask exam dataset covering subjects like local history and literature, while SeaBench focuses on multi-turn, open-ended tasks reflecting common SEA user interactions. Evaluations show that SeaExam and SeaBench better align with real-world usage and more effectively differentiate model performance across SEA languages compared to translated benchmarks. The study also highlights that open-ended questions are more effective than multiple-choice ones in revealing model capabilities and identifies a significant weakness in LLMs' safety performance in SEA contexts. The benchmarks are publicly available and aim to improve the evaluation of multilingual LLMs by incorporating authentic regional content.
PDF viewer
Chunks(40)
Chunk 0 · 1,996 chars
SeaExam and SeaBench: Benchmarking LLMs with Local
Multilingual Questions in Southeast Asia
Chaoqun Liu∗ 12 Wenxuan Zhang† 23 Jiahao Ying24 Mahani Aljunied23
Anh Tuan Luu1 Lidong Bing23
1Nanyang Technological University, Singapore; 2DAMO Academy, Alibaba Group, Singapore;
3Hupan Lab, 310023, Hangzhou, China; 4Singapore Management University
{chaoqun.liu,saike.zwx}@alibaba-inc.com
Abstract
This study introduces two novel benchmarks,
SeaExam and SeaBench, designed to evalu-
ate the capabilities of Large Language Models
(LLMs) in Southeast Asian (SEA) application
scenarios. Unlike existing multilingual datasets
primarily derived from English translations,
these benchmarks are constructed based on real-
world scenarios from SEA regions. SeaExam
draws from regional educational exams to form
a comprehensive dataset that encompasses sub-
jects such as local history and literature. In
contrast, SeaBench is crafted around multi-
turn, open-ended tasks that reflect daily inter-
actions within SEA communities. Our evalua-
tions demonstrate that SeaExam and SeaBench
more effectively discern LLM performance on
SEA language tasks compared to their trans-
lated benchmarks. This highlights the impor-
tance of using real-world queries to assess the
multilingual capabilities of LLMs. 1
1 Introduction
Large Language Models (LLMs) have shown
remarkable performance across various English
benchmarks, including both human exam datasets
such as MMLU (Hendrycks et al., 2021),
or instruction-following datasets such as MT-
Bench (Zheng et al., 2023b), indicating their strong
capabilities (OpenAI, 2023; Dubey et al., 2024;
Team et al., 2024). As these LLMs are increasingly
deployed globally, there is growing interest in their
ability to handle multiple languages and adapt to
a wide range of multilingual applications (Huang
et al., 2023; Qin et al., 2024; Huang et al., 2024;
Dou et al., 2024; Nguyen et al., 2023; Zhang et al.,
2024).
∗∗Chaoqun Liu is under the Joint PhD Program between
DAMOChunk 1 · 1,988 chars
loyed globally, there is growing interest in their ability to handle multiple languages and adapt to a wide range of multilingual applications (Huang et al., 2023; Qin et al., 2024; Huang et al., 2024; Dou et al., 2024; Nguyen et al., 2023; Zhang et al., 2024). ∗∗Chaoqun Liu is under the Joint PhD Program between DAMO Academy and Nanyang Technological University. ††Wenxuan Zhang is the corresponding author. 1SeaExam and SeaBench are publicly available at https://github.com/DAMO-NLP-SG/SeaExam and https: //github.com/DAMO-NLP-SG/SeaBench. Viết một đoạn văn miêu tả về một khu chợ nhộn nhịp kết hợp các chi tiết cảm quan như mùi, âm thanh và các yếu tố thị giác để tạo ra trải nghiệm đắm chìm cho người đọc. (English translation: Write a descriptive paragraph about a bustling marketplace, incorporating sensory details such as smells, sounds, and visual elements to create an immersive experience for the reader.) Native Queries Translated Queries Object embedding distribution Question: Tôi sắp kinh doanh bán các thiết bị về cà phê. Hãy giúp tôi viết một hoặc hai đoạn hướng dẫn sử dụng cho: phin cà phê việt nam. English version: I am about to start a business selling coffee equipment. Please help me write one or two paragraphs of usage instructions for the Vietnamese coffee filter. Question: Hãy viết một bài đăng hấp dẫn về chuyến đi gần đây tới Hawaii lên một trang blog du lịch, làm nổi bật những trải nghiệm văn hóa và những địa điểm tham quan không thể bỏ lỡ. English version: Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. Figure 1: Compared with local usage queries in Viet- namese, questions in English-based translations show more American context (Hawaii). To better illustrate this discrepancy, we extracted the object in these ques- tions and visualised their distribution. The results show that the objects in translated questions cover only a small portion of those in local usage
Chunk 2 · 1,998 chars
Viet- namese, questions in English-based translations show more American context (Hawaii). To better illustrate this discrepancy, we extracted the object in these ques- tions and visualised their distribution. The results show that the objects in translated questions cover only a small portion of those in local usage queries. This led to the development of multiple multilin- gual benchmarks to assess the multilingual capabil- ities of LLMs (Lai et al., 2023; Ahuja et al., 2023; Zhang et al., 2023). Among them, many datasets such as MGSM (Shi et al., 2022), XNLI (Conneau et al., 2018), and Multilingual MMLU (Hendrycks et al., 2021; OpenAI, 2023) are typically con- structed by translating the English set into target languages. Considering that original English test sets are often carefully designed, such translations provide an effective way to leverage the task cat- egorization, evaluation targets, and construction methods of the monolingual dataset into the multi- lingual context. However, such translated questions focus merely on evaluating the same contextual elements as their monolingual counterparts. In other words, they fo- cus primarily on the application scenarios relevant to the original benchmarks rather than adapting to a wide range of multilingual applications in the real world. Instead, a truly effective multilingual benchmark must also consider the content typically arXiv:2502.06298v1 [cs.CL] 10 Feb 2025 -- 1 of 18 -- used in the practical application of the target lan- guage (Liu et al., 2024). For example, as shown in Figure 1, we visualize the distribution of objects in questions collected from local usage queries ver- sus those translated from English. Compared to local usage queries, translated questions based on English exhibit more of an American context, e.g., involving the place “Hawaii”. It shows that trans- lated questions cover only a small portion of the entities in local usage queries, indicating a signifi- cant divergence in the query
Chunk 3 · 1,996 chars
slated from English. Compared to local usage queries, translated questions based on English exhibit more of an American context, e.g., involving the place “Hawaii”. It shows that trans- lated questions cover only a small portion of the entities in local usage queries, indicating a signifi- cant divergence in the query context. Considering the scarcity of such effective mul- tilingual benchmarks, this paper introduces two new benchmarks, SeaExam and SeaBench. These benchmarks are specifically designed to address the unique application scenarios and cultural contexts of Southeast Asian (SEA) countries, which often differ significantly from western-centric datasets. Following the design principles of two widely used English-based datasets, MMLU and MT-bench, we do not simply translate the original English ques- tions but incorporate real-world usage scenarios from SEA natives into the content — allowing us to measure a model’s adaptability in multilingual application scenarios. Specifically, SeaExam is a multitask exam dataset sourced from real exams in SEA countries that cover a wide range of sub- jects including local history, geography, and litera- ture. SeaBench, following MT-Bench’s approach, focuses on multi-turn instruction-following tasks spanning ten task categories. It incorporates scenar- ios and instructions that are commonly encountered in SEA cultures and daily life. Our experimental analysis quantitatively demon- strates that, 1) Compared to the translated bench- marks MMLU and MT-bench, our SeaExam and SeaBench benchmarks include questions that are more aligned with the daily usage of regional languages (Section 3.1). 2) Furthermore, using SeaExam and SeaBench, we are able to more effectively discern the capabilities of models in real-world multilingual applications (Section 3.2.1). Further analysis reveals that 3) While multiple- choice questions in exam datasets can objectively measure model capabilities, open-ended questions are more effective in
Chunk 4 · 1,997 chars
ing SeaExam and SeaBench, we are able to more effectively discern the capabilities of models in real-world multilingual applications (Section 3.2.1). Further analysis reveals that 3) While multiple- choice questions in exam datasets can objectively measure model capabilities, open-ended questions are more effective in highlighting differences in model performance across various languages (Sec- tion 3.2.2 and Section 3.2.3). Additionally, we find that 4) The nine models involved generally per- form poorly in the “safety” category — evaluating whether the models generate harmful responses in the local context (Section 3.2.4). Therefore, we advocate for enhanced safety measures in multi- lingual applications to adapt to a broader range of scenarios. The key contribution can be summarized as: • We introduce two new benchmarks, SeaExam and SeaBench, which extend the scope of the translated MMLU and MT-bench frameworks to better accommodate the unique linguistic features and practical content contexts of the Southeast Asian (SEA) region. • We compare these benchmarks with translated counterparts, such as MMLU and MT-Bench, and find that SeaExam and SeaBench have closer distribution to real-world queries. Uti- lizing these benchmarks allows for a better differentiation of model performance across different language uses. 2 SeaExam and SeaBench We aim to build multilingual benchmarks to com- prehensively evaluate model adaptability to South- east Asia applications, focusing on both linguistic style and content essence that cannot be fully mea- sured with translated questions. Following the de- sign principle of MMLU and MT-bench, two com- prehensive datasets in measuring the English capa- bilities of large language models, we incorporate real local exams of each country for SeaExam and engage native speakers to craft instructions com- monly used in the corresponding language commu- nities for SeaBench. This approach ensures that our benchmarks reflect real-world usage in
Chunk 5 · 1,994 chars
in measuring the English capa- bilities of large language models, we incorporate real local exams of each country for SeaExam and engage native speakers to craft instructions com- monly used in the corresponding language commu- nities for SeaBench. This approach ensures that our benchmarks reflect real-world usage in SEA con- texts. We outline the detailed creation processes for SeaExam and SeaBench in Section 2.1 and Sec- tion 2.2, respectively. 2.1 SeaExam Construction Evaluating LLMs using human exam questions can provide valuable insights into the model’s perfor- mance, as these questions encompass a wide range of knowledge types. However, relying solely on translations of monolingual exam questions can in- troduce content biases into model evaluations. For example, the widely used MMLU benchmark in- cludes categories such as “US History”, which may be more relevant to American users. To address this, we decide to manually collect exam questions from the SEA region (Indonesian (id), Thai (th), and Vietnamese (vi)). We follow the construction of M3Exam (Zhang et al., 2023), 2 -- 2 of 18 -- Cermati kalimat berikut: Mandra mengajari adik membuat layang-layang. Variasi kalimat tersebut yang tepat adalah… A. Adik mengajari Mandra dibuat layang-layang. B. Mandra membuat layang-layang diajari adik. C. Adik diajari Mandra membuat layang-layang. D. Layang-layang diajari Mandra dibuat adik. Khu sinh học nào sau đây có độ đa dạng sinh học cao nhất? A. Thảo nguyên. B. Hoang mạc. C. Rừng mưa nhiệt đới. D. Rừng lá rụng ôn đới. เพราะเหตุใดจึงควรมีการสงเสริมสินคาที่เปน มิตรตอสิ่งแวดลอม A. เพื่อสนับสนุนการผลิตที่สงผลเสียตอ ธรรมชาตินอยที่สุด B. เพื่อใหไดสินคาที่มีการออกแบบดี คุณภาพสูง C. เพื่อตอบสนองกระแสอนุรักษนิยม D. เพื่อกระตุนใหมีการเลือกใชวัตถุดิบชั้นดี ในการผลิต Indonesian (language) Vietnamese (natural science) Thai (social science) (a) 1st turn Syarat pencairan JKK itu salah satunya surat kecelakaan kerja atau surat musibah kerja. Tolong berikan 3
Chunk 6 · 1,999 chars
หไดสินคาที่มีการออกแบบดี คุณภาพสูง C. เพื่อตอบสนองกระแสอนุรักษนิยม D. เพื่อกระตุนใหมีการเลือกใชวัตถุดิบชั้นดี ในการผลิต Indonesian (language) Vietnamese (natural science) Thai (social science) (a) 1st turn Syarat pencairan JKK itu salah satunya surat kecelakaan kerja atau surat musibah kerja. Tolong berikan 3 kecelakaan kerja yang realistik tetapi tidak butuh opname di rumah sakit, biar pengajuan klaim JKK ku bisa disetujui dan cair. 2nd turn Kecelakaan mana yang lebih mudah dibuktikan tanpa perlu kejadian? 1st turn Tôi sắp kinh doanh bán các thiết bị về cà phê. Hãy giúp tôi viết một hoặc hai đoạn hướng dẫn sử dụng cho: phin cà phê việt nam. 2nd turn Bây giờ hãy viết đoạn tương tự cho bình pha cà phê kiểu Pháp. 1st turn ปกติรามคําแหงจะเปิดรับนักศึกษาใหม่ เทอม 1 ช่วงไหน 2nd turn ค่าเทอมของที@นี@ประมาณกี@บาท Indonesian (safety) Vietnamese (writing) Thai (life) (b) Figure 2: Data Examples for the three languages in (a) SeaExam and (b) SeaBench. The correct answer for SeaExam is in bold. The information within "()" indicates the subject or task category of the example. one of the few guidelines for compiling multilin- gual regional exam datasets. M3Exam provides detailed steps for data collection and data cleaning processes. In line with the ‘Multilingual Evalua- tion’ principle, we collaborate with native linguists from the SEA region to systematically collect offi- cial region-specific exam questions. These linguists are native speakers of their respective languages and work full-time on data annotation tasks. These exam questions, along with their corresponding answers are typically taken at the end of each ed- ucational level — primary school, middle school, and high school graduation exams. These questions undergo detailed data processing and annotation, ensuring their transformation into multiple-choice format with four answer options (examples are pro- vided in Figure 2). Further details regarding the data curation process for SeaExam are provided
Chunk 7 · 1,989 chars
chool, middle school, and high school graduation exams. These questions undergo detailed data processing and annotation, ensuring their transformation into multiple-choice format with four answer options (examples are pro- vided in Figure 2). Further details regarding the data curation process for SeaExam are provided in Appendix A.1. The final SeaExam comprises a total of 5,451 test samples and we categorize the samples fol- lowing the categorization standard of MMLU. The statistics of the SeaExam are shown in Table 1. Category id th vi Total STEM 952 593 888 2,433 Humanities 628 729 57 1,414 Social Sciences 0 804 800 1,604 Total 1,580 2,126 1,745 5,451 Table 1: The statistical details of SeaExam, including three SEA languages: Indonesian (id), Thai (th), and Vietnamese (vi). We follow the category framework of MMLU (Hendrycks et al., 2021). In the case of Indone- sian, the absence of data for social science questions stems from the fact that no such questions were identi- fied during the construction process. 2.2 SeaBench Construction Exam questions can objectively assess a model’s knowledge and capabilities; however, many real- world user inquiries are inherently open-ended, challenging an LLM not only to demonstrate its knowledge retention but also to interpret in- structions effectively and generate high-quality re- sponses. Currently, MT-bench (Zheng et al., 2023b), widely regarded as the most authoritative and sys- 3 -- 3 of 18 -- tematically categorized open-ended benchmark, is composed of manually crafted, English-based in- structions, thus it predominantly suits the usage scenarios of English-speaking users. To better eval- uate the instructional applicability in the SEA re- gion’s actual usage scenarios, we engaged profes- sional native linguists to meticulously construct our SeaBench. Specifically, given the framework of MT-bench as a reference, including category names and instruction examples, these linguists are tasked with innovating and
Chunk 8 · 1,994 chars
he instructional applicability in the SEA re- gion’s actual usage scenarios, we engaged profes- sional native linguists to meticulously construct our SeaBench. Specifically, given the framework of MT-bench as a reference, including category names and instruction examples, these linguists are tasked with innovating and constructing instruc- tions from scratch, ensuring that these reflect the local users’ interests, behavior patterns, cultural content and sensitivities. Three detailed examples are shown in Figure 2(b). Besides the eight original categories used in MT- bench, we add two additional categories “safety” and “life” in SeaBench, which are specifically tai- lored for the multilingual context. Safety ques- tions are designed to evaluate whether LLMs can avoid producing harmful responses corresponding to SEA language usage scenarios. Life questions, selected without modification from various trend- ing discussion groups in the corresponding SEA language nation’s most popular forum sites, repre- sent real users’ interests and exemplify the authen- tic question-writing style of native speakers. Along with these carefully designed questions, a reference answer is also manually crafted for each question, which is subjected to multiple rounds of review to ensure quality. In total, we created 100 question and answer pairs for each language, resulting in a total of 300 test samples. Detailed statistical results are presented in Table 6. 3 Experiment Given the meticulously built SeaExam and SeaBench, we then conduct experiments to quan- titatively demonstrate how our benchmarks could better evaluate models’ abilities on multilingual applications from: 1) how our datasets align more closely with the daily usage of regional languages (Section 3.1), and 2) how it effectively distinguish- ing differences in model performance across var- ious languages (Section 3.2.1) and distinguishing performance variations within the same model across different languages ((Section
Chunk 9 · 1,999 chars
m: 1) how our datasets align more closely with the daily usage of regional languages (Section 3.1), and 2) how it effectively distinguish- ing differences in model performance across var- ious languages (Section 3.2.1) and distinguishing performance variations within the same model across different languages ((Section 3.2.2) and (Section 3.2.3)). Through our fine-grained anal- ysis using SeaBench, we have uncovered signifi- cant deficiencies in LLMs’ response safety across multilingual usage scenarios. Consequently, we advocate for enhanced safety measures in models for multilingual contexts to better adapt to actual usage realities (Section 3.2.4)). 3.1 Are the Contructed SeaExam and SeaBench More Aligned with Actual Local Usage? Despite utilizing local exams and engaging native language experts specifically to tailor questions to the local context, the critical question remains un- solved: How do these questions more accurately reflect the actual local usage compared to those de- rived from translations? To evaluate the alignment of our benchmarks with actual local usage, we con- duct a quantitative comparison between SeaExam and SeaBench and real-world user queries. As the first step, we construct the real-world user queries dataset “Wild Queries” as follows: Wild Queries is constructed based on LMSYS- Chat-1M (Zheng et al., 2023a) and WildChat- 1M (Zhao et al., 2024b; Deng et al., 2024), which are databases of real-world human queries with mil- lions of conversations across various application scenarios. Using these conversation data, we con- ducted a meticulous post-filtering process to obtain high-quality queries in SEA languages. First, we conducted 1) Language Filter for the correspond- ing SEA language using the original language la- bels and further refined our selection using the Google Translate API to confirm the query lan- guage. Given corresponding SEA queries, we have 2) Data Balance Control — removing overly long conversations, limiting the data to
Chunk 10 · 1,993 chars
cted 1) Language Filter for the correspond- ing SEA language using the original language la- bels and further refined our selection using the Google Translate API to confirm the query lan- guage. Given corresponding SEA queries, we have 2) Data Balance Control — removing overly long conversations, limiting the data to extracting user inputs up to five rounds per conversation, to en- sure data balance across different usage scenarios. Finally, we employ a capable multilingual model, GPT-4o, to process 3) LLM-Based Heuristic Fil- ter to further filter out questions that are not queries or instructions. After these three steps, we get a to- tal of 4,658 queries real-world user queries in SEA languages. The statistic result is shown in Table 10 in the appendix. Using these real-world user queries, we compare the similarity between them and our benchmarks, SeaExam and SeaBench, for each SEA language respectively. Specifically, we utilize the cluster distance (C-Dist) of sentence embeddings derived from the bge-multilingual-gemma2 model (Chen et al., 2024) to measure similarity. We also deploy translated MMLU (MMLU-SEA) and MT-bench (MT-bench-SEA) on SEA languages as baselines (more details on the datasets and the embedding 4 -- 4 of 18 -- id th vi avg 15 20 25 30 35 40 Cluster Distance SeaExam MMLU-SEA (a) id th vi avg 15 20 25 30 35 40 Cluster Distance SeaBench MT-bench-SEA (b) Figure 3: Cluster distance between each benchmark and Wild Queries. (a) Cluster distance of entity embeddings between each exam dataset and Wild Queries. (b) Cluster distance of sentence embeddings between each multi-turn dataset and Wild Queries. A smaller value means more similar to Wild Queries. calculation are shown in Appendix B). As shown in Figure 3, SeaExam and SeaBench have a more similar distribution with Wild Queries than translated benchmarks, with a smaller cluster distance by an average of 6 units. This demonstrates that our benchmarks could better evaluate model
Chunk 11 · 1,995 chars
means more similar to Wild Queries. calculation are shown in Appendix B). As shown in Figure 3, SeaExam and SeaBench have a more similar distribution with Wild Queries than translated benchmarks, with a smaller cluster distance by an average of 6 units. This demonstrates that our benchmarks could better evaluate model performance in real-world multilin- gual application scenarios. 3.2 Can SeaExam and SeaBench better distinguish models across SEA language? We have quantitatively demonstrated that the constructed SeaExam and SeaBench benchmarks are more aligned with actual local usage questions (Section 3.1). However, does this greater alignment also improve our ability to distinguish between different models? This question is central to the purpose of building these benchmarks — aiming to better discern models’ ability to handle multiple languages and adapt to a wide range of multilingual applications across SEA languages. To answer the question, we evaluate nine LLMs, a detailed experiment setting as follows: Models: We consider multiple factors when selecting nine models for evaluation. First, instruction-following capability is a key require- ment, as SeaBench necessitates models that can effectively adhere to given instructions. Second, we select only those with parameters ranging from 7B to 9B, as they offer a good balance be- tween performance and inference speed. Based on these criteria, we select models from three groups: (1) the most popular open-source models, including Meta-Llama-3.1-8B-Instruct (Llama-3.1- 8B)(Dubey et al., 2024), Gemma-2-9b-it (Gemma- 2-9B)(Team et al., 2024), Mistral-7B-Instruct- v0.3 (Mistral-7B)(Jiang et al., 2023), and Qwen2- 7B-Instruct (Qwen2-7B)(Yang et al., 2024); (2) models optimized for multilingual capabilities, including glm-4-9b-chat (glm-4-9b)(GLM et al., 2024) and Aya-23-8B(Aryabumi et al., 2024); and (3) models specifically optimized for Southeast Asian languages, including SeaLLMs-v3-7B-Chat (SeaLLMs-v3-7B)(Zhang et
Chunk 12 · 1,995 chars
and Qwen2- 7B-Instruct (Qwen2-7B)(Yang et al., 2024); (2) models optimized for multilingual capabilities, including glm-4-9b-chat (glm-4-9b)(GLM et al., 2024) and Aya-23-8B(Aryabumi et al., 2024); and (3) models specifically optimized for Southeast Asian languages, including SeaLLMs-v3-7B-Chat (SeaLLMs-v3-7B)(Zhang et al., 2024), llama3- 8b-cpt-sealionv2-instruct (sealionv2)(Singapore, 2024), and Sailor-7B-Chat (Sailor-7B) (Dou et al., 2024). Metrics and Setups: For SeaExam, we con- duct evaluation in 3-shot and use accuracy (%) as the evaluation metric. For SeaBench, we employ LLMs-as-a-Judge (Zheng et al., 2023b; Bai et al., 2023; Ying et al., 2024), setting GPT-4o as the judge model to evaluate LLM’s responses based on the reference answers (construction details in Section 2.2). Considering that different categories of questions focus on assessing different aspects of model performance, we have designed a list of priority evaluation aspects for each category to fa- cilitate a comprehensive judgment. We prompt GPT-4o to rate each response on a scale from 1 to 10. These evaluation aspects are detailed in Table 7 and the evaluation prompt is shown in Figure 10 and Figure 11 in the appendix. More experimental and model setups is shown in Appendix B.1. Following this experimental setup, we conduct tests using SeaExam and SeaBench, with results presented in Table 2. Upon analyzing these results, we identify several interesting findings as follows: 5 -- 5 of 18 -- model SeaExam SeaBench id th vi avg id th vi avg Gemma-2-9b-it 58.5 60.4 68.4 62.4 8.30 7.37 7.78 7.82 SeaLLMs-v3-7B-Chat 55.8 57.1 64.4 59.1 6.77 6.62 6.32 6.57 Qwen2-7B-Instruct 55.8 55.4 62.2 57.8 6.42 5.68 6.19 6.09 glm-4-9b-chat 50.9 49.9 59.4 53.4 6.33 5.06 6.88 6.09 Meta-Llama-3.1-8B-Instruct 50.7 49.1 57.1 52.3 6.76 5.05 5.62 5.81 llama3-8b-cpt-sealionv2-instruct 51.1 49.1 54.7 51.6 6.22 6.06 6.14 6.14 Sailor-7B-Chat 47.5 46.6 51.4 48.5 4.70 3.98 4.45 4.37 Aya-23-8B 41.6 29.9 48.1 39.9 5.37 2.25
Chunk 13 · 1,997 chars
62.2 57.8 6.42 5.68 6.19 6.09 glm-4-9b-chat 50.9 49.9 59.4 53.4 6.33 5.06 6.88 6.09 Meta-Llama-3.1-8B-Instruct 50.7 49.1 57.1 52.3 6.76 5.05 5.62 5.81 llama3-8b-cpt-sealionv2-instruct 51.1 49.1 54.7 51.6 6.22 6.06 6.14 6.14 Sailor-7B-Chat 47.5 46.6 51.4 48.5 4.70 3.98 4.45 4.37 Aya-23-8B 41.6 29.9 48.1 39.9 5.37 2.25 5.26 4.29 Mistral-7B-Instruct-v0.3 42.5 35.1 41.5 39.7 4.61 2.73 4.23 3.85 Table 2: Performance (%) of the of the 9 involved models on SeaExam (three-shot) and SeaBench (zero-shot). The models are sorted by the "SeaExam avg" column. The detailed experiment setups are shown in Appendix B.1. avg th vi id 0.04 0.06 0.08 0.10 Stadndard Deviation SeaExam MMLU-SEA (a) avg th vi id 0.8 1.0 1.2 1.4 1.6 1.8 Stadndard Deviation SeaBench MT-bench-SEA (b) Figure 4: (a) Accuracy standard deviation across the nine models for each language on SeaExam and MMLU- SEA. (b) Score standard deviation across the nine mod- els for each language on SeaBench and MT-bench-SEA. 3.2.1 Finding 1: SeaExam and SeaBench can better distinguish different models We compare the performance of tested models be- tween SeaExam and MMLU-SEA, examining the standard deviation of model performances across three SEA languages. Results, as shown in Fig- ure 4, indicate that the variances in SeaExam are significantly higher than those in MMLU-SEA by 9.3%. A similar phenomenon was observed when comparing SeaBench with MT-bench-SEA by 8.7%. This consistency suggests that, compared to direct translations, our benchmarks more effec- tively discern the capabilities of models in real- world application scenarios. In Figure 4, we find the abnormal phenomenon that SeaExam has no distinct advantage in differen- tiating among models for the Indonesian language. This may be due to the poor performance across the models on Indonesian, each showing a decline of more than 4.5% compared to MMLU-SEA, re- sulting in a lower standard deviation in differentia- tion. This observation prompts us to explore
Chunk 14 · 1,995 chars
istinct advantage in differen- tiating among models for the Indonesian language. This may be due to the poor performance across the models on Indonesian, each showing a decline of more than 4.5% compared to MMLU-SEA, re- sulting in a lower standard deviation in differentia- tion. This observation prompts us to explore further whether the ability to effectively separate models extends to aiding in a more nuanced analysis across different languages. 3.2.2 Finding 2: SeaBench can better distinguish performance variations within the same model across different languages Aya-23-8B Gemma-2-9B glm-4-9b SeaLLMs-v3-7B Llama-3.1-8B Mistral-7B Qwen2-7B sealionv2 Sailor-7B 0.0 0.2 0.4 0.6 0.8 1.0 Standard Deviation SeaExam MMLU-SEA (a) Aya-23-8B Mistral-7B glm-4-9b Llama-3.1-8B Gemma-2-9B Qwen2-7B Sailor-7B SeaLLMs-v3-7B sealionv2 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Standard Deviation SeaBench MT-bench-SEA (b) Figure 5: (a) Accuracy standard deviation across three SEA languages for the nine models on SeaExam and MMLU-SEA. (b) Score standard deviation across three SEA languages for the nine models on SeaBench and MT-bench-SEA. We conduct a comparison of nine models’ per- formance standard deviations on SeaExam across three SEA languages and compared these with per- formances on MMLU-SEA. As shown in Figure 5, SeaExam does not demonstrate a significant ad- vantage in distinguishing language differences. In contrast, a notable distinction emerges when com- paring SeaBench to MT-Bench. Specifically, the performance gaps across the three languages in SeaBench are significantly larger than those in the translated MT-bench-SEA, by 6.7% on average, in- dicating that SeaBench more effectively highlights the performance variations within the same model across different languages. Additionally, we identi- fied a few models, such as Sailor-7B, SeaLLMs-v3- 7B, and Sealionv2, that exhibited more balanced performances across SEA languages in SeaBench. This is because these
Chunk 15 · 1,993 chars
dicating that SeaBench more effectively highlights the performance variations within the same model across different languages. Additionally, we identi- fied a few models, such as Sailor-7B, SeaLLMs-v3- 7B, and Sealionv2, that exhibited more balanced performances across SEA languages in SeaBench. This is because these models were specifically trained with a focus on SEA daily scenarios, which 6 -- 6 of 18 -- resulted in a more balanced performance on SEA language tests. Despite both being meticulously designed to reflect real-world application scenarios, the out- comes for SeaExam and SeaBench are different when compared with the translation-based bench- marks. We hypothesize that it may lie in the na- ture of the question formats: SeaExam employs multiple-choice questions (MCQs), where the pro- vided choices may offer linguistic cues that aid in selecting the correct answer; therefore, it does not demonstrate a distinct advantage over MMLU-SEA in distinguishing language capabilities. In contrast, SeaBench utilizes open-ended questions, which do not provide options and thus more rigorously test the model’s intrinsic ability to handle real-world applications in SEA languages. To further validate our hypothesis, we conducted an in-depth analysis, which led to our third finding. 3.2.3 Finding 3: Open-Ended Question Formats More Effectively Distinguish Model Capabilities id th vi 0.00 0.04 0.08 0.12 0.16 0.20 Stadndard Deviation SeaExam SeaBench (a) Aya-23-8B Llama-3.1-8B glm-4-9b Qwen2-7B Gemma-2-9B Mistral-7B SeaLLMs-v3-7B Sailor-7B sealionv2 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Standard Deviation SeaExam SeaBench (b) Figure 6: (a) Accuracy standard deviation across the models for each language on SeaExam and SeaBench. (b) Accuracy standard deviation across the language for each model on SeaExam and SeaBench. We define the accuracy on SeaBench as the rate of high-score queries over the total number of queries. We compare the performance of models
Chunk 16 · 1,995 chars
ccuracy standard deviation across the models for each language on SeaExam and SeaBench. (b) Accuracy standard deviation across the language for each model on SeaExam and SeaBench. We define the accuracy on SeaBench as the rate of high-score queries over the total number of queries. We compare the performance of models across three languages in SeaExam and SeaBench. Since SeaExam employs accuracy (%) as its metric and SeaBench uses scores from a judge model, the scor- ing methods are not directly comparable. To stan- dardize the evaluation, we converted the latter’s scores to accuracy rates and full mark rates (where a response is considered correct only if it achieves full marks on all aspects). The results, depicted in Figure 6, reveal that the deviations among the nine models across the three languages are greater in SeaBench compared to SeaExam by 1.37 times. This observation supports our earlier hypothesis that open-ended question formats, requiring more extensive language use, better highlight differences in model capabilities. 3.2.4 Finding 4: LLMs Perform Poorly on Safety Questions Through extensive experimental analysis, we have demonstrated that our benchmarks more effectively evaluate models’ abilities in real-world multilin- gual applications. Building on this, we conduct a fine-grained analysis, with the results for SeaBench shown in Figure 7. We find that models perform sig- nificantly worse on the “safety category” of ques- tions, with an average score of 5.02, which is 20% lower than the highest-performing “STEM cate- gory”. These questions assess the model’s ability to avoid generating harmful responses. This find- ing highlights a notable deficiency in the models’ safety performance in relevant usage scenarios. We speculate that most alignment efforts are conducted using data on the models’ primary languages and overlooking other multilingual application contexts. Consequently, we advocate for enhanced safety measures in models for multilingual
Chunk 17 · 1,996 chars
notable deficiency in the models’ safety performance in relevant usage scenarios. We speculate that most alignment efforts are conducted using data on the models’ primary languages and overlooking other multilingual application contexts. Consequently, we advocate for enhanced safety measures in models for multilingual contexts to better adapt to actual usage. stem writing roleplay life coding extraction humanities reasoning safety math 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 Average Scores Figure 7: The average scores of the nine LLMs on 8 categories of SeaBench. The models performs poorly on the safety questions. 4 Human Evaluation For both constructed benchmarks, SeaExam and SeaBench, each question and its corresponding reference answer are meticulously crafted by en- gaged native linguists, ensuring high quality. To further validate the reliability of our experimental results—particularly the evaluation scores assigned by GPT-4o for SeaBench—we conduct a human agreement evaluation. For each question, we ran- domly sample three distinct model pairs, ensur- ing that no model combination is repeated. Since 7 -- 7 of 18 -- Judge model With tie votes (R = 33.3%) Without tie votes (R = 50%) id th vi avg id th vi avg gpt-4o 67.3% 68.7% 58.7% 64.9% 91.3% 95.8% 86.7% 91.3% claude-3.5-sonnet 64.2% 67.1% 58.8% 63.4% 92.3% 95.8% 88.4% 92.2% gemini-pro-1.5 59.2% 64.6% 55.0% 59.6% 87.1% 94.0% 87.9% 89.7% gpt-4o-mini 59.8% 64.8% 56.5% 60.4% 91.3% 96.2% 86.6% 91.4% claude-3-haiku 50.8% 53.3% 47.5% 50.6% 89.3% 94.0% 82.2% 88.5% gemini-flash-1.5 60.5% 62.5% 60.0% 61.0% 91.4% 95.2% 86.3% 91.0% Ensemble 66.2% 70.6% 60.3% 65.7% 91.8% 96.5% 90.9% 93.1% Table 3: Agreement between human evaluators and six judge models on SeaBench. The agreement between two random judges in each setup is denoted as “R=”. For the judge models, a tie is recorded if two scores differ by 1 or less. SeaBench consists of 100 questions per language, each linguist evaluates 300 model pairs. As each language
Chunk 18 · 1,993 chars
ment between human evaluators and six judge models on SeaBench. The agreement between two random judges in each setup is denoted as “R=”. For the judge models, a tie is recorded if two scores differ by 1 or less. SeaBench consists of 100 questions per language, each linguist evaluates 300 model pairs. As each language involves two turns, this approach results in a total of 600 votes per language. Annotators judge which of the two models pro- duces a better response. If both responses are equally good, the result is marked as a tie. During the annotation process, the linguists are unaware of which models generated each response pair. The instructions for the human judges are provided in Figure 13 in the appendix. For model-based judg- ments, we determine the winner by comparing the response scores. To ensure a more balanced distri- bution of labels, we treat responses as ties if their scores differ by 1 point or less, as the model scores range from 1 to 10. Finally, we compare the human- generated votes with the model-derived votes to assess the level of agreement between them. Results in Table 3 show that GPT-4o has a high agreement with human evaluations—64.9% on average (with tie votes) and 91.3% (without tie votes). In comparison, Zheng et al. (2023b) report 65% agreement for human evaluators on MT-bench when including tie votes and 81.5% when exclud- ing them. This suggests that GPT-4o’s judgments align well with human preferences on SeaBench, confirming the reliability of our findings. In addition to evaluating the results using GPT- 4o as the judge in our experiment (more details in Section 3.2), we expand our evaluation to in- clude more judges, including GPT-4o-mini, Claude- 3.5-Sonnet, Claude-3-Haiku, Gemini-Pro-1.5, and Gemini-Flash-1.5 and assess their results. This expansion aims to explore whether the approach can be applied to more models acting as judges. Considering that relying solely on GPT-4o might introduce biases, such as self-preference,
Chunk 19 · 1,987 chars
udges, including GPT-4o-mini, Claude- 3.5-Sonnet, Claude-3-Haiku, Gemini-Pro-1.5, and Gemini-Flash-1.5 and assess their results. This expansion aims to explore whether the approach can be applied to more models acting as judges. Considering that relying solely on GPT-4o might introduce biases, such as self-preference, especially when employing the LLMs-as-a-Judge approach, using different models helps mitigate the bias asso- ciated with exclusively using one judge (Bai et al., 2023; Ying et al., 2024; Zhao et al., 2024a). The result is shown in Table 3. More details on the experimental setup and results are discussed in Ap- pendix C.2. 5 Related Work SEA Benchmarks. Several benchmarks have been developed to evaluate LLMs on SEA lan- guages. SeaEval (Wang et al., 2023) includes 28 datasets covering classic NLP tasks, reasoning, and cultural comprehension. For the newly cre- ated datasets, Cross-MMLU and Cross-LogiQA, the questions were translated from English using Google Translate and proofread by native speakers. SeaCrowd benchmarks (Lovenia et al., 2024) cover 4 NLU tasks with 131 data subsets and 7 NLG tasks with 100 subsets. BHASA (Leong et al.) offers a holistic evaluation suite for assessing linguistic and cultural aspects in LLMs tailored to SEA languages. These benchmarks aim to provide a comprehen- sive evaluation for SEA languages, with a focus on NLP tasks. However, none of the existing bench- marks evaluate open-ended questions or multi-turn conversations. In contrast, SeaExam focuses on real-world exam questions, and SeaBench offers the first SEA benchmark designed specifically for open-ended and multi-turn evaluations. LLM-as-a-Judge Strong LLMs have emerged as judges to evaluate model capabilities on open- ended questions. Zheng et al. (2023b) proposed MT-bench, with GPT-4 as the judge to test multi- turn conversation and instruction-following ability. Li et al. (2023) introduced AlpacaEval, a method for assessing a model’s performance by
Chunk 20 · 1,995 chars
a-Judge Strong LLMs have emerged as judges to evaluate model capabilities on open- ended questions. Zheng et al. (2023b) proposed MT-bench, with GPT-4 as the judge to test multi- turn conversation and instruction-following ability. Li et al. (2023) introduced AlpacaEval, a method for assessing a model’s performance by determin- ing the percentage of instances in which a powerful LLM favors the model’s outputs compared to those 8 -- 8 of 18 -- from a reference model. Building on this, Dubois et al. (2024) proposed length-controlled AlpacaE- val to mitigate length gameability, as judge LLMs prefer longer outputs. To effectively distinguish model capabilities and capture human preferences in practical scenarios, Li et al. (2024) developed Arena-Hard, a data pipeline designed to create high- quality benchmarks using live data from Chatbot Arena (Zheng et al., 2023b). Similarly, Lin et al. (2024) proposed Wildbench to benchmark LLMs with real user queries. These benchmarks are lim- ited to use LLMs as English judges. Hada et al. (2024) expand the evaluation of LLM-based eval- uators to eight languages, but not including SEA languages. To our knowledge, SeaBench is the first open-ended multi-turn benchmark for SEA languages. 6 Conclusion In this study, we introduced two benchmarks, SeaExam and SeaBench, specifically designed to evaluate LLMs within Southeast Asian (SEA) ap- plication scenarios. Through empirical evaluation, we demonstrated that these benchmarks better re- flect the daily use of regional languages and pro- vide more accurate insights into LLM performance in real-world multilingual scenarios compared to translated datasets. Our findings emphasize the importance of using real-world benchmarks for evaluating models’ multilingual capabilities. In the future, we plan to expand the datasets by incor- porating additional SEA languages and extending the range of models included in our leaderboard to broaden the scope of our evaluation. Limitations Like many
Chunk 21 · 1,996 chars
emphasize the importance of using real-world benchmarks for evaluating models’ multilingual capabilities. In the future, we plan to expand the datasets by incor- porating additional SEA languages and extending the range of models included in our leaderboard to broaden the scope of our evaluation. Limitations Like many existing benchmarks, SeaExam and SeaBench are static, which may lead to issues such as saturation and data contamination. To address these challenges, we are curating additional ques- tions and keeping this dataset private. We also plan to implement dynamic updates to these bench- marks in the future to further mitigate these limita- tions. Given the limited availability of human re- sources, we engaged a single professional linguist to perform agreement evaluations for each of the three languages; hence, we do not report inter-rater agreement analysis among multiple human evalu- ators. However, the study by Zheng et al. (2023b) indicated that human agreement rates are approxi- mately 80%, which provides a useful reference for our results. Acknowledgements This research is supported, in part, by DAMO Academy through DAMO Academy Research In- tern Program and Alibaba-NTU Singapore Joint Research Institute (JRI), Nanyang Technological University, Singapore. Chaoqun Liu extends his gratitude to Interdisciplinary Graduate Programme and College of Computing and Data Science of NTU, for their support. References Kabir Ahuja, Harshita Diddee, Rishav Hada, Milli- cent Ochieng, Krithika Ramesh, Prachi Jain, Ak- shay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023. MEGA: Multilingual Evaluation of Generative AI. In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computa- tional Linguistics. Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern
Chunk 22 · 1,992 chars
ive AI. In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computa- tional Linguistics. Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Se- bastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. Aya 23: Open Weight Releases to Further Multilingual Progress. ArXiv:2405.15032 [cs]. Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yi- jia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking foundation models with language-model-as-an-examiner. Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Cross- lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 2475–2485, Brus- sels, Belgium. Association for Computational Lin- guistics. Yuntian Deng, Wenting Zhao, Jack Hessel, Xiang Ren, Claire Cardie, and Yejin Choi. 2024. Wildvis: Open source visualizer for million-scale chat logs in the wild. Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Ji- ahui Zhou, Wei Lu, and Min Lin. 2024. Sailor: 9 -- 9 of 18 -- Open Language Models for South-East Asia. ArXiv:2404.03608 [cs]. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. 2024. The Llama 3 Herd of Models. ArXiv:2407.21783 [cs]. Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. 2024. Length-controlled al- pacaeval: A simple way to debias automatic evalua- tors. arXiv
Chunk 23 · 1,997 chars
or South-East Asia. ArXiv:2404.03608 [cs]. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. 2024. The Llama 3 Herd of Models. ArXiv:2407.21783 [cs]. Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. 2024. Length-controlled al- pacaeval: A simple way to debias automatic evalua- tors. arXiv preprint arXiv:2404.04475. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, et al. 2024. ChatGLM: A Family of Large Lan- guage Models from GLM-130B to GLM-4 All Tools. ArXiv:2406.12793 [cs]. Rishav Hada, Varun Gumma, Adrian Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Ka- lika Bali, and Sunayana Sitaram. 2024. Are large language model-based evaluators the solution to scal- ing up multilingual evaluation? pages 1051–1070, St. Julian’s, Malta. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. ArXiv:2009.03300 [cs]. Haoyang Huang, Tianyi Tang, Dongdong Zhang, Wayne Xin Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross- Lingual-Thought Prompting. ArXiv:2305.07004 [cs]. Kaiyu Huang, Fengran Mo, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, Jinan Xu, Jian-Yun Nie, and Yang Liu. 2024. A Survey on Large Language Models with Multilingualism: Recent Advances and New Fron- tiers. ArXiv:2405.10936 [cs]. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- thée Lacroix, and William El Sayed. 2023. Mistral 7B. ArXiv:2310.06825 [cs]. Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. 2023. ChatGPT Beyond English: Towards a
Chunk 24 · 1,983 chars
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- thée Lacroix, and William El Sayed. 2023. Mistral 7B. ArXiv:2310.06825 [cs]. Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. 2023. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. ArXiv:2304.05613 [cs]. Wei Qi Leong, Jian Gang Ngui, Yosephine Su- santo, Hamsawardhini Rengarajan, Kengatharaiyer Sarveswaran, and William Chandra Tjhi. BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models. Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. 2024. From Crowdsourced Data to High- Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. ArXiv:2406.11939 [cs]. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpacaeval: An au- tomatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval. Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. 2024. WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild. ArXiv:2406.04770 [cs]. Chaoqun Liu, Wenxuan Zhang, Yiran Zhao, Anh Tuan Luu, and Lidong Bing. 2024. Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models. ArXiv:2403.10258 [cs]. Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, et al. 2024. SEACrowd: A Multilingual Mul- timodal Data Hub and Benchmark Suite for Southeast Asian Languages. ArXiv:2406.10118 [cs]. Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2023. SeaLLMs – Large Language Models for Southeast
Chunk 25 · 1,988 chars
ltilingual Mul- timodal Data Hub and Benchmark Suite for Southeast Asian Languages. ArXiv:2406.10118 [cs]. Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2023. SeaLLMs – Large Language Models for Southeast Asia. ArXiv:2312.00738 [cs]. OpenAI. 2023. GPT-4 Technical Report. ArXiv:2303.08774 [cs]. Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and Philip S. Yu. 2024. Multilingual Large Language Model: A Survey of Resources, Taxonomy and Fron- tiers. ArXiv:2404.04925 [cs]. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Di- panjan Das, and Jason Wei. 2022. Language Mod- els are Multilingual Chain-of-Thought Reasoners. ArXiv:2210.03057 [cs]. AI Singapore. 2024. Sea-lion (southeast asian lan- guages in one network): A family of large language models for southeast asia. https://github.com/ aisingapore/sealion. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, et al. 2024. Gemma 2: Im- proving Open Language Models at a Practical Size. ArXiv:2408.00118 [cs]. Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F. Chen. 2023. SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning. ArXiv:2309.04766 [cs]. 10 -- 10 of 18 -- An Yang, Baosong Yang, Binyuan Hui, et al. 2024. Qwen2 Technical Report. ArXiv:2407.10671 [cs]. Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, and Shuicheng Yan. 2024. Au- tomating dataset updates towards reliable and timely evaluation of large language models. Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2023. M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models. ArXiv:2306.05179
Chunk 26 · 1,992 chars
and Shuicheng Yan. 2024. Au- tomating dataset updates towards reliable and timely evaluation of large language models. Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2023. M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models. ArXiv:2306.05179 [cs]. Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, and Lidong Bing. 2024. SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages. ArXiv:2407.19672 [cs]. Ruochen Zhao, Wenxuan Zhang, Yew Ken Chia, Deli Zhao, and Lidong Bing. 2024a. Auto arena of llms: Automating llm evaluations with agent peer-battles and committee discussions. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024b. Wildchat: 1m chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representa- tions. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2023a. LMSYS-Chat- 1M: A Large-Scale Real-World LLM Conversation Dataset. ArXiv:2309.11998 [cs]. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023b. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. ArXiv:2306.05685 [cs]. 11 -- 11 of 18 -- A Benchmark Details A.1 SeaExam Following the construction of M3Exam dataset (Zhang et al., 2023), we engage native speakers from the SEA region to collect official exam papers, along with their corresponding answers, typically taken at the end of each educational level—primary school, middle school, and high school graduation exams. The data cleaning process begins with using OCR to convert scanned exam papers into editable
Chunk 27 · 1,992 chars
native speakers from the SEA region to collect official exam papers, along with their corresponding answers, typically taken at the end of each educational level—primary school, middle school, and high school graduation exams. The data cleaning process begins with using OCR to convert scanned exam papers into editable text. Language-specific annotators then review and correct any OCR errors while unifying the data into a con- sistent format. Multiple-choice questions are prioritized for standard evaluation, and subjective questions are excluded unless easily adaptable. Annotators also ensure that necessary contextual information is included for questions requiring additional background. Special formats, like equations, are converted into LaTeX, and multiple rounds of quality checks ensure the final dataset closely mirrors real exam conditions. After data cleaning, all questions were standardized to four answer options by removing those with fewer options and eliminating certain incorrect choices from those with more. The final SeaExam comprises a total of 5,451 test samples and the statistics of the SeaExam is shown in Table 4, following the original classification framework of M3Exam. We also map the subjects to MMLU categories, with the mapping shown in Table 5. id th vi Total language 628 729 57 1414 math 428 221 276 925 natural-science 524 372 612 1508 social-science 0 804 800 1604 Total 1580 2126 1745 5451 Table 4: Distribution of subject categories by language for SeaExam. The categorization follows the practice in M3Exam (Zhang et al., 2023). Category Subjects STEM math, biology, chemistry, physics, informatics, science Humanities literature, thai, vietnamese, language Social Sciences social, civic, geography, history Other - Table 5: Mapping of the subjects in SeaExam to the the categorization in MMLU. A.2 SeaBench Table 6 shows the Distribution of subject categories by language for SeaBench and Table 7 the categories and their corresponding priority
Chunk 28 · 1,998 chars
ure, thai, vietnamese, language Social Sciences social, civic, geography, history Other - Table 5: Mapping of the subjects in SeaExam to the the categorization in MMLU. A.2 SeaBench Table 6 shows the Distribution of subject categories by language for SeaBench and Table 7 the categories and their corresponding priority aspects in SeaBench. A.3 Translated Benchmarks We compare SeaExam and SeaBench with the translated MMLU and the translated MT-bench. For an effective comparison with the two datasets, we process the datasets using the following procedures: MMLU We randomly select 50 questions from each subject, totaling 2850 questions. Then we translate the questions and the choices from English into Indonesian, Thai, and Vietnamese using Google Translate API. For each language, there are 900 questions for STEM, 650 for humanities, 600 for social sciences, and 700 for other subjects (business, health, misc.). We call the curated benchmark MMLU-SEA. MT-bench We translated MT-bench into Indonesian, Thai, and Vietnamese using the Google Translate API. Instead of the default model for MT-bench, GPT-4, we use GPT-4o (gpt-4o-08-06) as the judge, as GPT-4o is more proficient in both English and other languages. In addition, we utilize GPT-4o to 12 -- 12 of 18 -- Category id th vi Total Writing 10 10 10 30 Math 10 10 10 30 Reasoning 10 10 10 30 STEM 10 10 10 30 Roleplay 10 10 10 30 Extraction 10 10 10 30 Humanities 10 10 10 30 Coding 10 10 10 30 Safety 10 10 10 30 Life 10 10 10 30 Total 100 100 100 300 Table 6: Distribution of subject categories by language for SeaBench. Category Priority Aspects Writing Relevance, Appropriateness, and Fluency of target generated texts; Creativity factor in songs, poetry or story-writing questions Math Accuracy, and explanation quality Reasoning Logical, common-sense reasoning quality, helpfulness, and relevance STEM Relevance, appropriateness, and informativeness Roleplay Lifelikeness of assigned role; Relevance of information to role and
Chunk 29 · 1,991 chars
ted texts; Creativity factor in songs, poetry or story-writing questions Math Accuracy, and explanation quality Reasoning Logical, common-sense reasoning quality, helpfulness, and relevance STEM Relevance, appropriateness, and informativeness Roleplay Lifelikeness of assigned role; Relevance of information to role and task Extraction Relevance and Appropriateness of reply; Information extraction ability; Faithfulness to source text meaning for translation questions Humanities Relevance, appropriateness, and informativeness Coding Functional accuracy, and helpfulness Safety Err on the side of caution; Avoidance of Statements Sensitive, Discriminative, or Con- troversial in the context of the country where the language is spoken; Legality, Privacy, Physical, and Property-related safety Life Appropriateness, helpfulness, practicality and safety Table 7: Categories and their priority aspects in SeaBench. generate reference answers for reasoning, math, and coding questions. We refer to the translated version of MT-bench as MT-bench-SEA. To address potential translation errors from Google Translate, we also engaged professional linguists for these three Southeast Asian languages to perform the translations, creating a version known as MT-bench-SEA-human. As we found that MT-bench-SEA-human yields similar results to MT-bench-SEA, we mainly report the results of MT-bench-SEA for consistency. A.4 Comparison of Dataset Distribution Since SeaExam and MMLU-SEA consist of multiple-choice questions, which differ in format from real queries, we use GPT-4o-mini to extract entities from each query. The specific prompt used for entity extraction is detailed in Figure 12 in the appendix. After that, we bge-multilingual-gemma2 model to embed each entity. For SeaBench and MT-bench-SEA queries, we embed the entire query. After deriving all the embeddings of a dataset, we calculate the centroid embedding of the dataset. We measure the cluster distance by calculating the Euclidean
Chunk 30 · 1,996 chars
igure 12 in the appendix. After that, we bge-multilingual-gemma2 model to embed each entity. For SeaBench and MT-bench-SEA queries, we embed the entire query. After deriving all the embeddings of a dataset, we calculate the centroid embedding of the dataset. We measure the cluster distance by calculating the Euclidean distance of two centroid embeddings. The distributions of the datasets are shown in Figure 8. B Experiment Details B.1 Evaluation Setup We evaluate on SeaExam with 3-shot setting in the completion mode. We aim to ensure a fair and consistent comparison across different LLMs while mitigating the risk of data contamination. We have designed four instruction templates to provide a fair comparison and reduce LLMs’ dependence on specific 13 -- 13 of 18 -- prompt templates. During evaluation, a template will be randomly selected for each question. As we fix the seed to control randomness, all the LLMs are evaluated on the same set of questions. Additionally, users have the option to change the seed value to generate a different set of questions for evaluation purposes. We evaluate SeaBench with zero-shot setting to assess the model’s instruction-following capabilities. We apply chat template to each query with the default system prompt "You are a helpful assistant." If the model does not support the system prompt, we leave it empty. We run all the evaluations on Nvdia A100 GPUs. B.2 Additional Results model SeaExam MMLU-SEA id th vi avg id th vi avg gemma-2-9b-it 58.5 60.4 68.4 62.4 64.7 57.9 61.3 61.3 SeaLLMs-v3-7B-Chat 55.8 57.1 64.4 59.1 62.6 54.6 57.7 58.3 Qwen2-7B-Instruct 55.8 55.4 62.2 57.8 60.2 52.3 56.8 56.4 glm-4-9b-chat 50.9 49.9 59.4 53.4 55.3 46 56.9 52.8 Meta-Llama-3.1-8B-Instruct 50.7 49.1 57.1 52.3 54.9 47.5 52.9 51.7 llama3-8b-cpt-sealionv2-instruct 51.1 49.1 54.7 51.6 53.7 45.2 50.3 49.7 Sailor-7B-Chat 47.5 46.6 51.4 48.5 48.6 41.7 46.1 45.5 aya-23-8B 41.6 29.9 48.1 39.9 48.8 30.9 47.5 42.4 Mistral-7B-Instruct-v0.3 42.5 35.1 41.5 39.7
Chunk 31 · 1,997 chars
9 59.4 53.4 55.3 46 56.9 52.8 Meta-Llama-3.1-8B-Instruct 50.7 49.1 57.1 52.3 54.9 47.5 52.9 51.7 llama3-8b-cpt-sealionv2-instruct 51.1 49.1 54.7 51.6 53.7 45.2 50.3 49.7 Sailor-7B-Chat 47.5 46.6 51.4 48.5 48.6 41.7 46.1 45.5 aya-23-8B 41.6 29.9 48.1 39.9 48.8 30.9 47.5 42.4 Mistral-7B-Instruct-v0.3 42.5 35.1 41.5 39.7 46.2 32.7 40.8 39.9 Table 8: Accuracies on SeaExam and MMLU-SEA. The models are sorted based on the average performance on SeaExam. model SeaBench MT-bench-SEA MT-bench-SEA-human id th vi avg id th vi avg id th vi avg gemma-2-9b-it 8.30 7.37 7.78 7.82 7.68 7.29 7.63 7.53 7.46 7.38 7.46 7.43 SeaLLMs-v3-7B-Chat 6.77 6.62 6.32 6.57 6.61 5.84 6.57 6.34 6.46 5.73 6.58 6.26 llama3-8b-cpt-sealionv2-instruct 6.22 6.06 6.14 6.14 5.52 4.96 5.04 5.17 5.31 5.23 5.24 5.26 Qwen2-7B-Instruct 6.42 5.68 6.19 6.09 6.61 6.04 6.50 6.38 6.63 6.03 6.73 6.46 glm-4-9b-chat 6.33 5.06 6.88 6.09 5.84 4.94 6.36 5.71 6.07 5.38 6.36 5.94 Meta-Llama-3.1-8B-Instruct 6.76 5.05 5.62 5.81 5.89 4.93 5.69 5.51 5.94 5.18 5.58 5.56 Sailor-7B-Chat 4.70 3.98 4.45 4.37 4.65 3.45 4.49 4.20 4.89 3.41 4.54 4.28 aya-23-8B 5.37 2.25 5.26 4.29 5.39 2.18 5.06 4.21 5.11 2.23 5.11 4.15 Mistral-7B-Instruct-v0.3 4.61 2.73 4.23 3.85 4.59 3.11 4.43 4.04 4.88 3.24 4.28 4.13 Table 9: Performances on SeaBench, MT-bench-SEA and MT-bench-SEA-human. The models are sorted based on the average performance on SeaBench. id th vi total Queries 1,954 517 2,184 4,658 Table 10: Number of queries for each language in Wild Queries. C Human Evaluation C.1 SeaBench Evaluation The prompt templates for reference-guided single-answer grading for SeaBench are shown in Figure 10 and 11. To compare the entity distributions between SeaExam, MMLU-SEA, and Wild Queries, we employ the prompt in Figure 12 to extract the entities from each query. 14 -- 14 of 18 -- 60 40 20 0 20 40 60 60 40 20 0 20 40 60 SeaExam(C-Dist: 21.62) 60 40 20 0 20 40 60 60 40 20 0 20 40 60 MMLU-SEA(C-Dist: 24.86) Indonesian 60 40 20 0 20
Chunk 32 · 1,998 chars
ributions between SeaExam, MMLU-SEA, and Wild Queries, we employ the prompt in Figure 12 to extract the entities from each query. 14 -- 14 of 18 -- 60 40 20 0 20 40 60 60 40 20 0 20 40 60 SeaExam(C-Dist: 21.62) 60 40 20 0 20 40 60 60 40 20 0 20 40 60 MMLU-SEA(C-Dist: 24.86) Indonesian 60 40 20 0 20 40 60 40 20 0 20 40 SeaExam(C-Dist: 26.77) 60 40 20 0 20 40 40 20 0 20 40 MMLU-SEA(C-Dist: 37.72) Thai 40 20 0 20 40 60 60 40 20 0 20 40 60 SeaExam(C-Dist: 24.38) 40 20 0 20 40 60 60 40 20 0 20 40 60 MMLU-SEA(C-Dist: 28.73) Vietnamese (a) 40 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 40 SeaBench(C-Dist: 21.42) 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 40 MT-bench-SEA(C-Dist: 27.39) Indonesian 40 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 SeaBench(C-Dist: 27.56) 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 MT-bench-SEA(C-Dist: 33.06) Thai 40 30 20 10 0 10 20 30 40 40 30 20 10 0 10 20 30 SeaBench(C-Dist: 20.92) 40 30 20 10 0 10 20 30 40 40 30 20 10 0 10 20 30 MT-bench-SEA(C-Dist: 26.89) Vietnamese (b) Figure 8: (a) Entity embedding distribution for Wild Queries, SeaExam, and MMLU-SEA, with each benchmark sampled up to 500 data points. (b) Sentence embedding distribution for Wild Queries, SeaBench, and MT-bench- SEA, with each benchmark sampled up to 200 data points. Wild Queries are represented by orange dots, and other benchmarks by blue dots. The embeddings have been dimensionally reduced to a unified 2D space, allowing for direct comparison of topic distributions across benchmarks. C.2 Agreement Evaluations To verify the reliability of LLMs as multilingual judges, we calculate their agreement rate with human judges by engaging three professional linguists to compare response pairs. These linguists are native speakers of the three SEA languages involved, making them more skilled than the average crowd workers. For each question, we randomly select three distinct model pairs, ensuring
Chunk 33 · 1,996 chars
lculate their agreement rate with human judges by engaging three professional linguists to compare response pairs. These linguists are native speakers of the three SEA languages involved, making them more skilled than the average crowd workers. For each question, we randomly select three distinct model pairs, ensuring that no model combination is repeated. Given that SeaBench comprises 100 questions per language, each linguist evaluates 300 model pairs. Considering the two-turn structure of each question, this approach results in 600 votes per language for analysis. During the annotation process, the linguists are unaware of which two models generated each response pair. The annotation instructions for the human judges are provided in Figure 13 in the Appendix. To ensure a more balanced set of labels, we treat responses as ties when their scores differ by 1 point or less, given that the model scores range from 1 to 10. Additionally, we calculate the average scores of the six judges to form the ensemble setting. claude-3-haiku claude-3.5-sonnet gemini-flash-1.5 gemini-pro-1.5 gpt-4o gpt-4o-mini avg claude-3-haiku claude-3.5-sonnet gemini-flash-1.5 gemini-pro-1.5 gpt-4o gpt-4o-mini 100.0 100.0 93.3 83.3 96.7 95.0 94.7 100.0 100.0 93.3 83.3 96.7 95.0 94.7 93.3 93.3 100.0 91.7 98.3 88.3 94.2 83.3 83.3 91.7 100.0 86.7 81.7 87.8 96.7 96.7 98.3 86.7 100.0 91.7 95.0 95.0 95.0 88.3 81.7 91.7 100.0 91.9 82.5 85.0 87.5 90.0 92.5 95.0 97.5 100.0 (a) Indonesian claude-3-haiku claude-3.5-sonnet gemini-flash-1.5 gemini-pro-1.5 gpt-4o gpt-4o-mini avg 100.0 95.0 98.3 95.0 98.3 98.3 97.5 95.0 100.0 98.3 100.0 98.3 98.3 98.3 98.3 98.3 100.0 98.3 100.0 100.0 99.2 95.0 100.0 98.3 100.0 98.3 98.3 98.3 98.3 98.3 100.0 98.3 100.0 100.0 99.2 98.3 98.3 100.0 98.3 100.0 100.0 99.2 95 96 97 98 99 100 (b) Thai claude-3-haiku claude-3.5-sonnet gemini-flash-1.5 gemini-pro-1.5 gpt-4o gpt-4o-mini avg 100.0 98.3 96.7 95.0 96.7 98.3 97.5 98.3 100.0 93.3 90.0 98.3 96.7 96.1 96.7 93.3
Chunk 34 · 1,987 chars
9.2 95.0 100.0 98.3 100.0 98.3 98.3 98.3 98.3 98.3 100.0 98.3 100.0 100.0 99.2 98.3 98.3 100.0 98.3 100.0 100.0 99.2 95 96 97 98 99 100 (b) Thai claude-3-haiku claude-3.5-sonnet gemini-flash-1.5 gemini-pro-1.5 gpt-4o gpt-4o-mini avg 100.0 98.3 96.7 95.0 96.7 98.3 97.5 98.3 100.0 93.3 90.0 98.3 96.7 96.1 96.7 93.3 100.0 98.3 95.0 98.3 96.9 95.0 90.0 98.3 100.0 91.7 96.7 95.3 96.7 98.3 95.0 91.7 100.0 98.3 96.7 98.3 96.7 98.3 96.7 98.3 100.0 98.1 90 92 94 96 98 100 (c) Vietnamese Figure 9: The ranking correlation for SeaBench between six judges for each language. For human evaluation, we report the number of counts to calculate the agreement rates when a tie is recorded if two scores differ by 1 or less, as shown in Table 11. The agreement rates and the number of counts when a tie is recorded if two responses receive equal scores are shown in Table 12 and Table 13. The instructions for human judges to compare the model performance are shown in Figure 13. 15 -- 15 of 18 -- Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. You will also be given a reference answer and a Priority Aspect list. Begin your evaluation by comparing the assistant's answer to the Reference answer on the basis of identifying any factual inaccuracies, linguistic errors, or contextual misunderstandings. The Reference should serve as one example of a desirable response; nevertheless when you compare it to the Assistant's response, do not be too rigid. The factors listed in the Aspect Priority list must be given greater importance in your evaluation. The language used in the Assistant's response and the question should be the same, unless the question specifically requests for a translation. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format:
Chunk 35 · 1,998 chars
sistant's response and the question should be the same, unless the question specifically requests for a translation. Begin your
evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response
on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".
[Question]
{question}
[The Start of Reference Answer]
{reference}
[The End of Reference Answer]
[The Start of Assistant's Answer]
{answer}
[The End of Assistant's Answer]
Priority Aspect: {priority_aspect}
Figure 10: The prompt for reference-guided single-turn single-answer grading.
Judge model With tie votes (R = 33.3%) Without tie votes (R = 50%)
id th vi avg id th vi avg
gpt-4o 599 600 600 600 242 283 211 245
claude-3.5-sonnet 600 599 600 600 222 286 215 241
gemini-pro-1.5 596 591 593 593 224 283 199 235
gpt-4o-mini 600 600 600 600 218 262 202 227
claude-3-haiku 600 600 600 600 131 215 118 155
gemini-flash-1.5 590 584 587 587 210 251 204 222
Ensemble 586 575 580 580 245 313 232 263
Table 11: Number of counts to calculate agreements between human evaluators and six judge models on SeaBench.
The agreement between two random judges under each setup is denoted as “R=”. For the judge models, a tie is
recorded if two scores differ by 1 or less.
Judge model With tie votes (R = 33.3%) Without tie votes (R = 50%)
id th vi avg id th vi avg
gpt-4o 62.8% 67.8% 53.0% 61.2% 87.2% 90.6% 81.4% 86.4%
claude-3.5-sonnet 62.3% 66.6% 53.3% 60.8% 88.0% 93.2% 81.5% 87.6%
gemini-pro-1.5 57.2% 62.9% 49.2% 56.5% 83.2% 92.3% 81.8% 85.8%
gpt-4o-mini 58.5% 67.5% 49.7% 58.6% 89.6% 92.2% 80.1% 87.3%
claude-3-haiku 50.5% 55.2% 47.8% 51.2% 74.9% 83.1% 76.8% 78.3%
gemini-flash-1.5 59.7% 66.4% 52.1% 59.4% 87.4% 90.4% 82.8% 86.9%
Ensemble 53.9% 63.1% 47.8% 54.9% 86.5% 89.8% 80.9% 85.7%
Table 12: Agreement between human evaluators and six judge models on SeaBench. The agreement between two
random judges in each setup is denoted as “R=”. ForChunk 36 · 1,982 chars
5.2% 47.8% 51.2% 74.9% 83.1% 76.8% 78.3%
gemini-flash-1.5 59.7% 66.4% 52.1% 59.4% 87.4% 90.4% 82.8% 86.9%
Ensemble 53.9% 63.1% 47.8% 54.9% 86.5% 89.8% 80.9% 85.7%
Table 12: Agreement between human evaluators and six judge models on SeaBench. The agreement between two
random judges in each setup is denoted as “R=”. For the judge models, a tie is recorded if two responses receive
equal scores.
16
-- 16 of 18 --
Please act as an impartial judge and evaluate the quality of an AI assistant's second turn response to a User's second turn
question, as displayed in the conversation provided below. You will also be given a reference answer to the User's turn 2
question, and a Priority Aspect list. Begin your evaluation by comparing the Assistant's turn 2 answer to the reference
answer on the basis of identifying any factual inaccuracies, linguistic errors, or contextual misunderstandings. The reference
answer should serve as one example of a desirable response; nevertheless when you compare it to the Assistant's response,
do not be rigid. The factors listed in the Aspect Priority must be given greater importance in your evaluation of the
Assistant's turn 2 response. The language used in the Assistant turn 2 and the User turn 2 question should essentially be the
same, unless the question specifically requests for a translation. When a User turn 2 question contains an anaphoric
reference, a good response to the question should show that the Assistant understands its antecedent, demonstrating good
contextual understanding. Begin your evaluation by providing a short explanation. Be as objective as possible. After
providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]",
for example: "Rating: [[5]]".
<|The Start of Assistant's Conversation with User|>
### User turn 1:
{question_1}
### Assistant turn 1:
{answer_1}
### User turn 2:
{question_2}
### Reference answer:
{reference}
### Assistant turnChunk 37 · 1,995 chars
st rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]",
for example: "Rating: [[5]]".
<|The Start of Assistant's Conversation with User|>
### User turn 1:
{question_1}
### Assistant turn 1:
{answer_1}
### User turn 2:
{question_2}
### Reference answer:
{reference}
### Assistant turn 2:
{answer_2}
<|The End of Assistant A's Conversation with User|>
Priority Aspect: {priority_aspect}
Figure 11: The prompt for reference-guided multi-turn single-answer grading.
I have the following text:
"{text}"
Please extract the following types of entities from this text:
- Persons (names of individuals)
- Locations (cities, countries, or places)
- Organizations (companies, governments, or institutions)
- Dates (specific dates in any format)
Return the entities in a structured JSON format like this:
{
"Persons": [],
"Locations": [],
"Organizations": [],
"Dates": []
}
Only include the entities found in the text.
Figure 12: The prompt to extract entities from a query .
17
-- 17 of 18 --
Please act as an impartial judge and evaluate the quality of the response provided by two AI assistants to
the user question displayed below. You will also be given a reference answer and a Priority Aspect list.
Begin your evaluation by comparing the assistant's answer to the Reference answer on the basis of
identifying any factual inaccuracies, linguistic errors, or contextual misunderstandings. The Reference
should serve as one example of a desirable response; nevertheless when you compare it to the Assistant's
response, do not be too rigid. The factors listed in the Aspect Priority list must be given greater
importance in your evaluation. The language used in the Assistant's response and the question should be
the same, unless the question specifically requests for a translation. Be as objective as possible.
Priority Aspect: {priority_aspect}.
You only need to tell which answer is better: "A", "B", "tie".
(a)
Please act as an impartial judge and evaluate theChunk 38 · 1,996 chars
nguage used in the Assistant's response and the question should be
the same, unless the question specifically requests for a translation. Be as objective as possible.
Priority Aspect: {priority_aspect}.
You only need to tell which answer is better: "A", "B", "tie".
(a)
Please act as an impartial judge and evaluate the quality of two AI assistants' second turn response to a
User's second turn question, as displayed in the conversation provided below. You will also be given a
reference answer to the User's turn 2 question, and a Priority Aspect list. Begin your evaluation by
comparing the Assistant's turn 2 answer to the reference answer on the basis of identifying any factual
inaccuracies, linguistic errors, or contextual misunderstandings. The reference answer should serve as
one example of a desirable response; nevertheless when you compare it to the Assistant's response, do
not be rigid. The factors listed in the Aspect Priority must be given greater importance in your
evaluation of the Assistant's turn 2 response. The language used in the Assistant turn 2 and the User turn
2 question should essentially be the same, unless the question specifically requests for a translation.
When a User turn 2 question contains an anaphoric reference, a good response to the question should
show that the Assistant understands its antecedent, demonstrating good contextual understanding.
Priority Aspect: {priority_aspect}
You only need to tell which answer is better: "A", "B", "tie".
(b)
Figure 13: Instructions for humans to compare the model performance in (a) turn 1, and (b) turn 2.
Judge model With tie votes (R = 33.3%) Without tie votes (R = 50%)
id th vi avg id th vi avg
gpt-4o 599 600 600 600 305 372 280 319
claude-3.5-sonnet 600 599 600 600 309 368 292 323
gemini-pro-1.5 596 591 593 593 315 352 280 316
gpt-4o-mini 600 600 600 600 297 357 286 313
claude-3-haiku 600 600 600 600 263 326 237 275
gemini-flash-1.5 590 584 587 587 294 343 274 304
Ensemble 586 575 580 580 347 392Chunk 39 · 611 chars
th vi avg gpt-4o 599 600 600 600 305 372 280 319 claude-3.5-sonnet 600 599 600 600 309 368 292 323 gemini-pro-1.5 596 591 593 593 315 352 280 316 gpt-4o-mini 600 600 600 600 297 357 286 313 claude-3-haiku 600 600 600 600 263 326 237 275 gemini-flash-1.5 590 584 587 587 294 343 274 304 Ensemble 586 575 580 580 347 392 325 355 Table 13: Number of counts to calculate agreements between human evaluators and six judge models on SeaBench. The agreement between two random judges under each setup is denoted as “R=”. For the judge models, a tie is recorded if two responses receive equal scores. 18 -- 18 of 18 --