BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

Authors: Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat LimkonchotiwatarXiv:2602.18788Published: 21-Feb-2026Source

Open PDF in new tab Download PDF

Summary

The paper introduces BURMESE-SAN, the first comprehensive benchmark for evaluating large language models (LLMs) in Burmese across three core NLP competencies: understanding, reasoning, and generation. It includes seven subtasks—Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation—covering 3,920 samples. The dataset was curated through a rigorous, native-speaker-driven process to ensure linguistic authenticity, cultural relevance, and high quality. The authors evaluate both open-source and commercial LLMs, finding that Burmese performance is more influenced by architectural design, language representation, and instruction tuning than model scale. Regional fine-tuning on Southeast Asian languages and newer model generations significantly improve results. BURMESE-SAN is released as a public leaderboard to support ongoing research in low-resource language NLP.

PDF viewer

Chunks(56)

Chunk 0 · 1,995 chars

BURMESE-SAN: Burmese NLP Benchmark
for Evaluating Large Language Models
Thura Aung1,∗, Jann Railey Montalan2,3, Jian Gang Ngui2,3, Peerat Limkonchotiwat2,3
1King Mongkut’s Institute of Technology Ladkrabang
2AI Singapore 3 National University of Singapore
66011606@kmitl.ac.th
{railey, jiangangngui, peerat.limkonchotiwat}@aisingapore.org
Abstract
We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models
(LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and gener-
ation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question
Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive
Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The
benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency,
and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation
of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited
pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance
depends more on architectural design, language representation, and instruction tuning than on model scale
alone. In particular, southeast asia regional fine-tuning and newer model generations yield substantial gains.
Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained
progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY.
Keywords: Burmese NLP, Large Language Models, Benchmark, Low-Resource Language
1. Introduction
Recent advances in Large Language Models (LLM)
development have improved the overall capabili-
ties of Natural Language Processing (NLP). With
a large amount of data and high

Chunk 1 · 1,992 chars

ges. https://leaderboard.sea-lion.ai/detailed/MY.
Keywords: Burmese NLP, Large Language Models, Benchmark, Low-Resource Language
1. Introduction
Recent advances in Large Language Models (LLM)
development have improved the overall capabili-
ties of Natural Language Processing (NLP). With
a large amount of data and high computing power,
LLMs have seen widespread adoption due to their
ability to provide scalable, practical, and diverse
capabilities (Guo et al., 2023; Hadi et al., 2023).
In particular, studies (Liang et al., 2023; Qin et al.,
2025; Chang et al., 2024) demonstrate the effec-
tiveness of LLMs across various languages and
tasks, utilizing their benchmarks to evaluate and
compare model performance.
HELM (Liang et al., 2023) proposed formu-
lating a holistic benchmark to evaluate the ro-
bustness of LLMs for a diversity of tasks in En-
glish. SEACrowd (Lovenia et al., 2024) and SEA-
HELM (Susanto et al., 2025), designed as holistic
benchmarks, evaluate LLMs’ linguistic capabilities
in several Southeast Asian (SEA) languages, and
both use human-crafted and machine-generated
corpora. The fact that some LLMs can perform
well on these benchmarks demonstrates that LLMs
can generalize from English to SEA languages.
However, most existing LLM benchmarks fo-
cus on English, with relatively few evaluating non-
English languages (Son et al., 2024). Furthermore,
even in multilingual benchmarks (Liang et al., 2023;
* Work conducted during Research Internship at AI
Singapore, National University of Singapore.
Susanto et al., 2025), Burmese is not included be-
cause the necessary datasets are not yet available.
Despite being the official language of Myanmar
and having a total number of speakers of over 43-
45 million (Ethnologue, 2019), Burmese remains
an under-resourced language for both models and
benchmarks (Dou et al., 2025).
Burmese stands out among the major SEA lan-
guages for its extensive system of case marking
and its rich morphological structure (Jenny,

Chunk 2 · 1,999 chars

yanmar
and having a total number of speakers of over 43-
45 million (Ethnologue, 2019), Burmese remains
an under-resourced language for both models and
benchmarks (Dou et al., 2025).
Burmese stands out among the major SEA lan-
guages for its extensive system of case marking
and its rich morphological structure (Jenny, 2021).
It is a tonal language with an analytic grammati-
cal structure (Okell, 2023). Burmese features four
distinct tones, with each syllable carrying a tone
that plays a crucial role in distinguishing meaning
between otherwise identical syllables.
The language follows a subject-object-verb
(SOV) word order and utilizes a complex system
of honorifics and politeness markers that reflect
social hierarchy and relationships (Republic of the
Union of Myanmar, Myanmar Language Commis-
sion, 2013; Khin Aye, 2024). Burmese is also a
diglossic language, with formal and informal vari-
eties that can differ significantly in vocabulary and
structure, adding another layer of complexity for
language modeling and evaluation. Therefore, due
to the challenge of language and the availability
of Burmese corpora, we need to dedicate more
resources to creating a high-quality benchmark to
identify any gaps and challenges in current LLMs.
To tackle these challenges, we propose
arXiv:2602.18788v2 [cs.CL] 25 Feb 2026

-- 1 of 22 --

Human
Translation
Native
Re-translation
Text
Normalization
Binary Agreement Rating
Native Revision
Label Verification
Evaluation with
Native Burmese Prompts
BURMESE-SAN
AS: Abstractive Summarization
MT: Machine Translation
TD: Toxicity Detection
QA: Question Answering
CR: Causal Reasoning
NLI: Natural Language Inference
SA: Sentiment Analysis
ေအာက်ပါ ြမန်မာေဆာင်းပါးကို ဝါကျတစ်ေကာင်း သိမဟုတ် ှစ်
ေကာင်းပါေသာ စာပိုဒ်တစ်ပိုဒ်အြဖစ် အကျ်းချပ်ေဖာ်ြပပါ။
ေအာက်ေဖာ်ြပပါ ပံုစံကိုသာ အသံုးြပ၍ ေြဖဆိုပါ:
အကျ်းချပ်: $SUMMARY
$SUMMARY ကို အကျ်းချပ်ြဖင့် အစားထိုးပါ။
ေဆာင်းပါး:
```
{text}
```
E.g. Prompt for Abstractive Summarization
NL Understanding
NL Reasoning
NL

Chunk 3 · 1,987 chars

Analysis
ေအာက်ပါ ြမန်မာေဆာင်းပါးကို ဝါကျတစ်ေကာင်း သိမဟုတ် ှစ်
ေကာင်းပါေသာ စာပိုဒ်တစ်ပိုဒ်အြဖစ် အကျ်းချပ်ေဖာ်ြပပါ။
ေအာက်ေဖာ်ြပပါ ပံုစံကိုသာ အသံုးြပ၍ ေြဖဆိုပါ:
အကျ်းချပ်: $SUMMARY
$SUMMARY ကို အကျ်းချပ်ြဖင့် အစားထိုးပါ။
ေဆာင်းပါး:
```
{text}
```
E.g. Prompt for Abstractive Summarization
NL Understanding
NL Reasoning
NL Generation
Competencies
QA
SA
CR
NLI
AS
MT
7 Distinct Tasks
QA	CR	NLIAS	MT	SA
CR	SA
English
-sourced
English
-adapted
Natively
Sourced
QA	AS	MT	NLI
Natively
-translated
Random Sampling and Filtering by text quality
TD
TD
TD
Figure 1: BURMESE-SAN Benchmark (Left) and Dataset Curation Process for the benchmark
(Right). BURMESE-SAN is a benchmark that holistically evaluates LLM performance across a wide
range of Burmese language tasks. The evaluation is based on native Burmese text, with prompts written
in formal Burmese to ensure clarity and grammatical correctness.
BURMESE-SAN1, the first holistic Burmese Bench-
mark for evaluating LLMs across seven NLP tasks
covering natural language understanding, reason-
ing, and generation. In particular, as shown in
Figure 1, we have seven subtasks: sentiment anal-
ysis (SA), toxicity detection (TD), question answer-
ing (QA), causal reasoning (CR), natural language
inference (NLI), abstractive summarization (AS),
and machine translation (MT), with a total of 3,920
samples. BURMESE-SAN was meticulously built
in collaboration with native speakers, who have
been involved at every stage of the pipeline - from
task selection, prompt translation, and data vali-
dation to final quality assurance. This high-effort,
human-centered approach ensures linguistic au-
thenticity, cultural relevance, and evaluative rigor,
setting a new standard for Burmese benchmarks.
Proposed Studies. Based on BURMESE-SAN,
we investigate the following research questions:
RQ1 Comparison of Commercial and Open-
Source Models How do commercial LLMs
compare with open-weight models on
Burmese language tasks?
RQ2 Effect of Model Scale Does

Chunk 4 · 1,997 chars

evaluative rigor,
setting a new standard for Burmese benchmarks.
Proposed Studies. Based on BURMESE-SAN,
we investigate the following research questions:
RQ1 Comparison of Commercial and Open-
Source Models How do commercial LLMs
compare with open-weight models on
Burmese language tasks?
RQ2 Effect of Model Scale Does increasing
model size lead to improved performance
on Burmese NLP tasks?
RQ3 Effect of Southeast Asian (SEA) Fine-
Tuning Does fine-tuning on Southeast Asian
languages improve model performance on
Burmese?
RQ4 Effect of Model Quantization How does
model quantization affect performance com-
pared to full-precision models?
1The word San means Standard in Burmese.
RQ5 Temporal Progress in Burmese Language
Capability Have LLMs demonstrated consis-
tent improvements in Burmese performance
across model generations?
Contributions. We summarize the contribution of
our work as follows:
• We propose BURMESE-SAN. A holistic
Burmese benchmark for evaluating LLMs
across seven NLP tasks. Our benchmark is
formulated and edited by humans, resulting in
high-quality samples.
• We present a reproducible dataset develop-
ment process, covering sourcing, task design,
and quality checks. https://github.com/
aisingapore/SEA-HELM.
• We conduct extensive experiments on
Instruction-Tuned, and Reasoning LLMs of
different sizes, revealing systematic gaps
in Burmese understanding, reasoning, and
generation. The results show that SEA-
tuned models can substantially improve
performance.
2. Related Works
2.1. Burmese NLP
Myanmar is a linguistically diverse country with
around 100 ethnic languages and dialects, belong-
ing to four major language families: Sino-Tibetan,
Austro-Asiatic, Tai–Kadai, and Indo-European (Eth-
nologue, 2024). Although Standard Burmese is the
official language, several dialects are spoken, in-
cluding Beik, Dawei, and Rakhine (Oo et al., 2023).

-- 2 of 22 --

Burmese (or Bamar), also known as Standard
Burmese, is the native language of the Bamar ma-
jority

Chunk 5 · 1,996 chars

Austro-Asiatic, Tai–Kadai, and Indo-European (Eth-
nologue, 2024). Although Standard Burmese is the
official language, several dialects are spoken, in-
cluding Beik, Dawei, and Rakhine (Oo et al., 2023).

-- 2 of 22 --

Burmese (or Bamar), also known as Standard
Burmese, is the native language of the Bamar ma-
jority and is also spoken by related groups such
as the Mon. It belongs to the Sino-Tibetan (specif-
ically Tibeto-Burman) language family. Burmese
has been significantly influenced by Pali, the liturgi-
cal language of Theravada Buddhism, as well as by
Mon and English. Over time, many foreign words
entered Burmese as loanwords. After the end of
British colonial rule, the government replaced many
English terms by creating new Burmese equiva-
lents (MULTICSD Project Team, 2025). Although
Standard Burmese is the official language of Myan-
mar, widely used by the majority population and
more resource-rich than other ethnic languages, it
is still considered under-resourced for NLP. It lacks
sufficient high-quality data, tools, and benchmarks
for effective language processing. However, due to
its official status, widespread use, and relative re-
source availability, we focus on Standard Burmese
for benchmarking in this work.
The design of BURMESE-SAN focuses on lin-
guistic authenticity and representativeness in vo-
cabulary, sentence structure, and grammar. This
enables a comprehensive evaluation of Natu-
ral Language Processing (NLP) capabilities in
Burmese, encompassing Natural Language Un-
derstanding (NLU), Natural Language Reasoning
(NLR), and Natural Language Generation (NLG).
Regarding sentence structure, BURMESE-SAN is
designed to reflect natural and native use of Stan-
dard Burmese. We carefully prepared the eval-
uation data to help evaluate how well the LLMs
handle real-world Burmese language use. (See
Section 3). The benchmark uses commonly spo-
ken vocabulary, focusing on words and phrases
frequently used in daily conversation. It also nat-
urally

Chunk 6 · 1,998 chars

natural and native use of Stan-
dard Burmese. We carefully prepared the eval-
uation data to help evaluate how well the LLMs
handle real-world Burmese language use. (See
Section 3). The benchmark uses commonly spo-
ken vocabulary, focusing on words and phrases
frequently used in daily conversation. It also nat-
urally includes loanwords - mainly from Pali and
English - that are now a regular part of modern
Burmese lexicon.
2.2. Dataset and Benchmark Overview
Previous efforts to evaluate Burmese language
models have been mostly scattered and uncoor-
dinated. Researchers have developed datasets
for Burmese NLP tasks, including text classifica-
tion, sequence tagging, and machine translation
(Thu et al., 2023; San et al., 2024; Zaw et al.,
2022; Hlaing et al., 2022; Aung et al., 2024; Kyaw
et al., 2024). However, there is still no unified,
comprehensive benchmark to systematically eval-
uate LLMs across different linguistic aspects of the
Burmese language.
Recent Southeast Asian benchmarks largely ex-
clude Burmese or rely on machine-translated data.
SeaExam and SeaBench (Liu et al., 2025) intro-
duce localized exam-style and open-ended queries
in Vietnamese, Thai, and Indonesian, but do not
include Burmese. SailCompass (Guo et al., 2024b)
evaluates Southeast Asian language understand-
ing without any Burmese components. SeaEval
(Wang et al., 2024) provides a multilingual and mul-
ticultural benchmark comprising 29 datasets; how-
ever, Burmese is excluded, and most of the data
are machine-translated or derived from general-
purpose multilingual prompts. BHASA covers In-
donesian, Tamil, Thai, and Vietnamese (Leong
et al., 2023), whereas IndoNLU focuses solely on
Indonesian NLU tasks (Wilie et al., 2020), lacking
evaluations of generative or reasoning capabilities.
Unlike these existing benchmarks, BURMESE-
SAN provides a comprehensive, multi-task eval-
uation suite for Burmese NLP, covering NLU,
NLR, and NLG tasks. All datasets were care-
fully curated and adapted to

Chunk 7 · 1,998 chars

olely on
Indonesian NLU tasks (Wilie et al., 2020), lacking
evaluations of generative or reasoning capabilities.
Unlike these existing benchmarks, BURMESE-
SAN provides a comprehensive, multi-task eval-
uation suite for Burmese NLP, covering NLU,
NLR, and NLG tasks. All datasets were care-
fully curated and adapted to ensure native speaker
verification, providing linguistic authenticity, high-
quality labels, and cultural relevance. This makes
BURMESE-SAN the first holistic benchmark
specifically designed for Burmese.
3. Task and Dataset Curation
As shown in Table 2, BURMESE-SAN covers tasks
across Natural Language Understanding (NLU),
Natural Language Reasoning (NLR), and Natural
Language Generation (NLG) with balanced class
distributions, totaling 3,920 samples. We discuss
how to formulate BURMESE-SAN, starting from
task and dataset selection in Sections 3.1 and 3.2,
to ensure that our benchmark is holistic. Finally,
we then describe the dataset adaptation and an-
notation process in Section 3.3, including how we
ensure the quality of our benchmark. Note that
more information about the tasks and datasets,
along with task descriptions, dataset information,
and quality, is discussed further in Appendix A, B,
and C.
3.1. Task Selection
Similar to well-exhibited benchmarks in the SEA
languages (Montalan et al., 2025; Liu et al., 2025),
we require tasks that are widely used to evalu-
ate the robustness of LLMs that reflect real-world
scenarios. The tasks should consist of classical
NLP tasks, such as classification, and LLM-widely
adopted tasks, including question answering (QA),
summarization (AS), and machine translation (MT),
as well as reasoning and understanding tasks, in-
cluding natural language inference (NLI). There-
fore, BURMESE-SAN brings comprehensive LLM
evaluation to Burmese with seven tasks across
NLU (sentiment analysis, question answering, and
toxicity detection), NLG (abstractive summarization
and machine translation), and NLR (causal reason-

-- 3

Chunk 8 · 1,998 chars

understanding tasks, in-
cluding natural language inference (NLI). There-
fore, BURMESE-SAN brings comprehensive LLM
evaluation to Burmese with seven tasks across
NLU (sentiment analysis, question answering, and
toxicity detection), NLG (abstractive summarization
and machine translation), and NLR (causal reason-

-- 3 of 22 --

Comp. Task Dataset 	Target1 Language Source2 Adaptation Our contribution
NLU QA Belebele (Bandarkar et al., 2024) S (4) Burmese English-adapted Native re-translation ✓
SA GKLMIP-mya (Jiang et al., 2021) L (3) Burmese Natively-sourced Normalization3 ✓
TD myHateSpeech (Kyaw et al., 2024) L (9) Burmese Natively-sourced No Adaptation
NLR CR Balanced COPA (Kavumba et al., 2019) L (2) English English-sourced Native translation ✓
NLI myXNLI (Htet and Dras, 2025) L (3) Burmese Natively-translated No Adaptation
NLG AS XL-Sum (Hasan et al., 2021) Sum Burmese English-adapted Native re-translation ✓
MT FLORES+ (Team, 2024a) Trans Burmese English-adapted Native re-translation ✓
1 Number of options shown in parentheses. L: Label; S: Span; Sum: Summary; Trans: Translate
2 English-adapted: previously translated from English, then further adapted or corrected by native speakers.
English-sourced: originally written in English.
Natively-sourced: originally created in Burmese by native speakers. Natively-translated: translated by native speakers
3 Normalization: spelling was standardized to follow consistent orthography and modern usage norms,
including Unicode normalization and correction of common typos.
Table 1: Dataset information for each task in BURMESE-SAN. A checkmark in the Our contribution
column indicates direct contributions to the adaptation process.
Competency Task Label # samples
NLU QA Span 120
SA Positive 200
Negative 200
Neutral 200
TD Clean 200
Hate 200
NLR CR Cause 200
Effect 200
NLI Contradiction 200
Entailment 200
Neutral 200
NLG AS Summary 100
MT MYA → ENG 850
ENG → MYA 850
Total 3,920
Table 2: Class distribution per task in BURMESE-
SAN

Chunk 9 · 1,998 chars

ptation process.
Competency Task Label # samples
NLU QA Span 120
SA Positive 200
Negative 200
Neutral 200
TD Clean 200
Hate 200
NLR CR Cause 200
Effect 200
NLI Contradiction 200
Entailment 200
Neutral 200
NLG AS Summary 100
MT MYA → ENG 850
ENG → MYA 850
Total 3,920
Table 2: Class distribution per task in BURMESE-
SAN Benchmark Dataset
ing and natural language inference), as shown in
Table 1. This ensures that our benchmark covers
all aspects of LLM evaluation.
3.2. Dataset Selection
To build BURMESE-SAN, we collected open-
source datasets with clear sources, focusing on
those that reflect authentic Burmese language use
in domains such as social media, news, and forums.
For tasks like TD and NLI, high-quality datasets
are already available that do not require adaptation.
These datasets are either formulated in Burmese
(monolingual datasets) or carefully edited by pre-
vious works to fix translation issues (Kyaw et al.,
2024; Htet and Dras, 2025).
However, for other tasks, existing datasets re-
quired translation or editing to meet our goals of cre-
ating high-quality and culturally relevant datasets.
Therefore, we adapted English and partially trans-
lated data sets, including FLORES+ (Team, 2024a),
Belebele (Bandarkar et al., 2024), GKLMIP-
mya (Jiang et al., 2021), XL-Sum (Hasan et al.,
2021), and Balanced COPA (Kavumba et al., 2019),
to create Burmese-SAN. While these datasets are
well-known and widely used in LLM evaluation
benchmarks (Lovenia et al., 2024; Susanto et al.,
2025), they were not readily available in Burmese
or were inadequately translated. We applied nor-
malization and native re-translation to ensure suit-
ability for inclusion in Burmese-SAN. Table 1 also
summarizes the usage, adaptations, and our con-
tributions, highlighting the comprehensive, native-
speaker-verified nature of Burmese-SAN, which dis-
tinguishes it from prior benchmarks (Vayani et al.,
2025; Lovenia et al., 2024).
3.3. Dataset Adaptation and Annotation
As we discussed in the

Chunk 10 · 1,994 chars

Burmese-SAN. Table 1 also
summarizes the usage, adaptations, and our con-
tributions, highlighting the comprehensive, native-
speaker-verified nature of Burmese-SAN, which dis-
tinguishes it from prior benchmarks (Vayani et al.,
2025; Lovenia et al., 2024).
3.3. Dataset Adaptation and Annotation
As we discussed in the previous discussion, al-
though some Burmese datasets are available, their
data format and quality need refinement to make
them natural in Burmese, culturally suitable, and re-
liable for evaluation. As shown in Figure 1 (Right),
we designed a four-step process (random sam-
pling and filtering, translation, text normalization,
and label verification, as well as native revision) to
combine Burmese and non-Burmese datasets into
a single benchmark. Note that, to ensure dataset
quality, we employed native Burmese speakers
(ages 18–25), primarily university students enrolled
in international programs as annotators.
Random Sampling and Filtering For each task,
we first sampled data and restricted text length to
20–3,000 characters, ensuring balanced class dis-
tributions to reduce bias. All NLU and NLR tasks
were limited to 32 tokens per output, whereas gen-
eration tasks utilized longer outputs, with 256 to-
kens for MT and 512 tokens for AS, to accommo-
date richer content generation. Low-quality items,
such as duplicates, unclear text, or unnatural writ-
ing, were removed by native speakers to maintain
the high quality of Burmese examples. For the na-
tively sourced datasets (SA, TD) and the natively
translated NLI (Htet and Dras, 2025), we applied
task-specific filtering. For the SA and TD tasks,
we removed sentences with contextually ambigu-
ous expressions that could obscure labels. For the

-- 4 of 22 --

NLI task, we ensured premise–hypothesis pairs
accurately reflected the intended relation.
Translation and Binary Criteria Rating For tasks
originally in English (QA, CR, AS, MT), the data
was translated into Burmese using annotators, en-
suring

Chunk 11 · 1,989 chars

ambigu-
ous expressions that could obscure labels. For the

-- 4 of 22 --

NLI task, we ensured premise–hypothesis pairs
accurately reflected the intended relation.
Translation and Binary Criteria Rating For tasks
originally in English (QA, CR, AS, MT), the data
was translated into Burmese using annotators, en-
suring high-quality data. For the English-sourced
dataset (CR), we translated it from English to
Burmese by leveraging our native speaker anno-
tators. In contrast, the English-adapted datasets
(QA, AS, and MT) were re-translated due to the
low quality of existing translations. After transla-
tion, our native-speaker annotators applied a binary
rating (agree/disagree) to evaluate dataset quality
based on multiple criteria. We used Completeness,
Fluency, and Sensibility to assess information re-
tention, grammaticality, and contextual logic for all
tasks; Faithfulness for translation to evaluate mean-
ing preservation; and Relevance and Coherence
for summarization to measure content focus and
structural flow. The initial overall mean joint agree-
ment across all tasks and criteria is 99.11%, indicat-
ing a consistently high level of annotator alignment
throughout the evaluation. For each sample, if it
did not receive full agreement from our annotators,
it was then revised by them to improve quality. This
process efficiently filtered out poor-quality transla-
tions without requiring full re-annotation, ensuring
that only accurate and fluent Burmese text was
included in the benchmark.
Text Normalization and Label Verification In the
next step, normalization was applied to improve
consistency for datasets that were natively sourced
(SA and TD). This process primarily served to
resolve ambiguities and measure inter-annotator
agreement, preserving dataset integrity while im-
proving reliability for downstream evaluation. For
Toxicity Detection (TD), despite spelling errors, dis-
torted spellings were preserved, as they reflect real
social media usage and

Chunk 12 · 1,993 chars

TD). This process primarily served to
resolve ambiguities and measure inter-annotator
agreement, preserving dataset integrity while im-
proving reliability for downstream evaluation. For
Toxicity Detection (TD), despite spelling errors, dis-
torted spellings were preserved, as they reflect real
social media usage and intentional attempts to by-
pass detection. For Sentiment Analysis (SA), typos,
spelling mistakes, and character-order issues were
corrected so that errors would not affect classifica-
tion. Moreover, code-switching (English-Burmese)
samples were preserved in both tasks, as it is a
natural part of Myanmar social media. For the gold
label of each sample, due to the subjective nature
of these tasks, native speakers verified and, when
necessary, adjusted the labels. We retained the
original labels as far as possible, as our goal was
to ensure label consistency and quality rather than
redefine the task. We found that the agreement
between the original and verified labels was very
high. For Sentiment Analysis (SA), Cohen’s Kappa
= 0.948, Krippendorff’s Alpha = 0.948, and total
agreement = 96.5%. For Toxicity Detection (TD),
Cohen’s Kappa = 0.985, Krippendorff’s Alpha =
0.985, and total agreement = 99.25%.
Native Revision Finally, native speakers con-
ducted a final review of all text. This step focused
on refining linguistic naturalness, cultural appro-
priateness, and readability, including minor adjust-
ments to grammar, punctuation, and style. The
goal was not to alter the meaning, but to ensure
that the dataset reflects authentic Burmese lan-
guage use suitable for downstream tasks.
4. Experimental Setup
4.1. Prompt Templates
For BURMESE-SAN, we designed task-specific
prompt templates entirely in Burmese. These tem-
plates were carefully aligned with the principles
of prompt design established in SEA-HELM (Su-
santo et al., 2025) to ensure consistency between
tasks. In particular, we translate the task prompts
from SEA-HELM into the native Burmese

Chunk 13 · 1,993 chars

ESE-SAN, we designed task-specific
prompt templates entirely in Burmese. These tem-
plates were carefully aligned with the principles
of prompt design established in SEA-HELM (Su-
santo et al., 2025) to ensure consistency between
tasks. In particular, we translate the task prompts
from SEA-HELM into the native Burmese language.
This is because the design prompt of SEA-HELM
maintains a clear separation between instruction
and content, as the formal prompts are unambigu-
ous and standardized, thereby reducing confound-
ing factors that might influence model performance
due to differences in prompt style rather than gen-
uine understanding of the Burmese text.
4.2. Evaluation Setup
Tasks. Each task in BURMESE-SAN, as described
in Section 3, is designed to thoroughly test how well
LLMs understand and use the Burmese language.
A task includes a set of test and example cases,
each with an input, reference answers, metadata,
and the correct label. Metrics We adopt multiple
evaluation metrics to assess model performance
across tasks. For each metric, the model output is
treated as the prediction, while the corresponding
instance label serves as the reference. For NLU
and NLR tasks, we report the accuracy score. For
machine translation, we use MetricX-24 (Juraska
et al., 2024). For abstractive summarization, we
report ROUGE-L F1 using the multilingual ROUGE
implementation from XL-Sum (Hasan et al., 2021).
All metric scores are normalized to a common
[0,100] scale following the SEA-HELM normaliza-
tion process, which accounts for differences in
metric ranges, task difficulty, and random base-
lines (e.g., MetricX originally ranges from [0,25]
but is rescaled to [0,100]). Each model is evalu-
ated across eight independent runs without greedy
decoding (i.e., temperature = 0) (Miller, 2024), and
we report the mean normalized score to ensure
stable and reliable performance estimates.
4.3. Models
Model Selection. We evaluated a set of LLMs
across diverse families, parameter

Chunk 14 · 1,999 chars

ed to [0,100]). Each model is evalu-
ated across eight independent runs without greedy
decoding (i.e., temperature = 0) (Miller, 2024), and
we report the mean normalized score to ensure
stable and reliable performance estimates.
4.3. Models
Model Selection. We evaluated a set of LLMs
across diverse families, parameter scales, and ac-

-- 5 of 22 --

Model 	MY 	CR 	NLI 	QA 	SA 	TD 	AS 	MT
Small Models (< 14B)
MERaLiON 2 (10B) 	10.66 ± 0.16 0.00 ± 0.00 12.61 ± 0.31 0.00 ± 0.00 0.07 ± 0.13 14.79 ± 0.09 19.49 ± 0.27 35.33 ± 0.15
Olmo 2 1124 (13B) 	1.84 ± 0.09 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.79 ± 0.03 0.00 ± 0.00 1.04 ± 0.03
Olmo 2 1124 (7B) 	2.18 ± 0.12 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 1.80 ± 0.04 0.00 ± 0.00 6.44 ± 0.10
Olmo 3 (7B) 	3.41 ± 0.11 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 3.66 ± 0.05 14.00 ± 0.14 1.11 ± 0.02
Tulu 3 (8B) 	10.95 ± 0.14 0.00 ± 0.00 1.35 ± 0.13 5.89 ± 0.64 0.00 ± 0.00 13.70 ± 0.06 21.48 ± 0.17 37.48 ± 0.10
SEA-LION v3-Gemma-2 (9B) 	15.40 ± 0.22 0.00 ± 0.00 25.10 ± 0.37 4.52 ± 1.35 0.00 ± 0.00 20.64 ± 0.11 14.84 ± 0.29 53.10 ± 0.15
SEA-LION v3-Llama (8B) 	17.88 ± 0.26 4.45 ± 1.29 17.01 ± 0.70 37.41 ± 0.81 22.19 ± 0.97 12.79 ± 0.26 5.61 ± 0.27 13.92 ± 0.12
SEA-LION v4-Apertus (8B) 	16.68 ± 0.22 0.00 ± 0.00 0.00 ± 0.00 23.44 ± 0.97 19.05 ± 0.43 15.71 ± 0.08 5.47 ± 0.13 55.46 ± 0.16
SEA-LION v4-Gemma-3-VL (4B) 	26.24 ± 0.13 0.00 ± 0.00 34.39 ± 0.30 47.78 ± 0.00 43.11 ± 0.15 27.90 ± 0.07 17.93 ± 0.10 62.56 ± 0.12
SEA-LION v4-Qwen-3-VL (4B) 	23.31 ± 0.17 28.80 ± 0.41 21.26 ± 0.21 42.81 ± 0.25 31.54 ± 0.14 20.37 ± 0.10 0.63 ± 0.03 24.90 ± 0.13
SEA-LION v4-Qwen-3-VL (8B) 	30.67 ± 0.19 38.77 ± 0.41 23.28 ± 0.25 58.22 ± 0.60 36.21 ± 0.27 31.76 ± 0.11 15.49 ± 0.21 51.70 ± 0.20
Qwen 2.5 (7B) 	8.04 ± 0.15 0.00 ± 0.00 5.97 ± 0.82 0.00 ± 0.00 0.00 ± 0.00 8.77 ± 0.15 19.35 ± 0.18 11.22 ± 0.11
Qwen 3 VL (4B) 	18.61 ± 0.13 0.00 ± 0.00 29.18 ± 0.48 40.59 ± 0.41 35.60 ± 0.13 21.34 ± 0.09 21.48 ± 0.09 40.17 ± 0.12
Qwen 3 VL

Chunk 15 · 1,998 chars

23.28 ± 0.25 58.22 ± 0.60 36.21 ± 0.27 31.76 ± 0.11 15.49 ± 0.21 51.70 ± 0.20
Qwen 2.5 (7B) 	8.04 ± 0.15 0.00 ± 0.00 5.97 ± 0.82 0.00 ± 0.00 0.00 ± 0.00 8.77 ± 0.15 19.35 ± 0.18 11.22 ± 0.11
Qwen 3 VL (4B) 	18.61 ± 0.13 0.00 ± 0.00 29.18 ± 0.48 40.59 ± 0.41 35.60 ± 0.13 21.34 ± 0.09 21.48 ± 0.09 40.17 ± 0.12
Qwen 3 VL (8B) 	26.26 ± 0.20 30.70 ± 0.65 3.16 ± 0.14 40.04 ± 0.73 33.83 ± 0.33 25.73 ± 0.13 20.11 ± 0.15 47.90 ± 0.13
Qwen 3 VL (4B) 	20.42 ± 0.16 22.77 ± 0.57 21.53 ± 0.32 44.44 ± 0.38 36.70 ± 0.13 20.01 ± 0.13 0.55 ± 0.00 34.34 ± 0.11
Qwen 3 VL (8B) 	30.25 ± 0.22 36.85 ± 0.59 32.05 ± 0.46 53.59 ± 0.70 36.41 ± 0.26 32.60 ± 0.17 16.51 ± 0.23 49.72 ± 0.09
Babel (9B) 	8.01 ± 0.18 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 22.39 ± 0.63 8.64 ± 0.08 18.33 ± 0.33 17.51 ± 0.22
SeaLLMs V3 (7B) 	7.13 ± 0.16 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 8.47 ± 0.08 17.68 ± 0.31 18.87 ± 0.15
Aya Expanse (8B) 	3.03 ± 0.13 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 1.48 ± 0.04 0.00 ± 0.00 2.82 ± 0.06
Command R7B 12-2024 (7B) 	3.19 ± 0.13 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 4.36 ± 0.07 10.06 ± 0.28 9.73 ± 0.14
Gemma 2 (9B) 	9.63 ± 0.17 0.00 ± 0.00 9.16 ± 0.63 0.00 ± 0.00 0.00 ± 0.00 13.02 ± 0.13 14.08 ± 0.28 35.64 ± 0.13
Gemma 3 VL (12B) 	42.46 ± 0.15 44.93 ± 0.77 26.22 ± 0.16 67.11 ± 0.41 48.96 ± 0.22 40.97 ± 0.15 18.77 ± 0.26 70.96 ± 0.09
Gemma 3 VL (4B) 	20.56 ± 0.19 0.00 ± 0.00 22.12 ± 0.31 35.74 ± 0.38 39.18 ± 0.16 22.25 ± 0.09 19.35 ± 0.12 50.90 ± 0.14
Llama 3 (8B) 	3.20 ± 0.06 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 4.89 ± 0.03 0.00 ± 0.00 22.95 ± 0.15
Llama 3.1 (8B) 	8.45 ± 0.19 0.00 ± 0.00 0.00 ± 0.00 7.48 ± 1.02 0.00 ± 0.00 10.72 ± 0.09 26.73 ± 0.27 20.69 ± 0.14
Ministral 2410 (8B) 	3.90 ± 0.13 0.00 ± 0.00 0.29 ± 0.23 0.00 ± 0.00 0.05 ± 0.10 6.61 ± 0.07 4.22 ± 0.19 27.35 ± 0.17
Sailor2 (8B) 	11.65 ± 0.13 0.00 ± 0.00 43.74 ± 0.79 0.26 ± 0.27 0.00 ± 0.00 19.30 ± 0.14 0.00 ± 0.00 48.76 ± 0.14
Apertus (8B) 	9.30 ± 0.21 0.00 ± 0.00 0.25 ± 0.22 3.78

Chunk 16 · 1,996 chars

± 0.09 26.73 ± 0.27 20.69 ± 0.14
Ministral 2410 (8B) 	3.90 ± 0.13 0.00 ± 0.00 0.29 ± 0.23 0.00 ± 0.00 0.05 ± 0.10 6.61 ± 0.07 4.22 ± 0.19 27.35 ± 0.17
Sailor2 (8B) 	11.65 ± 0.13 0.00 ± 0.00 43.74 ± 0.79 0.26 ± 0.27 0.00 ± 0.00 19.30 ± 0.14 0.00 ± 0.00 48.76 ± 0.14
Apertus (8B) 	9.30 ± 0.21 0.00 ± 0.00 0.25 ± 0.22 3.78 ± 1.27 2.92 ± 0.74 12.13 ± 0.10 13.22 ± 0.27 40.68 ± 0.16
Medium Models (14B–32B)
Olmo 2 0325 (32B) 	4.38 ± 0.15 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 3.91 ± 0.05 0.00 ± 0.00 14.73 ± 0.12
SEA-LION v4-Gemma-3-VL (27B) 47.18 ± 0.15 57.15 ± 0.47 52.92 ± 0.18 69.11 ± 0.41 47.56 ± 0.26 48.95 ± 0.11 11.43 ± 0.23 77.83 ± 0.10
SEA-LION v4-Qwen-3-VL (32B) 	49.56 ± 0.14 66.38 ± 0.20 57.13 ± 0.20 81.52 ± 0.22 40.49 ± 0.13 51.70 ± 0.08 23.56 ± 0.08 64.00 ± 0.19
Qwen 2.5 (14B) 	13.59 ± 0.19 0.00 ± 0.00 4.36 ± 0.68 33.96 ± 0.68 0.00 ± 0.00 14.25 ± 0.13 21.50 ± 0.11 32.46 ± 0.10
Qwen 2.5 (32B) 	26.99 ± 0.17 13.68 ± 0.64 5.60 ± 0.47 43.04 ± 0.71 32.36 ± 0.33 22.78 ± 0.14 24.05 ± 0.12 39.36 ± 0.15
Qwen 3 VL (14B) 	32.46 ± 0.13 31.75 ± 0.52 17.80 ± 0.30 61.26 ± 0.63 36.96 ± 0.23 31.52 ± 0.12 22.76 ± 0.12 51.87 ± 0.16
Qwen 3 VL (32B) 	44.60 ± 0.19 60.68 ± 0.45 52.07 ± 0.34 74.48 ± 0.73 41.37 ± 0.31 45.33 ± 0.12 25.31 ± 0.18 44.71 ± 0.18
Qwen 3 VL (32B) 	40.89 ± 0.19 49.48 ± 0.48 43.49 ± 0.28 74.04 ± 0.27 47.17 ± 0.17 40.26 ± 0.12 10.77 ± 0.23 56.03 ± 0.11
Aya Expanse (32B) 	6.44 ± 0.15 0.25 ± 0.24 0.25 ± 0.22 0.00 ± 0.00 0.00 ± 0.00 5.67 ± 0.08 0.00 ± 0.00 20.66 ± 0.18
Command R 08-2024 (32B) 	4.87 ± 0.15 0.00 ± 0.00 1.37 ± 0.69 1.07 ± 0.61 0.00 ± 0.00 6.30 ± 0.13 16.97 ± 0.15 9.75 ± 0.11
Gemma 2 (27B) 	23.97 ± 0.24 25.78 ± 1.05 32.10 ± 0.27 31.30 ± 1.50 21.46 ± 0.87 29.59 ± 0.21 21.41 ± 0.19 50.32 ± 0.12
Gemma 3 VL (27B) 	48.14 ± 0.17 57.08 ± 0.51 53.82 ± 0.16 69.85 ± 0.31 47.13 ± 0.24 49.96 ± 0.12 13.16 ± 0.25 79.44 ± 0.07
phi-4 (14B) 	6.45 ± 0.18 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 17.77 ± 0.69 6.65 ± 0.09 10.79 ± 0.39 16.19 ± 0.12
Mistral Small 3.1 2503

Chunk 17 · 1,995 chars

.30 ± 1.50 21.46 ± 0.87 29.59 ± 0.21 21.41 ± 0.19 50.32 ± 0.12
Gemma 3 VL (27B) 	48.14 ± 0.17 57.08 ± 0.51 53.82 ± 0.16 69.85 ± 0.31 47.13 ± 0.24 49.96 ± 0.12 13.16 ± 0.25 79.44 ± 0.07
phi-4 (14B) 	6.45 ± 0.18 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 17.77 ± 0.69 6.65 ± 0.09 10.79 ± 0.39 16.19 ± 0.12
Mistral Small 3.1 2503 (24B) 	2.22 ± 0.11 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 2.38 ± 0.05 3.48 ± 0.09 6.34 ± 0.14
Sailor2 (20B) 	8.55 ± 0.11 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 11.66 ± 0.04 0.00 ± 0.00 52.86 ± 0.15
Large Models (> 32B)
Tulu 3 (70B) 	35.11 ± 0.17 56.78 ± 0.51 43.75 ± 0.43 66.85 ± 0.68 34.50 ± 0.40 43.89 ± 0.14 25.97 ± 0.24 66.60 ± 0.09
SEA-LION v3-Llama (70B) 	38.21 ± 0.28 54.30 ± 0.91 44.63 ± 0.47 73.30 ± 0.70 0.00 ± 0.00 43.98 ± 0.20 25.26 ± 0.16 63.28 ± 0.16
Qwen 3 A3B (30B MoE) 	25.62 ± 0.12 0.00 ± 0.00 44.07 ± 0.47 69.04 ± 0.44 31.12 ± 0.29 21.90 ± 0.09 1.27 ± 0.12 34.84 ± 0.12
Qwen 2.5 (72B) 	27.54 ± 0.20 5.58 ± 0.22 12.41 ± 0.25 48.63 ± 0.45 39.85 ± 0.22 23.51 ± 0.10 21.68 ± 0.27 46.32 ± 0.14
Qwen 3 A22B (235B MoE) 	54.29 ± 0.16 71.32 ± 0.23 57.42 ± 0.19 76.63 ± 0.07 50.32 ± 0.18 56.51 ± 0.08 23.36 ± 0.14 78.40 ± 0.06
Qwen 3 Next (80B MoE) 	44.88 ± 0.16 59.27 ± 0.60 46.77 ± 0.22 74.89 ± 0.43 41.88 ± 0.20 46.01 ± 0.12 21.66 ± 0.08 58.58 ± 0.15
Babel (83B) 	9.87 ± 0.23 0.00 ± 0.00 0.40 ± 0.32 29.89 ± 2.08 8.11 ± 1.02 7.72 ± 0.10 16.56 ± 0.24 9.61 ± 0.15
ERNIE 4.5 (21B MoE) 	17.70 ± 0.15 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 19.67 ± 0.06 10.37 ± 0.19 72.24 ± 0.18
ERNIE 4.5 (300B MoE) 	54.68 ± 0.16 68.82 ± 0.22 40.39 ± 0.15 78.52 ± 0.32 49.81 ± 0.16 54.27 ± 0.08 20.84 ± 0.22 86.25 ± 0.05
Command A 03-2025 (111B) 	16.52 ± 0.23 0.00 ± 0.00 18.57 ± 0.58 30.78 ± 1.22 16.68 ± 0.75 15.71 ± 0.14 5.57 ± 0.35 37.10 ± 0.19
Command R+ 08-2024 (104B) 	6.61 ± 0.18 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 1.83 ± 0.68 9.19 ± 0.07 16.00 ± 0.17 25.95 ± 0.17
DeepSeek V3 (671B MoE) 	40.87 ± 0.25 59.22 ± 0.80 16.31 ± 0.79 15.78 ± 1.72 39.07 ±

Chunk 18 · 1,995 chars

(111B) 	16.52 ± 0.23 0.00 ± 0.00 18.57 ± 0.58 30.78 ± 1.22 16.68 ± 0.75 15.71 ± 0.14 5.57 ± 0.35 37.10 ± 0.19
Command R+ 08-2024 (104B) 	6.61 ± 0.18 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 1.83 ± 0.68 9.19 ± 0.07 16.00 ± 0.17 25.95 ± 0.17
DeepSeek V3 (671B MoE) 	40.87 ± 0.25 59.22 ± 0.80 16.31 ± 0.79 15.78 ± 1.72 39.07 ± 0.33 41.73 ± 0.17 20.22 ± 0.13 72.92 ± 0.13
DeepSeek V3.1 (671B MoE) 	51.30 ± 0.23 58.92 ± 0.70 51.64 ± 0.37 70.41 ± 0.60 44.17 ± 0.21 53.41 ± 0.16 21.47 ± 0.11 85.86 ± 0.05
Llama 3 (70B) 	13.09 ± 0.17 32.92 ± 0.66 35.18 ± 0.43 15.33 ± 1.12 0.00 ± 0.00 24.03 ± 0.14 0.00 ± 0.00 49.89 ± 0.11
Llama 3.1 (70B) 	19.87 ± 0.24 0.00 ± 0.00 35.08 ± 0.71 37.63 ± 1.30 0.00 ± 0.00 26.76 ± 0.15 27.30 ± 0.26 58.42 ± 0.20
Llama 3.3 (70B) 	23.07 ± 0.15 0.00 ± 0.00 42.51 ± 0.31 3.70 ± 0.72 0.00 ± 0.00 29.50 ± 0.09 28.32 ± 0.21 60.06 ± 0.15
Llama 4 Maverick (400B MoE) 	51.49 ± 0.20 67.87 ± 0.21 31.87 ± 0.39 80.00 ± 0.00 44.92 ± 0.11 52.54 ± 0.10 30.66 ± 0.15 81.86 ± 0.05
Llama 4 Scout (109B MoE) 	45.54 ± 0.17 70.40 ± 0.32 31.74 ± 0.21 75.93 ± 0.19 13.19 ± 0.43 49.98 ± 0.09 25.53 ± 0.25 81.15 ± 0.04
Mistral Large 2411 (123B) 	26.17 ± 0.29 7.00 ± 1.42 10.76 ± 0.40 41.41 ± 1.39 36.23 ± 0.41 24.14 ± 0.27 19.68 ± 0.28 55.08 ± 0.20
Kimi K2 Instruct 0905 (1040B MoE) 43.94 ± 0.19 53.02 ± 1.07 13.98 ± 0.36 58.81 ± 0.94 40.21 ± 0.47 41.64 ± 0.20 23.02 ± 0.20 71.96 ± 0.19
Apertus (70B) 	13.09 ± 0.25 0.00 ± 0.00 0.00 ± 0.00 21.78 ± 1.43 0.00 ± 0.00 16.60 ± 0.09 15.17 ± 0.17 58.27 ± 0.17
Table 3: Performance of instruct models on BURMESE-SAN tasks. Best model for each task per size
group is bold, second best is underlined.
cess types. The selection includes (i) instruction-
tuned models such as Qwen 3 (Yang and et al.,
2025; Bai and et al., 2025), Llama-3/3.1/3.3/4
series (Grattafiori and et al., 2024), Gemma
2/3 (Team, 2025c), SEA-LION (Ng et al., 2025),
DeepSeek-V3, V3.1 (671B MoE) (Team, 2025a),
OLMo (Groeneveld et al., 2024), Mistral (Jiang
et al., 2023), Sailor2-Chat (Dou et

Chunk 19 · 1,996 chars

uction-
tuned models such as Qwen 3 (Yang and et al.,
2025; Bai and et al., 2025), Llama-3/3.1/3.3/4
series (Grattafiori and et al., 2024), Gemma
2/3 (Team, 2025c), SEA-LION (Ng et al., 2025),
DeepSeek-V3, V3.1 (671B MoE) (Team, 2025a),
OLMo (Groeneveld et al., 2024), Mistral (Jiang
et al., 2023), Sailor2-Chat (Dou et al., 2025), Ba-
bel (Zhao et al., 2025), and Command-R series;
(ii) reasoning-focused models such as DeepSeek-
V3.1 Thinking and DeepSeek-R1 (DeepSeek-AI,
2025), Qwen3-Thinking (Yang and et al., 2025),
QwQ (Team, 2025f), GPT-OSS (Team, 2025d),
OLMo-3-Think (Olmo, 2025), and SEA-LION v3.5 R
(Ng et al., 2025); and (iii) commercial models includ-
ing Google’s Gemini-2/2.5 (Flash and Pro) (Team,
2025b), OpenAI’s GPT-4o (Team, 2024b), GPT-4.1
(OpenAI, 2024), and GPT-5 (Team., 2025), and An-
thropic’s Claude family (Sonnet 4 and Opus 4.1)
(Anthropic, 2025). This diverse collection ensures

-- 6 of 22 --

Model 	MY 	CR 	NLI 	QA 	SA 	TD 	AS 	MT
Small Models (< 14B)
Olmo 3 Think (7B) 	3.85 ± 0.16 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 5.88 ± 0.06 9.99 ± 0.23 17.61 ± 0.12
SEA-LION v3.5 R (Llama) (8B) 	22.08 ± 0.20 0.00 ± 0.00 30.19 ± 0.64 54.00 ± 1.22 0.00 ± 0.00 22.28 ± 0.15 11.02 ± 0.32 48.28 ± 0.32
Qwen 3 (Thinking) (4B) 	34.78 ± 0.29 40.78 ± 1.30 31.98 ± 0.70 60.37 ± 0.96 31.82 ± 0.53 35.77 ± 0.28 16.29 ± 0.40 56.00 ± 0.11
Qwen 3 (Thinking) (8B) 	37.12 ± 0.22 55.80 ± 0.95 33.45 ± 0.63 65.19 ± 1.23 40.18 ± 0.37 40.22 ± 0.22 18.28 ± 0.18 59.53 ± 0.12
Medium Models (14B–32B)
Olmo 3 Think (32B) 	5.51 ± 0.16 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 5.85 ± 0.07 8.80 ± 0.19 15.28 ± 0.14
QwQ (32B) 	24.66 ± 0.27 0.00 ± 0.00 19.04 ± 0.69 55.96 ± 1.33 30.38 ± 0.49 18.58 ± 0.15 4.17 ± 0.27 38.97 ± 0.19
Qwen 3 (Thinking) (14B) 	43.81 ± 0.24 61.45 ± 0.97 35.47 ± 0.45 73.70 ± 1.26 38.71 ± 0.35 45.07 ± 0.20 20.86 ± 0.12 65.04 ± 0.21
Qwen 3 (Thinking) (32B) 	52.35 ± 0.26 71.32 ± 0.70 48.61 ± 0.54 77.33 ± 0.87 42.32 ± 0.35 48.49 ± 0.19 22.90 ± 0.16 43.43 ±

Chunk 20 · 1,994 chars

69 55.96 ± 1.33 30.38 ± 0.49 18.58 ± 0.15 4.17 ± 0.27 38.97 ± 0.19
Qwen 3 (Thinking) (14B) 	43.81 ± 0.24 61.45 ± 0.97 35.47 ± 0.45 73.70 ± 1.26 38.71 ± 0.35 45.07 ± 0.20 20.86 ± 0.12 65.04 ± 0.21
Qwen 3 (Thinking) (32B) 	52.35 ± 0.26 71.32 ± 0.70 48.61 ± 0.54 77.33 ± 0.87 42.32 ± 0.35 48.49 ± 0.19 22.90 ± 0.16 43.43 ± 0.32
Reka Flash 3.1 (21B) 	24.51 ± 0.28 3.80 ± 1.34 39.16 ± 0.65 61.00 ± 1.34 5.20 ± 0.75 26.51 ± 0.28 10.91 ± 0.21 56.18 ± 0.19
Large Models (> 32B)
SEA-LION v3.5 R (Llama) (70B) 	44.02 ± 0.26 19.65 ± 1.58 41.37 ± 0.56 75.07 ± 0.67 8.83 ± 0.80 41.60 ± 0.29 21.55 ± 0.10 78.97 ± 0.13
Qwen 3 (Thinking) (235B MoE) 	66.39 ± 0.22 77.97 ± 0.44 51.53 ± 0.47 82.44 ± 0.58 44.68 ± 0.26 61.37 ± 0.13 20.48 ± 0.09 85.48 ± 0.03
Qwen 3 (Thinking) (30B MoE) 	53.54 ± 0.19 66.55 ± 0.77 48.84 ± 0.59 76.26 ± 0.78 38.92 ± 0.27 53.77 ± 0.16 21.41 ± 0.15 78.70 ± 0.08
Qwen 3 Next (Thinking) (80B MoE) 	57.97 ± 0.18 72.50 ± 0.52 49.68 ± 0.54 80.04 ± 0.66 40.75 ± 0.30 56.35 ± 0.15 20.08 ± 0.12 79.88 ± 0.08
DeepSeek V3.1 Thinking (671B MoE) 59.46 ± 0.23 71.65 ± 0.67 48.27 ± 0.58 78.26 ± 0.86 40.91 ± 0.32 57.75 ± 0.15 21.21 ± 0.13 86.46 ± 0.04
Deepseek R1 0528 (671B MoE) 	57.31 ± 0.30 72.57 ± 0.54 47.25 ± 0.70 66.07 ± 1.29 41.44 ± 0.31 56.41 ± 0.17 18.28 ± 0.19 85.71 ± 0.05
GPT OSS (120B MoE mxfp4) 	48.71 ± 0.21 58.95 ± 0.78 24.25 ± 0.53 70.19 ± 0.89 39.11 ± 0.43 46.00 ± 0.16 17.34 ± 0.11 78.03 ± 0.08
GPT OSS (20B MoE mxfp4) 	35.25 ± 0.24 37.58 ± 1.00 13.97 ± 0.93 60.07 ± 1.37 36.12 ± 0.44 34.42 ± 0.26 17.17 ± 0.16 67.28 ± 0.09
Table 4: Performance of Reasoning models on BURMESE-SAN tasks. Best model for each task per size
group is bold, second best is underlined.
Model 	MY 	CR 	NLI 	QA 	SA 	TD 	AS 	MT
Opus 4.1 (2025-08-05) 45.10 ± 0.15 71.57 ± 0.35 0.00 ± 0.00 0.00 ± 0.00 48.35 ± 0.19 45.66 ± 0.08 24.69 ± 0.09 87.52 ± 0.03
Sonnet 4 	40.45 ± 0.30 24.33 ± 0.13 86.01 ± 0.03 0.00 ± 0.00 47.82 ± 0.25 43.22 ± 0.15 68.10 ± 0.67 0.00 ± 0.00
Gemini 2 Flash 	59.23 ± 0.16 76.58 ± 0.43

Chunk 21 · 1,987 chars

d.
Model 	MY 	CR 	NLI 	QA 	SA 	TD 	AS 	MT
Opus 4.1 (2025-08-05) 45.10 ± 0.15 71.57 ± 0.35 0.00 ± 0.00 0.00 ± 0.00 48.35 ± 0.19 45.66 ± 0.08 24.69 ± 0.09 87.52 ± 0.03
Sonnet 4 	40.45 ± 0.30 24.33 ± 0.13 86.01 ± 0.03 0.00 ± 0.00 47.82 ± 0.25 43.22 ± 0.15 68.10 ± 0.67 0.00 ± 0.00
Gemini 2 Flash 	59.23 ± 0.16 76.58 ± 0.43 41.30 ± 0.43 79.81 ± 0.33 41.48 ± 0.24 57.78 ± 0.12 22.09 ± 0.09 88.23 ± 0.04
Gemini 2.5 Flash 	69.34 ± 0.16 83.58 ± 0.40 61.07 ± 0.46 84.15 ± 0.53 39.60 ± 0.21 66.16 ± 0.10 24.11 ± 0.14 89.49 ± 0.02
Gemini 2.5 Pro 	72.35 ± 0.18 85.75 ± 0.38 67.55 ± 0.43 85.48 ± 0.53 45.56 ± 0.30 68.74 ± 0.11 24.21 ± 0.12 90.22 ± 0.02
GPT 4.1 (2025-04-14) 55.80 ± 0.18 68.80 ± 0.37 42.80 ± 0.38 77.15 ± 0.31 50.24 ± 0.27 54.05 ± 0.11 21.37 ± 0.24 79.73 ± 0.08
GPT 4o (2024-11-20) 51.61 ± 0.20 70.77 ± 0.53 39.48 ± 0.43 75.67 ± 0.66 50.94 ± 0.35 52.34 ± 0.13 21.97 ± 0.13 78.58 ± 0.11
GPT 5 (2025-08-07) 	66.46 ± 0.19 79.87 ± 0.66 43.63 ± 0.44 81.37 ± 0.47 45.70 ± 0.30 60.16 ± 0.13 17.09 ± 0.07 87.46 ± 0.04
Table 5: Performance of Commercial models on BURMESE-SAN tasks. Best model for each task per
size group is bold, second best is underlined.
representative coverage of both open-source and
proprietary ecosystems and enables analysis of
scaling trends from small to large models.
Inference Details. During evaluation, input
prompts are constructed by combining evalua-
tion instances with their corresponding prompt
templates. The default evaluation setting in
BURMESE-SAN for instruction-tuned models is
zero-shot prompting, where prompts do not include
any in-context input–label examples. For decoding
parameters, model-specific default configurations
are used when available. For any unspecified pa-
rameters, we apply vLLM default settings. Given
the input prompts and decoding parameters, the
model generates outputs for evaluation.
5. Evaluation Results
We present our findings organized around five
research questions that examine key aspects of
Burmese language model

Chunk 22 · 1,998 chars

e used when available. For any unspecified pa-
rameters, we apply vLLM default settings. Given
the input prompts and decoding parameters, the
model generates outputs for evaluation.
5. Evaluation Results
We present our findings organized around five
research questions that examine key aspects of
Burmese language model capabilities as described
in Section 1. Tables 3, 4, and 5 report performance
across all NLP tasks, where the MY column de-
notes overall performance. Figure 2 compares
original models with their SEA-fine-tuned variants
(left) and SEA-LION models with their quantized
versions (right), highlighting the effects of regional
fine-tuning and quantization.
Finding #1: Commercial models consistently
outperform open-weight models (RQ1). Com-
mercial models achieve substantially higher perfor-
mance on Burmese tasks, led by Gemini 2.5 Pro
(72.35%), Gemini 2.5 Flash (69.34%), and GPT-5
(66.46%). In contrast, the strongest open-weight
models - ERNIE 4.5 (54.68%), Qwen 3 235B MoE
(54.29%), and Llama 4 Maverick (51.49%) - lag be-
hind, with a gap of approximately 17.67% between
the top commercial and open-weight models. This
disparity is especially pronounced in tasks requir-
ing cultural and language-specific reasoning.
Finding #2: Larger models tend to perform bet-
ter, but scale alone is insufficient (RQ2). Per-
formance generally improves with model scale,
though gains are non-linear and diminish at larger
sizes. While larger variants within families such as
Qwen 3 and Gemma 3 outperform smaller ones,
notable exceptions exist: DeepSeek V3.1 substan-
tially exceeds V3 (51.30% vs. 40.87%), and the
much larger Kimi K2 Instruct (1040B MoE, 43.94%)
underperforms smaller models. These results high-
light the critical role of architecture, training data,
and tuning strategies beyond parameter count.
Finding #3: Southeast Asian fine-tuning ben-
efits certain model families (RQ3). SEA fine-
tuning yields selective improvements that de-
pend strongly on the base model.

Chunk 23 · 1,990 chars

derperforms smaller models. These results high-
light the critical role of architecture, training data,
and tuning strategies beyond parameter count.
Finding #3: Southeast Asian fine-tuning ben-
efits certain model families (RQ3). SEA fine-
tuning yields selective improvements that de-
pend strongly on the base model. While Qwen-
based SEA-LION v4 (32B) shows moderate gains
(+4.96%) and the Gemma-based variant slightly de-
grades (–0.96%), Llama-based SEA-LION models
benefit substantially from SEA fine-tuning, particu-

-- 7 of 22 --

Original 	SEA-LION
0 	10 	20 	30 	40 	50
Apertus-8B
Qwen-3-VL-4B
Qwen-3-VL-8B
Llama-3.1-8B
Llama-3.1-70B
Gemma-2-9B
Gemma-3-4B
Gemma-3-27B
Qwen-3-32B
9.30
16.68
20.42
23.31
30.25
30.67
8.45
17.88
19.87
38.21
9.63
15.40
20.56
26.24
48.14
47.18
44.60
49.56
Original 	NVFP4 	DynFP8
SEA-LION-v3.5-R-Llama3.1-70B SEA-LION-v4-Gemma-3-27B SEA-LION-v4-Qwen-3-VL-32B	
0
10
20
30
40
50
44.02
31.31
42.64
47.18
44.28
46.70
49.56 	49.44 	49.60
Figure 2: Left: Comparison of original models against SEA-fine-tuned variants, and Right: SEA-LION
models with their quantized versions - NVIDIA FP4 (NVFP4) and Dynamic FP8 (DynFP8).
larly at larger scales. For example, SEA-LION v3
(Llama 3.1) 70B improves markedly over its base
counterparts, indicating that SEA-specific data is
especially effective for Llama architectures. Task-
wise, gains are most pronounced in machine trans-
lation (+19.29%) and question answering (+7.04%),
highlighting the value of regional fine-tuning for
cross-lingual and generation-centric tasks.
Finding #4: Careful quantization preserves per-
formance for most tasks (RQ4). Modern quanti-
zation methods can largely retain model perfor-
mance when applied conservatively. For SEA-
LION v4 (Qwen) 32B, both 8-bit DynFP8 and 4-
bit NVFP4 quantization yield results comparable
to full precision. DynFP8 similarly preserves per-
formance for Gemma- and Llama-based models,
whereas aggressive NVFP4 quantization causes
notable degradation,

Chunk 24 · 1,998 chars

retain model perfor-
mance when applied conservatively. For SEA-
LION v4 (Qwen) 32B, both 8-bit DynFP8 and 4-
bit NVFP4 quantization yield results comparable
to full precision. DynFP8 similarly preserves per-
formance for Gemma- and Llama-based models,
whereas aggressive NVFP4 quantization causes
notable degradation, particularly for reasoning-
intensive models. These results indicate that quan-
tization effectiveness depends on both the chosen
method and the target task.
Finding #5: Burmese language capability has
improved rapidly across model generations
(RQ5). Burmese performance has increased sub-
stantially across successive model generations,
with clear recent acceleration. Major open-weight
families such as Llama, Qwen, and Gemma exhibit
large generational gains (e.g., Llama 3.3 70B at
23.07% to Llama 4 Maverick at 51.49%), while com-
mercial models show steady progress from GPT-4o
(51.61%) to GPT-5 (66.46%) and from Gemini 2
Flash (59.23%) to Gemini 2.5 Pro (72.35%). Simi-
lar trends are observed within the SEA-LION series,
where newer releases consistently outperform ear-
lier versions; for instance, SEA-LION v4 (Gemma
3) 4B achieves 26.24%, substantially exceeding
SEA-LION v3 (Gemma 2) 9B at 15.40%.
Commercial models continue to achieve the
strongest performance, but the gap with open-
weight models is steadily narrowing as architec-
tures and training strategies improve. While model
scale remains relevant, performance depends on
architectural design, data quality, and instruction
tuning rather than parameter count alone.
Regional fine-tuning yields model-dependent
benefits - particularly for Llama-based models. Fi-
nally, temporal analysis highlights rapid and consis-
tent improvements in Burmese language capability
across model generations, offering practical guid-
ance for model selection and deployment under
diverse constraints.
6. Conclusion
We introduce BURMESE-SAN, the first compre-
hensive benchmark for evaluating large language
models on Burmese

Chunk 25 · 1,996 chars

highlights rapid and consis-
tent improvements in Burmese language capability
across model generations, offering practical guid-
ance for model selection and deployment under
diverse constraints.
6. Conclusion
We introduce BURMESE-SAN, the first compre-
hensive benchmark for evaluating large language
models on Burmese across NLU, NLR, and NLG
tasks, constructed with high-quality, linguistically
natural data spanning diverse domains.
Our evaluation reveals clear performance gaps
between model families and generations, demon-
strating that Burmese capability is strongly influ-
enced by model architecture, instruction tuning,
and training strategy rather than scale alone. In par-
ticular, Southeast Asian fine-tuned models - espe-
cially SEA-LION variants - consistently improve per-
formance over generation while recent architectural
advances such as MoE and reasoning-focused
training further accelerate progress. Although com-
mercial models currently achieve the highest over-
all scores, our results indicate that carefully tuned
open-weight models can significantly narrow this
gap, especially as Burmese-focused data and train-
ing strategies continue to improve.
Together, these findings underscore the impor-
tance of language-specific adaptation and position
BURMESE-SAN as a robust foundation for future
research, evaluation, and deployment of LLMs for
Burmese and other low-resource languages.

-- 8 of 22 --

Acknowledgement
This research is supported by the National Re-
search Foundation, Singapore, under its National
Large Language Models Funding Initiative. Any
opinions, findings, and conclusions or recommen-
dations expressed in this material are those of the
author(s) and do not reflect the views of the Na-
tional Research Foundation, Singapore.
The authors also would like to express their sin-
cere gratitude to the native-speaker annotators and
quality control contributors for their careful work
and linguistic expertise. We also thank our intern-
ship students

Chunk 26 · 1,998 chars

ose of the
author(s) and do not reflect the views of the Na-
tional Research Foundation, Singapore.
The authors also would like to express their sin-
cere gratitude to the native-speaker annotators and
quality control contributors for their careful work
and linguistic expertise. We also thank our intern-
ship students from King Mongkut’s University of
Technology Thonburi (KMUTT), Htoo Myat Min Bo
and Wira Ye Yint, for their valuable assistance with
data annotation and quality assurance.
Limitations
Benchmarking with formal written Burmese
prompt templates only In this work, although
the evaluation data are natural and conversational,
we focus on evaluating models using formally writ-
ten native Burmese prompt templates to ensure
clarity, consistency, and grammatical correctness.
Model performance on informal-style prompt tem-
plates may differ from what is observed in our
benchmark. Extending BURMESE-SAN to include
spoken and colloquial prompt templates could be a
valuable direction for future work, enabling a more
comprehensive evaluation of Burmese language
models across different registers and contexts.
Focus on Standard Burmese We focus our
study on the relatively better-resource standard
Burmese from central region of Myanmar and
do not include other notable dialects such as
Arakanese (Rakhine) in the southwest, Tavoyan
in the southeast, and Intha in the east, with others
such as Yaw, Merguese (Myeik), and Palaw. Future
work might explore translation or other community-
driven data collection initiatives to extend coverage
to dialects of Burmese language.
Ethical Considerations
Our work on Burmese language technologies con-
tributes to addressing key challenges in linguistic
inclusion and improving technological accessibility
for underrepresented language communities. The
benchmark deliberately incorporates assessments
of culturally specific knowledge to mitigate known
biases in large language models. This emphasizes
the importance of evaluating models within

Chunk 27 · 1,985 chars

key challenges in linguistic
inclusion and improving technological accessibility
for underrepresented language communities. The
benchmark deliberately incorporates assessments
of culturally specific knowledge to mitigate known
biases in large language models. This emphasizes
the importance of evaluating models within their
cultural context rather than assuming universal ap-
plicability. Datasets included in BURMESE-SAN
are from publicly accessible sources and the au-
thors check and include the License information of
the datasets in the study.
For benchmark dataset quality assurance, a
team of Burmese native speakers was involved
in reviewing and annotating the data. The team,
composed of university internship students, was
recruited through faculty channels. Workload and
compensation were communicated in advance and
adhered to the university guidelines and regulatory
requirements. Given the nature of the toxicity detec-
tion (TD) task, the authors and quality assurance
team were exposed to potentially offensive material.
Measures were taken to mitigate harm: annotators
were encouraged to report inappropriate content
and had the option to discontinue their work at any
time during label verification and native revision.
We do not anticipate any negative social im-
pacts arising from this study. The BURMESE-SAN
dataset and accompanying codebase will be re-
leased under the Creative Commons Attribution
Share-Alike 4.0 (CC-BY-SA 4.0) license.
7. Bibliographical References
Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and psychological
measurement, 20(1):37–46.
Sara Court and Micha Elsner. 2024. Shortcomings
of LLMs for low-resource translation: Retrieval
and understanding are both the problem. In Pro-
ceedings of the Ninth Conference on Machine
Translation, pages 1332–1354, Miami, Florida,
USA. Association for Computational Linguistics.
Jiaxin Duan, Fengyu Lu, and Junfei Liu. 2024. Al-
leviating exposure bias in abstractive

Chunk 28 · 1,999 chars

LLMs for low-resource translation: Retrieval
and understanding are both the problem. In Pro-
ceedings of the Ninth Conference on Machine
Translation, pages 1332–1354, Miami, Florida,
USA. Association for Computational Linguistics.
Jiaxin Duan, Fengyu Lu, and Junfei Liu. 2024. Al-
leviating exposure bias in abstractive summa-
rization via sequentially generating and revis-
ing. In Proceedings of the 2024 Joint Interna-
tional Conference on Computational Linguistics,
Language Resources and Evaluation (LREC-
COLING 2024), pages 739–750, Torino, Italia.
ELRA and ICCL.
Samuel Frontull and Thomas Ströhle. 2025. Com-
pensating for data with reasoning: Low-resource
machine translation with llms.
Juraj Juraska, Daniel Deutsch, Mara Finkelstein,
and Markus Freitag. 2024. MetricX-24: The
Google submission to the WMT 2024 metrics
shared task. In Proceedings of the Ninth Confer-
ence on Machine Translation, pages 492–504,
Miami, Florida, USA. Association for Computa-
tional Linguistics.

-- 9 of 22 --

Khin Aye. 2024. Burmese Spoken Grammar, 1st
edition. Myanmar Language Commission.
Klaus Krippendorff. 2018. Content analysis: An
introduction to its methodology. Sage publica-
tions.
Lei Li, Wei Liu, Marina Litvak, Natalia Vanetik, Ji-
acheng Pei, Yinan Liu, and Siya Qi. 2021. Sub-
jective bias in abstractive summarization. arXiv
preprint arXiv:2106.10084.
Evan Miller. 2024. Adding error bars to evals: A
statistical approach to language model evalua-
tions.
Republic of the Union of Myanmar, Myanmar Lan-
guage Commission. 2013. Burmese Grammar,
2nd edition. Myanmar Language Commission.
Yewei Song, Lujun Li, Cedric Lothritz, Saad Ezzini,
Lama Sleem, Niccolo Gentile, Radu State,
Tegawendé F. Bissyandé, and Jacques Klein.
2025. Is small language model the silver bullet
to low-resource languages machine translation?
Open AI Team. 2025. Openai gpt-5 system card.
Language Resource References
Anthropic. 2025. System card: Claude Opus 4 &
Claude Sonnet 4. Accessed: 2026-02-18.
Thura Aung,

Chunk 29 · 1,993 chars

du State,
Tegawendé F. Bissyandé, and Jacques Klein.
2025. Is small language model the silver bullet
to low-resource languages machine translation?
Open AI Team. 2025. Openai gpt-5 system card.
Language Resource References
Anthropic. 2025. System card: Claude Opus 4 &
Claude Sonnet 4. Accessed: 2026-02-18.
Thura Aung, Ye Kyaw Thu, and Myat Noe Oo. 2024.
myocr: Optical character recognition for myan-
mar language with post-ocr error correction. In
2024 19th International Joint Symposium on Ar-
tificial Intelligence and Natural Language Pro-
cessing (iSAI-NLP), pages 1–6.
Shuai Bai and et al. 2025. Qwen3-vl technical
report.
Lucas Bandarkar, Davis Liang, Benjamin Muller,
Mikel Artetxe, Satya Narayan Shukla, Donald
Husa, Naman Goyal, Abhinandan Krishnan,
Luke Zettlemoyer, and Madian Khabsa. 2024.
The belebele benchmark: a parallel reading com-
prehension dataset in 122 language variants. In
Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 749–775, Bangkok,
Thailand and virtual meeting. Association for
Computational Linguistics.
Yupeng Chang, Xu Wang, Jindong Wang, Yuan
Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan
Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue
Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and
Xing Xie. 2024. A survey on evaluation of large
language models. ACM Trans. Intell. Syst. Tech-
nol., 15(3).
Gheorghe Comanici and et al. 2025. Gemini 2.5:
Pushing the frontier with advanced reasoning,
multimodality, long context, and next generation
agentic capabilities.
DeepSeek-AI. 2025. Deepseek-r1: Incentivizing
reasoning capability in llms via reinforcement
learning. arXiv preprint arXiv:2501.12948.
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen,
Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu,
Cunxiao Du, Penghui Yang, Haonan Wang, Jia-
heng Liu, Yongchi Zhao, Xiachong Feng, Xin
Mao, Man Tsung Yeung, Kunat Pipatanakul,
Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi
Liu, Qunshu Lin, Sittipong

Chunk 30 · 1,994 chars

preprint arXiv:2501.12948.
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen,
Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu,
Cunxiao Du, Penghui Yang, Haonan Wang, Jia-
heng Liu, Yongchi Zhao, Xiachong Feng, Xin
Mao, Man Tsung Yeung, Kunat Pipatanakul,
Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi
Liu, Qunshu Lin, Sittipong Sripaisarnmongkol,
Kridtaphad Sae-Khow, Nirattisai Thongchim,
Taechawat Konkaew, Narong Borijindargoon,
Anh Dao, Matichon Maneegard, Phakphum
Artkaew, Zheng-Xin Yong, Quan Nguyen, Wan-
naphong Phatthiyaphaibun, Hoang H. Tran, Mike
Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi
Wan, Wei Lu, and Min Lin. 2025. Sailor2: Sailing
in south-east asia with inclusive multilingual llms.
Ethnologue. 2019. Burmese. https://web.
archive.org/web/20190820164330/
https://www.ethnologue.com/
language/mya. Accessed via Internet
Archive on August 20, 2019.
Ethnologue. 2024. Myanmar. https://www.
ethnologue.com/country/MM/. Accessed:
27 Sept. 2024.
Aaron Grattafiori and et al. 2024. The llama 3 herd
of models.
Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita
Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya
Jha, Hamish Ivison, Ian Magnusson, Yizhong
Wang, Shane Arora, David Atkinson, Russell Au-
thur, Khyathi Chandu, Arman Cohan, Jennifer
Dumas, Yanai Elazar, Yuling Gu, Jack Hessel,
Tushar Khot, William Merrill, Jacob Morrison,
Niklas Muennighoff, Aakanksha Naik, Crystal
Nam, Matthew Peters, Valentina Pyatkin, Abhi-
lasha Ravichander, Dustin Schwenk, Saurabh
Shah, William Smith, Emma Strubell, Nishant
Subramani, Mitchell Wortsman, Pradeep Dasigi,
Nathan Lambert, Kyle Richardson, Luke Zettle-
moyer, Jesse Dodge, Kyle Lo, Luca Soldaini,
Noah Smith, and Hannaneh Hajishirzi. 2024.
OLMo: Accelerating the science of language
models. In Proceedings of the 62nd Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 15789–
15809, Bangkok, Thailand. Association for Com-
putational Linguistics.

-- 10 of 22 --

Jia Guo, Longxu Dou, Guangtao

Chunk 31 · 1,996 chars

h Hajishirzi. 2024.
OLMo: Accelerating the science of language
models. In Proceedings of the 62nd Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 15789–
15809, Bangkok, Thailand. Association for Com-
putational Linguistics.

-- 10 of 22 --

Jia Guo, Longxu Dou, Guangtao Zeng, Stanley
Kok, Wei Lu, and Qian Liu. 2024a. Sailcompass:
Towards reproducible and robust evaluation for
southeast asian languages.
Jia Guo, Longxu Dou, Guangtao Zeng, Stanley
Kok, Wei Lu, and Qian Liu. 2024b. Sailcom-
pass: Towards reproducible and robust evalua-
tion for southeast asian languages. arXiv preprint
arXiv:2412.01186.
Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang,
Dan Shi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian
Xiong, Deyi Xiong, et al. 2023. Evaluating large
language models: A comprehensive survey.
Muhammad Usman Hadi, Rizwan Qureshi, Abbas
Shah, Muhammad Irfan, Anas Zafar, Muham-
mad Bilal Shaikh, Naveed Akhtar, Jia Wu,
Seyedali Mirjalili, et al. 2023. A survey on large
language models: Applications, challenges, limi-
tations, and practical usage. Authorea Preprints.
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful
Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin
Kang, M. Sohel Rahman, and Rifat Shahriyar.
2021. XL-Sum: Large-scale Multilingual Abstrac-
tive Summarization for 44 Languages. In Find-
ings of the Association for Computational Lin-
guistics: ACL-IJCNLP 2021, pages 4693–4703,
Online. Association for Computational Linguis-
tics.
Zar Zar Hlaing, Ye Kyaw Thu, Thepchai Supnithi,
and Ponrudee Netisopakul. 2022. Improving neu-
ral machine translation with pos-tag features for
low-resource language pairs. Heliyon. Submit-
ted Mar 5, 2022; revised Jun 25, 2022; accepted
Aug 15, 2022; published online Aug 22, 2022.
Aung Kyaw Htet and Mark Dras. 2025. Myan-
mar xnli: Building a dataset and exploring low-
resource approaches to natural language infer-
ence with myanmar.
Xin Huang, Tarun Kumar Vangani, Minh Duc Pham,
Xunlong Zou, Bin

Chunk 32 · 1,991 chars

ubmit-
ted Mar 5, 2022; revised Jun 25, 2022; accepted
Aug 15, 2022; published online Aug 22, 2022.
Aung Kyaw Htet and Mark Dras. 2025. Myan-
mar xnli: Building a dataset and exploring low-
resource approaches to natural language infer-
ence with myanmar.
Xin Huang, Tarun Kumar Vangani, Minh Duc Pham,
Xunlong Zou, Bin Wang, Zhengyuan Liu, and
Ai Ti Aw. 2025. Meralion-textllm: Cross-lingual
understanding of large language models in chi-
nese, indonesian, malay, and singlish.
Mathias Jenny. 2021. The national languages of
msea: Burmese, thai, lao, khmer, vietnamese.
In Paul Sidwell and Mathias Jenny, editors, The
Languages and Linguistics of Mainland South-
east Asia: A Comprehensive Guide, pages 599–
622. De Gruyter Mouton. Retrieved 2024-12-06.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur
Mensch, Chris Bamford, Devendra Singh Chap-
lot, Diego de las Casas, Florian Bressand,
Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, Lélio Renard Lavaud, Marie-Anne
Lachaux, Pierre Stock, Teven Le Scao, Thibaut
Lavril, Thomas Wang, Timothée Lacroix, and
William El Sayed. 2023. Mistral 7b.
Shengyi Jiang, Xiuwen Huang, Xiaonan Cai, and
Nankai Lin. 2021. Pre-trained models and eval-
uation data for the myanmar language. In The
28th International Conference on Neural Informa-
tion Processing, Cham. Springer International
Publishing.
Pride Kavumba, Naoya Inoue, Benjamin Heinzer-
ling, Keshav Singh, Paul Reisert, and Kentaro
Inui. 2019. When choosing plausible alterna-
tives, clever hans can be clever. In Proceedings
of the First Workshop on Commonsense Infer-
ence in Natural Language Processing, pages
33–42, Hong Kong, China. Association for Com-
putational Linguistics.
Nang Aeindray Kyaw, Ye Kyaw Thu, Thazin Myint
Oo, Hutchatai Chanlekha, Manabu Okumura,
and Thepchai Supnithi. 2024. Enhancing hate
speech classification in myanmar language
through lexicon-based filtering. In 2024 21st
International Joint Conference on Computer Sci-
ence and Software Engineering (JCSSE),

Chunk 33 · 1,995 chars

Linguistics.
Nang Aeindray Kyaw, Ye Kyaw Thu, Thazin Myint
Oo, Hutchatai Chanlekha, Manabu Okumura,
and Thepchai Supnithi. 2024. Enhancing hate
speech classification in myanmar language
through lexicon-based filtering. In 2024 21st
International Joint Conference on Computer Sci-
ence and Software Engineering (JCSSE), pages
316–323.
Nathan Lambert and et al. 2025. Tulu 3: Pushing
frontiers in open language model post-training.
Wei Qi Leong, Jian Gang Ngui, Yosephine Su-
santo, Hamsawardhini Rengarajan, Kengath-
araiyer Sarveswaran, and William Chandra Tjhi.
2023. Bhasa: A holistic southeast asian linguistic
and cultural evaluation suite for large language
models.
Percy Liang, Rishi Bommasani, Tony Lee, Dim-
itris Tsipras, Dilara Soylu, Michihiro Yasunaga,
Yian Zhang, Deepak Narayanan, Yuhuai Wu,
Ananya Kumar, Benjamin Newman, Binhang
Yuan, Bobby Yan, Ce Zhang, Christian Alexan-
der Cosgrove, Christopher D Manning, Christo-
pher Re, Diana Acosta-Navas, Drew Arad Hud-
son, Eric Zelikman, Esin Durmus, Faisal Lad-
hak, Frieda Rong, Hongyu Ren, Huaxiu Yao,
Jue WANG, Keshav Santhanam, Laurel Orr, Lu-
cia Zheng, Mert Yuksekgonul, Mirac Suzgun,
Nathan Kim, Neel Guha, Niladri S. Chatterji,
Omar Khattab, Peter Henderson, Qian Huang,
Ryan Andrew Chi, Sang Michael Xie, Shibani
Santurkar, Surya Ganguli, Tatsunori Hashimoto,
Thomas Icard, Tianyi Zhang, Vishrav Chaudhary,
William Wang, Xuechen Li, Yifan Mai, Yuhui
Zhang, and Yuta Koreeda. 2023. Holistic evalu-
ation of language models. Transactions on Ma-

-- 11 of 22 --

chine Learning Research. Featured Certification,
Expert Certification.
Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Ma-
hani Aljunied, Anh Tuan Luu, and Lidong Bing.
2025. SeaExam and SeaBench: Benchmarking
LLMs with local multilingual questions in South-
east Asia. In Findings of the Association for
Computational Linguistics: NAACL 2025, pages
6119–6136, Albuquerque, New Mexico. Associ-
ation for Computational Linguistics.
Holy Lovenia, Rahmad Mahendra,
Salsabil

Chunk 34 · 1,950 chars

and Lidong Bing.
2025. SeaExam and SeaBench: Benchmarking
LLMs with local multilingual questions in South-
east Asia. In Findings of the Association for
Computational Linguistics: NAACL 2025, pages
6119–6136, Albuquerque, New Mexico. Associ-
ation for Computational Linguistics.
Holy Lovenia, Rahmad Mahendra,
Salsabil Maulana Akbar, Lester James V. Mi-
randa, Jennifer Santoso, Elyanah Aco, Akhdan
Fadhilah, Jonibek Mansurov, Joseph Marvin
Imperial, Onno P. Kampman, Joel Ruben Antony
Moniz, Muhammad Ravi Shulthan Habibi,
Frederikus Hudi, Railey Montalan, Ryan
Ignatius, Joanito Agili Lopo, William Nixon,
Börje F. Karlsson, James Jaya, Ryandito
Diandaru, Yuze Gao, Patrick Amadeus, Bin
Wang, Jan Christian Blaise Cruz, Chenxi
Whitehouse, Ivan Halim Parmonangan, Maria
Khelli, Wenyu Zhang, Lucky Susanto, Rey-
nard Adha Ryanda, Sonny Lazuardi Hermawan,
Dan John Velasco, Muhammad Dehan Al Kaut-
sar, Willy Fitra Hendria, Yasmin Moslem, Noah
Flynn, Muhammad Farid Adilazuarda, Haochen
Li, Johanes Lee, R. Damanhuri, Shuo Sun,
Muhammad Reza Qorib, Amirbek Djanibekov,
Wei Qi Leong, Quyet V. Do, Niklas Muennighoff,
Tanrada Pansuwan, Ilham Firdausi Putra, Yan
Xu, Tai Ngee Chia, Ayu Purwarianti, Sebastian
Ruder, William Tjhi, Peerat Limkonchotiwat,
Alham Fikri Aji, Sedrick Keh, Genta Indra
Winata, Ruochen Zhang, Fajri Koto, Zheng-
Xin Yong, and Samuel Cahyawijaya. 2024.
SEACrowd: A multilingual multimodal data hub
and benchmark suite for Southeast Asian lan-
guages. In Proceedings of the 2024 Conference
on Empirical Methods in Natural Language
Processing, pages 5155–5203, Miami, Florida,
USA. Association for Computational Linguistics.
Jann Railey Montalan, Jimson Paulo Layacan,
David Demitri Africa, Richell Isaiah Flores,
Michael T. Lopez II, Theresa Denise Magsajo,
Anjanette Cayabyab, and William Chandra Tjhi.
2025. Batayan: A filipino nlp benchmark for eval-
uating large language models.
MULTICSD Project Team. 2025. Burmese
(myanmar).

Chunk 35 · 1,993 chars

inguistics.
Jann Railey Montalan, Jimson Paulo Layacan,
David Demitri Africa, Richell Isaiah Flores,
Michael T. Lopez II, Theresa Denise Magsajo,
Anjanette Cayabyab, and William Chandra Tjhi.
2025. Batayan: A filipino nlp benchmark for eval-
uating large language models.
MULTICSD Project Team. 2025. Burmese
(myanmar). https://sites.google.com/
view/multicsd/global-languages/
burmese-myanmar. Accessed: 2025-07-12.
Raymond Ng, Thanh Ngan Nguyen, Huang Yuli,
Tai Ngee Chia, Leong Wai Yi, Wei Qi Leong, Xi-
anbin Yong, Jian Gang Ngui, Yosephine Susanto,
Nicholas Cheng, Hamsawardhini Rengarajan,
Peerat Limkonchotiwat, Adithya Venkatadri Hu-
lagadri, Kok Wai Teng, Yeo Yeow Tong, Bryan
Siow, Wei Yi Teo, Tan Choon Meng, Brandon
Ong, Zhi Hao Ong, Jann Railey Montalan, Ad-
win Chan, Sajeban Antonyrex, Ren Lee, Esther
Choa, David Ong Tat-Wee, Bing Jie Darius Liu,
William Chandra Tjhi, Erik Cambria, and Leslie
Teo. 2025. SEA-LION: Southeast Asian lan-
guages in one network. In Proceedings of the
14th International Joint Conference on Natural
Language Processing and the 4th Conference
of the Asia-Pacific Chapter of the Association
for Computational Linguistics, pages 512–526,
Mumbai, India. The Asian Federation of Natural
Language Processing and The Association for
Computational Linguistics.
John Okell. 2023. Burmese (Myanmar): An Intro-
duction To The Spoken Language, Book 1. CC0
1.0 Universal (Creative Commons Zero) License,
Open Source.
Team Olmo. 2025. Olmo 3.
Thazin Myint Oo, Thitipong Tanprasert, Ye Kyaw
Thu, and Thepchai Supnithi. 2023. Transfer
and triangulation pivot translation approaches for
burmese dialects. IEEE Access, 11:6150–6168.
OpenAI. 2024. Gpt-4 technical report.
Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen,
Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and
Philip S. Yu. 2025. A survey of multilingual large
language models. Patterns, 6(1):101118.
Republic of the Union of Myanmar, Ministry of Ed-
ucation. 2019. University Technical Terms. De-
partment of

Chunk 36 · 1,990 chars

Gpt-4 technical report.
Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen,
Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and
Philip S. Yu. 2025. A survey of multilingual large
language models. Patterns, 6(1):101118.
Republic of the Union of Myanmar, Ministry of Ed-
ucation. 2019. University Technical Terms. De-
partment of Higher Education.
Mya Ei San, Sasiporn Usanavasin, Ye Kyaw Thu,
and Manabu Okumura. 2024. A study for en-
hancing low-resource thai-myanmar-english neu-
ral machine translation. ACM Transactions on
Asian and Low-Resource Language Information
Processing (TALLIP), 23(4):1–24. Article No. 54.
Submitted Aug 15, 2023; accepted with major
revision Nov 4, 2023; accepted Jan 31, 2024;
published online Apr 15, 2024.
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier
Aula-Blasco, Mano Aslan, Vu Trong Kim,
Shayekh Bin Islam, Jaume Prats-Cristià, Lucía
Tormo-Bañuelos, and Seungone Kim. 2024. Mm-
eval: A multilingual meta-evaluation benchmark
for llm-as-a-judge and reward models.
Yosephine Susanto, Adithya Venkatadri Hulagadri,
Jann Railey Montalan, Jian Gang Ngui, Xian Bin
Yong, Weiqi Leong, Hamsawardhini Rengara-
jan, Peerat Limkonchotiwat, Yifan Mai, and

-- 12 of 22 --

William Chandra Tjhi. 2025. Sea-helm: South-
east asian holistic evaluation of language mod-
els.
DeepSeek-AI Team. 2025a. Deepseek-v3 techni-
cal report.
Gemini Team. 2025b. Gemini: A family of highly
capable multimodal models.
Gemma Team. 2025c. Gemma 3 technical report.
NLLB Team. 2022. No Language Left Behind: Scal-
ing Human-Centered Machine Translation.
NLLB Team. 2024a. Scaling neural machine trans-
lation to 200 languages. Nature, 630(8018):841–
846.
OpenAI Team. 2024b. Gpt-4o system card.
OpenAI Team. 2025d. gpt-oss-120b & gpt-oss-20b
model card.
Qwen Team. 2025e. Qwen2.5 technical report.
Qwen Team. 2025f. Qwq-32b: Embracing the
power of reinforcement learning.
Ye Kyaw Thu, Thura Aung, and Thepchai Sup-
nithi. 2023. Neural sequence labeling based
sentence segmentation for myanmar

Chunk 37 · 1,998 chars

. Gpt-4o system card.
OpenAI Team. 2025d. gpt-oss-120b & gpt-oss-20b
model card.
Qwen Team. 2025e. Qwen2.5 technical report.
Qwen Team. 2025f. Qwq-32b: Embracing the
power of reinforcement learning.
Ye Kyaw Thu, Thura Aung, and Thepchai Sup-
nithi. 2023. Neural sequence labeling based
sentence segmentation for myanmar language.
In The 12th Conference on Information Technol-
ogy and Its Applications, pages 285–296, Cham.
Springer Nature Switzerland.
Ashmal Vayani, Dinura Dissanayake, Hasindri
Watawana, Noor Ahsan, Nevasini Sasikumar,
Omkar Thawakar, Henok Biadglign Ademtew,
Yahya Hmaiti, Amandeep Kumar, Kartik Kuck-
reja, Mykola Maslych, Wafa Al Ghallabi, Mihail
Mihaylov, Chao Qin, Abdelrahman M Shaker,
Mike Zhang, Mahardika Krisna Ihsani, Amiel
Esplana, Monil Gokani, Shachar Mirkin, Harsh
Singh, Ashay Srivastava, Endre Hamerlik, Fathi-
nah Asma Izzati, Fadillah Adamsyah Maani,
Sebastian Cavada, Jenny Chim, Rohit Gupta,
Sanjay Manjunath, Kamila Zhumakhanova,
Feno Heriniaina Rabevohitra, Azril Amirudin,
Muhammad Ridzuan, Daniya Kareem, Ketan
More, Kunyang Li, Pramesh Shakya, Muham-
mad Saad, Amirpouya Ghasemaghaei, Amirbek
Djanibekov, Dilshod Azizov, Branislava Jankovic,
Naman Bhatia, Alvaro Cabrera, Johan Obando-
Ceron, Olympiah Otieno, Fabian Farestam, Muz-
toba Rabbani, Sanoojan Baliah, Santosh San-
jeev, Abduragim Shtanchaev, Maheen Fatima,
Thao Nguyen, Amrin Kareem, Toluwani Aremu,
Nathan Xavier, Amit Bhatkal, Hawau Toyin,
Aman Chadha, Hisham Cholakkal, Rao Muham-
mad Anwer, Michael Felsberg, Jorma Laakso-
nen, Thamar Solorio, Monojit Choudhury, Ivan
Laptev, Mubarak Shah, Salman Khan, and Fa-
had Khan. 2025. All languages matter: Evaluat-
ing lmms on culturally diverse 100 languages.
Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai
Jiao, Yang Ding, AiTi Aw, and Nancy Chen.
2024. SeaEval for multilingual foundation mod-
els: From cross-lingual alignment to cultural rea-
soning. In Proceedings of the 2024 Conference
of the North American Chapter of the Association
for

Chunk 38 · 1,984 chars

on culturally diverse 100 languages.
Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai
Jiao, Yang Ding, AiTi Aw, and Nancy Chen.
2024. SeaEval for multilingual foundation mod-
els: From cross-lingual alignment to cultural rea-
soning. In Proceedings of the 2024 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies (Volume 1: Long Papers), pages
370–390, Mexico City, Mexico. Association for
Computational Linguistics.
Bryan Wilie, Karissa Vincentio, Genta Indra Winata,
Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan
Lim, Sidik Soleman, Rahmad Mahendra, Pas-
cale Fung, Syafri Bahar, and Ayu Purwarianti.
2020. IndoNLU: Benchmark and resources for
evaluating Indonesian natural language under-
standing. In Proceedings of the 1st Conference
of the Asia-Pacific Chapter of the Association
for Computational Linguistics and the 10th Inter-
national Joint Conference on Natural Language
Processing, pages 843–857, Suzhou, China. As-
sociation for Computational Linguistics.
An Yang and et al. 2025. Qwen3 technical report.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, Guant-
ing Dong, Haoran Wei, Huan Lin, Jialong Tang,
Jialin Wang, Jian Yang, Jianhong Tu, Jianwei
Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren
Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai
Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei
Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang,
Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie
Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tian-
hao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng,
Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang,
Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang
Fan, Yang Yao, Yichang Zhang, Yu Wan, Yun-
fei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang,
Zhifang Guo, and Zhihao Fan. 2024. Qwen2
technical report.
Wint Theingi Zaw, Ye Kyaw Thu, Zar Zar Hlaing,
and Thepchai Supnithi. 2022. English to
burmese machine translation with asian

Chunk 39 · 1,995 chars

,
Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang
Fan, Yang Yao, Yichang Zhang, Yu Wan, Yun-
fei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang,
Zhifang Guo, and Zhihao Fan. 2024. Qwen2
technical report.
Wint Theingi Zaw, Ye Kyaw Thu, Zar Zar Hlaing,
and Thepchai Supnithi. 2022. English to
burmese machine translation with asian pivot
languages. Journal of Intelligent Informatics and
Smart Technology, Oct 2nd Issue:26–1–26–9.
Submitted Feb 21, 2022; accepted Aug 30, 2022;
published online Oct 31, 2022.
Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying,
Mahani Aljunied, Zhaodonghui Li, Lidong Bing,
Hou Pong Chan, Yu Rong, Deli Zhao, and Wenx-
uan Zhang. 2025. Babel: Open multilingual large

-- 13 of 22 --

language models serving over 90% of global
speakers. arXiv preprint arXiv:2503.00865.

-- 14 of 22 --

A. Overview of Tasks and Datasets
As described in Section 3, we selected seven tasks for inclusion in BURMESE-SAN. Table 1 lists the
source datasets, their adaptations, and usage, all consistent with the original intent.
• Abstractive Summarization (AS) In this task, an LLM is given a paragraph and must generate a
concise sentence summarizing its content. The evaluation focuses not only on identifying the key
information but also on paraphrasing the content coherently. We use XL-Sum (Hasan et al., 2021)
for this task, which contains annotated article-summary pairs.
• Causal Reasoning (CR) This task requires the LLM to identify the causal relationship between
events. Given a premise and a set of statements, the model must determine which statement
represents the cause or effect of the premise. We translate from scratch and employ Balanced
COPA (Kavumba et al., 2019), designed to evaluate commonsense causal reasoning with paired
alternatives.
• Machine Translation (MT) Here, an LLM is provided with text in one language and is expected to
translate it into another language. In this work, we evaluate both English to Burmese and Burmese to
English translations using the FLORES+

Chunk 40 · 1,988 chars

2019), designed to evaluate commonsense causal reasoning with paired
alternatives.
• Machine Translation (MT) Here, an LLM is provided with text in one language and is expected to
translate it into another language. In this work, we evaluate both English to Burmese and Burmese to
English translations using the FLORES+ dataset (Team, 2024a), which includes translations across
multiple languages and domains.
• Natural Language Inference (NLI) This classification task requires the LLM to determine the
relationship between two sentences (X and Y) as one of the following: (a) X implies Y, (b) X
contradicts Y, or (c) X neither implies nor contradicts Y. We use myXNLI (Htet and Dras, 2025), which
provides human-annotated examples for cross-lingual inference evaluation.
• Question Answering (QA) In this task, an LLM is given a passage and a question and must select
the span from the passage that answers the question. We use Belebele (Bandarkar et al., 2024), a
multiple-choice reading comprehension dataset designed to evaluate passage understanding.
• Toxicity Detection (TD) and Sentiment Analysis (SA) Both tasks involve analyzing natural language
text. TD requires detecting hate speech or abusive language, while SA involves classifying sentiment
as positive, negative, or neutral. We use myHateSpeech (Kyaw et al., 2024) for TD and GKLMIP-mya
(Jiang et al., 2021) for SA.
Task Dataset License
QA Belebele (Bandarkar et al., 2024) CC BY-NC 4.0
SA GKLMIP-mya (Jiang et al., 2021) Unknown
TD myHateSpeech (Kyaw et al., 2024) CC BY-NC-SA 4.0
CR Balanced COPA (Kavumba et al., 2019) CC BY 4.0
NLI myXNLI (Htet and Dras, 2025) CC BY-NC 4.0
AS XL-Sum (Hasan et al., 2021) CC BY-NC-SA 4.0
MT FLORES+ (Team, 2024a) CC BY-SA 4.0
Table 6: Licenses for all datasets included in BURMESE-SAN, with corresponding references.

-- 15 of 22 --

B. Dataset Quality Assurance
To ensure dataset quality, we employed bilingual native speakers (ages 18–25), primarily university
students enrolled in

Chunk 41 · 1,999 chars

, 2021) CC BY-NC-SA 4.0
MT FLORES+ (Team, 2024a) CC BY-SA 4.0
Table 6: Licenses for all datasets included in BURMESE-SAN, with corresponding references.

-- 15 of 22 --

B. Dataset Quality Assurance
To ensure dataset quality, we employed bilingual native speakers (ages 18–25), primarily university
students enrolled in international programs in Thailand.
(ENG) Assist doctors while providing patient care.
(Acceptable: Two sentences) ဆရာဝန်ေတွကို ကူညီပါ။ (Help doctors) လူနာေတွကိုလည်း တစ်ချိန်
တည်းမှာ ကည့်ေပးပါ။ (Help patients at the same time)
(Acceptable: One sentence different order with ENG) လူနာေတွကို ကည့်ရင်း ဆရာဝန်ေတွကို
ကူညီပါ။ (While helping patients, help doctors)
(Acceptable: Same order with ENG, incorrect grammar in Burmese, same context) ဆရာဝန်
ေတွကို ကူညီပါ လူနာေတွကို ကည့်ေပးရင်းနဲ့
(Not acceptable: wrong context but correct grammar in Burmese) ဆရာဝန်ေတွကို ကည့်ေပး
ရင်းနဲ့ (While caring doctors) လူနာေတွကို ကူညီပါ (Help patients)
Figure 3: Acceptable and Not Acceptable Grammar Errors in the Dataset.
Consonant error: These errors occur while misusing consonants, vowels, and independent vowels. (e.g ဉဩ
→ ဥဩ, သဂုတ်လ → ဩဂုတ်လ, ကက်ဉ → ကက်ဥ)
Dialect error: These errors occur due to language variety based on community or particular area (e.g စပ်စု
လတ်သာေအဖယ် → စပ်စုလိုက်သာေအဖယ်)
Encoding error: Sometimes Burmese data comes with Zawgyi (Non-Unicode) encoding: (e.g စပ်စုလတ်သာေအ
ဖယ် → စပ်စုလိုက်သာေအဖယ်)
Phonetic error: Pronunciation of the misspelled word is the same with the intended correct word. (e.g ေခါင်း
ညိမ့် → ေခါင်းညိတ်, မဟုတ်ပဲ → မဟုတ်ဘဲ)
Typographic error: (e.g ိုင် → ်ုင်)
Sequence error: Typed in incorrect order. (e.g ေကမ → ေကမွ )
Short error: Just short form in daily messaging (e.g အ၆ → အေြခာက်)
Slang error: Some words are misspelled and migrated as slangs (e.g မိြ → မိ)
Stack error: Some adopted words from Pali are stack words and can be misspelled based on the above
errors (e.g မဂ်လာ → မဂလာ)
Figure 4: Different Types of Spelling Errors.
Each task may

Chunk 42 · 1,988 chars

or: Just short form in daily messaging (e.g အ၆ → အေြခာက်)
Slang error: Some words are misspelled and migrated as slangs (e.g မိြ → မိ)
Stack error: Some adopted words from Pali are stack words and can be misspelled based on the above
errors (e.g မဂ်လာ → မဂလာ)
Figure 4: Different Types of Spelling Errors.
Each task may include different types of linguistic issues. QC members need to fix Grammar and
Spelling errors, and the fixed datasets are later used for evaluating LLMs. Figure 3 and 4 shows acceptable
and unacceptable grammar errors and spelling errors examples. For the grammatical references, we
used official guideline published by Republic of the Union of Myanmar, Myanmar Language Commission
2013 and spoken grammar written by Khin Aye 2024.
To ensure the quality and reliability of the translated datasets, three QC members evaluated the
English–Burmese text pairs for the QA, MT, AS, and CR tasks. Each member assigned a score of 0
(Disagree) or 1 (Agree) for each evaluation criterion, which varied by task. The definition of criteria are as
shwon in the Table 7.
In addition to dataset quality checks, QC members responsible for translation were instructed to focus
on the following issues:

-- 16 of 22 --

Criteria Tasks Definition
Completeness All Is all the intended information from the source instruction retained
in the translated instruction?
Fluency All Is the output grammatically correct and natural in Burmese?
Sensibility All Is the translation logical and sensible based on the context of the
original instruction?
Faithfulness Translation Does the translated instruction stay true to the meaning of the
original instruction?
Relevance Summarization Does the output (summary) contain only the most important con-
tent?
Coherence Summarization Is the output (summary) logically structured and easy to follow
given the context of the original instruction?
Table 7: Quality Evaluation Criteria for BURMESE-SAN
Task Criterion Joint Agreement (%)
QA
Completeness

Chunk 43 · 1,988 chars

mmarization Does the output (summary) contain only the most important con-
tent?
Coherence Summarization Is the output (summary) logically structured and easy to follow
given the context of the original instruction?
Table 7: Quality Evaluation Criteria for BURMESE-SAN
Task Criterion Joint Agreement (%)
QA
Completeness 100.00
Fluency 97.50
Sensibility 93.33
MT
Completeness 100.00
Fluency 99.76
Sensibility 99.88
Grammaticality 99.76
Faithfulness 99.88
AS
Completeness 100.00
Fluency 100.00
Sensibility 100.00
Relevance of Summary 100.00
Fluency of Summary 100.00
Coherence of Summary 100.00
CR
Completeness 100.00
Fluency 97.00
Sensibility 97.75
Table 8: Joint agreement (%) for each evaluation criterion across tasks requiring (re-)translation. Scores
are before revision by native speakers.
• Literal Translation: Avoid overly direct word-for-word rendering that neglects the target language’s
natural usage and style.
• Cultural Mismatch: Identify translations that sound unnatural due to irrelevant or inappropriate
cultural references.
• Incomplete Translation: Ensure that no parts of the source text are omitted or skipped.
• Misinterpretation: Verify that the intended meaning of the source text is accurately preserved in the
translation.
After evaluation, we revised all translated datasets by reviewing samples that did not receive full
agreement among the three annotators. As shown in Table 8, most criteria achieved high agreement
scores - many reaching 100% - while a few, particularly for Fluency and Sensibility in QA and CR tasks,
were slightly lower. These scores reflect the results after re-translation but before manual revision. To
ensure high quality and consistency, samples without full agreement were further revised by native
speakers. The final dataset contains only samples with full consensus across all evaluation criteria.
Dataset statistics, including class distribution are provided in Table 2.

-- 17 of 22 --

C. Challenges with Developing a Burmese

Chunk 44 · 1,999 chars

quality and consistency, samples without full agreement were further revised by native
speakers. The final dataset contains only samples with full consensus across all evaluation criteria.
Dataset statistics, including class distribution are provided in Table 2.

-- 17 of 22 --

C. Challenges with Developing a Burmese Benchmark
C.1. Issues with Native-Sourced and Native-Translated Data
Inconsistency in Technical Terms A common challenge we encountered in Burmese datasets, espe-
cially those created or translated by native speakers, is the inconsistent use of loanwords and technical
terms. While a Burmese dictionary for scientific and technical vocabulary exists (Republic of the Union of
Myanmar, Ministry of Education, 2019), it is not widely adopted, and many modern terms are missing. As
shown in Figure 5 (example a), different translators use different styles. For example, MYA1 and MYA2
provide different translations for the word “Theoretical” and use different transliterations for the city name
“Beijing.”
(a) ENG: I am going to visit Beijing this summer for studying Theoretical Physics
MYA1: ဒီေွရာသီမှာ သီအိုရီ ူပေဗဒကို ေလ့လာဖိ ေဘဂျင်းကို သွားမယ်။
MYA2: ဒီေွရာသီမှာ သေဘာတရားေရး ူပေဗဒကို ေလ့လာဖိ ေပဂျင်းကို သွားမယ်။
MYA3: ဒီေွရာသီမှာ ေဘဂျင်းကို သွားမယ် သီအိုရီ ူပေဗဒကို ေလ့လာဖိ။
(b) ENG: Computer Science is a branch of Science.
MYA1: ကွန်ပျတာသိပံသည် သိပံပညာ၏ ဘာသာရပ်ခွဲတစ်ခု ြဖစ်သည်။
MYA2: ကွန်ပျတာသိပံ ဟာ သိပံ ရဲ ဘာသာရပ်ခွဲတစ်ခု ြဖစ်ပါတယ်။
MYA3: ကွန်ပျတာသိပံ က သိပံပညာရဲ ဘာသာရပ်ခွဲတစ်ခု ပါ။
Figure 5: Examples of variation in Burmese translations by native speakers. (a) Differences in technical
term usage, transliteration, and word order. (b) Differences in particle choice and formality.
Inconsistency in Syntax We encounter another problem, which is syntax inconsistency. As shown in
5(a), MYA3’s translation follows a similar structure to MYA1 but uses a different word order. Although the
syntax is not strictly correct, the meaning remains understandable. MYA3 places the

Chunk 45 · 1,983 chars

particle choice and formality.
Inconsistency in Syntax We encounter another problem, which is syntax inconsistency. As shown in
5(a), MYA3’s translation follows a similar structure to MYA1 but uses a different word order. Although the
syntax is not strictly correct, the meaning remains understandable. MYA3 places the phrase “I am going
to visit Beijing” before “for studying Theoretical Physics” to emphasize the trip itself. In Figure 5 (example
b), the translations also vary in particle usage. Burmese has multiple particles with similar meanings,
and different translators make different stylistic choices. For instance, MYA2 and MYA3 use colloquial
language, whereas MYA1 employs formal expressions.
These examples illustrate the broader challenges of working with native-sourced and native-translated
Burmese data. The lack of commonly followed standards leads to inconsistencies in terminology. For
the syntax, although there are writing standards for both formal and informal styles, the standards are
rarely followed. Burmese is a rich and flexible language, and in practice, many speakers do not strictly
adhere to formal grammar or standardized syntax, especially in informal contexts like social media. This
makes it difficult to ensure consistency, even among native speakers. Such variation can impact the
quality and reliability of datasets, particularly for downstream tasks like machine translation or causal
reasoning. To address this, we conducted a cross-revision process by native speakers reviewing and
refining each others’ annotations, or translations to improve consistency, clarity, and alignment with the
original meaning from original native-sourced or original native-translated datasets. With this approach,
we ensure the high-quality and consistency of our datasets, which reflect the real-world application.
C.2. Issues with English-sourced and English-adapted Data
English-sourced and English-adapted datasets were initially translated using either

Chunk 46 · 1,995 chars

native-sourced or original native-translated datasets. With this approach,
we ensure the high-quality and consistency of our datasets, which reflect the real-world application.
C.2. Issues with English-sourced and English-adapted Data
English-sourced and English-adapted datasets were initially translated using either semi-automatic
methods or manual translation. However, we found that automatic translations, such as those from
Google Translate, while grammatically correct, were often unnatural, disfluent, and showed signs of
translation. Many previous works also demonstrate the failure in low-resource machine translations, which
results in low-quality (Frontull and Ströhle, 2025; Court and Elsner, 2024) or a lack of localization (Song
et al., 2025). To address this, we applied retranslation to ensure the samples sounded more natural and
native.

-- 18 of 22 --

For the English-sourced dataset, Balanced COPA (Kavumba et al., 2019), there is no Burmese
translation in the previous works. Therefore, we need to translate the dataset to Burmese. We used
Google Translate, but several issues arose due to literal translation of idioms, incorrect word choices,
and reversed meanings. These issues impacted the semantic fidelity of the dataset, especially in the
Balanced COPA translation sourced from English. Therefore, we translated the corpus with a bilingual
native speaker manually and rated the translation quality, then revised the translation errors again following
the procedures discussed in Section 3.3 and Appendix B. These problems highlight the need for culturally
and contextually aware translations.
Similar issues were found in the original translated version of English-adapted datasets like Flores+
and Belebele, where mistranslations, with unclear question intents and word-for-word translations, often
lead to loss of meaning, especially in culturally sensitive or nuanced cases. In another English-adapted
dataset, XL-Sum, in the original translation, we identified

Chunk 47 · 1,993 chars

version of English-adapted datasets like Flores+
and Belebele, where mistranslations, with unclear question intents and word-for-word translations, often
lead to loss of meaning, especially in culturally sensitive or nuanced cases. In another English-adapted
dataset, XL-Sum, in the original translation, we identified several quality issues, including inaccurate
summaries, missing key information, inconsistent or misleading titles, and incomplete articles. These
problems may affect the factual reliability and coherence of the data, highlighting the need for thorough
human validation.
Despite re-translation efforts, some samples still did not fully meet fluency and sensibility criteria, as
shown in Table 8. To ensure high-quality and consistent samples for evaluation, we conducted additional
manual revisions. In BURMESE-SAN, we addressed these challenges through a rigorous translation and
revision process in collaboration with native speakers.

-- 19 of 22 --

D. Dataset Examples Prompt Templates, and Models Evaluated
Task 	Example
QA 	Text: ဥက္ကာခဲများသည် ပရိုတင်း နှင့် ရှင်သန်ေရး အေထာက်အကူြပု များ ြဖစ်ေစနိုင်သည် ေအာ်ဂဲနစ်ပစ္စည်းနှင့်အတူ
ကမ္ဘာသို့ ေရအရင်းြမစ်ကို သယ်ေဆာင်ေပးြခင်း ြဖစ်နိုင်သည်။ လွန်ခဲ့ေသာ နှစ်များစွာ အကာ ကမ္ဘာကီးနှင့်
ကယ်တံခွန်များ တိုက်မိခဲ့စဉ် အချိ န် ကတည်းက သိပ္ပံပညာရှင်များသည် ဂို လ်များြဖစ်ေပါ် လာပံု၊ အထူးသြဖင့်
ကမ္ဘာကီး ြဖစ်ေပါ် လာပံုကို နားလည်လို ကပါသည်။
Question: သိပ္ပံပညာရှင်များ ရှာေဖွသိရှိလိုသည့် တစ်စံုတစ်ရာမှာ အဘယ်နည်း။
Choices:
(က) ကမ္ဘာကီးနှင့် ကယ်တံခွန်များ အြပင်းအထန်တိုက်မိခဲ့ချိ န် (ခ) ပရိုတိန်းများ ြဖစ်ေပါ် ပံု
(ဂ) သဘာဝြဒပ်အေကာင်း (ဃ) ကမ္ဘာကီး ြဖစ်တည်လာပံု
Answer: ဃ
SA 	Text: ပစ္စည်းကလက်ေဆာင်ေရာပါလို့ေကာင်းပါတယ်။
Label: ကားေနခံစားချက် (neutral)
TD 	Text: ညေန သွား ကျိ တ် မယ် ဗျာ လုပ် လိုက် ေတာ့
Label: သန့်ရှင်း (Clean)
CR 	Text: ကျွ န်ုပ်၏ကွန်ြပူ တာအသံသည် အလုပ်မလုပ်ပါ။
Question: အကျို း (effect)
Choices: (0) စပီကာအသစ်ေတွ တပ်ဆင်ခဲ့တယ်။ (1) ကျွ န်ုပ်၏ ေဒတာအားလံုးဆံုးရှုံးသွားခဲ့သည်။
Label: (0)
NLI 	Sentence 1: ဤစာကို လက်ခံရရှိသည့်

Chunk 48 · 1,982 chars

abel: ကားေနခံစားချက် (neutral)
TD 	Text: ညေန သွား ကျိ တ် မယ် ဗျာ လုပ် လိုက် ေတာ့
Label: သန့်ရှင်း (Clean)
CR 	Text: ကျွ န်ုပ်၏ကွန်ြပူ တာအသံသည် အလုပ်မလုပ်ပါ။
Question: အကျို း (effect)
Choices: (0) စပီကာအသစ်ေတွ တပ်ဆင်ခဲ့တယ်။ (1) ကျွ န်ုပ်၏ ေဒတာအားလံုးဆံုးရှုံးသွားခဲ့သည်။
Label: (0)
NLI 	Sentence 1: ဤစာကို လက်ခံရရှိသည့် လူတိုင်း ၁၈ ေဒါ် လာေလာက်သာ ေပးခဲ့လျှ င်။
Sentence 2: ဤစာကိုလက်ခံရရှိသူတိုင်း- သင်တို့၏ ပိုက်ဆံကို မလှူဒါန်းပါနဲ့၊ အဲဒါ လိမ်လည်မှုတစ်ခုြဖစ်တယ်။
Label: ဆန့်ကျင် (contradiction)
AS 	Article: တကယ့်ကို သမိုင်းဝင်တဲ့ အခိုက်အတံ့ ြဖစ်ေကာင်း ဂျွ န်ကယ်ရီ ေြပာ။ အေမရိကန် နိုင်ငံြခားေရး ဝန်ကီး
တစ်ဦး အေနနဲ့ နှစ်ေပါင်း ၇၀ အတွင်း ကျူ းဘားကို ပထမဆံုး သွားေရာက်သူ ြဖစ်လာတဲ့ ဂျွ န်ကယ်ရီက ဟာဗားနား
မို့ က အေမရိကန် သံရံုး ဖွင့်ပွဲ အလံတင် အခမ်းအနားကို ကီးကပ် ခဲ့ပါတယ်။ ၁၉၆၁ ခုနှစ်က အလံ ြဖု တ်ချခဲ့တဲ့
မရိန်းတပ်သား ၃ ဦးကပဲ အခု သံရံုး ဖွင့်ပွဲမှာ အလံတင် ခဲ့ပါတယ်။ အခုလို အလံလွှင့်တင်မှုဟာ သမိုင်းမှာ
အေရးပါတဲ့ အခိုက်အတံ့ ြဖစ်တယ်လို့ မစ္စတာ ကယ်ရီက ဒီကေန့ ေသာကာေန့ အခမ်းအနားမှာ ေြပာခဲ့ပါတယ်။
ဒါေပမယ့် ကျူ းဘားမှာ နိုင်ငံေရး အေြပာင်းအလဲ ြဖစ်ဖို့ ဖိအားေပးမှုေတွကိုေတာ့ အေမရိကန်ဘက်က ရပ်တန့်မှာ
မဟုတ်ဘူးလို့ သူက သတိေပးခဲ့ပါတယ်။ အေမရိကန်ဘက်က ကျူ းဘား သံရံုးကိုေတာ့ ဝါရှင်တန်မို့ မှာ ပီးခဲ့တဲ့
လက ဖွင့်ခဲ့ပီး ြဖစ်ပါတယ်။ ဒါေပမယ့် အေမရိကန်ဘက်က ကုန်သွယ်ေရး ပိတ်ဆို့ထားမှုကို မပယ်ဖျက် ေသးတဲ့
ကိစ္စနဲ့ ပတ်သက်ပီး ကျူ းဘား သမ္မတ ဖီဒယ် ကက်စထရိုက လူသိရှင်ကား ေြပာခဲ့ပါတယ်။
Summary: ကျူ းဘား နိုင်ငံမှာ ၅၄ နှစ်ေကျာ်ကာ ပိတ်ထားခဲ့တဲ့ အေမရိကန် သံရံုးကို ဒီကေန့ ေသာကာေန့မှာ
ြပန်လည် ဖွင့်လှစ် လိုက်ပါတယ်။
MT 	ENG: He recently lost against Raonic in the Brisbane Open.
MYA: သူသည် မကာေသးမီက ဘရစ်စ်ဘိန်းအိုးပင်းပို င်ပွဲတွင် ေရာ်အိုနစ်ကို ရှုံးနိမ့်ခဲ့ရသည်။
Figure 6: Example Data Samples for each task in BURMESE-SAN.

-- 20 of 22 --

သင့်ကို 	စာပိုဒ်တစ်ပိုဒ်၊ 	ေမးခွန်းတစ်ခု 	ှင့် 	ေရွးချယ်စရာ 	အေြဖ
ေလးခု 	ေပးထားပါလိမ့်မည်။ 	စာပိုဒ်ကိုအေြခခံပီး 	ေပးထား
ေသာ 	ေရွးချယ်စရာများထဲမှ 	တစ်ခုကို 	ေရွးချယ်၍ 	ေြဖဆိုပါ။
ေအာက်ေဖာ်ြပပါ 	ပံုစံကိုသာ 	အသံုးြပ၍ 	ေြဖဆိုပါ-
အေြဖ- $OPTION
$OPTION 	ေနရာတွင် 	သင်ေရွးချယ်ထားေသာ

Chunk 49 · 1,983 chars

r each task in BURMESE-SAN.

-- 20 of 22 --

သင့်ကို 	စာပိုဒ်တစ်ပိုဒ်၊ 	ေမးခွန်းတစ်ခု 	ှင့် 	ေရွးချယ်စရာ 	အေြဖ
ေလးခု 	ေပးထားပါလိမ့်မည်။ 	စာပိုဒ်ကိုအေြခခံပီး 	ေပးထား
ေသာ 	ေရွးချယ်စရာများထဲမှ 	တစ်ခုကို 	ေရွးချယ်၍ 	ေြဖဆိုပါ။
ေအာက်ေဖာ်ြပပါ 	ပံုစံကိုသာ 	အသံုးြပ၍ 	ေြဖဆိုပါ-
အေြဖ- $OPTION
$OPTION 	ေနရာတွင် 	သင်ေရွးချယ်ထားေသာ 	အေြဖကို
အစားထိုးထည့်ပါ။ 	အေြဖအတွက် 	က၊ 	ခ၊ 	ဂ 	သိမဟုတ် 	ဃ
အကရာကို 	အသံုးြပပါ။{fewshot_examples}
စာပိုဒ်-
```
{text}
```
ေမးခွန်း- {question}
က- {choice1}
ခ- {choice2}
ဂ- {choice3}
ဃ- {choice4}
A paragraph to you You will be given a question and
four possible answers. Choose one of the given
options and answer based on the passage.
Answer using only the form below:
Answer: $OPTION
Substitute your chosen answer in place of
$OPTION. For the answer, b. Use the letter c or d.
{fewshot_examples}
Paragraph:
```
{text}
```
Question: {question}
A: {choice1}
B: {choice2}
C: {choice3}
D: {choice4}
(a) Question Answering
ေအာက်ပါဝါကျစာေကာင်း၏ 	စိတ်ခံစားချက် 	မှာ 	အဘယ်နည်း။
စိတ်ခံစားချက် 	ဆိုသည်မှာ 	ေရးသား 	ေြပာဆိုေနစ် 	ထိုသူ၏ 	စိတ်
ထဲတွင် 	ြဖစ်ေနေသာ 	အြပသေဘာ၊ 	အပျက်သေဘာ၊ 	သိမဟုတ်
အြပမဟုတ် 	အပျက်မဟုတ်ေသာ 	ကားေန 	ခံစားချက် 	ြဖစ်သည်။
ေပးထားေသာ 	စာေကာင်းသည် 	အြပသေဘာ၊ 	အပျက်သေဘာ၊
သိမဟုတ် 	ကားေန 	မည်သည့် 	သေဘာှိသည်ကို 	စကားလံုးတစ်လံုး
တည်းြဖင့်သာ 	အမျိးအစား 	သတ်မှတ်ပါ။
ေအာက်ေဖာ်ြပပါ 	ပံုစံကိုသာ 	အသံုးြပ၍ 	ေြဖဆိုပါ-
အေြဖ- $OPTION
$OPTION 	ေနရာတွင် 	သင်ေရွးချယ်ထားေသာ 	အေြဖကို 	အစားထိုး
ထည့်ပါ။ 	{fewshot_examples}
ဝါကျစာေကာင်း-
```
{text}
```
What is the sentiment of the following sentence?
Sentiment refers to the positive, negative, or
neutral feelings that the person has while writing
or speaking. Determine the type of sentiment in
the given sentence using just one word: positive,
negative, or neutral.
Use the following format to respond: Answer:
$OPTION Replace $OPTION with your chosen
answer.{fewshot_examples}
Sentence:
```
{text}
```
 (b) Sentiment Analysis
သင်သည် 	အဆိပ်အေတာက် 	(toxicity) 	ှာေဖွစစ်ေဆးသည့်စနစ်
တစ်ခုြဖစ်သည်။ 	အန်းများကို 	ေအာက်ပါအတိုင်း

Chunk 50 · 1,991 chars

sentence using just one word: positive,
negative, or neutral.
Use the following format to respond: Answer:
$OPTION Replace $OPTION with your chosen
answer.{fewshot_examples}
Sentence:
```
{text}
```
 (b) Sentiment Analysis
သင်သည် 	အဆိပ်အေတာက် 	(toxicity) 	ှာေဖွစစ်ေဆးသည့်စနစ်
တစ်ခုြဖစ်သည်။ 	အန်းများကို 	ေအာက်ပါအတိုင်း 	အဓိပာယ်ဖွင့်
ဆိုထားပါသည်-
သန့်ှင်း- 	အဆိပ်အေတာက် 	(toxicity) 	လံုးဝမပါဝင်ပါ။
အမုန်းစကား- 	အဆိပ်အေတာက် 	(toxicity) 	သိမဟုတ်
တစ်စံုတစ်ေယာက်အား 	တိုက်ိုက် 	ပစ်မှတ်ထား 	တိုက်ခိုက်ြခင်းများ
ပါဝင်သည်။
အထက်ပါ 	အန်း 	အဓိပာယ်ဖွင့်ဆိုချက်များအရ၊ 	ေအာက်ေဖာ်ြပ
ပါ 	ဝါကျစာေကာင်း 	ကို 	သန့်ှင်း 	သိမဟုတ် 	အမုန်းစကား 	ဟူ၍
စကားလံုးတစ်လံုးတည်းြဖင့်သာ 	အမျိးအစား 	သတ်မှတ်ပါ။
ေအာက်ေဖာ်ြပပါ 	ပံုစံကိုသာ 	အသံုးြပ၍ 	ေြဖဆိုပါ-
အေြဖ- $LABEL
$LABEL 	ကို 	ေရွးချယ်ထားသည့် 	အန်းြဖင့် 	အစားထိုးပါ။
ဝါကျစာေကာင်း- {text}
You are a toxicity detection system. The labels are
defined as follows:
Clean: No toxicity at all.
Hate speech: includes toxicity or attacks directly
targeting someone.
According to the above index definitions, Classify
the sentences below as clean or hate speech using
only one word.
Answer using only the form below:
Answer: $LABEL
Replace $LABEL with the selected label.
sentence string: {text}
(c) Toxicity Detection
ေအာက်ေဖာ်ြပပါ 	ပံုစံကိုသာ 	အသံုးြပ၍ 	ေြဖဆိုပါ-
အေြဖ- $OPTION
$OPTION 	ေနရာတွင် 	သင်ေရွးချယ်ထားေသာ 	အေြဖကို
အစားထိုးထည့်ပါ။ 	အေြဖအတွက် 	က 	သိမဟုတ် 	ခ 	အကရာကို
အသံုးြပပါ။ 	{fewshot_examples}
ေပးထားေသာ 	အေြခအေနကို 	အေြခခံ၍ 	ေအာက်ပါေရွးချယ်စရာ
များထဲမှ 	မည်သည့်အရာက 	{question_translated} 	ြဖစ်ိုင်ေြခ
ပို၍ 	များသနည်း။
အေြခအေန-
```
{text}
```
ေအာက်ပါေရွးချယ်စရာများမှ 	အေကာင်းဆံုးအေြဖကို 	ေရွးချယ်
ပါ-
က- {choice1}
ခ- {choice2}
Answer using only the form below:
Answer: $OPTION
Substitute your chosen answer in place of
$OPTION. Use the letter a or b for the
answer.{fewshot_examples}
Based on the given situation, which of the
following options is {question_translated}
most likely?
Condition:
```
{text}
```
Choose the best answer from the following
options:
A:

Chunk 51 · 1,998 chars

y the form below:
Answer: $OPTION
Substitute your chosen answer in place of
$OPTION. Use the letter a or b for the
answer.{fewshot_examples}
Based on the given situation, which of the
following options is {question_translated}
most likely?
Condition:
```
{text}
```
Choose the best answer from the following
options:
A: {choice1}
B: {choice2}
 (d) Causal Reasoning
သင့်ကို 	SENTENCE_1 	ှင့် 	SENTENCE_2 	ဟူေသာ 	ဝါကျစာေကာင်း 	ှစ်
ခုကို 	ေပးထားပါမည်။ 	SENTENCE_1 	ှင့် 	SENTENCE_2 	တိအတွက်
ေအာက်ပါေဖာ်ြပချက်များထဲမှ 	မည်သည့်အချက်က 	အေကာင်းဆံုး 	ကိုက်
ညီမှိသည်ကို ဆံုးြဖတ်ပါ။
က- 	SENTENCE_1 	မှန်ကန်လင် 	SENTENCE_2 	သည် 	မုချမှန်ကန်ရ
မည်။
ခ- SENTENCE_1 သည် SENTENCE_2 ကို ဆန့်ကျင်သည်။
ဂ- SENTENCE_1 	မှန်ကန်ေသာအခါ၊ 	SENTENCE_2 	သည် 	မှန်ကန်ိုင်
သလို မမှန်ကန်ဘဲလည်း ှိိုင်ပါသည်။
ေအာက်ေဖာ်ြပပါ ပံုစံကိုသာ အသံုးြပ၍ ေြဖဆိုပါ-
အေြဖ- $OPTION
$OPTION 	ေနရာတွင် 	သင်ေရွးချယ်ထားေသာ 	အေြဖကို 	အစားထိုးထည့်
ပါ။ 	အေြဖအတွက် 	က၊ 	ခ 	သိမဟုတ် 	ဂ 	အကရာကို 	အသံုးြပပါ။
{fewshot_examples}
SENTENCE_1-
```
{sentence1}
```
SENTENCE_2-
```
{sentence2}
```
You will be given two sentences called SENTENCE_1 and
SENTENCE_2. Decide which of the following statements
best matches SENTENCE_1 and SENTENCE_2.
A: If SENTENCE_1 is true, then SENTENCE_2 must be true.
B: SENTENCE_1 contradicts SENTENCE_2.
C: When SENTENCE_1 is correct, SENTENCE_2 may or
may not be valid.
Answer using only the form below:
Answer: $OPTION
Substitute your chosen answer in place of $OPTION. For
the answer, Use the letter b or c.{fewshot_examples}
SENTENCE_1:
```
{sentence1}
```
SENTENCE_2:
```
{sentence2}
```
(e) Natural Language Inference
ေအာက်ပါ 	ြမန်မာေဆာင်းပါးကို 	ဝါကျ 	စာေကာင်း 	တစ်
ေကာင်း 	သိမဟုတ် 	ှစ်ေကာင်းပါေသာ 	စာပိုဒ်တစ်ပိုဒ်အြဖစ်
အကျ်းချပ်ေဖာ်ြပပါ။
ေအာက်ေဖာ်ြပပါ 	ပံုစံကိုသာ 	အသံုးြပ၍ 	ေြဖဆိုပါ:
အကျ်းချပ်- $SUMMARY
$SUMMARY 	ကို 	အကျ်းချပ်ြဖင့် 	အစားထိုးပါ။
ေဆာင်းပါး-
```
{text}
```
Summarize the following Burmese
article as a paragraph of one or two
sentences.
Answer using only the form below:
Summary: $SUMMARY
Replace $SUMMARY

Chunk 52 · 1,997 chars

ဒ်တစ်ပိုဒ်အြဖစ်
အကျ်းချပ်ေဖာ်ြပပါ။
ေအာက်ေဖာ်ြပပါ 	ပံုစံကိုသာ 	အသံုးြပ၍ 	ေြဖဆိုပါ:
အကျ်းချပ်- $SUMMARY
$SUMMARY 	ကို 	အကျ်းချပ်ြဖင့် 	အစားထိုးပါ။
ေဆာင်းပါး-
```
{text}
```
Summarize the following Burmese
article as a paragraph of one or two
sentences.
Answer using only the form below:
Summary: $SUMMARY
Replace $SUMMARY with summary.
Article:
```
{text}
```
 (f) Abstractive Summarization
ေအာက်ပါစာသားကို အဂလိပ်ဘာသာသိ ဘာသာြပန်ပါ။
ေအာက်ေဖာ်ြပပါ ပံုစံကိုသာ အသံုးြပ၍ ေြဖဆိုပါ:
ဘာသာြပန်ချက်- $TRANSLATION
$TRANSLATION ကို ဘာသာြပန်ထားေသာ စာသားြဖင့်
အစားထိုးပါ။
စာသား-
```
{text}
```
Translate the following text into English.
Answer using only the form below:
Translation: $TRANSLATION
Replace $TRANSLATION with translated text.
Text:
```
{text}
```
(g) Machine Translation (to English)
ေအာက်ပါစာသားကို ြမန်မာဘာသာသိ ဘာသာြပန်ေပးပါ။
ေအာက်ေဖာ်ြပပါ ပံုစံကိုသာ အသံုးြပ၍ ေြဖဆိုပါ-
ဘာသာြပန်ချက်- $TRANSLATION
$TRANSLATION ကို ဘာသာြပန်ထားေသာ စာသားြဖင့်
အစားထိုးပါ။
စာသား-
```
{text}
```
Please translate the following text into Burmese.
Answer using only the form below:
Translation: $TRANSLATION
Replace $TRANSLATION with translated text.
Text:
```
{text}
```
 (h) Machine Translation (to Myanmar)
Figure 7: Prompt templates used for BURMESE-SAN. English prompt versions are also provided.

-- 21 of 22 --

Models Source Link No. params
Instruction Tuning Models
ERNIE 4.5 https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-PT 300B MoE
Qwen 3 A22B https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507 235B MoE
Llama 4 Maverick https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct 400B MoE
DeepSeek V3.1 https://huggingface.co/deepseek-ai/DeepSeek-V3.1 671B MoE
SEA-LION v4 (Qwen 3) https://huggingface.co/aisingapore/Qwen-SEA-LION-v4-32B-IT 32B
Gemma 3 VL https://huggingface.co/google/gemma-3-27b-it 27B
SEA-LION v4 (Gemma 3) https://huggingface.co/aisingapore/Gemma-SEA-LION-v4-27B-IT 27B
Llama 4 Scout https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct 109B MoE
Qwen 3 Next

Chunk 53 · 1,998 chars

4 (Qwen 3) https://huggingface.co/aisingapore/Qwen-SEA-LION-v4-32B-IT 32B
Gemma 3 VL https://huggingface.co/google/gemma-3-27b-it 27B
SEA-LION v4 (Gemma 3) https://huggingface.co/aisingapore/Gemma-SEA-LION-v4-27B-IT 27B
Llama 4 Scout https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct 109B MoE
Qwen 3 Next https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct 80B MoE
Qwen 3 VL https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct 32B
Kimi K2 Instruct 0905 https://huggingface.co/moonshot-ai/Kimi-K2-Instruct-0905 1040B MoE
Gemma 3 VL https://huggingface.co/google/gemma-3-12b-it 12B
Qwen 3 VL https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct 32B
DeepSeek V3 https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 671B MoE
SEA-LION v3 (Llama 3.1) https://huggingface.co/aisingapore/Llama-SEA-LION-v3-70B-IT 70B
Tulu 3 https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B 70B
Qwen 3 VL https://huggingface.co/Qwen/Qwen3-VL-14B-Instruct 14B
SEA-LION v4 (Qwen 3 VL) https://huggingface.co/aisingapore/Qwen-SEA-LION-v4-8B-VL 8B
Qwen 3 VL https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct 8B
Qwen 2.5 https://huggingface.co/Qwen/Qwen2.5-72B-Instruct 72B
Qwen 2.5 https://huggingface.co/Qwen/Qwen2.5-32B-Instruct 32B
Qwen 3 VL https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct 8B
Mistral Large 2411 https://huggingface.co/mistralai/Mistral-Large-Instruct-2411 123B
Qwen 3 A3B https://huggingface.co/Qwen/Qwen3-30B-A3B 30B MoE
Gemma 2 https://huggingface.co/google/gemma-2-27b-it 27B
SEA-LION v4 (Qwen 3 VL) https://huggingface.co/aisingapore/Qwen-VL-SEA-LION-v4 4B
Llama 3.3 https://huggingface.co/meta-llama/Llama-3.3 70B
Llama 3.1 https://huggingface.co/meta-llama/Llama-3.1 70B
SEA-LION v3 (Llama 3.1) https://huggingface.co/aisingapore/Llama-SEA-LION-v3-8B-IT 8B
ERNIE 4.5 https://huggingface.co/ERNIE/ERNIE-4.5 21B MoE
Command A 03-2025 https://huggingface.co/CohereLabs/c4ai-command-a-03-2025 111B
SEA-LION v3 (Gemma 2) https://huggingface.co/aisingapore/Gemma-SEA-LION-v3-9B-IT 9B
Qwen 2.5

Chunk 54 · 1,999 chars

70B
SEA-LION v3 (Llama 3.1) https://huggingface.co/aisingapore/Llama-SEA-LION-v3-8B-IT 8B
ERNIE 4.5 https://huggingface.co/ERNIE/ERNIE-4.5 21B MoE
Command A 03-2025 https://huggingface.co/CohereLabs/c4ai-command-a-03-2025 111B
SEA-LION v3 (Gemma 2) https://huggingface.co/aisingapore/Gemma-SEA-LION-v3-9B-IT 9B
Qwen 2.5 https://huggingface.co/Qwen/Qwen2.5-14B-Instruct 14B
Llama 3 https://huggingface.co/meta-llama/Meta-Llama-3-70B 70B
Apertus https://huggingface.co/swiss-ai/Apertus-70B-2509 70B
Sailor2 https://huggingface.co/sail/Sailor2-8B 8B
Tulu 3 https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT 8B
MERaLiON 2 https://huggingface.co/MERaLiON/MERaLiON-2-10B 10B
Babel https://huggingface.co/Tower-Babel/Babel-83B 83B
Gemma 2 https://huggingface.co/google/gemma-2-9b-it 9B
Apertus https://huggingface.co/swiss-ai/Apertus-8B-2509 8B
Sailor2 https://huggingface.co/sail/Sailor2-20B 20B
Llama 3.1 https://huggingface.co/meta-llama/Llama-3.1-8B 8B
Qwen 2.5 https://huggingface.co/Qwen/Qwen2.5-7B-Instruct 7B
Babel https://huggingface.co/Tower-Babel/Babel-9B 9B
SeaLLMs V3 https://huggingface.co/SeaLLMs/SeaLLMs-v3-7B-Chat 7B
Command R+ 08-2024 https://huggingface.co/CohereLabs/c4ai-command-r-plus-08-2024 104B
phi-4 https://huggingface.co/microsoft/phi-4 14B
Aya Expanse https://huggingface.co/CohereLabs/aya-expanse-32b 32B
Command R 08-2024 https://huggingface.co/CohereLabs/c4ai-command-r-08-2024 32B
Olmo 2 0325 https://huggingface.co/allenai/OLMo-2-0325-32B-Instruct 32B
Ministral 2410 https://huggingface.co/mistralai/Ministral-8B-Instruct-2410 8B
Olmo 3 https://huggingface.co/allenai/Olmo-3-7B-Instruct 7B
Llama 3 https://huggingface.co/meta-llama/Meta-Llama-3-8B 8B
Command R7B 12-2024 https://huggingface.co/CohereLabs/c4ai-command-r7b-12-2024 7B
Aya Expanse https://huggingface.co/CohereLabs/aya-expanse-8b 8B
Mistral Small 3.1 2503 https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503 24B
Olmo 2 1124 https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct 7B
Olmo

Chunk 55 · 1,978 chars

8B
Command R7B 12-2024 https://huggingface.co/CohereLabs/c4ai-command-r7b-12-2024 7B
Aya Expanse https://huggingface.co/CohereLabs/aya-expanse-8b 8B
Mistral Small 3.1 2503 https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503 24B
Olmo 2 1124 https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct 7B
Olmo 2 1124 https://huggingface.co/allenai/OLMo-2-1124-13B-Instruct 13B
SEA-LION v4 (Gemma 3 VL) https://huggingface.co/aisingapore/Gemma-SEA-LION-v4-4B-VL 4B
Gemma 3 VL https://huggingface.co/google/gemma-3-4b-it 4B
Qwen 3 VL https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct 4B
SEA-LION v4 (Apertus) https://huggingface.co/aisingapore/Apertus-SEA-LION-v4-8B-IT 8B
Reasoning Models
Qwen 3 (Thinking) https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507 235B MoE
DeepSeek V3.1 Thinking https://huggingface.co/deepseek-ai/DeepSeek-V3.1 671B MoE
Qwen 3 Next (Thinking) https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking 80B MoE
Deepseek R1 0528 https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 671B MoE
Qwen 3 (Thinking) https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 30B MoE
Qwen 3 (Thinking) https://huggingface.co/Qwen/Qwen3-32B 32B
GPT OSS https://platform.openai.com/docs/models 120B MoE mxfp4
SEA-LION v3.5 R (Llama) https://huggingface.co/aisingapore/Llama-SEA-LION-v3.5-70B-R 70B
Qwen 3 (Thinking) https://huggingface.co/Qwen/Qwen3-14B 14B
Qwen 3 (Thinking) https://huggingface.co/Qwen/Qwen3-8B 8B
GPT OSS https://huggingface.co/openai/gpt-oss-20b 20B MoE mxfp4
Qwen 3 (Thinking) https://huggingface.co/Qwen/Qwen3 4B
QwQ https://huggingface.co/Qwen/QwQ-32B-Preview 32B
Reka Flash 3.1 https://huggingface.co/RekaAI/reka-flash-3.1 21B
SEA-LION v3.5 R (Llama) https://huggingface.co/aisingapore/Llama-SEA-LION-v3.5-8B-R 8B
Olmo 3 Think https://huggingface.co/allenai/Olmo-3-32B-Think 32B
Olmo 3 Think https://huggingface.co/allenai/Olmo-3-7B-Think 7B
Table 9: Links and sizes of models evaluated in our current experiments.

-- 22 of 22 --