SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Summary

This paper introduces SEA-BED, a large-scale benchmark designed to evaluate multilingual text embedding models across 10 Southeast Asian (SEA) languages and 9 task types. The authors challenge the assumption that these models create a stable, perspective-independent semantic space, showing instead that performance varies significantly across languages and tasks. SEA-BED includes 169 datasets, with 120 authored by native speakers and 11 new datasets for Thai and Burmese, addressing gaps in existing benchmarks like MMTEB. Evaluations of 17 models reveal no single model performs uniformly well across all languages or tasks. Performance is highly conditional, with task difficulty and language-specific factors influencing results. The study highlights three key areas for improvement: expanding high-quality datasets, refining training algorithms to handle semantic and cross-lingual variation, and adapting architectures to better support SEA linguistic diversity. The authors emphasize the need for broader, more nuanced evaluation frameworks to uncover inconsistencies in semantic representation and guide future model development.

PDF viewer

Chunks(93)

Chunk 0 · 1,998 chars

SEA-BED: How Do Embedding Models Represent
Southeast Asian Languages?
Wuttikorn Ponwitayarat1*,†, Peerat Limkonchotiwat2*, Raymond Ng2*, Jann Railey Montalan2,
Thura Aung3, Jian Gang Ngui2, Yosephine Susanto2, William Chandra Tjhi2,
Panuthep Tasawong1,†, Erik Cambria4, Ekapol Chuangsuwanich5, Sarana Nutanong1
1Vidyasirimedhi Institute of Science and Technology, 2AI Singapore,
3King Mongkut’s Institute of Technology Ladkrabang, 4Nanyang Technological University,
5Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University
{wuttikorn.p_s22,panuthep.t_s20,snutanon}@vistec.ac.th,
{peerat,raymond,railey,jiangangngui,yosephine,
wtjhi}@aisingapore.org, 66011606@kmitl.ac.th,
cambria@ntu.edu.sg, ekapolc@cp.eng.chula.ac.th
Abstract
Multilingual text embeddings are often as-
sumed to encode meaning in a perspective-
independent semantic space, yielding stable
similarity judgments across tasks and lan-
guages. Our results show that this assump-
tion does not hold in practice. We introduce
SEA-BED, a large-scale benchmark covering
10 Southeast Asian (SEA) languages and di-
verse embedding tasks, designed to systemat-
ically examine how embedding performance
varies across tasks, languages, and language-
task combinations. Across extensive evalua-
tions, we observe that no single model performs
uniformly well across SEA languages; task dif-
ficulty differs markedly within languages, and
success on one task does not reliably gener-
alize to others. Language-task analyses fur-
ther reveal highly non-uniform performance
landscapes, where performance varies across
different language-task combinations. These
findings call for closer attention to performance
measurements that provide an expansive view
across languages and tasks to uncover incon-
sistencies in semantic representation. Based
on these observations, we provide insights for
future model development, including data, al-
gorithmic, and architectural considerations.
1 Introduction
Text embedding plays

Chunk 1 · 1,999 chars

performance
measurements that provide an expansive view
across languages and tasks to uncover incon-
sistencies in semantic representation. Based
on these observations, we provide insights for
future model development, including data, al-
gorithmic, and architectural considerations.
1 Introduction
Text embedding plays a crucial role in NLP by
transforming complex linguistic structures into
fixed-size vectors that capture semantic locality.
*Equal contributions
†Work was conducted while Wuttikorn Ponwitayarat and
Panuthep Tasawong were visiting scholars at AI Singapore
Link to SEA-BED: https://leaderboard.sea-
lion.ai/embedding/SEA
These embeddings are fundamental for various
downstream tasks, including semantic textual sim-
ilarity, retrieval, and re-ranking. New embedding
models have also shifted toward global multilin-
guality. For instance, Wang et al. (2024a) proposed
a multilingual model on Mistral-7B that supports
93 languages, while Jina-embeddings-v3 (Sturua
et al., 2024a) includes 89 languages.
A common ideal motivating such multilingual
development is that a single embedding model
should approximate a robust semantic space han-
dling multiple languages simultaneously. This
ideal posits that semantic equivalence should pri-
marily determine geometric similarity, independent
of linguistic surface forms, an assumption that un-
derlies the use of multilingual embeddings in cross-
lingual retrieval (Conneau and Kiela, 2018), se-
mantic search, and transfer learning. In practice,
however, nothing in modern embedding architec-
tures guarantees such ideal behavior. Multilingual
text embeddings are learned through co-occurrence
statistics in training data. It reflects linguistic dis-
tributions, translation conventions, cultural norms,
and domain-specific usage patterns present in the
data. Hence, real embedding spaces often diverge
across tasks (the ”task consistency problem”) and
languages (the ”language consistency problem”),
revealing gaps between the ideal of

Chunk 2 · 1,980 chars

data. It reflects linguistic dis-
tributions, translation conventions, cultural norms,
and domain-specific usage patterns present in the
data. Hence, real embedding spaces often diverge
across tasks (the ”task consistency problem”) and
languages (the ”language consistency problem”),
revealing gaps between the ideal of perspective in-
dependence and the structure that models actually
capture.
Southeast Asia (SEA) presents a uniquely rich
environment for exploring the real-world limi-
tations of multilingual text embedding models.
The region spans multiple language families (Aus-
tronesian, Tai-Kadai, Sino-Tibetan, Austroasiatic,
among others), writing systems (Latin alphabet,
abugida, logographic), and typological properties
arXiv:2508.12243v3 [cs.CL] 8 Apr 2026

-- 1 of 35 --

Benchmark 	# Languages # SEA Languages # Datasets # Task # New datasets # SEA datasets # Human-Crafted datasets
(only SEA languages)
MTEB-French (Ciancone et al., 2024a) 	1 	N/A 	18 	8 	3 	N/A 	N/A
C-Pack (Xiao et al., 2024a) 	1 	N/A 	35 	6 	35 	N/A 	N/A
SEB (Enevoldsen et al., 2024) 	4 	N/A 	24 	4 	24 	N/A 	N/A
MMTEB (Enevoldsen et al., 2025) 	1,090 	9 	270 	10 	5 	22 	21
SEA-BED (ours) 	10 	10 	169 	9 	11 	169 	120 (71.01 %)
Table 1: The statistics of our benchmark compared to existing text embedding benchmarks.
(isolating, analytic, and agglutinative structures).
Despite the rich epistemic opportunities, SEA lan-
guages remain underrepresented in existing em-
bedding benchmarks. A practical approach to mul-
tilingual benchmarking is translation. For example,
XNLI (Conneau and Kiela, 2018), Tatoeba (com-
munity, 2021), and SIB-200 (Adelani et al., 2024)
rely on samples translated from English to extend
language and task coverage efficiently. While trans-
lation can approximate native data for certain tasks,
it may also alter semantic or discourse properties
in lower-resource settings.
An alternative is to use native-authored datasets,
which provide stronger linguistic and

Chunk 3 · 1,998 chars

on samples translated from English to extend
language and task coverage efficiently. While trans-
lation can approximate native data for certain tasks,
it may also alter semantic or discourse properties
in lower-resource settings.
An alternative is to use native-authored datasets,
which provide stronger linguistic and cultural
grounding, motivating MMTEB’s recent focus
on provenance nativity (Enevoldsen et al., 2025).
However, MMTEB includes only 22 SEA datasets
across 10 languages, resulting in limited coverage
and task diversity. As a result, current benchmarks
provide a narrow view of how embedding models
behave in such an epistemically rich testing ground.
Proposed Benchmark. When constructing a
multilingual benchmark, the first design decision
concerns the evaluation scope, the aspects of model
behavior we aim to observe. In this regard, SEA-
BED spans 9 task types across 10 SEA languages,
enabling broad cross-task and cross-lingual com-
parative analysis of multilingual embedding mod-
els. The second decision concerns data sourcing,
with the goal of maximizing breadth and depth of
coverage while ensuring data quality within the
defined scope. SEA-BED includes 169 datasets,
most of which are not covered by existing multi-
lingual sentence embedding benchmarks. We com-
bine native-authored datasets with carefully curated
translated datasets, enabling systematic compari-
son between native and transferred provenance. We
additionally contribute 11 new datasets for Thai
and Burmese, enabling deeper examination of se-
mantic similarity, relation understanding, and cross-
lingual transfer. A comparison between SEA-BED
and existing benchmarks is given in Table 1.
Proposed Study. With SEA-BED in place, we
can systematically examine aspects of multilingual
text embeddings that were previously unmeasur-
able in the SEA region, particularly how model
behavior varies across tasks and languages. We fo-
cus on a single research question: how embedding
model behavior varies

Chunk 4 · 1,996 chars

n Table 1.
Proposed Study. With SEA-BED in place, we
can systematically examine aspects of multilingual
text embeddings that were previously unmeasur-
able in the SEA region, particularly how model
behavior varies across tasks and languages. We fo-
cus on a single research question: how embedding
model behavior varies across tasks and Southeast
Asian languages. Specifically, we investigate per-
formance patterns under various evaluation con-
ditions. To this end, our analysis adopts three
complementary comparison views: (i) a language-
model view, which analyzes how model behavior
varies across languages; (ii) a task-model view,
which examines how different embedding models
perform across task categories; and (iii) a language-
task view, which aggregates model performance
over language-task combinations to reveal condi-
tional patterns that emerge at their intersection.
These three comparative views provide a structured
characterization of embedding behavior across lan-
guages, tasks, and model dimensions.
Key Results. Model performance across tasks
and SEA languages shows clear drift across lan-
guages and task types, with reconfigurations of
relative task difficulty rather than uniform scaling,
especially in Burmese and Lao. Furthermore, there
is no single model that performs uniformly well
across SEA languages and task types. These re-
sults indicate that embedding performance is in-
herently task- and language-dependent. Insights
from our analysis also point to three avenues for
performance improvement: enhancing data qual-
ity and coverage, refining algorithmic handling of
semantic and cross-lingual variation, and adapting
architectural designs to better accommodate SEA
linguistic diversity.
Our contributions are as follows.
• Resource. We introduce SEA-BED, a bench-
mark of 169 datasets across 9 task types and
10 SEA languages, including 11 new Thai and
Burmese datasets that expand coverage of pre-
viously missing task-language combinations.
• Experimental

Chunk 5 · 1,994 chars

to better accommodate SEA
linguistic diversity.
Our contributions are as follows.
• Resource. We introduce SEA-BED, a bench-
mark of 169 datasets across 9 task types and
10 SEA languages, including 11 new Thai and
Burmese datasets that expand coverage of pre-
viously missing task-language combinations.
• Experimental Studies. We evaluate 17 embed-
ding models through a set of studies to analyze
performance variation across tasks, languages,
and model.

-- 2 of 35 --

0 	10 	20 	30 	40 	50 	60 	70
Indonesian
Thai
Vietnamese
Burmese
Filipino
Khmer
Malay
Lao
Tamil
Tetum
70
55
41
35
31
22
19
19
18
4
Classification
Toxic Language Detection
Topic Classification
Sentiment
Language Identification
11	ก	
A 	10
A
Bitext Mining
Written form pairing
Dialect pairing
Cross-lingual pairing
15	ก 	10	A	A
Multi-label Classification
Toxic Language Detection
Topic Classification
Sentiment
5	ก 	6	A 	A
Retrieval
Question Answering (QA)
Long Document
Article Retrieval
12	ก 	6	A
A
Reranking
Article Reranking
2	ก 	2	A
A
Instruction Retrieval
Instruction QA
3	ก 	3	A
A	STS 	Cross-lingual STS	Multilingual STS	7	ก 	9	A
A
Pair Classification
Textual Entailment
11	ก 	7	A
A
Clustering
Topic Clustering
6	ก 	7	A
ก	
A 	Number of language 	Number of domain
26
(15.4%)
4 	(2.4%) Instruction
Retrieval
11 	(6.5%) Multi-label
Classification
Retrieval
11 (6.5%) STS
Bitext Mining
1 (0.6%) Reranking
10 (5.9%) Clustering
SEA-BED
73 	(43.2%)
20
(11.8%) 	13
(7.7%)
Classification
Pair Classification
Figure 1: An overview of SEA-BED, featuring 169 datasets, 9 tasks, and 10 languages.
• Insights for Future Model Development. Our
findings reveal substantial instability in model
performance and highlight concrete directions
for improving multilingual embedding robust-
ness in SEA settings.
2 Proposed Benchmark: SEA-BED
2.1 Language-Task Coverage
We constructed SEA-BED with broad language-
task coverage to extend existing evaluation re-
sources for Southeast Asian languages. As shown
in Figure 1, SEA-BED spans 10

Chunk 6 · 1,990 chars

light concrete directions
for improving multilingual embedding robust-
ness in SEA settings.
2 Proposed Benchmark: SEA-BED
2.1 Language-Task Coverage
We constructed SEA-BED with broad language-
task coverage to extend existing evaluation re-
sources for Southeast Asian languages. As shown
in Figure 1, SEA-BED spans 10 languages and 9
task types, covering linguistic contexts that are sub-
stantially underrepresented in benchmarks such as
MMTEB. This expanded structure enables the sys-
tematic investigation of embedding behavior across
tasks and languages.
We adopt the MMTEB task taxonomy (Enevold-
sen et al., 2025) to ensure comparability with prior
work, and extend their coverage of SEA languages.
The benchmark spans the following task types.
(i) Classification. Learn a classifier over sentence
embeddings to assign labels to individual sentences.
(ii) Multi-label Classification. Predict multiple la-
bels for each input text using a classifier trained
on embeddings. (iii) Pair Classification. Pre-
dict a binary relationship between two sentences
based on their embedding similarity. (iv) Seman-
tic Textual Similarity (STS). Measure similarity
between sentence pairs using continuous scores de-
rived from distance metrics computed over their
embeddings. (v) Clustering. The task groups em-
bedded texts into clusters based on semantic sim-
ilarity, using k-means with the number of unique
labels of k. (vi) Bitext Mining. Identify transla-
tion pairs across two languages by retrieving the
closest match for each sentence in a source set.
(vii) Retrieval. Retrieve relevant documents for
a given query by computing embedding similarity
between the query and candidate texts. (viii) In-
struction Retrieval. Extend traditional retrieval
by incorporating detailed instructions into queries,
pairing each query with a corresponding detailed
instruction that outlines the criteria for determining
document relevance. (ix) Reranking. Reorder a
set of candidate documents based on

Chunk 7 · 1,996 chars

nd candidate texts. (viii) In-
struction Retrieval. Extend traditional retrieval
by incorporating detailed instructions into queries,
pairing each query with a corresponding detailed
instruction that outlines the criteria for determining
document relevance. (ix) Reranking. Reorder a
set of candidate documents based on embedding
similarity to a query to improve relevance rank-
ing. Note that summarization is omitted due to data
availability constraints.
Table 2 summarizes the task subtypes and their
language coverage. SEA-BED contains 19 sub-
types across 9 task types and substantially increases
the number of language-subtype pairs relative to
MMTEB (from 10 to 19). Notably, several sub-
types Language Identification, Toxic Language De-
tection, Dialect Pairing, Instruction QA, and Arti-
cle Reranking are introduced for the first time in
SEA language evaluations.

-- 3 of 35 --

Task Type and Subtype ind tha vie mya fil khm zsm lao tam tet
Classification
Language Identification ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
Sentiment 	⟳ ⟳ ⟳ ▲ ⟳ ▲ 	△
Topic Classification ⟳ ⟳ ⟳ ▲ ⟳ ⟳ ⟳ △ ⟳
Toxic Language Detection ▲ ▲ ▲ ▲
Multi-label Classification
Sentiment 	▲ ▲
Topic Classification 	▲ ▲ ▲ ▲
Toxic Language Detection ▲
Pair Classification
Textual Entailment 	⟳ ⟳ ⟳ ▲ ▲ ▲ ▲ ▲ ▲
STS
Multilingual STS 	⟳ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
Cross-lingual STS 	▲ ▲ 	⟳
Clustering
Topic Clustering 	⟳ ⟳ ⟳ ⟳ ⟳ ⟳ ⟳ ⟳ △
Bitext Mining
Cross-lingual pairing ⟳ ⟳ ⟳ ⟳ ⟳ ⟳ △ ⟳ ⟳ ▲
Dialect pairing 	▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
Written-forms pairing ▲ ▲ ▲ ▲
Retrieval
Article Retrieval 	⟳ ⟳ ▲ ▲
Long Document Retrieval ▲
Question Answering ▲ ▲ ⟳ ▲ ▲
Instruction Retrieval
Instruction QA 	▲ ▲ ▲
Reranking
Article Reranking 	▲ ▲
Table 2: Task coverage of SEA-BED compared to
MMTEB. Cells use ⟳ (present in MMTEB and di-
rectly reused), △ (present in MMTEB and extended in
SEA-BED), and ▲ (entirely new). Similar task types
are grouped by highlight .
2.2 Evaluation Data Sourcing
Data Sourcing Overview. We now turn our at-
tention to how datasets were sourced

Chunk 8 · 1,997 chars

sk coverage of SEA-BED compared to
MMTEB. Cells use ⟳ (present in MMTEB and di-
rectly reused), △ (present in MMTEB and extended in
SEA-BED), and ▲ (entirely new). Similar task types
are grouped by highlight .
2.2 Evaluation Data Sourcing
Data Sourcing Overview. We now turn our at-
tention to how datasets were sourced and orga-
nized to enable systematic evaluation within the
scope described in the previous subsection. SEA-
BED consists of 169 datasets spanning 9 task types
and 10 languages, as previously discussed in Ta-
ble 1. While MMTEB contains 270 datasets, only
22 involve SEA languages, which reflects a lim-
ited regional representation. In contrast, 147 of
our datasets (86%) are not included in MMTEB,
highlighting substantial gaps in existing benchmark
coverage, making SEA-BED substantially more
representative of SEA linguistic diversity. Of these,
120 were authored by native speakers in their re-
spective languages, while the rest are sourced from
translated datasets. This combination enables the
study of provenance effects across task types and
languages. We also introduce 11 new datasets
for Thai and Burmese to expand coverage of pre-
viously underrepresented task-language combina-
tions. Benchmark efficiency considerations (e.g.,
caching and downsampling) are described in Ap-
pendix B.
Domain Coverage. Domain diversity is essential
for realistic and discriminative evaluation, as se-
mantic representations vary substantially across do-
mains and influence model robustness in real-world
applications. MMTEB provides coverage of widely
studied domains in SEA languages (e.g., news, non-
fiction, and encyclopedias). SEA-BED comple-
ments and extends this foundation by spanning 17
domains across 10 SEA languages, including a
broader range of formal, informal, and application-
oriented texts. Much of this coverage is supported
by new or newly sourced datasets, with some do-
mains (academic, blogs, medical, and subtitles)
newly introduced. Table 3 presents the

Chunk 9 · 1,997 chars

tends this foundation by spanning 17
domains across 10 SEA languages, including a
broader range of formal, informal, and application-
oriented texts. Much of this coverage is supported
by new or newly sourced datasets, with some do-
mains (academic, blogs, medical, and subtitles)
newly introduced. Table 3 presents the domain
coverage of the SEA-BED benchmark compared
to MMTEB across languages, indicating reused,
extended, and newly introduced domains. The full
domain descriptions are provided in Appendix C.
Domain 	ind tha vie mya fil khm zsm lao tam tet
Academic 	▲
Blog 	▲ ▲ ▲ ▲ ▲
Constructed ▲ ▲ ▲ ▲ ▲ 	▲
Encyclopedia △ △ △ △ △ △ ▲ △ ⟳
Fiction 	▲ ⟳ △ ▲ ⟳ ▲ ▲ ▲ ⟳
Government ▲ △ △ ▲ ▲ ▲ ▲ ▲ ⟳ ▲
Legal 	▲ ▲ ▲ ▲ 	▲ ▲ ▲ ⟳
Medical 	▲ ▲
News 	△ △ △ △ △ △ △ △ △
Non-fiction 	△ △ △ △ △ △ ▲ △ ⟳
Religious 	△ ⟳ ⟳ ⟳ ⟳ ▲ ▲ ▲ ⟳ ▲
Reviews 	△ △ △ 	▲ ▲ ▲
Social 	▲ ▲ ▲ ▲ △ 	⟳
Spoken 	△ △ △ ▲ △ △ △ △ △ ▲
Subtitles 	▲
Web 	△ ▲ ▲ ▲ 	▲ 	⟳ ▲
Written 	△ △ △ △ △ △ ▲ △ △ ▲
Table 3: Domain coverage of SEA-BED benchmark
compared to MMTEB. Cells use ⟳ to denote tasks
present in MMTEB and directly reused, △ for tasks
present in MMTEB but extended in SEA-BED, and ▲
for entirely new domains.
Data Quality Assurance. Data quality assurance
is achieved through systematic review processes
conducted by native speakers of SEA languages, all
of whom are also proficient in English. These anno-
tators verified and validated the data for grammat-
ical correctness, native written style, appropriate
language usage (excluding code-switching), and
the accuracy of the gold standard annotations. This
human verification process results in approximately
8% of the datasets being removed, reducing the
number from 182 to 169 datasets. See Appendix A
for the annotator guidelines and details.

-- 4 of 35 --

New Datasets. In addition to curated datasets,
we construct new Thai and Burmese datasets for
semantic textual similarity, natural language infer-
ence, and multi-label classification. These new re-
sources address

Chunk 10 · 1,993 chars

umber from 182 to 169 datasets. See Appendix A
for the annotator guidelines and details.

-- 4 of 35 --

New Datasets. In addition to curated datasets,
we construct new Thai and Burmese datasets for
semantic textual similarity, natural language infer-
ence, and multi-label classification. These new re-
sources address task-level gaps in SEA languages
and expand evaluation coverage in previously un-
derrepresented settings. Given the importance of
these tasks for downstream retrieval and re-ranking
performance (Gao et al., 2021; Chuang et al., 2022),
we release 4 Thai datasets comprising 3,147 sam-
ples and 7 Burmese datasets with 13,177 samples
to support systematic evaluation.
As shown in Table 4, we construct these datasets
by human verification of Google NMT trans-
lated data from established English benchmarks
for semantic textual similarity and natural lan-
guage inference, including STSBenchmark (Cer
et al., 2017), STS-2017 (Cer et al., 2017), STS-
2022 (Chen et al., 2022), STS-2024 (Ousidhoum
et al., 2024b), BIOSSES (So˘gancıo˘glu et al.,
2017), and XNLI (Conneau et al., 2018), which
serve as the original texts for dataset construc-
tion. We also translate the Thai multi-label dataset
Prachathai67k (cstorm125, 2019) into Burmese,1
providing a stronger starting point for creating
Burmese resources. Translations were performed
by native Thai and Burmese speakers (see Ap-
pendix A for annotator demographics and guide-
lines) following two instructions: (i) produce natu-
ral, conversational language, and (ii) make sentence
subjects gender-neutral, given that both languages
encode gender morphologically.
Dataset 	mya tha
Biosses 	100 100
STS17 	250 250
STS22 	197 197
STS24 	2,600 2,600
STSBenchmark 	2,880 -
XNLI 	5,000 -
Prachathai67k 	2,150 -
Total number of samples 13,177 3,147
Table 4: Statistics of the new evaluation datasets in-
cluded in SEA-BED.
3 Experimental Settings
Models. We evaluate 13 open-source and 4 propri-
etary multilingual text embedding

Chunk 11 · 1,995 chars

250 250
STS22 	197 197
STS24 	2,600 2,600
STSBenchmark 	2,880 -
XNLI 	5,000 -
Prachathai67k 	2,150 -
Total number of samples 13,177 3,147
Table 4: Statistics of the new evaluation datasets in-
cluded in SEA-BED.
3 Experimental Settings
Models. We evaluate 13 open-source and 4 propri-
etary multilingual text embedding models on SEA-
BED, spanning both encoder-based and decoder-
based architectures. Our model selection aims to
1Thai is a more culturally and politically compatible source
for Burmese translation than English.
encompass representative design choices in con-
temporary multilingual embedding models, rather
than providing an exhaustive comparison or estab-
lishing a single best-performing approach.
All models are treated as black-box embedding
functions, and our analysis focuses on observed
performance variation across tasks and languages.
Detailed model descriptions, training backgrounds,
and per-model results are provided in Appendix I.
Evaluation Setup. We employ F1 for Bitext Min-
ing, Classification, and Multi-label Classification.
For Pair Classification, we use average precision
(AP) as the main metric. For Clustering, we use the
V-measure metric. For Retrieval, we use various
metrics (i.e., nDCG@k, MRR@k, MAP@k, preci-
sion@k, and recall@k), with nDCG@10 as the pri-
mary metric. For Reranking, we use Mean Average
Precision (MAP). In addition, we use nDCG@5
as the main metric for Instruction Retrieval follow-
ing Weller et al. (2024). We employ the averaging
strategy similar to previous works (Muennighoff
et al., 2023; Enevoldsen et al., 2025), where all
tasks are averaged equally with the standard devia-
tion (SD) score. We acknowledge that the metrics
for each task are different (e.g., F1 for classification
and nDCG@10 for retrieval). Thus, we analyze
both individual and average results, rather than fo-
cusing solely on the average score. All experiments
were run on eight H100 (80 GB).
4 Experimental Results
4.1 Language-Model Comparisons
This

Chunk 12 · 1,995 chars

edge that the metrics
for each task are different (e.g., F1 for classification
and nDCG@10 for retrieval). Thus, we analyze
both individual and average results, rather than fo-
cusing solely on the average score. All experiments
were run on eight H100 (80 GB).
4 Experimental Results
4.1 Language-Model Comparisons
This study presents a language-wise analysis for
behavior across language and model to pinpoint
which SEA languages and scripts see the largest
performance drops. Note that the number of
datasets can be duplicated in multilingual scenar-
ios, resulting in a higher number of datasets than
the task-model comparison (Table 6).
Results. As shown in Table 5, we observe per-
formance variation across languages that, when
compared to the task-level results in Table 6, sug-
gests that strong overall performance does not nec-
essarily imply consistent multilingual coverage.
For example, while multilingual-e5-large-instruct
achieves the highest average score overall (78.93),
its performance varies across languages, ranging
from 69.40 points in Tetum to 84.60 points in
Malay. We also observe the same trend in pro-
prietary models. Moreover, we found that some
models do not fully support SEA languages, e.g.,

-- 5 of 35 --

Model 	ind tha vie mya fil khm zsm lao tam tet 	Avg.
Number of datasets (→) 	(70) (55) (41) (35) (31) (22) (19) (19) (18) (4) 	(314)
multilingual-e5-large-instruct (560M) 79.50 81.11 78.00 78.37 79.19 78.13 84.60 83.94 77.09 69.40 78.93±3.98
Qwen3-Embedding-8B (8B) 	79.73 81.49 78.99 74.91 78.05 75.46 82.39 78.20 75.95 67.44 77.26±4.02
bge-m3 (568M) 	78.09 77.59 75.91 73.12 75.78 76.23 82.54 82.26 77.51 65.53 76.46±4.55
multilingual-e5-large (560M) 	78.59 79.89 78.93 70.28 77.98 72.11 80.10 79.91 77.83 63.55 75.92±5.22
bge-multilingual-gemma2 (9B) 	79.93 80.58 78.76 70.01 79.61 74.39 83.38 65.82 80.96 65.05 75.85±6.31
LaBSE (471M) 	73.98 70.20 72.60 73.63 76.99 74.06 82.87 79.84 76.59 69.11 74.99±3.99
multilingual-mpnet-base (278M) 	74.60 73.91

Chunk 13 · 1,999 chars

l-e5-large (560M) 	78.59 79.89 78.93 70.28 77.98 72.11 80.10 79.91 77.83 63.55 75.92±5.22
bge-multilingual-gemma2 (9B) 	79.93 80.58 78.76 70.01 79.61 74.39 83.38 65.82 80.96 65.05 75.85±6.31
LaBSE (471M) 	73.98 70.20 72.60 73.63 76.99 74.06 82.87 79.84 76.59 69.11 74.99±3.99
multilingual-mpnet-base (278M) 	74.60 73.91 72.66 61.19 52.02 64.44 75.48 65.63 63.31 50.78 65.40±8.53
e5-mistral-7b-instruct (7B) 	79.23 74.77 75.37 48.85 78.10 56.49 78.82 27.99 66.73 66.73 65.32±15.74
GritLM-7B (7B) 	80.47 72.84 77.37 45.05 77.49 52.58 78.41 30.07 60.42 69.67 64.44±16.13
Qwen3-Embedding-0.6B (595M) 	75.60 75.85 75.13 49.08 63.11 44.10 69.51 29.78 61.12 63.38 60.67±14.55
multilingual-MiniLM-L12 (118M) 71.48 70.42 69.90 54.48 47.28 39.92 69.58 45.34 27.88 47.69 54.40±14.53
Gemma-SEA-LION-v3-9B-IT (9B) 49.86 41.67 51.90 30.80 54.14 39.53 49.24 22.01 29.20 25.06 39.34±11.27
Sailor2-8B-Chat (8B) 	49.54 35.98 42.94 30.14 46.16 28.57 30.75 18.31 28.57 25.76 33.67±9.33
Proprietary models
embed-multilingual-v3.0 	79.72 80.99 78.93 76.13 78.99 77.01 82.42 83.34 78.87 66.76 78.32±4.39
jina-embeddings-v3 	77.35 78.64 76.10 75.10 74.25 74.73 77.91 77.91 76.14 65.11 75.32±3.68
voyage-3 	75.56 69.78 73.68 48.19 71.43 35.02 69.13 24.27 67.28 61.48 59.58±16.83
text-embedding-3-small 	78.34 55.24 70.06 32.79 68.08 30.15 69.78 23.97 35.38 65.09 52.89±19.18
Table 5: Language-model performance view, where each cell reports scores averaged over all evaluated task types.
GritLM-7B does not support Burmese, Khmer, and
Lao, while bge-multilingual-gemma2 does not sup-
port Lao. This is because the training datasets
for these languages are smaller and of lower qual-
ity. Although previous works demonstrate the use
of LLMs to generate more datasets (Zhang et al.,
2025; Muennighoff et al., 2024), applying this
methodology to low-resource languages is underex-
plored and might not be effective. Thus, although
these models performed well overall in this experi-
mental study, their lack of support for some

Chunk 14 · 1,998 chars

works demonstrate the use
of LLMs to generate more datasets (Zhang et al.,
2025; Muennighoff et al., 2024), applying this
methodology to low-resource languages is underex-
plored and might not be effective. Thus, although
these models performed well overall in this experi-
mental study, their lack of support for some SEA
languages renders them less suitable for real-world
applications involving SEA languages.
Discussion. Experimental results demonstrate the
“language consistency” problem, where models ex-
hibit inconsistent performance across different lan-
guages. Although multilingual-e5-large-instruct
might perform best overall, we found that no model
can perform best for all languages. We observe
that while multilingual-e5-large-instruct performs
well on Burmese, Khmer, Malay, and Lao, Qwen3-
Embedding-8B performs well on Thai and Viet-
namese, bge-multilingual-gemma2 performs well
on Filipino and Tamil, and GritLM-7B performs
well on Indonesian and Tetum. This emphasizes
the inconsistency of model performance across lan-
guages, thus rendering the overall evaluation results
inconclusive for all models. Making the embedding
model consistent to support all languages equally
is a challenge and remains an open question in the
field of text understanding. Note that we also ex-
perimented with tokenizer and language similarity
to understand the underrepresented languages in
Appendix E and Appendix F, respectively.
4.2 Task-Model Comparisons
To understand the performance of each task, we
ask which tasks remain particularly challenging for
state-of-the-art models across SEA languages.
Results. As shown in Table 6, the experi-
ment results demonstrate that multilingual-e5-
large-instruct performs the best on our benchmark,
achieving 75.24 points on the average score. The
performance of the second-best model (Qwen3-
Embedding-8B) is lower than that of multilingual-
e5-large-instruct by only 0.06 points on average,
with a 70-fold difference in model parameters
(560M vs.

Chunk 15 · 1,999 chars

t multilingual-e5-
large-instruct performs the best on our benchmark,
achieving 75.24 points on the average score. The
performance of the second-best model (Qwen3-
Embedding-8B) is lower than that of multilingual-
e5-large-instruct by only 0.06 points on average,
with a 70-fold difference in model parameters
(560M vs. 8B parameters). Moreover, we found
that, although Gemma-SEA-LION-v3 and Sailor2
were specifically trained for SEA languages, the
models did not perform well on our text em-
bedding benchmark due to their design for gen-
eration, rather than embedding purposes. For
the proprietary models, in contrast to previous
works (Muennighoff et al., 2023), which found
that proprietary models outperformed open-source
models, we found that all proprietary models per-
form lower than multilingual-e5-large-instruct and
Qwen3-Embedding-8B. This suggests that all pro-
prietary models may be primarily trained in English

-- 6 of 35 --

Model 	Dim. Clf M. Clf Pr. Clf STS Clust Btxt Rtrvl In. Rtrvl Rrnk 	Avg.
Number of datasets (→) 	(73) (11) 	(13) (11) (10) (26) (20) 	(4) 	(1) 	(169)
multilingual-e5-large-instruct (560M) 1024 77.70 87.84 66.58 75.59 58.09 87.86 77.16 69.10 77.24 75.24±9.06
Qwen3-Embedding-8B (8B) 	4096 78.60 90.57 63.10 75.31 52.93 84.78 81.99 70.81 78.51 75.18±10.84
bge-multilingual-gemma2 (9B) 	3584 78.13 90.89 73.87 72.53 49.14 82.02 80.55 71.52 69.04 74.19±10.85
multilingual-e5-large (560M) 	1024 78.24 88.94 65.79 69.61 47.83 84.51 78.25 66.06 79.00 73.14±11.66
bge-m3 (568M) 	4096 75.98 89.89 68.73 73.27 42.23 86.18 73.56 58.51 75.98 71.59±13.48
GritLM-7B (7B) 	4096 77.47 88.76 63.86 64.69 46.29 63.63 65.97 67.60 73.37 67.96±10.92
e5-mistral-7b-instruct (7B) 	4096 76.65 88.32 63.81 63.50 49.48 65.30 72.93 54.46 75.33 67.75±11.24
Qwen3-Embedding-0.6B (595M) 	1024 74.47 88.19 60.36 65.74 43.94 56.53 76.24 65.80 75.03 67.37±11.58
multilingual-mpnet-base (278M) 	768 73.79 87.28 70.79 70.15 41.12 68.12 58.28 52.44 64.01 65.11±12.55
LaBSE (471M) 	768 75.19

Chunk 16 · 1,995 chars

tral-7b-instruct (7B) 	4096 76.65 88.32 63.81 63.50 49.48 65.30 72.93 54.46 75.33 67.75±11.24
Qwen3-Embedding-0.6B (595M) 	1024 74.47 88.19 60.36 65.74 43.94 56.53 76.24 65.80 75.03 67.37±11.58
multilingual-mpnet-base (278M) 	768 73.79 87.28 70.79 70.15 41.12 68.12 58.28 52.44 64.01 65.11±12.55
LaBSE (471M) 	768 75.19 86.65 62.32 68.32 41.39 86.84 53.72 39.73 61.23 63.93±16.31
multilingual-MiniLM-L12 (118M) 	768 70.50 84.88 65.70 64.59 31.50 53.23 52.47 48.66 62.27 59.31±14.25
Gemma-SEA-LION-v3-9B-IT (9B) 3584 75.87 89.94 57.77 38.85 39.94 15.31 22.03 11.02 65.49 46.25±26.18
Sailor2-8B-Chat (8B) 	3584 76.43 90.21 56.71 37.25 38.51 4.31 10.09 3.29 47.05 40.43±29.25
Proprietary models
embed-multilingual-v3.0 	1024 78.52 89.98 66.11 73.11 48.99 88.32 78.17 65.59 77.77 74.06±11.89
jina-embeddings-v3 	1024 77.40 88.97 63.61 73.17 50.90 81.86 76.28 69.11 72.49 72.64±10.30
voyage-3 	1024 75.72 88.70 60.23 61.97 45.15 55.62 62.91 61.77 74.62 65.19±12.01
text-embedding-3-small 	1536 72.88 88.19 60.16 52.31 39.34 43.12 65.18 52.87 71.25 60.59±14.65
Table 6: Task-model performance view, where each cell reports scores averaged over all evaluated languages.
and not optimized for SEA languages.
Discussion. We found that task performance con-
sistency is the main challenge for current text em-
bedding models. In particular, a robust model
should perform well on all tasks. As shown in
Table 6, we found that there is no dominant model
that achieves the highest score on all tasks. Notably,
model performance varies considerably depending
on the task. This emphasizes that the task consis-
tency problem in our benchmark is still challenging
for embedding models. In short, when using mul-
tilingual text embedding in SEA languages, it is
essential to select the model based on the specific
task at hand, as there is no all-purpose model that
suits every solution.
4.3 Language-Task Comparisons
Let us now turn our attention to language-task com-
parisons, averaging performance across all

Chunk 17 · 1,997 chars

rt, when using mul-
tilingual text embedding in SEA languages, it is
essential to select the model based on the specific
task at hand, as there is no all-purpose model that
suits every solution.
4.3 Language-Task Comparisons
Let us now turn our attention to language-task com-
parisons, averaging performance across all models.
Results. Table 7 presents our third comparative
view: language-task. Substantial variation is ob-
served across both dimensions, with no task ex-
hibiting uniformly strong performance across all
languages. For relatively well-studied tasks such
as classification and bitext mining, performance
is generally strong for higher-resource languages
(ind and tha), while noticeable degradation persists
for more resource-constrained languages, e.g., lao,
mya, and tet. In contrast, clustering remains chal-
lenging across most languages, exhibiting lower
and more variable performance, indicating that
some task categories pose intrinsic difficulties that
are not alleviated by language coverage alone. Bi-
text mining exhibits mixed behavior, with strong
performance in some languages but substantial vari-
ation across others, highlighting how task difficulty
interacts with language-specific factors.
Discussion. The results reveal a landscape of condi-
tional behaviors across tasks and languages, under-
scoring the importance of disentangling evaluation
dimensions for meaningful assessment.
Lang. Clf M. Clf Pr. Clf STS Clust Btxt Rtrvl In. Rtrvl Rrnk
ind 	77.77 92.49 64.51 61.04 43.55 75.65 72.71 	34.82 	69.28
tha 	73.73 76.60 70.31 68.73 38.49 70.99 67.71 	70.43 	71.86
vie 	78.93 96.49 63.67 77.15 38.56 76.36 56.80 	42.75 	-
mya 	78.78 69.71 67.10 61.95 28.09 48.83 55.44 	- 	-
fil 	72.78 90.06 50.63 68.41 52.69 69.56 	- 	- 	-
khm 	68.56 100.00 66.41 63.74 23.83 54.21 	- 	- 	-
zsm 	72.94 	- 	71.50 75.39 	- 	75.98 46.27 	- 	-
lao 	70.55 	- 	64.36 59.11 18.23 51.79 	- 	- 	-
tam 	84.35 	- 	68.14 52.49 38.75 61.23 46.27 	- 	-
tet 	99.81 	- 	- 	- 	- 	45.75 	- 	-

Chunk 18 · 1,998 chars

7.10 61.95 28.09 48.83 55.44 	- 	-
fil 	72.78 90.06 50.63 68.41 52.69 69.56 	- 	- 	-
khm 	68.56 100.00 66.41 63.74 23.83 54.21 	- 	- 	-
zsm 	72.94 	- 	71.50 75.39 	- 	75.98 46.27 	- 	-
lao 	70.55 	- 	64.36 59.11 18.23 51.79 	- 	- 	-
tam 	84.35 	- 	68.14 52.49 38.75 61.23 46.27 	- 	-
tet 	99.81 	- 	- 	- 	- 	45.75 	- 	- 	-
Table 7: Language-task performance view, where each
cell reports scores averaged over all evaluated models.
5 Insights for Future Model Development
From the main results and ablation studies, embed-
ding models for SEA languages can be enhanced
in three aspects: (i) datasets, (ii) training algorithm,
and (iii) architecture.
Dataset. A common technique to improve down-
stream tasks is to introduce data aligned with the
domain of the target task. However, in resource-
constrained settings, model developers often have
to resort to machine-assisted dataset generation.

-- 7 of 35 --

As shown in Appendix G, we examine the pos-
sibility of using MT on low-resource languages
and found that we can use machine translation
to translate from English to SEA languages with
a marginal difference between human and ma-
chine translations. There are various English
datasets that are not yet available in SEA lan-
guages, i.e., MSMACO (Nguyen et al., 2016b) and
NQ (Kwiatkowski et al., 2019), and some datasets
are only available in a subset of SEA languages,
e.g., only Thai or Indonesian, like Mr.TyDi (Clark
et al., 2020) and MIRACL (Zhang et al., 2023).
As demonstrated by previous works, having these
datasets in SEA languages will increase robust rep-
resentation in embeddings.
Training Algorithm. From our ablation study in
Appendix F, we found that there are a lot of false
positive and negative occurrences during the testing
of the models in Table 5. As shown in Figure 3a,
the experimental results from multilingual-e5-large-
instruct demonstrate that the similarity of positive
pairs is high (more than 0.87 in all cases). How-
ever, the similarity of negative pairs

Chunk 19 · 1,995 chars

ere are a lot of false
positive and negative occurrences during the testing
of the models in Table 5. As shown in Figure 3a,
the experimental results from multilingual-e5-large-
instruct demonstrate that the similarity of positive
pairs is high (more than 0.87 in all cases). How-
ever, the similarity of negative pairs is also high
(ranging from 0.74 to 0.81), resulting in the over-
lap between positive and negative pairs (Figure 3c).
Moreover, when the model performs worst in SEA-
BED (multilingual-MiniLM-L12-v2), Figures 3b
and 3d show that the contrast between positive
and negative samples is better than robust mod-
els, where this model did not employ contrastive
learning, unlike the SOTA model. This highlights
the inconsistency of robust models, which neces-
sitates immediate correction. To mitigate this, re-
cent works (Limkonchotiwat et al., 2022; Li and
Li, 2023; Wang et al., 2024c) demonstrate the pos-
sibility of contrasting positive and negative sam-
ples more effectively than vanilla contrastive learn-
ing (Gao et al., 2021), which is employed in current
models. However, these techniques have not been
well explored on SEA languages; the effects and
failure cases need further study. Applying these
techniques can mitigate the issue of overlap.
Architecture. Similar to the findings from previ-
ous works in Appendix E, the tokenizer plays a
crucial role in embeddings. In our findings, we
discovered that the current models’ tokenizer does
not include a Telugu token; adding Telugu would
enhance the representation of these models. How-
ever, Appendix H also shows that adding tokens
is not effective for all languages; in most cases,
non-Latin scripts will have the most effect. We can
also employ adding token techniques (Cui et al.,
2024; Nguyen et al., 2024) by adding new tokens
with minimal effort during continual pre-training
to maintain previous knowledge and incorporate
new knowledge into the model.
In summary, future work can benefit from the in-
sights

Chunk 20 · 1,997 chars

ripts will have the most effect. We can
also employ adding token techniques (Cui et al.,
2024; Nguyen et al., 2024) by adding new tokens
with minimal effort during continual pre-training
to maintain previous knowledge and incorporate
new knowledge into the model.
In summary, future work can benefit from the in-
sights gained from our discussions. The fastest and
most cost-effective way is to obtain more datasets
using machine translation. In particular, we can uti-
lize models that demonstrate robust performance
for English-to-SEA language translation on SEA
translation benchmarks (Susanto et al., 2025), such
as ChatGPT or Google Gemini. Then, we can focus
on applying and adapting novel training objectives
to SEA embeddings. We also need to study how to
adapt from an English-centric to a SEA-centric ap-
proach, leaving a gap for future work to propose a
new training objective for the multilingual scenario.
Lastly, we can focus on the changes in architecture
since this will require a new round of pre-training.
However, not all languages will benefit from this
change; we need to carefully add new tokens to the
model, as previous studies have shown.
6 Related Work
6.1 Text Embedding Benchmarks
Existing text embedding benchmarks primarily fo-
cus on high-resource languages. Notable exam-
ples include SentEval (Conneau and Kiela, 2018),
which provides a preliminary benchmark for under-
standing text embeddings in STS and transfer learn-
ing. USEB (Wang et al., 2021) is an unsupervised
embedding benchmark focusing on pair-text clas-
sification. BEIR (Thakur et al., 2021) is a hetero-
geneous benchmark focusing only on 18 informa-
tion retrieval datasets. MTEB (Muennighoff et al.,
2023) is a large-scale version of BEIR that not
only focuses on retrieval tasks but also on diverse
tasks, i.e., bitext mining, classification, and seman-
tic textual similarity. However, these benchmarks
primarily focus on English, while many works
extend MTEB from English to Chinese (Xiao
et

Chunk 21 · 1,992 chars

TEB (Muennighoff et al.,
2023) is a large-scale version of BEIR that not
only focuses on retrieval tasks but also on diverse
tasks, i.e., bitext mining, classification, and seman-
tic textual similarity. However, these benchmarks
primarily focus on English, while many works
extend MTEB from English to Chinese (Xiao
et al., 2024b), German (Wehrli et al., 2023), and
French (Ciancone et al., 2024b). Recently, an at-
tempt has been made to create a multilingual ver-
sion of MTEB, called MMTEB (Enevoldsen et al.,
2025). MMTEB multilingual benchmark evaluates
10 tasks and 270 datasets; notably, only 22 of these
datasets are from SEA languages. Thus, results
from MMTEB might not be representative of per-
formance in SEA languages, given the reliance on

-- 8 of 35 --

machine-translated datasets.
6.2 SEA Benchmarks
There have been many efforts to formulate SEA
benchmarks. NusaCrowd (Cahyawijaya et al.,
2023) proposed a large-scale Indonesian bench-
mark focusing on natural language understanding
and generation, especially for decoder models. VN-
MTEB (Pham et al., 2026) a Vietnamese text em-
bedding benchmark with 41 datasets across six
tasks, supported by a scalable pipeline using LLM-
based translation, semantic filtering, and LLM-as-
a-judge for quality assurance in low-resource set-
tings SEACrowd (Lovenia et al., 2024) and SEA-
VL (Cahyawijaya et al., 2025) are data collection
projects that gather SEA benchmarks in their own
repositories. The experiment from SEA projects
focuses primarily on large language models, partic-
ularly the Llama (Dubey et al., 2024) and T5 (Raf-
fel et al., 2020) families. Moreover, these bench-
marks do not accurately measure the effectiveness
of embedding in SEA texts. In particular, previ-
ous works studied large language models and gen-
erative outputs, while embeddings have not been
experimented with in SEA languages.
7 Conclusion
Our analyses show that multilingual embedding
performance in SEA languages is highly condi-
tional,

Chunk 22 · 1,996 chars

easure the effectiveness
of embedding in SEA texts. In particular, previ-
ous works studied large language models and gen-
erative outputs, while embeddings have not been
experimented with in SEA languages.
7 Conclusion
Our analyses show that multilingual embedding
performance in SEA languages is highly condi-
tional, varying across models, tasks, and language-
task combinations. Language-model comparisons
reveal that no single model consistently performs
well across all languages, with substantial dispar-
ities persisting even among the strongest models.
Task-model analyses reveal clear differences in task
difficulty: while classification and bitext mining
approach saturation for well-resourced languages,
clustering and semantic similarity remain chal-
lenging and unstable. Language-task comparisons
demonstrate uneven performance within individual
languages, showing that success on one task does
not reliably generalize to others.
Our results suggest that observed performance
gaps arise from interrelated limitations in data cov-
erage, training objectives, and architectural design.
Dataset availability plays a central role: models
trained predominantly on English-centric or weakly
aligned multilingual data struggle to generalize
across languages and tasks. Training algorithms
need to address the high positive-negative similarity
overlap in low-resource languages, suggesting the
application of cross-lingual transfer to ensure that
task-relevant semantic structures learned in one lan-
guage generalize to others. Architectural factors,
including tokenizer design and language coverage,
introduce additional structural constraints that dis-
proportionately affect non-Latin and underrepre-
sented languages. Consequently, improving mul-
tilingual embeddings for SEA languages requires
coordinated advances across datasets, training al-
gorithms, and architectures.
Acknowledgement
This research is supported by the National Research
Foundation, Singapore, under its National

Chunk 23 · 1,990 chars

y affect non-Latin and underrepre-
sented languages. Consequently, improving mul-
tilingual embeddings for SEA languages requires
coordinated advances across datasets, training al-
gorithms, and architectures.
Acknowledgement
This research is supported by the National Research
Foundation, Singapore, under its National Large
Language Models Funding Initiative. Any opin-
ions, findings, and conclusions or recommenda-
tions expressed in this material are those of the
author(s) and do not reflect the views of the Na-
tional Research Foundation, Singapore.
Limitations
While SEA-BED substantially expands the land-
scape of multilingual sentence-embedding evalua-
tion, several limitations remain.
First, the coverage is uneven across the 10 SEA
languages. Although the benchmark encompasses
the region’s major language families and scripts,
some languages are represented in fewer task cat-
egories due to the limited availability of publicly
accessible datasets. For extremely low-resource
languages such as Tetum, the limited availability
of high-quality datasets undermines the reliability
of evaluation results, necessitating careful interpre-
tation and preventing definitive conclusions. This
asymmetry limits the breadth of task-language com-
binations that can be examined uniformly.
Second, the scope of evaluation data is regionally
bounded. SEA-BED focuses on locally grounded
and native-verified data from Southeast Asian lin-
guistic communities. While this enables strong in-
ternal validity for SEA-specific probing, it does not
capture cultural or pragmatic phenomena unique to
other low-resource regions, nor does it fully repre-
sent global diversity.
Finally, the benchmark incorporates machine-
generated and machine-derived data. While this
supports broader task coverage and scalability,
such data may differ systematically from human-
authored text. Given persistent coverage con-
straints, the inclusion of machine-derived data is
often unavoidable in multilingual

Chunk 24 · 1,995 chars

Finally, the benchmark incorporates machine-
generated and machine-derived data. While this
supports broader task coverage and scalability,
such data may differ systematically from human-
authored text. Given persistent coverage con-
straints, the inclusion of machine-derived data is
often unavoidable in multilingual evaluation. Ac-

-- 9 of 35 --

cordingly, we distinguish evaluation results by data
source where applicable to examine differences be-
tween human-authored and machine-derived condi-
tions in Appendix G. However, we do not con-
duct controlled studies isolating the causal ef-
fects of machine generation, nor do we claim
that machine-derived data uniformly approximates
human-authored data across tasks.
Ethical Statement
For the annotator details, we hired annotators (grad-
uated students) who speak SEA languages natively
(see Appendix A for more details). We first ran the
annotation experiment and selected only the anno-
tators who passed the annotation test, i.e., the En-
glish test and NLP understanding, to test whether
annotators understand and can perform work in a
high-quality manner. In addition, the payment rate
for each annotator is 18 USD/Hr, which is consid-
ered higher than the average payment.
References
M. L. Khodra A. N. Azhar and A. P. Sutiono. 2019.
Multi-label aspect categorization with convolutional
neural networks and extreme gradient boosting. In
Proceedings of the 2019 International Conference
on Electrical Engineering and Informatics (ICEEI),
pages 35–40.
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen,
Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Hao-
nan Gao, and Annie En-Shiun Lee. 2023. Sib-200: A
simple, inclusive, and big evaluation dataset for topic
classification in 200+ languages and dialects.
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen,
Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Hao-
nan Gao, and Annie En-Shiun Lee. 2024. Sib-200: A
simple, inclusive, and big evaluation dataset for topic
classification in

Chunk 25 · 1,998 chars

nclusive, and big evaluation dataset for topic
classification in 200+ languages and dialects.
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen,
Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Hao-
nan Gao, and Annie En-Shiun Lee. 2024. Sib-200: A
simple, inclusive, and big evaluation dataset for topic
classification in 200+ languages and dialects.
Divyanshu Aggarwal, Vivek Gupta, and Anoop
Kunchukuttan. 2022. Indicxnli: Evaluating multi-
lingual inference for indian languages.
Pawitsapak Akarajaradwong, Pirat Pothavorn, Chom-
pakorn Chaksangchaichot, Panuthep Tasawong, Thi-
tiwat Nopparatbundit, and Sarana Nutanong. 2025.
Nitibench: A comprehensive studies of llm frame-
works capabilities for thai legal question answering.
Vesa Akerman, David Baines, Damien Daspit, Ulf Her-
mjakob, Taeho Jang, Colin Leong, Michael Martin,
Joel Mathew, Jonathan Robie, and Marcus Schwart-
ing. 2023. The ebible corpus: Data and model
benchmarks for bible translation for low-resource
languages. arXiv preprint arXiv:2304.09919.
Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo
Ekanata. 2017. Hate speech detection in the indone-
sian language: A dataset and preliminary study.
Mehdi Ali, Michael Fromm, Klaudia Thellmann,
Richard Rutmann, Max Lübbering, Johannes
Leveling, Katrin Klug, Jan Ebert, Niclas Doll,
Jasper Schulze Buschhoff, Charvi Jain, Alexan-
der Arno Weber, Lena Jurkschat, Hammam Abdelwa-
hab, Chelsea John, Pedro Ortiz Suarez, Malte Osten-
dorff, Samuel Weinbach, Rafet Sifa, Stefan Kessel-
heim, and Nicolas Flores-Herr. 2024. Tokenizer
choice for LLM training: Negligible or crucial? In
Findings of the Association for Computational Lin-
guistics: NAACL 2024, Mexico City, Mexico, June
16-21, 2024, pages 3907–3924. Association for Com-
putational Linguistics.
Samuel Cahyawijaya Arfinda Ilmania, Abdurrahman
and Ayu Purwarianti. 2018. Aspect detection and
sentiment classification using deep neural network
for indonesian aspect-based sentiment analysis. In
Proceedings of the 2018

Chunk 26 · 1,998 chars

ico City, Mexico, June
16-21, 2024, pages 3907–3924. Association for Com-
putational Linguistics.
Samuel Cahyawijaya Arfinda Ilmania, Abdurrahman
and Ayu Purwarianti. 2018. Aspect detection and
sentiment classification using deep neural network
for indonesian aspect-based sentiment analysis. In
Proceedings of the 2018 International Conference
on Asian Language Processing(IALP), pages 62–67.
IEEE.
Catherine Arnett and Benjamin Bergen. 2025. Why
do language models perform worse for morpholog-
ically complex languages? In Proceedings of the
31st International Conference on Computational Lin-
guistics, COLING 2025, Abu Dhabi, UAE, January
19-24, 2025, pages 6607–6623. Association for Com-
putational Linguistics.
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.
2019. On the cross-lingual transferability of mono-
lingual representations. CoRR, abs/1910.11856.
Laksmita Widya Astuti, Yunita Sari, and Suprapto. 2023.
Code-mixed sentiment analysis using transformer
for twitter social media data. International Journal
of Advanced Computer Science and Applications,
14(10).
Nofa Aulia and Indra Budi. 2019. Hate speech detection
on indonesian long text documents using machine
learning approach. In Proceedings of the 2019 5th In-
ternational Conference on Computing and Artificial
Intelligence, ICCAI ’19, page 164–169, New York,
NY, USA. Association for Computing Machinery.
Thura Aung, Eaint Kay Khaing Kyaw, Ye Kyaw
Thu, Thazin Myint Oo, and Thepchai Supnithi.
2025. Enhancing burmese news classification with
kolmogorov-arnold network head fine-tuning. In
2025 20th International Joint Symposium on Arti-
ficial Intelligence and Natural Language Processing
(iSAI-NLP), pages 1–6.
Thura Aung and Pyi Hein San. 2025. Askcoviddrbot:
Retrieval based tf-idf english and burmese bilingual
chatbot for covid-19 domain. GitHub repository.
Bianka Buschbeck and Miriam Exel. 2020. A parallel
evaluation data set of software documentation with
document structure annotation.

-- 10 of 35 --

Samuel

Chunk 27 · 1,998 chars

pages 1–6.
Thura Aung and Pyi Hein San. 2025. Askcoviddrbot:
Retrieval based tf-idf english and burmese bilingual
chatbot for covid-19 domain. GitHub repository.
Bianka Buschbeck and Miriam Exel. 2020. A parallel
evaluation data set of software documentation with
document structure annotation.

-- 10 of 35 --

Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji,
Genta Indra Winata, Bryan Wilie, Fajri Koto, Rah-
mad Mahendra, Christian Wibisono, Ade Romad-
hony, Karissa Vincentio, Jennifer Santoso, David
Moeljadi, Cahya Wirawan, Frederikus Hudi, Muham-
mad Satrio Wicaksono, Ivan Halim Parmonangan,
Ika Alfina, Ilham Firdausi Putra, Samsul Rah-
madani, Yulianti Oenang, Ali Akbar Septiandri,
James Jaya, Kaustubh D. Dhole, Arie Ardiyanti
Suryani, Rifki Afina Putri, Dan Su, Keith Stevens,
Made Nindyatama Nityasya, Muhammad Farid Adi-
lazuarda, Ryan Hadiwijaya, Ryandito Diandaru,
Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu,
Dyah Damapuspita, Haryo Akbarianto Wibowo, Cuk
Tho, Ichwanul Muslim Karo Karo, Tirana Fatyanosa,
Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebas-
tian Ruder, Pascale Fung, Herry Sujaini, Sakriani
Sakti, and Ayu Purwarianti. 2023. Nusacrowd: Open
source initiative for indonesian NLP resources. In
Findings of the Association for Computational Lin-
guistics: ACL 2023, Toronto, Canada, July 9-14,
2023, pages 13745–13818. Association for Computa-
tional Linguistics.
Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony
Moniz, Tack Hwa Wong, Mohammad Rifqi Farhan-
syah, Thant Thiri Maung, Frederikus Hudi, David
Anugraha, Muhammad Ravi Shulthan Habibi,
Muhammad Reza Qorib, Amit Agarwal, Joseph Mar-
vin Imperial, Hitesh Laxmichand Patel, Vicky Fe-
liren, Bahrul Ilmi Nasution, Manuel Antonio Rufino,
Genta Indra Winata, Rian Adam Rajagede, Car-
los Rafael Catalan, Mohamed Fazli Mohamed Imam,
Priyaranjan Pattnayak, Salsabila Zahirah Pranida,
Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang,
Patricia Nicole Monderin, Yueqi Song, Christian
Simon, Lynnette Hui Xian

Chunk 28 · 1,994 chars

-
liren, Bahrul Ilmi Nasution, Manuel Antonio Rufino,
Genta Indra Winata, Rian Adam Rajagede, Car-
los Rafael Catalan, Mohamed Fazli Mohamed Imam,
Priyaranjan Pattnayak, Salsabila Zahirah Pranida,
Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang,
Patricia Nicole Monderin, Yueqi Song, Christian
Simon, Lynnette Hui Xian Ng, Richardy Lobo
Sapan, Taki Hasan Rafi, Bin Wang, Supryadi,
Kanyakorn Veerakanjana, Piyalitt Ittichaiwong,
Matthew Theodore Roque, Karissa Vincentio, Tak-
danai Kreangphet, Phakphum Artkaew, Kadek Hen-
drawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti,
William Nixon, Mithil Bangera, Adrian Xuan Wei
Lim, Aye Hninn Khine, Hanif Muhammad Zhafran,
Teddy Ferdinan, Audra Aurora Izzani, Ayushman
Singh, Evan Evan, Jauza Akbar Krito, Michael
Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li,
John Amadeo Daniswara, Filbert Aurelian Tjiaranata,
Eryawan Presma Yulianrifat, Can Udomcharoen-
chaikit, Fadil Risdian Ansori, Mahardika Krisna Ih-
sani, Giang Nguyen, Anab Maulana Barik, Dan John
Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Cheng-
wei Wei, Isaiah Edri W. Flores, Kenneth Chen Ko
Han, Anjela Gail D. Santos, Wan Shen Lim, Kaung Si
Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun
Luo, Jan Christian Blaise Cruz, Ming Shan Hee,
Ikhlasul Akmal Hanif, M.Alif Al Hakim, Muham-
mad Rizky Sya’ban, Kun Kerdthaisong, Lester
James Validad Miranda, Fajri Koto, Tirana Noor
Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun
Kevin, Robert Wijaya, Onno P. Kampman, Ruochen
Zhang, Börje F. Karlsson, and Peerat Limkonchoti-
wat. 2025. Crowdsource, crawl, or generate? creat-
ing SEA-VL, a multicultural vision-language dataset
for Southeast Asia. In Proceedings of the 63rd An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 18685–
18717, Vienna, Austria. Association for Computa-
tional Linguistics.
Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie,
Karissa Vincentio, Xiaohong Li, Adhiguna Kun-
coro, Sebastian Ruder, Zhi Yuan Lim,

Chunk 29 · 1,992 chars

f the 63rd An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 18685–
18717, Vienna, Austria. Association for Computa-
tional Linguistics.
Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie,
Karissa Vincentio, Xiaohong Li, Adhiguna Kun-
coro, Sebastian Ruder, Zhi Yuan Lim, Syafri Ba-
har, Masayu Khodra, Ayu Purwarianti, and Pascale
Fung. 2021. IndoNLG: Benchmark and resources for
evaluating Indonesian natural language generation.
In Proceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing, pages
8875–8898, Online and Punta Cana, Dominican Re-
public. Association for Computational Linguistics.
Jasper Kyle Catapang and Moses Visperas. 2023.
Emotion-based morality in Tagalog and English sce-
narios (EMoTES-3K): A parallel corpus for explain-
ing (im)morality of actions. In Proceedings of the
Joint 3rd International Conference on Natural Lan-
guage Processing for Digital Humanities and 8th
International Workshop on Computational Linguis-
tics for Uralic Languages, pages 1–6, Tokyo, Japan.
Association for Computational Linguistics.
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
Gazpio, and Lucia Specia. 2017. SemEval-2017
task 1: Semantic textual similarity multilingual and
crosslingual focused evaluation. In Proceedings
of the 11th International Workshop on Semantic
Evaluation (SemEval-2017), pages 1–14, Vancouver,
Canada. Association for Computational Linguistics.
Andreas Chandra. 2020. Indonesian news dataset. On-
line. Accessed: 2024-02-13.
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu
Lian, and Zheng Liu. 2024. Bge m3-embedding:
Multi-lingual, multi-functionality, multi-granularity
text embeddings through self-knowledge distillation.
Xi Chen, Ali Zeynali, Chico Camargo, Fabian Flöck,
Devin Gaffney, Przemyslaw Grabowicz, Scott Hale,
David Jurgens, and Mattia Samory. 2022. SemEval-
2022 task 8: Multilingual news article similarity. In
Proceedings of the 16th International

Chunk 30 · 1,990 chars

nality, multi-granularity
text embeddings through self-knowledge distillation.
Xi Chen, Ali Zeynali, Chico Camargo, Fabian Flöck,
Devin Gaffney, Przemyslaw Grabowicz, Scott Hale,
David Jurgens, and Mattia Samory. 2022. SemEval-
2022 task 8: Multilingual news article similarity. In
Proceedings of the 16th International Workshop on
Semantic Evaluation (SemEval-2022), pages 1094–
1106, Seattle, United States. Association for Compu-
tational Linguistics.
Jirat Chiaranaipanich, Naiyarat Hanmatheekuna, Jitka-
pat Sawatphol, Krittamate Tiankanon, Jiramet Kin-
chagawat, Amrest Chinkamol, Parinthapat Pengpun,
Piyalitt Ittichaiwong, and Peerat Limkonchotiwat.
2024. Can general-purpose large language models
generalize to english-thai machine translation ?
Antonius Rachmat Chrismanto, Anny Kartika Sari, and
Yohanes Suyanto. 2022. Spamid-pair: A novel
indonesian post–comment pairs dataset containing
emoji. International Journal of Advanced Computer
Science and Applications, 13(11).

-- 11 of 35 --

Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo,
Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-
Wen Li, Scott Yih, Yoon Kim, and James Glass. 2022.
DiffCSE: Difference-based contrastive learning for
sentence embeddings. In Proceedings of the 2022
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 4207–4218, Seattle,
United States. Association for Computational Lin-
guistics.
Mathieu Ciancone, Imene Kerboua, Marion Schaeffer,
and Wissam Siblini. 2024a. Extending the mas-
sive text embedding benchmark to french. CoRR,
abs/2405.20468.
Mathieu Ciancone, Imene Kerboua, Marion Schaeffer,
and Wissam Siblini. 2024b. Mteb-french: Resources
for french sentence embedding evaluation and analy-
sis.
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan
Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and
Jennimaria Palomaki. 2020. Tydi qa: A benchmark
for information-seeking question answering in ty-
pologically diverse

Chunk 31 · 1,998 chars

ssam Siblini. 2024b. Mteb-french: Resources
for french sentence embedding evaluation and analy-
sis.
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan
Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and
Jennimaria Palomaki. 2020. Tydi qa: A benchmark
for information-seeking question answering in ty-
pologically diverse languages. Transactions of the
Association for Computational Linguistics.
Tatoeba community. 2021. Tatoeba: Collection of sen-
tences and translations.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2019. Unsupervised
cross-lingual representation learning at scale. CoRR,
abs/1911.02116.
Alexis Conneau and Douwe Kiela. 2018. Senteval: An
evaluation toolkit for universal sentence representa-
tions. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation,
LREC 2018, Miyazaki, Japan, May 7-12, 2018. Euro-
pean Language Resources Association (ELRA).
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina
Williams, Samuel Bowman, Holger Schwenk, and
Veselin Stoyanov. 2018. XNLI: Evaluating cross-
lingual sentence representations. In Proceedings of
the 2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2475–2485, Brus-
sels, Belgium. Association for Computational Lin-
guistics.
Jan Christian Blaise Cruz, Jose Kristian Resabal, James
Lin, Dan John Velasco, and Charibeth Cheng. 2020a.
Investigating the true performance of transformers in
low-resource languages: A case study in automatic
corpus creation. arXiv preprint arXiv:2010.11574.
Jan Christian Blaise Cruz, Julianne Agatha Tan, and
Charibeth Cheng. 2020b. Localization of fake news
detection via multitask transfer learning. In Proceed-
ings of The 12th Language Resources and Evaluation
Conference, pages 2596–2604.
lukkiddd cstorm125. 2019. prachathai67k.
https://github.com/PyThaiNLP/
prachathai-67k.
Yiming Cui, Ziqing Yang,

Chunk 32 · 1,997 chars

lianne Agatha Tan, and
Charibeth Cheng. 2020b. Localization of fake news
detection via multitask transfer learning. In Proceed-
ings of The 12th Language Resources and Evaluation
Conference, pages 2596–2604.
lukkiddd cstorm125. 2019. prachathai67k.
https://github.com/PyThaiNLP/
prachathai-67k.
Yiming Cui, Ziqing Yang, and Xin Yao. 2024. Efficient
and effective text encoding for chinese llama and
alpaca.
Hoang-Quan Dang, Duc-Duy-Anh Nguyen, and Trong-
Hop Do. 2022. Multi-task solution for aspect cate-
gory sentiment analysis on vietnamese datasets. In
2022 IEEE International Conference on Cybernetics
and Computational Intelligence (CyberneticsCom),
pages 404–409.
Mai Hoang Dao, Thinh Hung Truong, and Dat Quoc
Nguyen. 2021. Intent Detection and Slot Filling for
Vietnamese. In Proceedings of the 22nd Annual Con-
ference of the International Speech Communication
Association (INTERSPEECH).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: pre-training of
deep bidirectional transformers for language under-
standing. CoRR, abs/1810.04805.
Sumanth Doddapaneni, Rahul Aralikatte, Gowtham
Ramesh, Shreyansh Goyal, Mitesh M. Khapra,
Anoop Kunchukuttan, and Pratyush Kumar. 2022.
Towards leaving no indic language behind: Building
monolingual corpora, benchmark and models for in-
dic languages. Annual Meeting of the Association for
Computational Linguistics.
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili
Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunx-
iao Du, Penghui Yang, Haonan Wang, Jiaheng Liu,
Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung
Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu,
Hynek Kydlíˇcek, Zeyi Liu, Qunshu Lin, Sittipong
Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai
Thongchim, Taechawat Konkaew, Narong Borijindar-
goon, Anh Dao, Matichon Maneegard, Phakphum
Artkaew, Zheng-Xin Yong, Quan Nguyen, Wan-
naphong Phatthiyaphaibun, Hoang H. Tran, Mike
Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi
Wan, Wei Lu, and Min Lin. 2025.

Chunk 33 · 1,996 chars

ttipong
Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai
Thongchim, Taechawat Konkaew, Narong Borijindar-
goon, Anh Dao, Matichon Maneegard, Phakphum
Artkaew, Zheng-Xin Yong, Quan Nguyen, Wan-
naphong Phatthiyaphaibun, Hoang H. Tran, Mike
Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi
Wan, Wei Lu, and Min Lin. 2025. Sailor2: Sailing in
south-east asia with inclusive multilingual llm. arXiv
preprint arXiv:2502.12982.
Kerenza Doxolodeo and Adila Alfa Krisnadhi. 2024.
Ac-iquad: Automatically constructed indonesian
question answering dataset by leveraging wikidata.
Language Resources and Evaluation. Publisher
Copyright: extcopyright 2024, The Author(s).
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang,
Archi Mitra, Archie Sravankumar, Artem Korenev,
Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien
Rodriguez, Austen Gregerson, Ava Spataru, Bap-
tiste Rozière, Bethany Biron, Binh Tang, Bobbie
Chern, Charlotte Caucheteux, Chaya Nayak, Chloe

-- 12 of 35 --

Bi, Chris Marra, Chris McConnell, Christian Keller,
Christophe Touret, Chunyang Wu, Corinne Wong,
Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al-
lonsius, Daniel Song, Danielle Pintz, Danny Livshits,
David Esiobu, Dhruv Choudhary, Dhruv Mahajan,
Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes,
Egor Lakomkin, Ehab AlBadawy, Elina Lobanova,
Emily Dinan, Eric Michael Smith, Filip Radenovic,
Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor-
gia Lewis Anderson, Graeme Nail, Grégoire Mialon,
Guan Pang, Guillem Cucurell, Hailey Nguyen, Han-
nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov,
Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan
Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
Geffert, Jana Vranes, Jason Park, Jay Mahadeokar,
Jeet Shah, Jelmer van der Linde, Jennifer Billock,
Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi,
Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao

Chunk 34 · 1,993 chars

, Hugo Touvron, Iliyan Zarov,
Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan
Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
Geffert, Jana Vranes, Jason Park, Jay Mahadeokar,
Jeet Shah, Jelmer van der Linde, Jennifer Billock,
Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi,
Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu,
Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph
Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia,
Kalyan Vasuden Alwala, Kartikeya Upasani, Kate
Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and
et al. 2024. The llama 3 herd of models. CoRR,
abs/2407.21783.
Kenneth C. Enevoldsen, Isaac Chung, Imene Ker-
boua, Márton Kardos, Ashwin Mathur, David Stap,
Jay Gala, Wissam Siblini, Dominik Krzeminski,
Genta Indra Winata, Saba Sturua, Saiteja Utpala,
Mathieu Ciancone, Marion Schaeffer, Gabriel Se-
queira, Diganta Misra, Shreeya Dhakal, Jonathan
Rystrøm, Roman Solomatin, Ömer Çagatan, Akash
Kundu, Martin Bernstorff, Shitao Xiao, Akshita
Sukhlecha, Bhavish Pahwa, Rafal Poswiata, Kran-
thi Kiran GV, Shawon Ashraf, Daniel Auras, Björn
Plüster, Jan Philipp Harries, Loïc Magne, Isabelle
Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte
Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Kon-
rad Wojtasik, Taemin Lee, Marek Suppa, Crystina
Zhang, Roberta Rocca, Mohammed Hamdy, Andri-
anos Michail, John Yang, Manuel Faysse, Aleksei
Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani,
Pranjal Chitale, Simone Tedeschi, Nguyen Tai,
Artem Snegirev, Michael Günther, Mengzhou Xia,
Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Kr-
ishnakumar, Anna Maksimova, Silvan Wehrli, Maria
Tikhonova, Henil Panchal, Aleksandr Abramov,
Malte Ostendorff, Zheng Liu, Simon Clematide,
Lester James Miranda, Alena Fenogenova, Guangyu
Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borgh-
ini, Federico Cassano, Hongjin Su, Jimmy Lin,
Howard Yen, Lasse Hansen, Sara Hooker, Chenghao
Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy,
and Niklas Muennighoff. 2025. MMTEB: massive
multilingual text embedding

Chunk 35 · 1,996 chars

ide,
Lester James Miranda, Alena Fenogenova, Guangyu
Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borgh-
ini, Federico Cassano, Hongjin Su, Jimmy Lin,
Howard Yen, Lasse Hansen, Sara Hooker, Chenghao
Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy,
and Niklas Muennighoff. 2025. MMTEB: massive
multilingual text embedding benchmark. CoRR,
abs/2502.13595.
Kenneth C. Enevoldsen, Márton Kardos, Niklas Muen-
nighoff, and Kristoffer L. Nielbo. 2024. The scan-
dinavian embedding benchmarks: Comprehensive
assessment of multilingual and monolingual text em-
bedding. In Advances in Neural Information Pro-
cessing Systems 38: Annual Conference on Neural
Information Processing Systems 2024, NeurIPS 2024,
Vancouver, BC, Canada, December 10 - 15, 2024.
Ridi Fe. 2019. Indonesia sentiment analysis
dataset. https://github.com/ridife/
dataset-idsa.
Christian Federmann, Tom Kocmi, and Ying Xin. 2022.
NTREX-128 – news test references for MT evalua-
tion of 128 languages. In Proceedings of the First
Workshop on Scaling Up Multilingual Evaluation,
pages 21–24, Online. Association for Computational
Linguistics.
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari-
vazhagan, and Wei Wang. 2022. Language-agnostic
BERT sentence embedding. In Proceedings of the
60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), ACL
2022, Dublin, Ireland, May 22-27, 2022, pages 878–
891. Association for Computational Linguistics.
Jack FitzGerald, Christopher Hench, Charith Peris,
Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron
Nash, Liam Urbach, Vishesh Kakarala, Richa Singh,
Swetha Ranganath, Laurie Crist, Misha Britan,
Wouter Leeuwis, Gokhan Tur, and Prem Natara-
jan. 2022. Massive: A 1m-example multilin-
gual natural language understanding dataset with 51
typologically-diverse languages.
H Fujita and H Perez-Meana. 2021. An empirical
investigation of online news classification on an
open-domain, large-scale and high-quality dataset
in vietnamese. In New Trends in

Chunk 36 · 1,994 chars

Prem Natara-
jan. 2022. Massive: A 1m-example multilin-
gual natural language understanding dataset with 51
typologically-diverse languages.
H Fujita and H Perez-Meana. 2021. An empirical
investigation of online news classification on an
open-domain, large-scale and high-quality dataset
in vietnamese. In New Trends in Intelligent Soft-
ware Methodologies, Tools and Techniques: Proceed-
ings of the 20th International Conference on New
Trends in Intelligent Software Methodologies, Tools
and Techniques (SoMeT_21), volume 337, page 367.
SAGE Publications Limited.
Jay Gala, Pranjal A Chitale, A K Raghavan, Varun
Gumma, Sumanth Doddapaneni, Aswanth Kumar M,
Janki Atul Nawale, Anupama Sujatha, Ratish Pudup-
pully, Vivek Raghavan, Pratyush Kumar, Mitesh M
Khapra, Raj Dabre, and Anoop Kunchukuttan. 2023.
Indictrans2: Towards high-quality and accessible ma-
chine translation models for all 22 scheduled indian
languages. Transactions on Machine Learning Re-
search.
Valfrid Galinato, Lawrence Amores, Gino Ben Magsino,
and David Rafael Sumawang. 2023. Context-based
profanity detection and censorship using bidirectional
encoder representations from transformers.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
SimCSE: Simple contrastive learning of sentence em-
beddings. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Process-
ing, pages 6894–6910, Online and Punta Cana, Do-
minican Republic. Association for Computational
Linguistics.

-- 13 of 35 --

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krish-
nan, Marc’Aurelio Ranzato, and Francisco Guzmán.
2022. The flores-101 evaluation benchmark for low-
resource and multilingual machine translation. In
Proceedings of the 2022 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 19–35.
Tri Wahyu Guntara, Alham Fikri Aji, and Radityo Eko
Prasojo. 2020. Benchmarking multidomain

Chunk 37 · 1,990 chars

benchmark for low-
resource and multilingual machine translation. In
Proceedings of the 2022 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 19–35.
Tri Wahyu Guntara, Alham Fikri Aji, and Radityo Eko
Prasojo. 2020. Benchmarking multidomain English-
Indonesian machine translation. In Proceedings of
the 13th Workshop on Building and Using Compara-
ble Corpora, pages 35–43, Marseille, France. Euro-
pean Language Resources Association.
Mika Hämäläinen, Pattama Patpong, Khalid Alnajjar,
Niko Partanen, and Jack Rueter. 2021. Detecting de-
pression in Thai blog posts: a dataset and a baseline.
In Proceedings of the Seventh Workshop on Noisy
User-generated Text (W-NUT 2021), pages 20–25,
Online. Association for Computational Linguistics.
Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Is-
lam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang,
M Sohel Rahman, and Rifat Shahriyar. 2021. Xl-sum:
Large-scale multilingual abstractive summarization
for 44 languages. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
pages 4693–4703.
Rommel Hernandez Urbano Jr, Jeffrey Uy Ajero, An-
gelic Legaspi Angeles, Maria Nikki Hacar Quintos,
Joseph Marvin Regalado Imperial, and Ramon Lla-
banes Rodriguez. 2021. A bert-based hate speech
classifier from transcribed online short-form videos.
In 2021 5th International Conference on E-Society,
E-Education and E-Technology.
Ahmad Fathan Hidayatullah, Siwi Cahyaningtyas, and
Rheza Daffa Pamungkas. 2020. Attention-based
cnn-bilstm for dialect identification on javanese text.
Kinetik: Game Technology, Information System,
Computer Network, Computing, Electronics, and
Control, pages 317–324.
Vong Anh Ho, Duong Huynh-Cong Nguyen,
Danh Hoang Nguyen, Linh Thi-Van Pham,
Duc-Vu Nguyen, Kiet Van Nguyen, and Ngan
Luu-Thuy Nguyen. 2020. Emotion recognition for
vietnamese social media text. In Computational
Linguistics: 16th International Conference of

Chunk 38 · 1,993 chars

r Network, Computing, Electronics, and
Control, pages 317–324.
Vong Anh Ho, Duong Huynh-Cong Nguyen,
Danh Hoang Nguyen, Linh Thi-Van Pham,
Duc-Vu Nguyen, Kiet Van Nguyen, and Ngan
Luu-Thuy Nguyen. 2020. Emotion recognition for
vietnamese social media text. In Computational
Linguistics: 16th International Conference of the
Pacific Association for Computational Linguistics,
PACLING 2019, Hanoi, Vietnam, October 11–13,
2019, Revised Selected Papers 16, pages 319–333.
Springer.
Aung Kyaw Htet and Mark Dras. 2024. Myanmar xnli:
Building a dataset and exploring low-resource ap-
proaches to natural language inference with myan-
mar. PREPRINT (Version 1) available at Research
Square.
Muhammad Okky Ibrohim and Indra Budi. 2018. A
dataset and preliminaries study for abusive language
detection in indonesian social media. Procedia Com-
puter Science, 135:222–229. The 3rd International
Conference on Computer Science and Computational
Intelligence (ICCSCI 2018) : Empowering Smart
Technology in Digital Era for a Better Life.
Muhammad Okky Ibrohim and Indra Budi. 2019. Multi-
label hate speech and abusive language detection
in Indonesian Twitter. In Proceedings of the Third
Workshop on Abusive Language Online, pages 46–
57, Florence, Italy. Association for Computational
Linguistics.
Ahmad Izzan, Christian Wibisono, and Ilham Firdausi
Putra. 2025. Netifier: Negativity classifier. GitHub
repository.
Jakarta Artificial Intelligence Research. 2023. Indoqa:
Building indonesian qa dataset.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de Las Casas, Florian Bressand, Gianna Lengyel,
Guillaume Lample, Lucile Saulnier, Lélio Re-
nard Lavaud, Marie-Anne Lachaux, Pierre Stock,
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo-
thée Lacroix, and William El Sayed. 2023. Mistral
7b. CoRR, abs/2310.06825.
Shengyi Jiang, Sihui Fu, Nankai Lin, and Yingwen Fu.
2021a. Pre-trained models and evaluation data for the
khmer language. Tsinghua

Chunk 39 · 1,996 chars

ier, Lélio Re-
nard Lavaud, Marie-Anne Lachaux, Pierre Stock,
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo-
thée Lacroix, and William El Sayed. 2023. Mistral
7b. CoRR, abs/2310.06825.
Shengyi Jiang, Sihui Fu, Nankai Lin, and Yingwen Fu.
2021a. Pre-trained models and evaluation data for the
khmer language. Tsinghua Science and Technology.
Shengyi Jiang, Sihui Fu, Nankai Lin, and Yingwen Fu.
2022. Pretrained models and evaluation data for the
khmer language. Tsinghua Science and Technology,
27(4):709–718.
Shengyi Jiang, Xiuwen Huang, Xiaonan Cai, and
Nankai Lin. 2021b. Pre-trained models and eval-
uation data for the myanmar language. In The 28th
International Conference on Neural Information Pro-
cessing, Cham. Springer International Publishing.
Sarah Samson Juan, Suhaila Saee, and Fitri Suraya
Mohamad. 2022. Social versus physical distancing:
Analysis of public health messages at the start of
covid-19 outbreak in malaysia using natural language
processing. In Proceedings of the 8th International
Conference on Computational Science and Technol-
ogy, volume 835 of Lecture Notes in Electrical Engi-
neering, pages 577–589. Springer Singapore.
A. H. Khine, K. T. Nwet, and K. M. Soe. 2017. Au-
tomatic myanmar news classification. In 15th Pro-
ceedings of International Conference on Computer
Applications, pages 401–408.
Dhamir Raniah Kiasati Desrul and Ade Romadhony.
2019. Abusive language detection on indonesian on-
line news comments. In 2019 International Seminar
on Research of Information Technology and Intelli-
gent Systems (ISRITI), pages 320–325.

-- 14 of 35 --

Fajri Koto and Ikhwan Koto. 2020. Towards computa-
tional linguistics in minangkabau language: Studies
on sentiment analysis and machine translation. In
Proceedings of the 34th Pacific Asia Conference on
Language, Information and Computation (PACLIC),
Vietnam.
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2020a.
Liputan6: A large-scale indonesian dataset for text
summarization. In Proceedings of the

Chunk 40 · 1,997 chars

u language: Studies
on sentiment analysis and machine translation. In
Proceedings of the 34th Pacific Asia Conference on
Language, Information and Computation (PACLIC),
Vietnam.
Fajri Koto, Jey Han Lau, and Timothy Baldwin. 2020a.
Liputan6: A large-scale indonesian dataset for text
summarization. In Proceedings of the 1st Conference
of the Asia-Pacific Chapter of the Association for
Computational Linguistics and the 10th International
Joint Conference on Natural Language Processing,
pages 598–608.
Fajri Koto, Afshin Rahimi, Jey Han Lau, and Timothy
Baldwin. 2020b. Indolem and indobert: A bench-
mark dataset and pre-trained language model for in-
donesian NLP. CoRR, abs/2011.00677.
Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier
Garcia, Christopher A. Choquette-Choo, Katherine
Lee, Derrick Xin, Aditya Kusupati, Romi Stella,
Ankur Bapna, and Orhan Firat. 2023. Madlad-400:
A multilingual and document-level large audited
dataset.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
field, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Matthew Kelcey,
Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: a benchmark for question answering
research. Transactions of the Association of Compu-
tational Linguistics.
Matthew Lamm, Jennimaria Palomaki, Chris Alberti,
Daniel Andor, Eunsol Choi, Livio Baldini Soares,
and Michael Collins. 2020. Qed: A framework and
dataset for explanations in question answering.
Moritz Laurer, Wouter van Atteveldt, Andreu Salleras
Casas, and Kasper Welbers. 2022. Less Annotat-
ing, More Classifying – Addressing the Data Scarcity
Issue of Supervised Machine Learning with Deep
Transfer Learning and BERT - NLI. Preprint. Pub-
lisher: Open Science Framework.
Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian
Riedel, and Holger Schwenk. 2019. Mlqa: Eval-
uating cross-lingual extractive question answering.
arXiv

Chunk 41 · 1,995 chars

fying – Addressing the Data Scarcity
Issue of Supervised Machine Learning with Deep
Transfer Learning and BERT - NLI. Preprint. Pub-
lisher: Open Science Framework.
Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian
Riedel, and Holger Schwenk. 2019. Mlqa: Eval-
uating cross-lingual extractive question answering.
arXiv preprint arXiv:1910.07475.
Xianming Li and Jing Li. 2023. Angle-optimized text
embeddings. arXiv preprint arXiv:2309.12871.
Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Na-
man Goyal, Marjan Ghazvininejad, Luke Zettle-
moyer, and Madian Khabsa. 2023. XLM-V: over-
coming the vocabulary bottleneck in multilingual
masked language models. In Proceedings of the 2023
Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2023, Singapore, Decem-
ber 6-10, 2023, pages 13142–13152. Association for
Computational Linguistics.
Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita
Lowphansirikul, Can Udomcharoenchaikit, Ekapol
Chuangsuwanich, and Sarana Nutanong. 2022. Con-
Gen: Unsupervised control and generalization distil-
lation for sentence representation. In Findings of the
Association for Computational Linguistics: EMNLP
2022, pages 6467–6480, Abu Dhabi, United Arab
Emirates. Association for Computational Linguistics.
E. D. Livelo and C. Cheng. 2018. Intelligent dengue in-
foveillance using gated recurrent neural learning and
cross-label frequencies. In 2018 IEEE International
Conference on Agents (ICA), pages 2–7.
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana
Akbar, Lester James V. Miranda, Jennifer Santoso,
Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov,
Joseph Marvin Imperial, Onno Kampman, Joel
Ruben Antony Moniz, Muhammad Ravi Shulthan
Habibi, Frederikus Hudi, Jann Montalan, Ryan Hadi-
wijaya, Joanito Agili Lopo, William Nixon, Börje
Karlsson, James Jaya, Ryandito Diandaru, Yuze
Gao, Patrick Amadeus Irawan, Bin Wang, Jan Chris-
tian Blaise Cruz, Chenxi Whitehouse, Ivan Halim
Parmonangan, Maria Khelli, Wenyu Zhang, Lucky
Susanto,

Chunk 42 · 1,990 chars

hammad Ravi Shulthan
Habibi, Frederikus Hudi, Jann Montalan, Ryan Hadi-
wijaya, Joanito Agili Lopo, William Nixon, Börje
Karlsson, James Jaya, Ryandito Diandaru, Yuze
Gao, Patrick Amadeus Irawan, Bin Wang, Jan Chris-
tian Blaise Cruz, Chenxi Whitehouse, Ivan Halim
Parmonangan, Maria Khelli, Wenyu Zhang, Lucky
Susanto, Reynard Adha Ryanda, Sonny Lazuardi Her-
mawan, Dan John Velasco, Muhammad Dehan Al
Kautsar, Willy Fitra Hendria, Yasmin Moslem, Noah
Flynn, Muhammad Farid Adilazuarda, Haochen Li,
Johanes Lee, R. Damanhuri, Shuo Sun, Muham-
mad Reza Qorib, Amirbek Djanibekov, Wei Qi
Leong, Quyet V. Do, Niklas Muennighoff, Tanrada
Pansuwan, Ilham Firdausi Putra, Yan Xu, Ngee Tai
Chia, Ayu Purwarianti, Sebastian Ruder, William-
Chandra Tjhi, Peerat Limkonchotiwat, Alham Fikri
Aji, Sedrick Keh, Genta Indra Winata, Ruochen
Zhang, Fajri Koto, Zheng Xin Yong, and Samuel
Cahyawijaya. 2024. Seacrowd: A multilingual mul-
timodal data hub and benchmark suite for southeast
asian languages. In Proceedings of the 2024 Con-
ference on Empirical Methods in Natural Language
Processing, EMNLP 2024, Miami, FL, USA, Novem-
ber 12-16, 2024, pages 5155–5203. Association for
Computational Linguistics.
Lalita Lowphansirikul, Charin Polpanumas, Attapol T.
Rutherford, and Sarana Nutanong. 2022. A large
english-thai parallel corpus from the web and
machine-generated text. Lang. Resour. Evaluation,
56(2):477–499.
Luong Luc Phan, Phuc Huynh Pham, Kim Thi-
Thanh Nguyen, Sieu Khai Huynh, Tham Thi Nguyen,
Luan Thanh Nguyen, Tin Van Huynh, and Kiet
Van Nguyen. 2021. Sa2sl: From aspect-based senti-
ment analysis to social listening system for business
intelligence. In Knowledge Science, Engineering
and Management, pages 647–658, Cham. Springer
International Publishing.
Son T. Luu, Kiet Van Nguyen, and Ngan Luu-Thuy
Nguyen. 2021. A large-scale dataset for hate speech
detection on vietnamese social media texts. In Ad-
vances and Trends in Artificial Intelligence. Artifi-

-- 15 of 35 --

cial

Chunk 43 · 1,994 chars

Science, Engineering
and Management, pages 647–658, Cham. Springer
International Publishing.
Son T. Luu, Kiet Van Nguyen, and Ngan Luu-Thuy
Nguyen. 2021. A large-scale dataset for hate speech
detection on vietnamese social media texts. In Ad-
vances and Trends in Artificial Intelligence. Artifi-

-- 15 of 35 --

cial Intelligence Practices, pages 415–426, Cham.
Springer International Publishing.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Dan Huang, Andrew Y. Ng, and Christopher Potts.
2011. Learning word vectors for sentiment analysis.
In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Language Technologies, pages 142–150, Portland,
Oregon, USA. Association for Computational Lin-
guistics.
Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan,
Fahrurrozi Rahman, and Clara Vania. 2021. IndoNLI:
A natural language inference dataset for Indonesian.
In Proceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing, pages
10511–10527, Online and Punta Cana, Dominican
Republic. Association for Computational Linguistics.
Rahmad Mahendra Mei Silviana Saputri and Mirna
Adriani. 2018. Emotion classification on indone-
sian twitter dataset. In Proceedings of the 2018 In-
ternational Conference on Asian Language Process-
ing(IALP), pages 90–95. IEEE.
Min Si Thu,Khin Myat Noe. Myanmar-agriculture-1k.
Sepideh Mollanorozy, Marc Tanti, and Malvina Nissim.
2023. Cross-lingual transfer learning with Persian.
In Proceedings of the 5th Workshop on Research in
Computational Linguistic Typology and Multilingual
NLP, pages 89–95, Dubrovnik, Croatia. Association
for Computational Linguistics.
Niklas Muennighoff, Hongjin Su, Liang Wang, Nan
Yang, Furu Wei, Tao Yu, Amanpreet Singh, and
Douwe Kiela. 2024. Generative representational in-
struction tuning.
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and
Nils Reimers. 2023. MTEB: massive text embedding
benchmark. In Proceedings of the 17th Conference of
the European

Chunk 44 · 1,997 chars

Muennighoff, Hongjin Su, Liang Wang, Nan
Yang, Furu Wei, Tao Yu, Amanpreet Singh, and
Douwe Kiela. 2024. Generative representational in-
struction tuning.
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and
Nils Reimers. 2023. MTEB: massive text embedding
benchmark. In Proceedings of the 17th Conference of
the European Chapter of the Association for Compu-
tational Linguistics, EACL 2023, Dubrovnik, Croatia,
May 2-6, 2023, pages 2006–2029. Association for
Computational Linguistics.
Huyen TM Nguyen, Hung V Nguyen, Quyen T Ngo, Lu-
ong X Vu, Vu Mai Tran, Bach X Ngo, and Cuong A
Le. 2018a. Vlsp shared task: sentiment analy-
sis. Journal of Computer Science and Cybernetics,
34(4):295–310.
Kiet Nguyen, Son Quoc Tran, Luan Thanh Nguyen,
Tin Van Huynh, Son Thanh Luu, and Ngan Luu-
Thuy Nguyen. 2022. Vlsp 2021-vimrc challenge:
Vietnamese machine reading comprehension. VNU
Journal of Science: Computer Science and Commu-
nication Engineering, 38(2).
Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen,
Tham T. H. Truong, and Ngan Luu-Thuy Nguyen.
2018b. Uit-vsfc: Vietnamese students’ feedback cor-
pus for sentiment analysis. In 2018 10th Interna-
tional Conference on Knowledge and Systems Engi-
neering (KSE), pages 19–24.
Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan Luu-
Thuy Nguyen. 2021a. Constructive and toxic speech
detection for open-domain social media comments
in vietnamese. In Advances and Trends in Artificial
Intelligence. Artificial Intelligence Practices, pages
572–583, Cham. Springer International Publishing.
Minh-Tien Nguyen, Dac Viet Lai, Phong-Khac Do, Duc-
Vu Tran, and Minh-Le Nguyen. 2016a. VSoLSC-
Sum: Building a Vietnamese sentence-comment
dataset for social context summarization. In Pro-
ceedings of the 12th Workshop on Asian Language
Resources (ALR12), pages 38–48, Osaka, Japan. The
COLING 2016 Organizing Committee.
Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha,
Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan
Luu-Thuy Nguyen. 2021b. Vietnamese complaint
detection

Chunk 45 · 1,994 chars

t for social context summarization. In Pro-
ceedings of the 12th Workshop on Asian Language
Resources (ALR12), pages 38–48, Osaka, Japan. The
COLING 2016 Organizing Committee.
Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha,
Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan
Luu-Thuy Nguyen. 2021b. Vietnamese complaint
detection on e-commerce websites.
Phu-Vinh Nguyen, Minh-Nam Tran, Long Nguyen, and
Dien Dinh. 2025. Advancing vietnamese information
retrieval with learning objective and benchmark.
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng
Gao, Saurabh Tiwary, Rangan Majumder, and
Li Deng. 2016b. MS MARCO: A human gener-
ated machine reading comprehension dataset. CoRR,
abs/1611.09268.
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani
Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken
Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liy-
ing Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
Chaoqun Liu, Hang Zhang, and Lidong Bing. 2024.
SeaLLMs - large language models for Southeast Asia.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 3:
System Demonstrations), pages 294–304, Bangkok,
Thailand. Association for Computational Linguistics.
Tran Nhiem. 2023. Vietnamese instruction data corpus
for large-scale finetuning of language models.
Hiroki Nomoto, Kenji Okano, David Moeljadi, and
Hideo Sawada. 2018. Tufs asian language parallel
corpus (talpco). pages 436–439.
Hiroki Nomoto, Kenji Okano, Sunisa Wittayapanyanon,
and Junta Nomura. 2019. Interpersonal meaning
annotation for asian language corpora: The case of
tufs asian language parallel corpus (talpco). pages
846–849.
Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad,
Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said
Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir
Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem
Beloucif, Chris Biemann, Sofia Bourhim, Christine
De Kock, Genet Shanko Dekebo, Oumaima Hour-
rane, Gopichand Kanumolu, Lokesh Madasu, Samuel
Rutunda, Manish Shrivastava, Thamar Solorio,

Chunk 46 · 1,994 chars

alla, Idris Abdulmumin, Ibrahim Said
Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir
Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem
Beloucif, Chris Biemann, Sofia Bourhim, Christine
De Kock, Genet Shanko Dekebo, Oumaima Hour-
rane, Gopichand Kanumolu, Lokesh Madasu, Samuel
Rutunda, Manish Shrivastava, Thamar Solorio, Nir-
mal Surange, Hailegnaw Getaneh Tilaye, Krish-
napriya Vishnubhotla, Genta Winata, Seid Muhie Yi-
mam, and Saif M. Mohammad. 2024a. Semrel2024:

-- 16 of 35 --

A collection of semantic textual relatedness datasets
for 13 languages. In Findings of the Association for
Computational Linguistics: ACL 2024. Association
for Computational Linguistics.
Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad,
Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said
Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir
Araujo, Meriem Beloucif, Christine De Kock,
Oumaima Hourrane, Manish Shrivastava, Thamar
Solorio, Nirmal Surange, Krishnapriya Vishnubhotla,
Seid Muhie Yimam, and Saif M. Mohammad. 2024b.
SemEval-2024 task 1: Semantic textual relatedness
for african and asian languages. In Proceedings of
the 18th International Workshop on Semantic Evalua-
tion (SemEval-2024). Association for Computational
Linguistics.
Ryan Park, Rafael Rafailov, Stefano Ermon, and
Chelsea Finn. 2024. Disentangling length from qual-
ity in direct preference optimization.
Kitsuchart Pasupa, Ponruedee Netisopakul, and
Ratthawut Lertsuksakda. 2016. Sentiment analysis
on thai children stories. Artificial Life and Robotics,
21(3):357–364.
Patomporn Payoungkhamdee, Peerachet Porkaew, Atth-
asith Sinthunyathum, Phattharaphon Songphum, Wit-
sarut Kawidam, Wichayut Loha-Udom, Prachya
Boonkwan, and Vipas Sutantayawalee. 2021. Lime-
soda: Dataset for fake news detection in healthcare
domain. In 2021 16th International Joint Sympo-
sium on Artificial Intelligence and Natural Language
Processing (iSAI-NLP), pages 1–6.
Parinthapat Pengpun, Can Udomcharoenchaikit, Weer-
ayut Buaphet, and Peerat Limkonchotiwat.

Chunk 47 · 1,989 chars

n, and Vipas Sutantayawalee. 2021. Lime-
soda: Dataset for fake news detection in healthcare
domain. In 2021 16th International Joint Sympo-
sium on Artificial Intelligence and Natural Language
Processing (iSAI-NLP), pages 1–6.
Parinthapat Pengpun, Can Udomcharoenchaikit, Weer-
ayut Buaphet, and Peerat Limkonchotiwat. 2024.
Seed-free synthetic data generation framework for
instruction-tuning LLMs: A case study in Thai. In
Proceedings of the 62nd Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 4: Stu-
dent Research Workshop), pages 438–457, Bangkok,
Thailand. Association for Computational Linguistics.
Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, and Viet
Hoang. 2026. VN-MTEB: Vietnamese massive text
embedding benchmark. In Findings of the Associ-
ation for Computational Linguistics: EACL 2026,
pages 1705–1725, Rabat, Morocco. Association for
Computational Linguistics.
Wannaphong Phatthiyaphaibun. 2020. Pythainlp/thai-
lao-parallel-corpus: Thai lao parallel corpus v0.5.
Wannaphong Phatthiyaphaibun. 2025. Lao news classi-
fication.
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich,
Charin Polpanumas, Arthit Suriyawongkul, Lalita
Lowphansirikul, Pattarawat Chormai, Peerat Limkon-
chotiwat, Thanathip Suntorntip, and Can Udom-
charoenchaikit. 2023. PyThaiNLP: Thai natural lan-
guage processing in Python. In Proceedings of the
3rd Workshop for Natural Language Processing Open
Source Software (NLP-OSS 2023), pages 25–36, Sin-
gapore, Singapore. Empirical Methods in Natural
Language Processing.
Inggrid Yanuar Risca Pratiwi, Rosa Andrie Asmara,
and Faisal Rahutomo. 2017. Study of hoax news
detection using naïve bayes classifier in indonesian
language. In 2017 11th International Conference on
Information, Communication Technology and System
(ICTS), pages 73–78.
Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti.
2019. Improving bi-lstm performance for indone-
sian sentiment analysis using paragraph vector. In
Proceedings of the 2019 International

Chunk 48 · 1,985 chars

indonesian
language. In 2017 11th International Conference on
Information, Communication Technology and System
(ICTS), pages 73–78.
Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti.
2019. Improving bi-lstm performance for indone-
sian sentiment analysis using paragraph vector. In
Proceedings of the 2019 International Conference of
Advanced Informatics: Concepts, Theory and Appli-
cations (ICAICTA), pages 1–5. IEEE.
Rifki Afina Putri and Alice Oh. 2022. IDK-MRC: Unan-
swerable questions for Indonesian machine reading
comprehension. In Proceedings of the 2022 Con-
ference on Empirical Methods in Natural Language
Processing, pages 6918–6933, Abu Dhabi, United
Arab Emirates. Association for Computational Lin-
guistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21:140:1–140:67.
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth
Bheemaraj, Mayank Jobanputra, Raghavan AK,
Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Ma-
halakshmi J, Divyanshu Kakwani, Navneet Kumar,
Aswin Pradeep, Srihari Nagaraj, Deepak Kumar,
Vivek Raghavan, Anoop Kunchukuttan, Pratyush Ku-
mar, and Mitesh Shantadevi Khapra. 2022. Samanan-
tar: The largest publicly available parallel corpora
collection for 11 indic languages. Trans. Assoc. Com-
put. Linguistics, 10:145–162.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual us-
ing knowledge distillation. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing. Association for Computational
Linguistics.
Riccosan and Karen Etania Saputra. 2023.

Chunk 49 · 1,986 chars

tational Linguistics.
Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual us-
ing knowledge distillation. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing. Association for Computational
Linguistics.
Riccosan and Karen Etania Saputra. 2023. Multilabel
multiclass sentiment and emotion dataset from in-
donesian mobile application review. Data in Brief,
50.
Riccosan, Karen Etania Saputra, Galih Dea Pratama,
and Andry Chowanda. 2022. Emotion dataset from
indonesian public opinion. Data in Brief, 43:108465.

-- 17 of 35 --

Morgane Rivière, Shreya Pathak, Pier Giuseppe
Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard
Hussenot, Thomas Mesnard, Bobak Shahriari,
Alexandre Ramé, Johan Ferret, Peter Liu, Pouya
Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos,
Ravin Kumar, Charline Le Lan, Sammy Jerome, An-
ton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan
Girgin, Nikola Momchev, Matt Hoffman, Shantanu
Thakoor, Jean-Bastien Grill, Behnam Neyshabur,
Olivier Bachem, Alanna Walton, Aliaksei Severyn,
Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin
Abdagic, Amanda Carl, Amy Shen, Andy Brock,
Andy Coenen, Anthony Laforge, Antonia Pater-
son, Ben Bastian, Bilal Piot, Bo Wu, Brandon
Royal, Charlie Chen, Chintu Kumar, Chris Perry,
Chris Welty, Christopher A. Choquette-Choo, Danila
Sinopalnikov, David Weinberger, Dimple Vijayku-
mar, Dominika Rogozinska, Dustin Herbison, Elisa
Bandy, Emma Wang, Eric Noland, Erica Moreira,
Evan Senter, Evgenii Eltyshev, Francesco Visin,
Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus
Martins, Hadi Hashemi, Hanna Klimczak-Plucinska,
Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda
Mein, Jack Zhou, James Svensson, Jeff Stanway,
Jetha Chan, Jin Peng Zhou, Joana Carrasqueira,
Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost
van Amersfoort, Josh Gordon, Josh Lipschultz,
Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kar-
tikeya Badola, Kat Black, Katie Millican,

Chunk 50 · 1,996 chars

tra, Harsh Dhand, Ivan Nardini, Jacinda
Mein, Jack Zhou, James Svensson, Jeff Stanway,
Jetha Chan, Jin Peng Zhou, Joana Carrasqueira,
Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost
van Amersfoort, Josh Gordon, Josh Lipschultz,
Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kar-
tikeya Badola, Kat Black, Katie Millican, Keelin
McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish
Greene, Lars Lowe Sjösund, Lauren Usui, Laurent
Sifre, Lena Heuermann, Leticia Lago, and Lilly Mc-
Nealus. 2024. Gemma 2: Improving open language
models at a practical size. CoRR, abs/2408.00118.
Hammam Riza, Michael Purwoadi, Gunarso, Teduh
Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied,
Luong Chi Mai, Vu Tat Thang, Nguyen Phuong
Thái, Rapid Sun, Vichet Chea, Khin Mar Soe,
Khin Thandar Nwet, Masao Utiyama, and Chenchen
Ding. 2019. Asian language treebank. In Proceed-
ings of O-COCOSDA. National Institute of Informa-
tion and Communication Technology (NICT), Japan.
Muhammad Razif Rizqullah, Ayu Purwarianti, and Al-
ham Fikri Aji. 2023. Qasina: Religious domain ques-
tion answering using sirah nabawiyah. In 2023 10th
International Conference on Advanced Informatics:
Concept, Theory and Application (ICAICTA), pages
1–6.
Ken Nabila Setya and Rahmad Mahendra. 2018.
Semi-supervised textual entailment on indonesian
wikipedia data. In Proceedings of the 2018 Interna-
tional Conference on Computational Linguistics and
Intelligent Text Processing (CICLing).
M. Si Thu. 2024. Burmese microbiology 1k dataset
(1.1).
AI Singapore. 2024. Sea-lion (southeast asian lan-
guages in one network): A family of large language
models for southeast asia. https://github.
com/aisingapore/sealion.
Shivalika Singh, Angelika Romanou, Clémentine Four-
rier, David I. Adelani, Jian Gang Ngui, Daniel
Vila-Suero, Peerat Limkonchotiwat, Kelly Marchi-
sio, Wei Qi Leong, Yosephine Susanto, Raymond
Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder,
Madeline Smith, Antoine Bosselut, Alice Oh, Andre
F. T. Martins, Leshem Choshen, Daphne

Chunk 51 · 1,997 chars

ingh, Angelika Romanou, Clémentine Four-
rier, David I. Adelani, Jian Gang Ngui, Daniel
Vila-Suero, Peerat Limkonchotiwat, Kelly Marchi-
sio, Wei Qi Leong, Yosephine Susanto, Raymond
Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder,
Madeline Smith, Antoine Bosselut, Alice Oh, Andre
F. T. Martins, Leshem Choshen, Daphne Ippolito,
Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, and
Sara Hooker. 2025. Global mmlu: Understanding
and addressing cultural and linguistic biases in multi-
lingual evaluation.
Rohayani Sitepu et al. 2024. Sentiment analysis in
karonese tweet using machine learning algorithms.
Indonesian Journal of Electrical Engineering and
Informatics (IJEEI), 12(4):2482–2489.
Gizem So˘gancıo˘glu, Hakime "Ozt"urk, and Arzucan
"Ozg"ur. 2017. Biosses: a semantic sentence simi-
larity estimation system for the biomedical domain.
Bioinformatics, 33(14):i49–i58.
Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram,
Michael Günther, Bo Wang, Markus Krimmel, Feng
Wang, Georgios Mastrapas, Andreas Koukounas,
Nan Wang, and Han Xiao. 2024a. jina-embeddings-
v3: Multilingual embeddings with task lora. CoRR,
abs/2409.10173.
Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram,
Michael Günther, Bo Wang, Markus Krimmel, Feng
Wang, Georgios Mastrapas, Andreas Koukounas,
Nan Wang, and Han Xiao. 2024b. jina-embeddings-
v3: Multilingual embeddings with task lora.
Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pat-
tarawat Chormai, and Charin Polpanumas. 2019.
Pythainlp/wisesight-sentiment: First release.
Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann
Montalan, Jian Gang Ngui, Xian Bin Yong, Wei Qi
Leong, Hamsawardhini Rengarajan, Peerat Limkon-
chotiwat, Yifan Mai, and William-Chandra Tjhi.
2025. SEA-HELM: southeast asian holistic eval-
uation of language models. CoRR, abs/2502.14301.
Gemma Team. 2024. Gemma.
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab-
hishek Srivastava, and Iryna Gurevych. 2021. BEIR:
A heterogeneous benchmark for zero-shot evaluation
of information retrieval

Chunk 52 · 1,997 chars

William-Chandra Tjhi.
2025. SEA-HELM: southeast asian holistic eval-
uation of language models. CoRR, abs/2502.14301.
Gemma Team. 2024. Gemma.
Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab-
hishek Srivastava, and Iryna Gurevych. 2021. BEIR:
A heterogeneous benchmark for zero-shot evaluation
of information retrieval models. In Proceedings of
the Neural Information Processing Systems Track on
Datasets and Benchmarks 1, NeurIPS Datasets and
Benchmarks 2021, December 2021, virtual.
C Tho, Y Heryadi, L Lukas, and A Wibowo. 2021.
Code-mixed sentiment analysis of indonesian lan-
guage and javanese language using lexicon based
approach. Journal of Physics: Conference Series,
1869(1):012084.
Jörg Tiedemann. 2020. The tatoeba translation chal-
lenge – realistic data sets for low resource and multi-
lingual MT. In Proceedings of the Fifth Conference
on Machine Translation, pages 1174–1182, Online.
Association for Computational Linguistics.

-- 18 of 35 --

Kanokorn Trakultaweekoon, Santipong Thaiprayoon,
Pornpimon Palingoon, and Anocha Rugchatjaroen.
2019. The first wikipedia questions and factoid an-
swers corpus in the thai language. In 2019 14th
International Joint Symposium on Artificial Intelli-
gence and Natural Language Processing (iSAI-NLP),
pages 1–4. IEEE.
Co Van Dinh, Son T. Luu, and Anh Gia-Tuan Nguyen.
2022. Detecting spam reviews on vietnamese e-
commerce websites. In Intelligent Information and
Database Systems, pages 595–607, Cham. Springer
International Publishing.
Kobkrit Viriyayudhakorn and Charin Polpanumas. 2021.
iapp_wiki_qa_squad.
Kexin Wang, Nils Reimers, and Iryna Gurevych. 2021.
TSDAE: using transformer-based sequential denois-
ing auto-encoder for unsupervised sentence embed-
ding learning. CoRR, abs/2104.06979.
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,
Rangan Majumder, and Furu Wei. 2024a. Improv-
ing text embeddings with large language models. In
Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics

Chunk 53 · 1,998 chars

uto-encoder for unsupervised sentence embed-
ding learning. CoRR, abs/2104.06979.
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,
Rangan Majumder, and Furu Wei. 2024a. Improv-
ing text embeddings with large language models. In
Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 11897–11916, Bangkok, Thai-
land. Association for Computational Linguistics.
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,
Rangan Majumder, and Furu Wei. 2024b. Multilin-
gual e5 text embeddings: A technical report. arXiv
preprint arXiv:2402.05672.
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan
Yang, and Ming Zhou. 2020. Minilm: Deep self-
attention distillation for task-agnostic compression
of pre-trained transformers.
Xinghao Wang, Junliang He, Pengyu Wang, Yun-
hua Zhou, Tianxiang Sun, and Xipeng Qiu. 2024c.
Denosent: A denoising objective for self-supervised
sentence representation learning. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 38, pages 19180–19188.
Silvan Wehrli, Bert Arnrich, and Christopher Irrgang.
2023. German text embedding clustering benchmark.
In Proceedings of the 19th Conference on Natural
Language Processing (KONVENS 2023), pages 187–
201, Ingolstadt, Germany. Association for Computa-
tional Lingustics.
Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle
Lo, Arman Cohan, Benjamin Van Durme, Dawn
Lawrie, and Luca Soldaini. 2024. Followir: Eval-
uating and teaching information retrieval models to
follow instructions.
Andika William and Yunita Sari. 2020. Click-id: A
novel dataset for indonesian clickbait headlines.
Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawi-
jaya, Rahmad Mahendra, Fajri Koto, Ade Romad-
hony, Kemal Kurniawan, David Moeljadi, Radi-
tyo Eko Prasojo, Pascale Fung, Timothy Baldwin,
Jey Han Lau, Rico Sennrich, and Sebastian Ruder.
2023. NusaX: Multilingual parallel sentiment dataset
for 10 Indonesian local languages. In Proceedings
of the 17th

Chunk 54 · 1,997 chars

Samuel Cahyawi-
jaya, Rahmad Mahendra, Fajri Koto, Ade Romad-
hony, Kemal Kurniawan, David Moeljadi, Radi-
tyo Eko Prasojo, Pascale Fung, Timothy Baldwin,
Jey Han Lau, Rico Sennrich, and Sebastian Ruder.
2023. NusaX: Multilingual parallel sentiment dataset
for 10 Indonesian local languages. In Proceedings
of the 17th Conference of the European Chapter of
the Association for Computational Linguistics, pages
815–834, Dubrovnik, Croatia. Association for Com-
putational Linguistics.
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muen-
nighoff, Defu Lian, and Jian-Yun Nie. 2024a. C-pack:
Packed resources for general chinese embeddings. In
Proceedings of the 47th International ACM SIGIR
Conference on Research and Development in Infor-
mation Retrieval, SIGIR 2024, Washington DC, USA,
July 14-18, 2024, pages 641–649. ACM.
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muen-
nighoff, Defu Lian, and Jian-Yun Nie. 2024b. C-
pack: Packed resources for general chinese embed-
dings. In Proceedings of the 47th International ACM
SIGIR Conference on Research and Development
in Information Retrieval, SIGIR ’24, page 641–649,
New York, NY, USA. Association for Computing
Machinery.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian-
hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu,
Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng
Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian-
hao Li, Tingyu Xia, Xingzhang Ren, Xuancheng
Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan,
Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan
Qiu. 2024. Qwen2.5 technical report. arXiv preprint
arXiv:2412.15115.
Evi Yulianti, Ajmal Kurnia, Mirna Adriani, and
Yoppy Setyo Duto. 2021. Normalisation of
indonesian-english code-mixed text and its effect on
emotion classification. International Journal of Ad-
vanced Computer Science and Applications.
Xinyu Zhang, Nandan Thakur,

Chunk 55 · 1,997 chars

n2.5 technical report. arXiv preprint
arXiv:2412.15115.
Evi Yulianti, Ajmal Kurnia, Mirna Adriani, and
Yoppy Setyo Duto. 2021. Normalisation of
indonesian-english code-mixed text and its effect on
emotion classification. International Journal of Ad-
vanced Computer Science and Applications.
Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo,
Ehsan Kamalloo, David Alfonso-Hermelo, Xi-
aoguang Li, Qun Liu, Mehdi Rezagholizadeh, and
Jimmy Lin. 2023. MIRACL: A Multilingual Re-
trieval Dataset Covering 18 Diverse Languages.
Transactions of the Association for Computational
Linguistics, 11:1114–1131.
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang,
Huan Lin, Baosong Yang, Pengjun Xie, An Yang,
Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren
Zhou. 2025. Qwen3 embedding: Advancing text
embedding and reranking through foundation models.
arXiv preprint arXiv:2506.05176.

-- 19 of 35 --

Appendix
A Annotator Demographics and Guidelines
Data Assurance Annotators We hired two native speakers of each SEA language to check the quality of
the datasets before adding them to SEA-BED. These annotators are graduate students who have passed
the English test and the NLP test (i.e., understanding the concepts of each task in SEA-BED). We give
them the guidelines as follows. Please re-check the datasets that (i) the correctness of text and written
style is natural and understandable for a native speaker; (ii) the correctness of the gold label, e.g., a
correct class of text classification is assigned. We have asked them to check 182 datasets. There are some
datasets that utilize machine translation with low-quality outputs or poor-quality labels; we remove them
from our SEA-BED.
New Dataset Annotators In this work, our collaborators helped us translate the data from English
to Thai and Burmese for STS and NLI tasks. These people are Thai and Burmese undergraduate and
graduate students studying in Thailand, aged from 20 to 25 years old, who can speak English and their
native language (Thai

Chunk 56 · 1,990 chars

ED.
New Dataset Annotators In this work, our collaborators helped us translate the data from English
to Thai and Burmese for STS and NLI tasks. These people are Thai and Burmese undergraduate and
graduate students studying in Thailand, aged from 20 to 25 years old, who can speak English and their
native language (Thai or Burmese). We use three Thai annotators and one Burmese annotator to create
new datasets. We also removed some examples that contain special characters that cannot be shown in
Google Sheets. We give them the guidelines as follows. Translate the selected datasets to make them a
human-like or everyday conversation in your native languages and change the subject of a sentence to be
gender-neutral since both the Thai and Burmese languages have words or morphemes that can express the
gender of the speaker. Therefore, the quality of our new human-crafted dataset is higher than that of using
machine translations or LLMs to generate data, as such methods have been observed to be less native-like
or unrepresentative of natural language use (Lovenia et al., 2024; Singh et al., 2025).
ISO Language Name ISO 639-3 Number of speakers
Indonesian ind ∼ 200 million
Thai tha ∼ 60 million
Vietnamese vie ∼ 85 million
Burmese mya ∼ 43 million
Filipino fil ∼ 45 million
Khmer khm ∼ 17 million
Malay zsm ∼ 33 million
Lao lao ∼ 7 million
Tamil tam ∼ 85 million
Tetum tet ∼ 1.3 million
Table 8: Overview of Southeast Asian languages, including ISO language names, ISO 639-3 codes, and approximate
numbers of speakers (L1 and L2 combined), based on Ethnologue (2023/2024).
B Benchmark Efficiency
Caching Embeddings. To improve the run-time efficiency, we use embedding caching to store embedded
texts in memory and cache files; when seen texts are input to the same model, we will use the cached
embedding instead of computing the new one to decrease the run-time of our benchmark.
Downsampling. Enevoldsen et al. (2025) proposed a downsampling technique for the English

Chunk 57 · 1,997 chars

we use embedding caching to store embedded
texts in memory and cache files; when seen texts are input to the same model, we will use the cached
embedding instead of computing the new one to decrease the run-time of our benchmark.
Downsampling. Enevoldsen et al. (2025) proposed a downsampling technique for the English benchmark,
decreasing the number of samples by 98%. However, as shown in Table 9, we applied the same technique
to our benchmark (bitext mining datasets) and found that the performance of each model increased in
all cases. This is because all challenging samples may have been removed from the dataset, leading to
improved performance for most models. Moreover, the ranking of each model changed, in contrast to the
findings of Enevoldsen et al. (2025), where the rankings remained largely unchanged. Therefore, we did
not apply the downsampling technique to our benchmark.

-- 20 of 35 --

Model 	100% Dataset 30% Dataset Rank after downsampling
multilingual-e5-large-instruct (560M) 	87.86 	93.03 	0
Qwen3-Embedding-8B (8B) 	84.78 	90.31 	↓1
bge-multilingual-gemma2 (9B) 	82.02 	90.71 	↑3
multilingual-e5-large (560M) 	84.51 	88.19 	↓1
bge-m3 (568M) 	86.18 	91.89 	↑1
GritLM-7B (7B) 	63.63 	69.68 	0
e5-mistral-7b-instruct (7B) 	65.30 	73.42 	0
Qwen3-Embedding-0.6B (595M) 	56.53 	62.95 	0
multilingual-mpnet-base (278M) 	68.12 	73.97 	0
LaBSE (471M) 	86.84 	90.51 	↓2
multilingual-MiniLM-L12 (118M) 	53.23 	59.06 	0
Gemma-SEA-LION-v3-9B-IT (9B) 	15.31 	3.21 	↓1
Sailor2-8B-Chat (8B) 	4.31 	6.01 	↑1
Table 9: We evaluate 13 models on bitext mining using 100% and 30% dataset sizes. We also indicate the rank
change of the model before and after downsampling to show the performance discrepancy.
C Domains
For domains in the SEA-BED benchmark, we include the following:
• Academic: Formal writing and research publications commonly found in scholarly journals, theses,
and dissertations.
• Blog: Informal, conversational writings about a variety of topics published on websites

Chunk 58 · 1,990 chars

to show the performance discrepancy.
C Domains
For domains in the SEA-BED benchmark, we include the following:
• Academic: Formal writing and research publications commonly found in scholarly journals, theses,
and dissertations.
• Blog: Informal, conversational writings about a variety of topics published on websites or personal
blogs.
• Constructed: Artificially created text or speech, often in experiments to target particular abilities.
• Encyclopedic: Structured, reference-based texts offering thorough and factual information on various
topics.
• Fiction: Narrative writing that involves creative content, such as novels, short stories, and other
storytelling forms.
• Government: Documents, reports, and publications officially issued by government agencies.
• Legal: Documents and texts concerning laws, legal processes, contracts, and legal theories.
• Medical: Scientific and clinical publications focused on healthcare, treatments, patient care, and
medical studies.
• News: News articles and reports that address current events, political developments, economic trends,
and other timely topics.
• Non-fiction: Texts grounded in real events and factual information, including biographies, essays,
and documentaries.
• Religious: Writings concerning religious teachings, doctrines, sacred texts, and discussions on
spirituality.
• Reviews: Analytical assessments of books, films, music, products, or services.
• Social: Messages and conversations shared on social media, online forums, and other digital
platforms.
• Spoken: Spoken content such as speeches, dialogues, interviews, and recorded discussions.
• Subtitles: Written transcriptions or translations of spoken content from films, videos, or multimedia
presentations.
• Web: Web-based content spanning diverse topics, often featuring hyperlinks and multimedia ele-
ments.
• Written: A broad category encompassing all forms of text-based communication, both print and
digital.
D Performance Changes Analysis
Here, we are

Chunk 59 · 1,990 chars

of spoken content from films, videos, or multimedia
presentations.
• Web: Web-based content spanning diverse topics, often featuring hyperlinks and multimedia ele-
ments.
• Written: A broad category encompassing all forms of text-based communication, both print and
digital.
D Performance Changes Analysis
Here, we are examining how SEA-focused performance contrasts with the broader multilingual benchmark
(MMTEB). To study the robustness of embeddings in world and SEA languages, we compare the ranking

-- 21 of 35 --

changes between our benchmark and the multilingual text embedding benchmark, MMTEB. We use the
task average metric (Table 6), similar to MMTEB.
As shown in Figure 2, based on the experiment from MMTEB, Qwen3-Embedding-8B performed the
best on world results, which includes 1,090 languages2. However, when we focus only on SEA languages
using SEA-BED, the ranking of Qwen3-Embedding-8B dropped from first to second place. In addition,
Qwen3-Embedding-0.6B dropped from second rank to eighth rank. This is because some of the linguistic
and dialect knowledge will be different compared to other groups of languages, when we evaluate them
only for the SEA languages. This highlights that the challenges, gaps, and model capabilities measured in
MMTEB and our benchmark differ, particularly in the supported languages for embedding models that do
not fully support SEA languages. Although some models are effective in performing well on MMTEB,
they are not guaranteed to achieve the same performance for SEA languages.
MMTEB
SEA-BED
Figure 2: Ranking difference between MMTEB and SEA-BED.
E Tokenizer Analysis
This section examines whether limited coverage of SEA vocabulary in multilingual tokenizers correlates
with poor downstream results. We highlight scripts like Lao or Khmer, which are often underrepresented
in tokenizers. Previous works (Ali et al., 2024; Arnett and Bergen, 2025; Liang et al., 2023) demonstrated
that vocabularies in a tokenizer affect the model

Chunk 60 · 1,994 chars

age of SEA vocabulary in multilingual tokenizers correlates
with poor downstream results. We highlight scripts like Lao or Khmer, which are often underrepresented
in tokenizers. Previous works (Ali et al., 2024; Arnett and Bergen, 2025; Liang et al., 2023) demonstrated
that vocabularies in a tokenizer affect the model performance in downstream tasks. In particular, when the
multilingual tokenizer represents more vocabulary in some languages, the performance on those languages
has also been observed to improve. In this study, we want to investigate whether the vocabulary in the
tokenizer affects SEA-BED’s overall performance or not. To answer this question, we count the SEA
tokens in each text embedding model and compare their performance from Table 5.
As shown in Table 10, the language with the most tokens represented in a tokenizer is Filipino, with
an average of 2.94 percent of vocabulary tokens in 13 models. However, compared to the language
performance (Table 5), Filipino performance is lower than Indonesian. Surprisingly, there are no tokens
for Tetum at all in the 13 models. We observe that performance on Tetum is also the worst compared to
other SEA languages. Moreover, the performance is mixed for languages that do not use Latin characters,
i.e., Thai, Burmese, Lao, and Tamil.
F Language Similarity
To further understand the similarity between language and performance, we analyze the performance of
bi-text retrieval datasets in SEA languages. In particular, we study the language similarity of robust and
non-robust models, e.g., top-performing and worst-performing embedding models, to see what the desired
property is to improve our benchmark. We utilize the dialect pairing subset task in this experiment, where
we use a batch size of 128 for the negative pair evaluation. In addition, we use cosine similarity as the
main metric, where higher values indicate greater embedding similarity between language pairs.
As shown in Figure 3, the top-performing model,

Chunk 61 · 1,997 chars

. We utilize the dialect pairing subset task in this experiment, where
we use a batch size of 128 for the negative pair evaluation. In addition, we use cosine similarity as the
main metric, where higher values indicate greater embedding similarity between language pairs.
As shown in Figure 3, the top-performing model, multilingual-e5-large-instruct, shows consistently high
similarity for positive samples, especially Indonesian-Malay (0.9682 points), Indonesian-Filipino (0.9305
points), and Thai-Vietnamese (0.9168 points), indicating strong cross-lingual embeddings. However,
multilingual-e5-large-instruct unexpectedly maintains high similarity for negative samples (0.75-0.81
2We obtained the model rankings on Nov 28th, 2025.

-- 22 of 35 --

Model ind tha vie mya fil khm zsm lao tam tet
multilingual-e5-large-instruct (560M) 1.20 1.61 0.73 0.91 3.59 0.66 0.20 0.56 0.98 0.00
Qwen3-Embedding-8B (8B) 0.39 1.70 0.84 0.02 1.13 0.03 0.11 0.02 0.02 0.00
bge-multilingual-gemma2 (9B) 0.59 0.50 0.55 0.45 3.04 0.03 0.11 0.02 0.13 0.00
multilingual-e5-large (560M) 1.20 1.61 0.73 0.91 3.59 0.66 0.20 0.56 0.98 0.00
bge-m3 (568M) 1.20 1.61 0.73 0.91 3.59 0.66 0.20 0.56 0.98 0.00
GritLM-7B (7B) 0.27 0.19 0.55 0.45 3.04 0.03 0.11 0.02 0.13 0.00
multilingual-mpnet-base (278M) 1.20 1.61 0.73 0.91 3.59 0.66 0.20 0.56 0.98 0.00
LaBSE (471M) 1.12 0.45 0.81 0.45 4.65 0.54 0.19 0.29 1.28 0.00
e5-mistral-7b-instruct (7B) 0.27 0.19 0.55 0.45 3.04 0.03 0.11 0.02 0.13 0.00
Qwen3-Embedding-0.6B (595M) 0.39 1.70 0.84 0.02 1.13 0.03 0.11 0.02 0.02 0.00
multilingual-MiniLM-L12 (118M) 1.20 1.61 0.73 0.91 3.59 0.66 0.20 0.56 0.98 0.00
Gemma-SEA-LION-v3-9B-IT (9B) 0.59 0.50 0.55 0.45 3.04 0.03 0.11 0.02 0.13 0.00
Sailor2-8B-Chat (8B) 0.39 1.70 0.84 0.02 1.13 0.03 0.11 0.02 0.02 0.00
Average 0.77 1.15 0.71 0.53 2.94 0.31 0.15 0.25 0.52 0.00
Table 10: The percentage number of vocabulary tokens for each model in each language.
points), indicating limited distinction between unrelated sentence pairs and

Chunk 62 · 1,991 chars

04 0.03 0.11 0.02 0.13 0.00
Sailor2-8B-Chat (8B) 0.39 1.70 0.84 0.02 1.13 0.03 0.11 0.02 0.02 0.00
Average 0.77 1.15 0.71 0.53 2.94 0.31 0.15 0.25 0.52 0.00
Table 10: The percentage number of vocabulary tokens for each model in each language.
points), indicating limited distinction between unrelated sentence pairs and highlighting a gap for im-
provement. In contrast, multilingual-MiniLM-L12-v2 struggles with related positive pairs, showing lower
similarity for Indonesian-Filipino (0.4601 points) and notably weak similarity with Burmese (around
0.12-0.59 points). Interestingly, this model achieves low similarity for negative pairs, mostly under 0.08
points, clearly distinguishing unrelated samples. Although it falls short in overall embedding quality,
multilingual-MiniLM-L12-v2’s distinct negative sample separation provides valuable insights into desir-
able characteristics for embedding models. These findings suggest that a balanced approach, achieving
both strong cross-lingual similarity for positive examples and clear differentiation for negative examples,
is essential to improve future embedding benchmarks.
G Machine vs. Human Datasets
This section testing whether human-crafted data yields results different from machine-generated data. We
split the experiment into machine generation and translation studies.
Machine Translation vs. Human-translated Datasets. To compare machine-translated and human-
translated data, we evaluate our new Thai and Burmese STS datasets from Table 4 against versions
translated by Google’s MT system. As shown in Table 11, Thai results differ by less than 2 Spearman
points across all settings, aligning with prior work showing that English-Thai NMT is already reliable
for practical use (Lowphansirikul et al., 2022; Chiaranaipanich et al., 2024). In contrast to Thai, the
performance gap between Burmese human and machine translation datasets is larger than that of Thai in
most cases. We found that the Google NMT results for Burmese

Chunk 63 · 1,996 chars

rk showing that English-Thai NMT is already reliable
for practical use (Lowphansirikul et al., 2022; Chiaranaipanich et al., 2024). In contrast to Thai, the
performance gap between Burmese human and machine translation datasets is larger than that of Thai in
most cases. We found that the Google NMT results for Burmese sometimes show code-switching between
Thai and Burmese characters, as shown in Figure 4. This emphasizes that, in underrepresented languages,
using humans to create evaluation datasets is still better than relying on machine translations.
Machine Generation vs. Human Datasets. While many recent works introduce datasets generated with
machine learning for scalability, we argue that fully machine-generated data remains unstable and should
not dominate benchmarks, as it can distort model performance and research conclusions. To illustrate this,
we compare human-crafted and machine-generated datasets using the top five models from our previous
study. Keeping the same tasks and languages, we evaluate only datasets created by either humans or
machines. The results for both settings are reported in Tables 14, 15 and 16.
Our results reveal two main effects: (i) shifts in average performance and (ii) shifts in model ranking.
First, machine-generated datasets almost always reduce performance relative to human-crafted ones,
with the sole exception of bge-multilingual-gemma2. This finding aligns with our results for machine-
translated data (Table 11), indicating that machine-generated inputs can degrade model performance.
Second, rankings become unstable, making evaluations unreliable. A robust benchmark should align with
human-based outcomes, yet machine-generated datasets fail to preserve ranking consistency. Tetum offers
a clear example: in Table 14, bge-multilingual-gemma2 leaps from 30.91 to 99.18 points. This occurs

-- 23 of 35 --

ind
tha
vie
mya
fil
khm
zsm
lao
ind
tha
vie
mya
fil
khm
zsm
lao
0.9173
0.9272 0.9168
0.8835 0.8899 0.8649
0.9305

Chunk 64 · 1,987 chars

t machine-generated datasets fail to preserve ranking consistency. Tetum offers
a clear example: in Table 14, bge-multilingual-gemma2 leaps from 30.91 to 99.18 points. This occurs

-- 23 of 35 --

ind	
tha	
vie	
mya	
fil	
khm	
zsm	
lao
ind
tha
vie
mya
fil
khm
zsm
lao
0.9173
0.9272 	0.9168
0.8835 	0.8899 	0.8649
0.9305 	0.9047 	0.9105 	0.8837
0.8980 	0.8956 	0.8750 	0.9095 	0.8825
0.9682 	0.9123 	0.9227 	0.8836 	0.9407 	0.8963
0.8871 	0.8841 	0.8620 	0.8966 	0.8751 	0.9110 	0.8840
multilingual-e5-large-instruct
0.0
0.2
0.4
0.6
0.8
1.0
(a) The top-performing model on the positive pairs
ind	
tha	
vie	
mya	
fil	
khm	
zsm	
lao
ind
tha
vie
mya
fil
khm
zsm
lao
0.8576
0.8790 	0.8601
0.1238 	0.1344 	0.1235
0.4601 	0.4546 	0.4554 	0.3320
0.4213 	0.4340 	0.4243 	0.5954 	0.4538
0.8942 	0.8445 	0.8649 	0.1250 	0.4620 	0.4219
0.4855 	0.5025 	0.4898 	0.4568 	0.4252 	0.6839 	0.4854
multilingual-MiniLM-L12-v2
0.0
0.2
0.4
0.6
0.8
1.0
 (b) The worst-performing model on the positive pairs
ind	
tha	
vie	
mya	
fil	
khm	
zsm	
lao
ind
tha
vie
mya
fil
khm
zsm
lao
0.7520
0.7470 	0.7531
0.7685 	0.7838 	0.7520
0.7595 	0.7538 	0.7465 	0.7717
0.7802 	0.7867 	0.7596 	0.8163 	0.7683
0.7776 	0.7486 	0.7444 	0.7666 	0.7674 	0.7762
0.7807 	0.7834 	0.7577 	0.8117 	0.7709 	0.8189 	0.7752
multilingual-e5-large-instruct
0.0
0.2
0.4
0.6
0.8
1.0
(c) The top-performing model on the negative pairs
ind	
tha	
vie	
mya	
fil	
khm	
zsm	
lao
ind
tha
vie
mya
fil
khm
zsm
lao
0.0636
0.0588 	0.0656
0.0294 	0.0394 	0.0289
0.0636 	0.0742 	0.0645 	0.2782
0.0588 	0.0706 	0.0598 	0.5135 	0.2593
0.0587 	0.0658 	0.0607 	0.0319 	0.0673 	0.0624
0.0637 	0.0753 	0.0652 	0.3732 	0.2150 	0.3206 	0.0675
multilingual-MiniLM-L12-v2
0.0
0.2
0.4
0.6
0.8
1.0
 (d) The worst-performing model on the negative pairs
Figure 3: We perform cross-lingual similarity using the bitext mining task (dialect pairing subset). (Top) Cross-
lingual similarity metrics of the top-performing and worst-performing embedding models on the positive

Chunk 65 · 1,970 chars

ltilingual-MiniLM-L12-v2
0.0
0.2
0.4
0.6
0.8
1.0
(d) The worst-performing model on the negative pairs
Figure 3: We perform cross-lingual similarity using the bitext mining task (dialect pairing subset). (Top) Cross-
lingual similarity metrics of the top-performing and worst-performing embedding models on the positive parallel
samples. (Bottom) Cross-lingual correlation metrics of the top-performing and worst-performing embedding models
on the negative parallel samples.
เจริญပရာဒစ်လမ်းမှတစ်ဆင့် သမတံုးခန်းမသိ ဝင်ေရာက်သည်။
Periodic Table ၏ ဘယ်ဘက်ှိ ဓာတုြဒပ်စင်များသည် အိုင်ယွန်ไนှင်း
စွမ်းအင်ထက် များစွာနိမ့်ကျသည်။
(Pass Charoen Pradit Road, enter the Office of the President's
Auditorium.)
(The chemical elements to the left of the periodic table have a much
lower ionization energy.)
Figure 4: Code-switching between Thai and Burmese words (translated by Google NMT).
because the Tetum machine-generated set resembles a simple language-detection task from MADLAD-
400, where all models score above 90, suggesting data leakage or in-domain overlap. Additional analysis
of the machine-generated datasets is provided in Appendix H.
H Dataset Analysis
Performance Analysis. In addition to studying in Appendix G, we analyze the correlation between the
percentage of vocabulary token coverage and performance scores for the two top-performing and two
lowest-performing embedding models, as shown in Figure 5. The results indicate that the vocabulary
size of each model does not have a direct effect on model performance in the text embedding benchmark
for SEA languages. Although some models have a larger number of tokens in their tokenizers covering
SEA vocabularies, their performance in the benchmark is not significantly higher than that of models with
lower vocabulary coverage. This indicates that simply increasing vocabulary size does not necessarily

-- 24 of 35 --

Model Original (eng) Mach. (mya) Hum. (mya) Diff. (mya) Mach. (tha) Hum. (tha) Diff.

Chunk 66 · 1,997 chars

covering
SEA vocabularies, their performance in the benchmark is not significantly higher than that of models with
lower vocabulary coverage. This indicates that simply increasing vocabulary size does not necessarily

-- 24 of 35 --

Model 	Original (eng) Mach. (mya) Hum. (mya) Diff. (mya) Mach. (tha) Hum. (tha) Diff. (tha)
multilingual-e5-large-instruct (560M) 	82.87 	74.82 	75.06 	0.24 	79.66 	79.80 	0.14
Qwen3-Embedding-8B (595M) 	81.17 	74.02 	75.81 	1.79 	80.75 	80.74 	0.01
bge-multilingual-gemma2 (9B) 	84.64 	75.51 	72.87 	2.64 	80.25 	78.97 	1.28
multilingual-e5-large (560M) 	80.00 	71.49 	71.55 	0.06 	76.44 	76.50 	0.06
bge-m3 (568M) 	80.86 	74.57 	71.96 	2.61 	77.75 	76.07 	1.68
GritLM-7B (7B) 	82.65 	65.60 	66.03 	0.43 	74.64 	74.81 	0.17
e5-mistral-7b-instruct (7B) 	81.86 	62.36 	64.63 	2.27 	74.57 	74.57 	0.00
Qwen3-Embedding-0.6B (8B) 	80.11 	67.10 	69.23 	2.13 	77.88 	77.79 	0.09
multilingual-mpnet-base (278M) 	80.54 	72.34 	71.16 	1.18 	72.61 	72.60 	0.01
LaBSE (471M) 	73.50 	69.06 	70.04 	0.98 	69.29 	68.83 	0.46
multilingual-MiniLM-L12 (118M) 	78.89 	69.27 	67.26 	2.01 	72.25 	72.23 	0.02
Gemma-SEA-LION-v3-9B-IT (9B) 	60.42 	46.50 	49.29 	2.79 	55.97 	56.01 	0.04
Sailor2-8B-Chat (8B) 	57.94 	48.79 	52.89 	4.1 	55.85 	54.36 	1.49
Proprietary models
embed-multilingual-v3.0 	82.62 	74.56 	73.92 	0.64 	78.40 	78.63 	0.23
jina-embeddings-v3 	78.37 	75.92 	75.61 	0.31 	76.40 	76.36 	0.04
voyage-3 	81.77 	69.54 	68.20 	1.34 	77.42 	73.64 	3.78
text-embedding-3-small 	82.37 	54.86 	55.65 	0.79 	66.47 	66.45 	0.02
Table 11: Model performance on Machine Translation vs. Human Datasets on our STS datasets.
lead to better performance in text embedding tasks for SEA languages.
Discussion. In contrast to previous works, we summarize that the number of tokens present in the
tokenizer might not strongly correlate with the performance in a language. There are many SEA languages
with diverse scripts, and solely having a larger vocabulary for each language might not

Chunk 67 · 1,999 chars

dding tasks for SEA languages.
Discussion. In contrast to previous works, we summarize that the number of tokens present in the
tokenizer might not strongly correlate with the performance in a language. There are many SEA languages
with diverse scripts, and solely having a larger vocabulary for each language might not necessarily yield
significant improvement. As shown in the language performances of GritLM-7B and bge-multilingual-
gemma2 (Table 5), omitting SEA languages from the training data results in poor performance in those
languages. To achieve a promising result, we can add more SEA training datasets in the training step to
improve downstream task performance rather than adding more tokens in the tokenizer. Moreover, we
report dataset statistics across Southeast Asian languages in Table 12, including the number of tokens and
samples for each language, to support decision making when adding datasets for each language.
Language Name Number of tokens Number of samples
Indonesian 56,742,156 1,056,710
Thai 214,798,897 1,337,357
Vietnamese 33,534,467 910,496
Burmese 21,080,124 234,913
Filipino 13,624,124 248,089
Khmer 2,0661,441 194,112
Malay 15,792,761 365,304
Lao 15,196,896 168,032
Tamil 96,633,008 322,393
Tetum 88,928,018 48,728
Table 12: Dataset statistics across Southeast Asian languages, including the number of tokens and samples per
language. All statistics are computed using the tokenizer of multilingual-e5-large-instruct, the top-performing model
in our evaluation.
I Models
To evaluate text embedding on SEA texts, we experiment on 13 open-source models across encoder and
decoder models as follows:
• multilingual-e5-large (Wang et al., 2024b). A multilingual-e5-large model that is trained on over
100 languages using a combination of contrastive pre-training on diverse multilingual text pairs and
supervised fine-tuning on high-quality labeled datasets using mined hard negatives and knowledge
distillation techniques.
• multilingual-e5-large-instruct

Chunk 68 · 1,997 chars

, 2024b). A multilingual-e5-large model that is trained on over
100 languages using a combination of contrastive pre-training on diverse multilingual text pairs and
supervised fine-tuning on high-quality labeled datasets using mined hard negatives and knowledge
distillation techniques.
• multilingual-e5-large-instruct (Wang et al., 2024b). The multilingual-e5-large-instruct model is

-- 25 of 35 --

ind
tha
vie mya
fil
tam
khm
zsm lao
tet
0.2 0.4 0.6 0.8 1.0
multilingual-e5-large-instruct Vocabulary coverage
Performance score
(a)
ind
tha
vie mya
fil
tam
khm
zsm lao
tet
0.2 0.4 0.6 0.8 1.0
Qwen3-Embedding-8B Vocabulary coverage
Performance score
(b)
ind
tha
vie mya
fil
tam
khm
zsm lao
tet
0.2 0.4 0.6 0.8 1.0
Qwen3-Embedding-0.6B Vocabulary coverage
Performance score
(c)
ind
tha
vie mya
fil
tam
khm
zsm lao
tet
0.2 0.4 0.6 0.8 1.0
multilingual-MiniLM-L12 Vocabulary coverage
Performance score
(d)
Figure 5: (Top) Correlation between the percentage of vocabulary token coverage and performance score for the two
top-performing models, multilingual-e5-large-instruct and Qwen3-Embedding-8B. (Bottom) Correlation between
the percentage of vocabulary token coverage and performance score for the two lowest-performing models, Qwen3-
Embedding-0.6B and multilingual-MiniLM-L12. Both values are normalized to a [0, 1] scale for comparability
across languages and models.
similar to multilingual-e5-large, with additional fine-tuning on instructional data.
• e5-mistral-7b-instruct (Wang et al., 2024a). The e5-mistral-7b-instruct model is a text embedding
model based on Mistral-7B (Jiang et al., 2023), fine-tuned with contrastive learning on synthetic
instruction data across 93 languages. Using a two-step prompting strategy, the model learns from
diverse embedding tasks and achieves strong multilingual performance with under 1,000 training
steps.
• multilingual-mpnet-base-v2 (Reimers and Gurevych, 2020). The multilingual-mpnet-base-v2
model is trained on parallel data for over 50

Chunk 69 · 1,998 chars

a across 93 languages. Using a two-step prompting strategy, the model learns from
diverse embedding tasks and achieves strong multilingual performance with under 1,000 training
steps.
• multilingual-mpnet-base-v2 (Reimers and Gurevych, 2020). The multilingual-mpnet-base-v2
model is trained on parallel data for over 50 languages via multilingual knowledge distillation
using paraphrase-mpnet-base-v2 (Reimers and Gurevych, 2019) as a teacher model, xlm-roberta-
base (Conneau et al., 2019) as a student model, and MSE loss to align their embeddings.
• LaBSE (Feng et al., 2022). The LaBSE model is trained on over 109 languages using a dual-encoder
transformer architecture based on BERT (Devlin et al., 2018), leveraging a translation ranking loss
function to produce sentence embeddings that align semantically similar sentences across languages
into a shared vector space
• multilingual-MiniLM-L12-v2 (Reimers and Gurevych, 2020). The multilingual-MiniLM-L12-v2
model is trained using a similar multilingual knowledge distillation approach to multilingual-mpnet-
base-v2, with paraphrase-MiniLM-L12-v2 (Reimers and Gurevych, 2019) as a teacher model,
Multilingual-MiniLM-L12-H384 (Wang et al., 2020) as a student model, and MSE loss to align their
embeddings.
• bge-m3 (Chen et al., 2024). The BGE-M3 model is trained on over 100 languages using a combi-
nation of contrastive pre-training on diverse multilingual corpora and supervised fine-tuning with
high-quality labeled and synthetic datasets, leveraging hard negative mining and a self-knowledge
distillation framework that integrates dense, sparse, and multi-vector retrieval signals.

-- 26 of 35 --

• bge-multilingual-gemma2 (Chen et al., 2024). The bge-multilingual-gemma2 model is built on
Gemma-2-9b (Team, 2024) and trained on diverse multilingual data across tasks such as retrieval,
classification, and clustering using embedding techniques.
• GritLM-7B (Muennighoff et al., 2024). The GritLM-7B model is built on the Mistral-7B

Chunk 70 · 1,999 chars

al-gemma2 (Chen et al., 2024). The bge-multilingual-gemma2 model is built on
Gemma-2-9b (Team, 2024) and trained on diverse multilingual data across tasks such as retrieval,
classification, and clustering using embedding techniques.
• GritLM-7B (Muennighoff et al., 2024). The GritLM-7B model is built on the Mistral-7B (Jiang
et al., 2023) architecture and trained using Generative Representational Instruction Tuning (GRIT),
a unified framework combining contrastive learning for embeddings and next-token prediction for
generation, with task-specific instructions and a joint loss to enable strong performance across both
tasks.
• Qwen3-Embedding-0.6B and Qwen3-Embedding-8B (Zhang et al., 2025). The Qwen3-
Embedding-0.6B and Qwen3-Embedding-8B models were trained on multiple languages using
a multi-stage training pipeline that combines large-scale weakly supervised pre-training on synthetic
multilingual data with supervised fine-tuning and model merging techniques to enhance robustness
and generalization.
• Sailor2-8B-Chat (Dou et al., 2025). The Sailor2-8B-Chat model, based on an expanded Qwen2.5-
7B (Yang et al., 2024), was trained on 13 SEA languages using two-stage continual pre-training with
balanced and high-quality data, followed by two-stage instruction tuning and preference tuning with
length-regularized DPO (Park et al., 2024).
• Gemma-SEA-LION-v3-9B-IT (Singapore, 2024). The Gemma-SEA-LION-v3-9B-IT model is fine-
tuned from the Gemma2 9B (Rivière et al., 2024) base model on English and multiple SEA languages
(such as Indonesian, Thai, and Vietnamese), using a combination of full parameter fine-tuning,
on-policy alignment, and model merging techniques.
Moreover, we also evaluate the performance of proprietary models as follows:
• text-embedding-3-small. We evaluate the text-embedding-3-small 3 model, which provides a highly
efficient embedding model suitable for various downstream applications.
• embed-multilingual-v3.0. We evaluate the embed-multilingual-v3.0

Chunk 71 · 1,997 chars

ques.
Moreover, we also evaluate the performance of proprietary models as follows:
• text-embedding-3-small. We evaluate the text-embedding-3-small 3 model, which provides a highly
efficient embedding model suitable for various downstream applications.
• embed-multilingual-v3.0. We evaluate the embed-multilingual-v3.0 4 model, designed for multilin-
gual representation learning across over 100 languages.
• voyage-3. We evaluate the voyage-3 5 model, which provides efficient, high-quality embeddings
optimized for retrieval across diverse domains.
• jina-embeddings-v3. We evaluate the jina-embeddings-v3 (Sturua et al., 2024b) model, which is
designed for efficient semantic similarity and search applications, supporting various multilingual
scenarios.
All proprietary models were accessed and evaluated using their latest publicly available versions during
experimentation (April 4th, 2025). The full model links are shown in Table 13.
Model Hugging Face Link
multilingual-e5-large-instruct https://huggingface.co/intfloat/multilingual-e5-large-instruct
Qwen3-Embedding-8B https://huggingface.co/Qwen/Qwen3-Embedding-8B
bge-multilingual-gemma2 https://huggingface.co/BAAI/bge-multilingual-gemma2
multilingual-e5-large https://huggingface.co/intfloat/multilingual-e5-large
bge-m3 https://huggingface.co/BAAI/bge-m3
GritLM-7B https://huggingface.co/GritLM/GritLM-7B
e5-mistral-7b-instruct https://huggingface.co/intfloat/e5-mistral-7b-instruct
Qwen3-Embedding-0.6B https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
multilingual-mpnet-base https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
LaBSE https://huggingface.co/sentence-transformers/LaBSE
multilingual-MiniLM-L12 https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Gemma-SEA-LION-v3-9B-IT https://huggingface.co/aisingapore/Gemma-SEA-LION-v3-9B-IT
Sailor2-8B-Chat https://huggingface.co/sail/Sailor2-8B-Chat
Table 13: Models and Hugging Face links used for the

Chunk 72 · 1,998 chars

sformers/LaBSE
multilingual-MiniLM-L12 	https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Gemma-SEA-LION-v3-9B-IT https://huggingface.co/aisingapore/Gemma-SEA-LION-v3-9B-IT
Sailor2-8B-Chat 	https://huggingface.co/sail/Sailor2-8B-Chat
Table 13: Models and Hugging Face links used for the evaluation.
3https://openai.com/index/new-embedding-models-and-api-updates
4https://cohere.com/blog/introducing-embed-v3
5https://blog.voyageai.com/2024/09/18/voyage-3/

-- 27 of 35 --

Model 	ind tha vie mya fil khm zsm lao tam tet 	Avg.
Number of datasets (→) 	(51) (50) (36) (32) (28) (18) (15) (16) (17) (2) 	(265)
multilingual-e5-large-instruct (560M) 82.06 83.12 79.54 78.27 80.33 79.63 88.68 87.07 76.89 41.06 77.67±12.71
Qwen3-Embedding-8B (8B) 	82.98 82.85 80.67 75.35 78.78 75.86 86.87 80.70 75.66 36.00 75.57±13.65
bge-m3 (568M) 	81.28 79.67 77.48 73.70 76.87 76.56 88.07 85.03 77.70 36.91 75.33±13.43
multilingual-e5-large (560M) 	81.90 81.72 80.67 70.56 79.42 72.63 82.92 82.81 78.03 32.24 74.29±14.58
bge-multilingual-gemma2 (9B) 	83.28 82.08 80.29 70.43 80.67 74.03 85.18 66.50 80.99 30.91 73.44±15.27
LaBSE (471M) 	75.72 72.59 74.82 73.98 78.27 74.66 89.21 83.41 77.05 43.46 74.42±11.34
multilingual-mpnet-base (278M) 	77.09 76.06 74.01 60.52 51.01 62.34 77.23 65.09 63.15 34.54 64.10±12.83
e5-mistral-7b-instruct (7B) 	82.40 76.50 77.32 46.95 79.17 54.68 81.45 23.12 66.61 37.21 62.54±20.00
GritLM-7B (7B) 	83.32 74.64 78.80 42.56 78.48 50.20 80.76 24.36 60.09 40.72 61.39±19.77
Qwen3-Embedding-0.6B (595M) 	78.83 77.16 76.70 47.62 62.83 39.19 71.48 24.31 60.67 30.69 56.95±19.22
multilingual-MiniLM-L12 (118M) 73.54 72.49 71.26 53.68 46.20 35.01 70.63 42.72 27.57 30.54 52.36±17.52
Gemma-SEA-LION-v3-9B-IT (9B) 51.93 42.10 52.09 28.02 53.69 35.63 52.25 16.07 27.58 0.08 35.94±17.17
Sailor2-8B-Chat (8B) 	51.30 36.18 42.23 27.43 45.19 22.91 29.08 11.84 27.22 1.45 29.48±14.42
Proprietary models
embed-multilingual-v3.0 	83.03 82.94 80.67 76.94 80.16 78.02 86.66

Chunk 73 · 1,996 chars

5.01 70.63 42.72 27.57 30.54 52.36±17.52
Gemma-SEA-LION-v3-9B-IT (9B) 51.93 42.10 52.09 28.02 53.69 35.63 52.25 16.07 27.58 0.08 35.94±17.17
Sailor2-8B-Chat (8B) 	51.30 36.18 42.23 27.43 45.19 22.91 29.08 11.84 27.22 1.45 29.48±14.42
Proprietary models
embed-multilingual-v3.0 	83.03 82.94 80.67 76.94 80.16 78.02 86.66 86.64 78.86 38.08 77.20±13.42
jina-embeddings-v3 	80.37 80.35 77.25 75.61 74.94 74.59 81.96 79.56 75.94 33.89 73.45±13.41
voyage-3 	78.28 71.60 75.25 46.42 72.18 29.32 73.03 18.49 67.37 25.62 55.76±22.19
text-embedding-3-small 	82.04 55.86 71.64 30.68 68.52 24.25 71.76 18.36 34.01 35.25 49.24±22.02
(a) Human-crafted datasets
Model 	ind tha vie mya fil khm zsm lao tam tet 	Avg.
Number of datasets (→) 	(19) (5) (5) (3) (3) (4) (4) (3) (1) (2) 	(49)
multilingual-e5-large-instruct (560M) 72.04 61.03 66.92 79.47 68.54 71.38 69.31 67.23 80.45 97.73 73.41±9.79
Qwen3-Embedding-8B (8B) 	69.95 67.83 66.91 70.19 71.24 73.66 65.62 64.85 80.86 98.89 73.00±9.68
bge-m3 (568M) 	69.14 56.74 64.57 66.90 65.62 74.74 61.77 67.47 74.39 94.15 69.55±9.65
multilingual-e5-large (560M) 	68.58 61.60 66.43 67.30 64.61 69.77 69.52 64.45 74.34 94.86 70.15±8.89
bge-multilingual-gemma2 (9B) 	71.17 65.64 67.80 65.55 69.77 76.00 76.63 62.19 80.40 99.18 73.43±10.14
LaBSE (471M) 	68.42 46.30 56.59 69.90 65.05 71.38 59.14 60.84 68.88 94.85 66.14±12.01
multilingual-mpnet-base (278M) 	67.93 52.53 62.97 68.37 61.37 73.89 68.93 68.51 66.07 67.01 65.76±5.47
e5-mistral-7b-instruct (7B) 	70.13 57.49 61.31 69.09 68.11 64.61 68.93 53.93 68.74 96.26 67.86±10.82
GritLM-7B (7B) 	72.24 54.86 67.10 71.61 68.20 63.31 69.58 60.49 66.04 98.62 69.21±11.01
Qwen3-Embedding-0.6B (595M) 	66.02 62.77 63.78 64.62 65.68 66.20 62.14 58.99 67.51 96.07 67.38±9.84
multilingual-MiniLM-L12 (118M) 65.86 49.79 60.10 62.95 56.71 61.99 65.64 59.30 33.12 64.84 58.03±9.50
Gemma-SEA-LION-v3-9B-IT (9B) 44.34 37.41 50.56 60.52 58.38 57.09 37.94 53.72 56.71 50.04 50.67±7.89
Sailor2-8B-Chat (8B) 	44.86 33.92 48.10 59.03 55.23

Chunk 74 · 1,994 chars

62.77 63.78 64.62 65.68 66.20 62.14 58.99 67.51 96.07 67.38±9.84
multilingual-MiniLM-L12 (118M) 65.86 49.79 60.10 62.95 56.71 61.99 65.64 59.30 33.12 64.84 58.03±9.50
Gemma-SEA-LION-v3-9B-IT (9B) 44.34 37.41 50.56 60.52 58.38 57.09 37.94 53.72 56.71 50.04 50.67±7.89
Sailor2-8B-Chat (8B) 	44.86 33.92 48.10 59.03 55.23 56.18 36.99 52.86 51.61 50.08 48.89±7.77
Proprietary models
embed-multilingual-v3.0 	69.76 61.41 66.40 67.52 68.06 72.48 66.53 65.73 79.09 95.43 71.24±9.20
jina-embeddings-v3 	68.40 61.53 67.83 69.73 67.76 75.37 62.74 69.11 79.70 96.33 71.85±9.59
voyage-3 	67.32 51.57 62.38 67.12 64.41 60.69 54.49 55.08 65.71 97.34 64.61±12.12
text-embedding-3-small 	67.53 49.03 58.65 55.30 63.93 56.71 62.36 53.88 58.56 94.92 62.09±12.03
(b) Machine-generated datasets
Table 14: Language-model performance view, where each cell reports scores averaged over all evaluated task types,
separated into human-crafted datasets (Top) and machine-generated datasets (Bottom).

-- 28 of 35 --

Model 	Dim. Clf M. Clf Pr. Clf STS Clust Btxt Rtrvl In. Rtrvl Rrnk 	Avg.
Number of datasets (→) 	(64) 	(9) 	(7) 	(10) (10) (20) (18) 	(1) 	(1) 	(140)
multilingual-e5-large-instruct (560M) 1024 80.33 86.61 69.10 73.08 53.18 87.62 80.41 96.38 77.24 78.22±11.72
Qwen3-Embedding-8B (8B) 	4096 81.48 89.35 65.45 73.38 43.91 84.67 83.89 96.18 78.51 77.42±14.49
bge-m3 (568M) 	4096 78.59 88.78 71.88 71.49 33.73 86.34 77.92 87.73 75.98 74.72±15.73
multilingual-e5-large (560M) 	1024 80.97 87.87 67.63 68.07 37.03 84.51 81.66 96.01 79.00 75.86±16.09
bge-multilingual-gemma2 (9B) 	3584 81.19 89.44 78.53 69.57 41.98 81.86 82.81 96.89 69.04 76.81±14.80
LaBSE (471M) 	768 77.53 85.13 63.45 67.92 32.53 86.61 56.82 79.92 61.23 67.90±16.06
multilingual-mpnet-base (278M) 	768 76.18 85.56 75.02 67.84 33.87 67.67 61.05 86.80 64.01 68.67±14.92
e5-mistral-7b-instruct (7B) 	4096 79.03 86.96 65.99 60.77 40.91 64.06 75.51 94.31 75.33 71.43±14.85
GritLM-7B (7B) 	4096 79.80 87.16 65.82 62.27 36.67 62.15 68.64 93.50 73.37

Chunk 75 · 1,991 chars

32.53 86.61 56.82 79.92 61.23 67.90±16.06
multilingual-mpnet-base (278M) 	768 76.18 85.56 75.02 67.84 33.87 67.67 61.05 86.80 64.01 68.67±14.92
e5-mistral-7b-instruct (7B) 	4096 79.03 86.96 65.99 60.77 40.91 64.06 75.51 94.31 75.33 71.43±14.85
GritLM-7B (7B) 	4096 79.80 87.16 65.82 62.27 36.67 62.15 68.64 93.50 73.37 69.93±15.66
Qwen3-Embedding-0.6B (595M) 	1024 77.02 86.83 62.33 64.14 34.79 55.19 78.23 94.30 75.03 69.76±16.89
multilingual-MiniLM-L12 (118M) 	768 72.85 83.02 69.86 64.23 23.51 52.14 55.10 83.58 62.27 62.95±17.35
Gemma-SEA-LION-v3-9B-IT (9B) 3584 78.67 88.34 57.89 32.62 28.74 16.05 23.74 31.16 65.49 46.97±24.64
Sailor2-8B-Chat (8B) 	3584 79.44 88.80 56.63 32.32 26.28 4.25 10.94 10.57 47.05 39.59±28.86
Proprietary models
embed-multilingual-v3.0 	1024 81.59 88.87 68.29 71.24 40.69 88.34 81.36 96.87 77.77 77.22±15.39
jina-embeddings-v3 	1024 80.25 87.72 65.98 69.56 44.38 81.83 78.94 96.51 72.49 75.30±14.02
voyage-3 	1024 78.26 87.36 62.59 61.50 36.50 54.22 66.19 87.29 74.62 67.61±15.46
text-embedding-3-small 	1536 75.76 86.53 61.95 47.54 30.24 40.82 67.57 83.60 71.25 62.81±18.36
(a) Human-crafted datasets
Model 	Dim. Clf M. Clf Pr. Clf STS Clust Btxt Rtrvl In. Rtrvl Rrnk 	Avg.
Number of datasets (→) 	(9) 	(2) 	(6) 	(1) 	(0) 	(6) 	(2) 	(3) 	(0) 	(29)
multilingual-e5-large-instruct (560M) 1024 65.76 93.34 64.24 80.61 - 92.13 43.04 60.00 	- 71.30±16.96
Qwen3-Embedding-8B (8B) 	4096 65.45 96.05 60.38 79.19 - 86.67 61.99 62.35 	- 73.15±13.13
bge-m3 (568M) 	4096 64.12 94.87 65.76 76.84 - 83.29 27.70 48.76 	- 65.91±20.76
multilingual-e5-large (560M) 	1024 65.79 93.78 63.00 72.67 - 84.57 42.48 56.07 	- 68.34±15.96
bge-multilingual-gemma2 (9B) 	3584 64.19 97.45 70.37 78.44 - 84.81 56.81 63.07 	- 73.59±13.15
LaBSE (471M) 	768 64.52 93.48 60.60 69.13 - 90.89 21.12 26.34 	- 60.87±26.24
multilingual-mpnet-base (278M) 	768 62.87 94.99 67.01 74.76 - 75.91 29.14 40.98 	- 63.67±20.61
e5-mistral-7b-instruct (7B) 	4096 65.81 94.43 62.16 68.97 - 87.72 45.87 41.18 	-

Chunk 76 · 1,999 chars

) 	3584 64.19 97.45 70.37 78.44 - 84.81 56.81 63.07 	- 73.59±13.15
LaBSE (471M) 	768 64.52 93.48 60.60 69.13 - 90.89 21.12 26.34 	- 60.87±26.24
multilingual-mpnet-base (278M) 	768 62.87 94.99 67.01 74.76 - 75.91 29.14 40.98 	- 63.67±20.61
e5-mistral-7b-instruct (7B) 	4096 65.81 94.43 62.16 68.97 - 87.72 45.87 41.18 	- 66.59±18.22
GritLM-7B (7B) 	4096 66.83 95.98 62.50 69.54 - 90.38 37.93 58.96 	- 68.87±18.12
Qwen3-Embedding-0.6B (595M) 	1024 62.89 94.34 58.17 68.95 - 80.57 55.39 56.30 	- 68.09±13.48
multilingual-MiniLM-L12 (118M) 	768 59.81 93.24 62.21 65.30 - 73.23 24.94 37.01 	- 59.39±20.94
Gemma-SEA-LION-v3-9B-IT (9B) 3584 63.11 97.12 56.86 51.31 - 	2.03 4.12 	4.31 	- 39.84±34.25
Sailor2-8B-Chat (8B) 	3584 62.72 96.58 56.18 47.11 - 	5.28 1.18 	0.86 	- 38.56±34.35
Proprietary models
embed-multilingual-v3.0 	1024 64.52 94.97 63.15 76.86 - 87.86 44.68 55.49 	- 69.65±16.55
jina-embeddings-v3 	1024 64.41 94.61 60.96 80.38 - 82.28 48.38 59.98 	- 70.14±14.86
voyage-3 	1024 64.13 94.69 57.55 62.92 - 80.80 28.42 53.27 	- 63.11±19.43
text-embedding-3-small 	1536 59.77 95.62 58.51 61.84 - 84.55 39.99 42.63 	- 63.27±18.91
(b) Machine-generated datasets
Table 15: Task-model performance view, where each cell reports scores averaged over all evaluated languages,
separated into human-crafted datasets (Top) and machine-generated datasets (Bottom). “-” indicates that no dataset
is available for the corresponding task.
J Example of Our Evaluation Tool
Similar to the previous text embedding benchmarks (Muennighoff et al., 2023; Enevoldsen et al., 2025),
the evaluation tool of SEA-BED can be simply run using Python as shown in Figure 6. We will release all
the evaluation tools, codes, results, and datasets in the final version of our paper.
K Task Examples
Figures 7 to 15 provide examples for each task covered in SEA-BED benchmark.

-- 29 of 35 --

Model 	Btxt Clf Clust In. Rtrvl M. Clf Pr. Clf Rtrvl Rrnk STS
Indonesian 75.98 81.00 43.55 	- 	92.49 76.28 72.71 69.28 46.43
Thai 	70.99

Chunk 77 · 1,993 chars

n tools, codes, results, and datasets in the final version of our paper.
K Task Examples
Figures 7 to 15 provide examples for each task covered in SEA-BED benchmark.

-- 29 of 35 --

Model 	Btxt Clf Clust In. Rtrvl M. Clf Pr. Clf Rtrvl Rrnk STS
Indonesian 75.98 81.00 43.55 	- 	92.49 76.28 72.71 69.28 46.43
Thai 	70.99 75.67 38.49 82.98 77.00 70.31 74.69 71.86 68.32
Vietnamese 76.36 80.62 38.56 	- 	96.49 71.86 56.80 - 	-
Burmese 48.83 85.10 28.09 	- 	69.71 67.10 55.44 - 61.19
Filipino 	69.56 75.60 52.69 	- 	- 	50.63 	- 	- 	-
Khmer 	54.21 74.39 23.83 	- 	- 	- 	- 	- 	-
Malay 	75.98 77.21 - 	- 	- 	- 	- 	- 	-
Lao 	51.79 75.91 18.23 	- 	- 	- 	- 	- 	-
Tamil 	61.23 84.35 38.75 	- 	- 	67.14 46.27 - 37.21
Tetum 	31.09 - 	- 	- 	- 	- 	- 	- 	-
(a) Human-crafted datasets
Model 	Btxt Clf Clust In. Rtrvl M. Clf Pr. Clf Rtrvl Rrnk STS
Indonesian 74.84 67.61 - 	34.82 	- 	59.58 	- 	- 75.65
Thai 	- 59.11 - 	57.88 	- 	- 	25.87 - 71.99
Vietnamese - 67.96 - 	42.75 	- 	55.49 	- 	- 77.15
Burmese 	- 66.15 - 	- 	- 	- 	- 	- 68.03
Filipino 	- 36.12 - 	- 	90.06 	- 	- 	- 68.41
Khmer 	- 39.36 - 	- 	100.00 66.41 	- 	- 63.74
Malay 	- 55.84 - 	- 	- 	71.50 46.27 - 75.39
Lao 	- 59.81 - 	- 	- 	64.36 	- 	- 59.11
Tamil 	- 	- 	- 	- 	- 	- 	- 	- 67.78
Tetum 	75.08 99.81 - 	- 	- 	- 	- 	- 	-
(b) Machine-generated datasets
Table 16: Language-task performance view, where each cell reports scores averaged over all evaluated models,
separated into human-crafted datasets (Top) and machine-generated datasets (Bottom). “-” indicates that no dataset
is available for the corresponding language-task combination.
Figure 6: Example usage of the SEA-BED evaluation framework for Semantic Textual Similarity (STS) and Pair
Classification tasks.
L Data Links
The complete dataset information, such as citations, languages, domains, annotation creators, and licenses,
are shown in Tables 17 and 18.

-- 30 of 35 --

Task 	Text 	Label
Language Identification
sadiissss kae pasti neng tegal yakin inyong 	Jawa Ngapak
kowe gurung

Chunk 78 · 1,983 chars

ty (STS) and Pair
Classification tasks.
L Data Links
The complete dataset information, such as citations, languages, domains, annotation creators, and licenses,
are shown in Tables 17 and 18.

-- 30 of 35 --

Task 	Text 	Label
Language Identification
sadiissss kae pasti neng tegal yakin inyong 	Jawa Ngapak
kowe gurung ngerti betapa bahagianya dichat arek e seh wkwkwk 	Jawa Timur
Sentiment
Namiss ko yung pusa ko nung bata pa ako, lagi ko sya katabi matulog and malambing. 	Positive
Sa tiktok ang pugad nila. Dapat ma ban na ang tiktok app 	Negative
Topic Classification
மீண்
	டும் 	ெசல் 	வராகவன் 	படத் 	தல் 	நடிப்பீர் 	களா? 	tamil-cinema
இத் 	தைகேயாைர பாதுகாக் 	கேவ இத் 	தைகய ஸ் 	கூட் 	டர் 	வடிவைமக் 	கப்
பட் 	டுள் 	ளது. business
Toxic Language Detection
Yang blm pada move on mending bsk piknik aja mumpung long weekend hehehe 	Non hate speech
Karna dia ga punya program hanya modal bacot doang 	Hate speech
Figure 7: Classification examples.
Task 	Text 	Label
Sentiment
saran ku dan pengalaman ku , mending beli mobil niaga L300 atau canter . irit
dan bandel .
[fuel (positive), machine (positive),
others (neutral), part (neutral),
price (neutral), service (neutral)]
Sudah dari dulu Toyota selalu kasih produk super mahal dengan fitur pas pasan 	[fuel (neutral), machine (neutral),
others (neutral), part (negative),
price (negative), service (neutral)]
Topic Classification
เกียวกับซิมนะคะ พอดีซือซิมมาใหม่ไม่สามารถเปดใช้งานได้ 	[ "report", "phone_issues" ]
เบอร์โทร โดน ระงับใช้บริการ ต้องทําอย่างไงค่ะ 	[ "enquire", "suspend" ]
Toxic Language Detection
Prabowo Sudah Kalah Menyebut Bantuan Jokowi Hanya Pencitraan Adalah
Ratapan Pilu'
[Hate speech, Hate speech
Individual, Hate speech Week]
Wah bangke emang nih truk' 	[Hate speech, Abusive, Hate speech
Individual, Hate speech Week]
Figure 8: Multi-label Classification examples.
Task 	Sentence 1 	Sentence 2 	Label
Textual Entailment
Làm sao anh biết ? Tất cả đây là thông tin của họ lần
nữa .
Thông tin này thuộc về họ .

Chunk 79 · 1,993 chars

speech
Individual, Hate speech Week]
Wah bangke emang nih truk' [Hate speech, Abusive, Hate speech
Individual, Hate speech Week]
Figure 8: Multi-label Classification examples.
Task Sentence 1 Sentence 2 Label
Textual Entailment
Làm sao anh biết ? Tất cả đây là thông tin của họ lần
nữa .
Thông tin này thuộc về họ . Contradiction
Conceptually kem skimming có hai kích thước cơ bản -
sản phẩm và địa lý .
Sản phẩm và địa lý là những gì làm cho kem skimming
làm việc . Entailment
Vui vẻ dành cho người lớn và trẻ em . Vui vì chỉ có trẻ con . Neutral
Figure 9: Pair Classification examples.
Task Sentence 1 Sentence 2 Score
Multilingual STS
လူတစ်ေယာက်သည် ေဘ့စ်ေဘာအသင်းတွင် ှိေနသည်။ လူတစ်ဦးသည် အသင်းတစ်သင်းတွင် ဘတ်စကတ်ေဘာ
ကစားေနသည်။ 2.4
Istilah benda hitam pertama kali diperkenalkan oleh
Gustav Kirchhoff tahun 1860.
Istilah "benda hitam" pertama kali diperkenalkan oleh
Gustav Robert Kirchhoff pada tahun 1862. 5
ชายคนหนึงในรถสีเขียวกําลังทําความสะอาดถนนในเมือง ชายและหญิงสามคนข้ามถนนในเมืองทีพลุกพล่าน 1.7
Cross-lingual STS
This triggered a revolution in the earth sciences. இக் ேகாட் பாடு புவ அறவயல் துைறகளில்
புரட் சகரமான மாற் றங் கைள ஏற் படுத் தற் று. 4
The up-regulation of miR-146a was also detected in
cervical cancer tissues.
miR-146a ၏အသံုးအှန်းသည်သားအိမ်ေခါင်းကင်ဆာ
တွင်ထိန်းချပ်ိုင်သည်ကိုေတှိရသည်။ 4
A person is on a baseball team. มีคนเล่นบาสเก็ตบอลในทีม 2.4
Figure 10: STS examples.

-- 31 of 35 --

Task Text Cluster
Topic Clustering
ဤရာသီ၏ ေရာဂါြဖစ်ပွားမ ကနဦးလူနာများသည် ဇူလိုင်လ အေှာင်းပိုင်းတွင် ေပလာက
သည်။ health
အပင်များသည် အစာချက်ြခင်းကို ေနမှတဆင့်ြပလုပ်သည်။ အရိပ်လဲေပးပါသည်။ science/technology
ဤစာရွက်စာတမ်းများကို ဂုဏ်ြပရန် ိုင်ငံြခား အစိုးရများ၏ စိတ်ထက်သန်မမှာ ေြပာင်းလဲ
ိုင်ပါသည်။ politics
ဝါှင်တန်၏ အတလန်နာသရက်ှာကို ၅-၃ ြဖင့် အိုင်ရေသာ ပွဲတွင် ၂ ဂိုးသွင်းပီး ၂ ဂိုး
ဖန်တီးေပးခဲ့သည်။ sports
Figure 11: Clustering examples.
Task First set sentence Second set sentence
Cross-lingual pairing Paris is the most beautiful city in the world Paris

Chunk 80 · 1,993 chars

်ငံြခား အစိုးရများ၏ စိတ်ထက်သန်မမှာ ေြပာင်းလဲ
ိုင်ပါသည်။ politics
ဝါှင်တန်၏ အတလန်နာသရက်ှာကို ၅-၃ ြဖင့် အိုင်ရေသာ ပွဲတွင် ၂ ဂိုးသွင်းပီး ၂ ဂိုး
ဖန်တီးေပးခဲ့သည်။ sports
Figure 11: Clustering examples.
Task 	First set sentence 	Second set sentence
Cross-lingual pairing 	Paris is the most beautiful city in the world 	Paris adalah kota tercantik di dunia.
Dialect pairing 	Andrea Maisi đã mở tỉ số cho Ý ở phút thứ tư với một quả try. 	ແອນເດຣຍ ມາຊີ ໄດ້ເປດການທໍາຄະແນນໃນນາທີທີສີໃຫ້ແກ່ອິຕາລີ.
Written-forms pairing
โรซาลีเล่าว่า "คุณต้องรู้ถึงอันตรายต่าง ๆ และดูว่าคุณพอจะทําอะไรได้
บ้าง ทีปรึกษาของฉัน ตอนทีเราดู การปะทุขนาดเล็กของภูเขาไฟเอตนา มี
เศษวัตถุขนาดเล็กตกลงมา เขาจะบอกเราให้เข้าไปเก็บตัวอย่าง คุณต้อง
ได้รับการฝกอย่างดี อย่าวิง ให้อยู่กับที และมองขึนด้านบน ถ้ามีวัตถุ
ขนาดใหญ่ตกลงมาใส่ ก็จะได้หลบออกด้านข้าง" สําหรับผู้ทีต้องการชม
ภูเขาไฟคุกรุ่น โรซาลี แนะนําให้ไป วานูอาตู มีภูเขาไฟชือ ยาซูร์ ซึงมีการ
ปะทุขนาดเล็ก คล้ายกับดอกไม้ไฟ มีความสวยงาม และเปนภูเขาไฟทีไป
ง่าย เธอบอกว่า สามารถขับรถขึนไปเกือบถึงปากปล่อง จากนันก็มีบันได
คอนกรีต ทีสามารถเดินขึนไปได้ และยังมีม้านังให้นังเล่นด้วย นอกจาก
ภูเขาไฟบนโลก เธอยังพบภูเขาไฟทีคุกรุ่น 71 ลูก บนดวงจันทร์ไอโอของ
ดาวพฤหัสฯ ด้วย
โรซาลี โลเปส นักภูเขาไฟวิทยาของนาซาโปรดปรานการปนภูเขาไฟทียัง
คุกรุ่นอยู่ เพือไปชมการปะทุขนาดเล็ก โดยเธอได้เยือนภูเขาไฟทีคุกรุ่นใน
ทุกทวีปทัวโลกมาแล้ว 63 ลูก
Figure 12: Bitext mining examples.
Task 	Query 	Relevant Document
Article Retrieval
Hà Nội: Đưa vào hoạt động trạm biến áp
110kV Bắc Thành Công
Việc đầu tư dự án 'Xây dựng mới Trạm 110kV Bắc Thành Công và
nhánh rẽ' sẽ góp phần giảm được tổn hao công suất và điện năng của
lưới điện trong khu vực, nâng cao chất lượng điện năng.
Long Document Retrieval
มะแว้งต้นมีประโยชน์อย่างไรในเชิงสรรพคุณ? 	มะแว้งต้น ประโยชน์ดีๆ สรรพคุณเด่นๆ และข้อมูลงานวิจัย\nหน้าแรก > บทความ
ทังหมด > มะแว้งต้น\nชือสมุนไพร มะแว้งต้น\nชืออืนๆ/ชือท้องถิน มะแคว้งขม,
มะแคว้งดํา, มะแคว้ง (ภาคเหนือ) ,หมากแข้ง , หมากแข้งขม (ภาคอีสาน) , มะแว้ง
(ภาคกลาง) , แว้งกาม (สงขลา,สุราษฎร์ธานี,ภาคใต้) ,

Chunk 81 · 1,989 chars

Retrieval
มะแว้งต้นมีประโยชน์อย่างไรในเชิงสรรพคุณ? 	มะแว้งต้น ประโยชน์ดีๆ สรรพคุณเด่นๆ และข้อมูลงานวิจัย\nหน้าแรก > บทความ
ทังหมด > มะแว้งต้น\nชือสมุนไพร มะแว้งต้น\nชืออืนๆ/ชือท้องถิน มะแคว้งขม,
มะแคว้งดํา, มะแคว้ง (ภาคเหนือ) ,หมากแข้ง , หมากแข้งขม (ภาคอีสาน) , มะแว้ง
(ภาคกลาง) , แว้งกาม (สงขลา,สุราษฎร์ธานี,ภาคใต้) , สะกังแค (กะเหรียง-
แม่ฮ่องสอน) , หมากแซ้งคง (ไทยใหญ่ – แม่ฮ่องสอน , ฉาน) , เทียนเฉีย ,ชือเทียน
เฉีย (จีนกลาง)\nชือวิทยาศาสตร์ Solanum indicum L. (มีหนาม) Solanum
sanitwongsei (ไร้หนาม)\nชือพ้องทางวิทยาศาสตร์ Solanum violaceum (มี
หนาม)\nชือสามัญ Sparrow’s Brinjal , Indian nightshade\nถินกําเนิดมะแว้ง
ต้น\nมีการคาดการณ์กันว่าถินกําเนิดดังเดิมของมะแว้งต้นนันอยู่ในเขตร้อนของ
ทวีปเอเชียซึงอาจอยู่ในประเทศ แถบเอเชียใต้ เช่น อินเดีย บังคลาเทศ เนปาล ฯลฯ
รวมถึงประเทศแถบเอเชียตะวันออกเฉียงใต้ เช่น ไทย, พม่า , ลาว ,กัมพูชา ฯลฯ ...
Question Answering
Dimana Jamie Richard Vardy lahir? 	Jamie Richard Vardy (lahir dengan nama Gill; 11 January 1987) adalah
pemain sepak bola Inggris yang bermain di klub Premiere League
Leicester City dan tim nasional Inggris. Ia bermain sebagai striker,
namun juga bisa bermain di sayap.
Figure 13: Retrieval examples.
Task 	Query 	Instruction 	Relevant Document
Instruction Question Answering
Stellar คืออะไร 	Stellar: เครือข่ายโอนเงินไร้พรมแดน
เทคโนโลยีการโอนเงินระหว่างธนาคารในตอนนี
ถือว่าลําหน้ามาก ๆ นะครับ ทุกวันนีเราสามารถ
โอนเงินจากบัญชีของเราไปยังบัญชีของ
ธนาคารอืน ๆ ได้อย่างสะดวก รวดเร็ว และไม่มี
ค่าธรรมเนียมใด ๆ ซึงไม่ใช่ทุกประเทศบนโลกนี
จะมีสิงอํานวยความสะดวกเหมือนกับ
ประเทศไทยนะครับ ในประเทศอืน ๆ การโอนเงิน
ระหว่างบัญชียังมีค่าธรรมเนียม จะถูกจะแพงก็
แล้วแต่ประเทศไป และยังใช้เวลาประมาณนึงอีก
ด้วยครับ ...
Stellar คือ เครือข่ายการโอนเงินแบบกระจาย
ศูนย์ (decentralized) ทีมีเปาหมายจะเปนช่อง
ทางการจ่ายเงินทีเร็ว ปลอดภัย ไร้พรมแดน
และมีค่าธรรมเนียมทีตํา ด้วยการใช้งาน
เทคโนโลยีบล็อกเชนทําให้ Stellar สามารถเชือม
ต่อทังบุคคลธรรมดากับองค์กร (เช่น ธนาคาร)
และทําให้ผู้ใช้งานเหล่านีสามารถส่งผ่าน
สินทรัพย์ไป-มาได้อย่างรวดเร็ว Stellar

Chunk 82 · 1,995 chars

ellar คือ เครือข่ายการโอนเงินแบบกระจาย
ศูนย์ (decentralized) ทีมีเปาหมายจะเปนช่อง
ทางการจ่ายเงินทีเร็ว ปลอดภัย ไร้พรมแดน
และมีค่าธรรมเนียมทีตํา ด้วยการใช้งาน
เทคโนโลยีบล็อกเชนทําให้ Stellar สามารถเชือม
ต่อทังบุคคลธรรมดากับองค์กร (เช่น ธนาคาร)
และทําให้ผู้ใช้งานเหล่านีสามารถส่งผ่าน
สินทรัพย์ไป-มาได้อย่างรวดเร็ว Stellar มีเปา
หมายทีจะ disrupt ระบบการจ่ายเงินทีใช้กันอยู่
ทุกวันนี ลองคิดถึงการโอนเงินข้ามประเทศดู
ทุกวันนีการโอนเงินข้ามประเทศมีค่าธรรมเนียม
การโอนทีแพง ...
Figure 14: Instruction Retrieval examples.

-- 32 of 35 --

Task Query Positive Negative
Article Reranking
kapankah Radin Inten II dilahirkan? Radin Inten II (Lampung, 1834 - Lampung, 5
Oktober 1858) adalah seorang pahlawan
nasional Indonesia.\nNamanya diabadikan
sebagai sebuah Bandara Radin Inten II dan
perguruan tinggi IAIN Raden Intan di
Lampung.
Akhirnya, Waleson menemukan cara lain. Ia
berhasil memperalat Radin Ngerapat. Maka
pengkhianatan pun terjadi. Radin Ngerapat
mengundang Radin Inten II untuk
mengadakan pertemuan. Dikatakannya
bahwa ia ingin membicarakan bantuan yang
diberikannya kepada Radin Inten II. Tanpa
curiga, Radin Inten II memenuhi undangan
itu. Pertemuan diadakan malam tanggal 5
Oktober 1856 di suatu tempat dekat
Kunyanya. Radin Inten II ditemani oleh satu
orang pengikutnya. Radin Ngerapat disertai
pula oleh beberapa orang. Akan tetapi, di
tempat yang cukup tersembunyi, beberapa
orang serdadu Belanda sudah disiapkan untuk
bertindak bila diperlukan. Radin Ngerapat
mempersilahkan Radin Inten II dan
pengiringnya memakan makanan yang
sengaja dibawanya terlebih dahulu.
Figure 15: Reranking examples.

-- 33 of 35 --

Type Name Languages Domains Sample creation Annotations creators License
Classification ABUSIVE (Ibrohim and Budi, 2018) [’ind’] [’Social’, ’Written’] found human-annotated CC BY-SA 4.0
AbusiveNewsComment (Kiasati Desrul and Romadhony, 2019) [’ind’] [’Social’, ’Web’, ’News’, ...] found human-annotated CC BY-SA 4.0
BookmebusReviews [’khm’] [’Reviews’,

Chunk 83 · 1,991 chars

otations creators 	License
Classification 	ABUSIVE (Ibrohim and Budi, 2018) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
AbusiveNewsComment (Kiasati Desrul and Romadhony, 2019) 	[’ind’] 	[’Social’, ’Web’, ’News’, ...] 	found 	human-annotated 	CC BY-SA 4.0
BookmebusReviews 	[’khm’] 	[’Reviews’, ’Written’] 	found 	human-annotated
Clickbait* (William and Sari, 2020) 	[’ind’] 	[’News’, ’Written’] 	found 	expert-annotated
CodeMixed (Tho et al., 2021) 	[’ind’] 	[’Social’, ’Web’] 	found 	manual curation 	CC BY 3.0
CyberbullyingLGBT 	[’tha’] 	[’Social’, ’Written’] 	found 	derived
Depression (Hämäläinen et al., 2021) 	[’tha’] 	[’Social’, ’Web’, ’News’, ...] 	found 	human-annotated 	CC BY-NC-ND 4.0
EMoTES3K (Catapang and Visperas, 2023) 	[’fil’] 	[’Morality’, ’Written’] 	found 	human-annotated 	Apache license 2.0
Emoji 	[’tha’] 	[’Social’, ’Written’] 	found 	human-annotated 	GPL-3.0
EmoT (Mei Silviana Saputri and Adriani, 2018) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated 	MIT
EmotionOpinion (Riccosan et al., 2022) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
EmotCMT (Yulianti et al., 2021) 	[’ind’] 	[’Social’, ’Written’] 	found 	derived 	MIT
Fakenews (Cruz et al., 2020b) 	[’fil’] 	[’News’, ’Written’] 	found 	human-annotated
GeneralAmy (Phatthiyaphaibun et al., 2023) 	[’tha’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY 3.0
GeneratedReviewsENTH (Lowphansirikul et al., 2022) 	[’tha’] 	[’conversation’, ’Web’, ’Written’, ...] 	found 	human-annotated 	CC BY-SA 4.0
GKLMIPSentiment (Jiang et al., 2021b) 	[’mya’] 	[’Social’, ’Web’, ”Written] 	found 	derived
GooglePlayReview 	[’ind’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	CC BY 4.0
HateSpeech (Alfina et al., 2017) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated
HateSpeech* 	[’fil’] 	[’Social’, ’Written’] 	found 	human-annotated 	Apache license 2.0
HoaxNews (Pratiwi et al., 2017) 	[’ind’] 	[’News’, ’Written’] 	found 	human-annotated 	CC BY

Chunk 84 · 1,990 chars

s’, ’Written’] 	found 	human-annotated 	CC BY 4.0
HateSpeech (Alfina et al., 2017) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated
HateSpeech* 	[’fil’] 	[’Social’, ’Written’] 	found 	human-annotated 	Apache license 2.0
HoaxNews (Pratiwi et al., 2017) 	[’ind’] 	[’News’, ’Written’] 	found 	human-annotated 	CC BY 4.0
HSDNofaaulia (Aulia and Budi, 2019) 	[’fil’] 	[’Social’, ’Written’] 	found 	human-annotated
IMDB (Maas et al., 2011) 	[’ind’] 	[’Reviews’, ’Written’] 	found 	human-annotated
Indonglish (Astuti et al., 2023) 	[’ind’] 	[’Social’, ’Written’] 	found 	expert-annotated
JaDiIde (Hidayatullah et al., 2020) 	[’ind’] 	[’Social’, ’Written’] 	found 	derived
Karonese (Sitepu et al., 2024) 	[’ind’] 	[’Social’, ’Web’] 	found 	derived
KhineMyanmarNews (Khine et al., 2017) 	[’mya’] 	[’News’, ’Written’] 	found 	derived 	GPL-3.0
Krathu500 	[’tha’] 	[’Social’, ’Web’, ’News’, ...] 	found 	human-annotated
LazadaReview 	[’fil’] 	[’Reviews’, ’Written’] 	found 	derived
LEMSentiment (Koto et al., 2020b) 	[’ind’] 	[’Social’, ’Review’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
LimeSoda (Payoungkhamdee et al., 2021) 	[’tha’] 	[’Healthcare’, ’Written’] 	found 	human-annotated 	CC BY 4.0
MADLAD400 (Kudugunta et al., 2023) 	[’tet’] 	[’Web’] 	found 	derived 	ODC-BY
MassiveIntent* (FitzGerald et al., 2022) 	[’ind’, ’tha’, ’vie’, ...] 	[’Spoken’] 	found 	human-annotated 	CC BY 4.0
MassiveScenario* (FitzGerald et al., 2022) 	[’ind’, ’tha’, ’vie’, ...] 	[’Spoken’] 	found 	human-annotated 	CC BY 4.0
Minang (Koto and Koto, 2020) 	[’ind’] 	[’Encyclopaedic’, ’Written’] 	found 	derived 	MIT
MultiLingualSentiment* (Mollanorozy et al., 2023) 	[’ind’, ’tha’, ’vie’] 	[’Reviews’, ’Written’] 	found 	derived
MurasuNews 	[’tam’] 	[’News’, ’Written’] 	found 	derived 	CC0
News (Khine et al., 2017) 	[’mya’] 	[’News’, ’Written’] 	found 	derived 	GLP-3.0
News 	[’zsm’] 	[’News’, ’Written’] 	found 	derived
News 	[’khm’] 	[’Encyclopaedic’, ’Web’, ’News’, ...] 	found 	derived
News 	[’tam’]

Chunk 85 · 1,995 chars

’] 	[’Reviews’, ’Written’] 	found 	derived
MurasuNews 	[’tam’] 	[’News’, ’Written’] 	found 	derived 	CC0
News (Khine et al., 2017) 	[’mya’] 	[’News’, ’Written’] 	found 	derived 	GLP-3.0
News 	[’zsm’] 	[’News’, ’Written’] 	found 	derived
News 	[’khm’] 	[’Encyclopaedic’, ’Web’, ’News’, ...] 	found 	derived
News 	[’tam’] 	[’News’, ’Written’] 	found 	derived 	CC BY-SA 4.0
News (Phatthiyaphaibun, 2025) 	[’lao’] 	[’News’, ’Written’] 	found 	derived
NewsDataset 	[’ind’] 	[’News’, ’Written’] 	found 	derived
NusaX (Winata et al., 2023) 	[’ind’] 	[’Social’, ’Economics’, ’Healthcare’, ...] 	found 	expert-annotated 	CC BY-SA 4.0
PhoATIS (Dao et al., 2021) 	[’vie’] 	[’Spoken’] 	found 	expert-annotated
PHElectionsSA 	[’fil’] 	[’Social’] 	found 	human-annotated
PHElectionsTD 	[’fil’] 	[’Social’] 	found 	human-annotated
Profanity (Galinato et al., 2023) 	[’fil’] 	[’Social’] 	found 	human-annotated
ReviewShopping (Phatthiyaphaibun et al., 2023) 	[’tha’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	CC BY 3.0
SIB200 (Adelani et al., 2023) 	[’ind’, ’tha’, ’vie’, ...] 	[’News’, ’Written’] 	found 	expert-annotated 	CC BY-SA 4.0
SEATranslationeseResampled (Lovenia et al., 2024) 	[’ind’, ’tha’, ’vie’, ...] 	[’News’, ’Social’, ’Culture’, ...] 	found 	derived 	Apache license 2.0
SentEmoMobileApps (Riccosan and Saputra, 2023) 	[’ind’] 	[’Reviews’, ’Written’] 	found 	human-annotated
SentimentAnalysis (Fe, 2019) 	[’ind’] 	[’Social’, ’Written’] 	found 	derived 	CC BY-NC-ND 4.0
ShopeeReviews* (Purwarianti and Crisdayanti, 2019) 	[’fil’] 	[’Social’, ’Written’] 	found 	human-annotated 	MLP-2.0
SMSA 	[’ind’] 	[’Reviews’, ’Written’] 	found 	derived 	MIT
SpamidPair (Chrismanto et al., 2022) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY 4.0
SpamReviews (Van Dinh et al., 2022) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	CC BY-NC 4.0
StudentFeedback (Nguyen et al., 2018b) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	MIT
TCAS61 (Phatthiyaphaibun et al.,

Chunk 86 · 1,997 chars

al., 2022) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY 4.0
SpamReviews (Van Dinh et al., 2022) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	CC BY-NC 4.0
StudentFeedback (Nguyen et al., 2018b) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	MIT
TCAS61 (Phatthiyaphaibun et al., 2023) 	[’tha’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY 3.0
The40ThaiChildrenStories (Pasupa et al., 2016) 	[’tha’] 	[’Encyclopaedic’, ’Written’] 	found 	human-annotated
ThuraMyanmarNews (Aung et al., 2025) 	[’mya’] 	[’News’, ’Written’] 	found 	derived 	MIT
TiktokHatespeech (Hernandez Urbano Jr et al., 2021) 	[’fil’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
Tweets (Juan et al., 2022) 	[’zsm’] 	[’Reviews’, ’Written’] 	found 	derived
TyphoonYolandaTweets 	[’fil’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY 4.0
UITViCTSD (Nguyen et al., 2021a) 	[’vie’] 	[’Social’, ’Written’] 	found 	human-annotated
UITViHSD (Luu et al., 2021) 	[’vie’] 	[’Social’, ’Written’] 	found 	human-annotated
UITViSFD (Luc Phan et al., 2021) 	[’vie’] 	[’Social’, ’Written’] 	found 	human-annotated
UITVION (Fujita and Perez-Meana, 2021) 	[’vie’] 	[’Social’, ’Written’] 	found 	human-annotated
UITVSMEC (Ho et al., 2020) 	[’vie’] 	[’Social’, ’Written’] 	found 	human-annotated
VaccinesTweets 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated
ViOCD (Nguyen et al., 2021b) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated
VLSP2016Sentiment (Nguyen et al., 2018a) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated
WisesightSentiment (Suriyawongkul et al., 2019) 	[’tha’] 	[’Social’, ’News’, ’Written’] 	found 	expert-annotated 	CC0-1.0
WongnaiReviews 	[’tha’] 	[’Reviews’, ’Written’] 	found 	derived 	LGPL-3.0
Multi-label Classification 	BurmesePrachathai67k (Phatthiyaphaibun et al., 2023) 	[’mya’] 	[’News’, ’Web’, ’Written’] 	created 	human-annotated 	Apache license 2.0
CASA (Arfinda Ilmania and Purwarianti, 2018) 	[’ind’] 	[’Reviews’,

Chunk 87 · 1,991 chars

tated 	CC0-1.0
WongnaiReviews 	[’tha’] 	[’Reviews’, ’Written’] 	found 	derived 	LGPL-3.0
Multi-label Classification 	BurmesePrachathai67k (Phatthiyaphaibun et al., 2023) 	[’mya’] 	[’News’, ’Web’, ’Written’] 	created 	human-annotated 	Apache license 2.0
CASA (Arfinda Ilmania and Purwarianti, 2018) 	[’ind’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	MIT
Dengue (Livelo and Cheng, 2018) 	[’fil’] 	[’Social’, ’Written’] 	found 	derived 	GLP-3.0
GKLMIPNews (Jiang et al., 2021a) 	[’khm’] 	[’News’, ’Written’] 	found 	derived
HateSpeech (Ibrohim and Budi, 2019) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
HoASA (A. N. Azhar and Sutiono, 2019) 	[’ind’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	MIT
Netifier (Izzan et al., 2025) 	[’ind’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
Prachathai67k (Phatthiyaphaibun et al., 2023) 	[’tha’] 	[’News’, ’Web’, ’Written’] 	found 	derived 	Apache license 2.0
TrueVoiceIntent 	[’tha’] 	[’Conversation’] 	found 	derived
VLSP2018SAHotel (Dang et al., 2022) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated
VLSP2018SARestaurant (Dang et al., 2022) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated
Pair Classification 	BurmeseXNLI (Conneau et al., 2018) 	[’mya’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	created 	human-annotated 	CC BY-NC 4.0
IDKMRCNLI 	[’ind’] 	[’Encyclopaedic’, ’News’, ’Written’] 	found
IndicXNLI* (Aggarwal et al., 2022) 	[’tam’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	found 	expert-annotated 	CC BY-NC 4.0
IndoNLI* (Mahendra et al., 2021) 	[’ind’] 	[’Encyclopaedic’, ’Web’, ’News’, ...] 	found 	expert-annotated 	CC BY-SA 4.0
MultilingualNLI26lang2mil7 (Laurer et al., 2022) 	[’ind’, ’vie’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	found 	machine-translated and reviewed
MyXNLI (Htet and Dras, 2024) 	[’mya’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	found 	human-annotated 	CC BY-NC 4.0
NewsPHNLI (Cruz et al., 2020a) 	[’fil’] 	[’News’, ’Written’] 	found

Chunk 88 · 1,983 chars

LI26lang2mil7 (Laurer et al., 2022) 	[’ind’, ’vie’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	found 	machine-translated and reviewed
MyXNLI (Htet and Dras, 2024) 	[’mya’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	found 	human-annotated 	CC BY-NC 4.0
NewsPHNLI (Cruz et al., 2020a) 	[’fil’] 	[’News’, ’Written’] 	found 	human-annotated 	GPL-3.0
PAWS 	[’fil’] 	[’Web’] 	found 	human-annotated
SQuADNLI 	[’ind’] 	[’Encyclopaedic’, ’News’, ’Written’] 	found
TyDIQANLI 	[’ind’] 	[’Encyclopaedic’, ’News’, ’Written’] 	found
WReTE (Setya and Mahendra, 2018) 	[’tha’] 	[’Encyclopaedic’, ’Web’, ’News’] 	found 	expert-annotated 	MIT
XNLI* (Conneau et al., 2018) 	[’tha’, ’vie’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	found 	expert-annotated 	CC BY-NC 4.0
XNLITranslated (Conneau et al., 2018) 	[’khm’, ’zsm’, ’lao’] 	[’Non-fiction’, ’Fiction’, ’Government’] 	machine-translated and verified 	machine-translated and reviewed 	CC BY-NC 4.0
Table 17: The datasets included in SEA-BED (part 1).

-- 34 of 35 --

Type 	Name 	Languages 	Domains 	Sample creation 	Annotations creators 	License
STS 	Biosses (So˘gancıo˘glu et al., 2017) 	[’tha’, ’mya’] 	[’Medical’] 	created 	human-annotated 	GPL-3.0
BiossesCrosslingual (So˘gancıo˘glu et al., 2017) 	[’tha’, ’mya’] 	[’Medical’] 	created 	human-annotated 	GPL-3.0
IndicCrosslingual* (Ramesh et al., 2022) 	[’tam’] 	[News, Non-fiction, Web, ...] 	found 	expert-annotated 	CC0-1.0
SemRel2024* (Ousidhoum et al., 2024a) 	[’ind’] 	[’Spoken’, ’Written’] 	found 	human-annotated
STS17 (Cer et al., 2017) 	[’tha’, ’mya’] 	[’News’, ’Web’, ’Written’] 	created 	human-annotated
STS17Crosslingual (Cer et al., 2017) 	[’tha’, ’mya’] 	[’News’, ’Web’, ’Written’] 	created 	human-annotated
STS22 (Chen et al., 2022) 	[’tha’, ’mya’] 	[’News’, ’Written’] 	created 	human-annotated
STS22Crosslingual (Chen et al., 2022) 	[’tha’, ’mya’] 	[’News’, ’Written’] 	created 	human-annotated
STS24 (Ousidhoum et al., 2024b) 	[’tha’, ’mya’] 	[’Spoken’, ’Written’] 	created

Chunk 89 · 1,998 chars

ews’, ’Web’, ’Written’] 	created 	human-annotated
STS22 (Chen et al., 2022) 	[’tha’, ’mya’] 	[’News’, ’Written’] 	created 	human-annotated
STS22Crosslingual (Chen et al., 2022) 	[’tha’, ’mya’] 	[’News’, ’Written’] 	created 	human-annotated
STS24 (Ousidhoum et al., 2024b) 	[’tha’, ’mya’] 	[’Spoken’, ’Written’] 	created 	human-annotated
STS24Crosslingual (Ousidhoum et al., 2024b) 	[’tha’, ’mya’] 	[’Spoken’, ’Written’] 	created 	human-annotated
STSBenchmark (Cer et al., 2017) 	[’ind’, ’tha’, ’vie’, ...] 	[’News’, ’Web’, ’Written’] 	machine-translated and verified 	machine-translated and reviewed 	CC BY-SA 4.0
Clustering 	EMoTES3K (Catapang and Visperas, 2023) 	[’fil’] 	[’Morality’, ’Written’] 	found 	human-annotated 	Apache license 2.0
MurasuNews 	[’tam’] 	[’News’, ’Written’] 	found 	derived 	CC0
News (Phatthiyaphaibun, 2025) 	[’lao’] 	[’News’, ’Written’] 	found 	derived
News (Jiang et al., 2022) 	[’khm’] 	[’News’, ’Written’] 	found 	derived
News (Chandra, 2020) 	[’ind’] 	[’News’, ’Written’] 	found 	derived
News 	[’tam’] 	[’News’, ’Written’] 	found 	derived 	CC BY-SA 4.0
News (Khine et al., 2017) 	[’mya’] 	[’News’, ’Written’] 	found 	derived
SIB200 (Adelani et al., 2023) 	[’ind’, ’tha’, ’vie’, ...] 	[’News’, ’Written’] 	found 	expert-annotated 	CC BY-SA 4.0
UITVION (Fujita and Perez-Meana, 2021) 	[’vie’] 	[’Social’, ’Written’] 	found 	human-annotated
ViOCD (Nguyen et al., 2021b) 	[’vie’] 	[’Reviews’, ’Written’] 	found 	human-annotated
BitextMining 	ALT (Riza et al., 2019) 	[’ind’, ’tha’, ...] 	[’News’, ’Written’] 	found 	expert-annotated 	CC BY 4.0
BibleNLP* (Akerman et al., 2023) 	[’ind’, ’tha’, ’vie’, ...] 	[’Religious’, ’Written’] 	found 	expert-annotated 	CC BY 4.0
Flores* (Goyal et al., 2022) 	[’ind’, ’tha’, ’vie’, ...] 	[’Non-fiction’, ’Encyclopaedic’, ’Written’] 	found 	human-annotated
Embassy (Phatthiyaphaibun, 2020) 	[’tha’, ’lao’] 	[’Government’, ’News’] 	found 	human-annotated 	CC0-1.0
IN22Conv* (Gala et al., 2023) 	[’tam’] 	[’Social’, ’Spoken’, ’Fiction’,

Chunk 90 · 1,996 chars

ated 	CC BY 4.0
Flores* (Goyal et al., 2022) 	[’ind’, ’tha’, ’vie’, ...] 	[’Non-fiction’, ’Encyclopaedic’, ’Written’] 	found 	human-annotated
Embassy (Phatthiyaphaibun, 2020) 	[’tha’, ’lao’] 	[’Government’, ’News’] 	found 	human-annotated 	CC0-1.0
IN22Conv* (Gala et al., 2023) 	[’tam’] 	[’Social’, ’Spoken’, ’Fiction’, ...] 	found 	expert-annotated 	CC BY 4.0
IN22Gen* (Gala et al., 2023) 	[’tam’] 	[’Web’, ’Legal’, ’Government’, ...] 	found 	expert-annotated 	CC BY 4.0
IndoGeneral (Guntara et al., 2020) 	[’ind’] 	[’General’, ’Writen’] 	found 	derived 	CC BY-SA 4.0
IndoIdentic (Gala et al., 2023) 	[’ind’] 	[’News’, ’Spoken’, ’Web’, ...] 	found 	derived
IndoNLG (Cahyawijaya et al., 2021) 	[’ind’] 	[’religion’] 	found 	derived
IndoNews (Guntara et al., 2020) 	[’ind’] 	[’News’, ’Written’] 	found 	derived 	CC BY-SA 4.0
IndoReligious (Guntara et al., 2020) 	[’ind’] 	[’Religion’, ’Writen’] 	found 	derived 	CC BY-SA 4.0
Liputan6 (Koto et al., 2020a) 	[’ind’] 	[’News’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
MADLAD400 (Kudugunta et al., 2023) 	[’tet’] 	[’Web’] 	found 	derived 	ODC-BY
NTREX* (Federmann et al., 2022) 	[’ind’, ’tha’, ’vie’, ...] 	[’News’, ’Written’] 	found 	expert-annotated 	CC BY-SA 4.0
NusaxMiners* (Winata et al., 2023) 	[’ind’] 	[’Reviews’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
QED (Lamm et al., 2020) 	[’ind’, ’tha’, ’vie’, ...] 	[’Education’, ’Social’, ’Spoken’, ...] 	found 	human-annotated 	CC BY-SA
SCBMTEnTh2020 (Lowphansirikul et al., 2022) 	[’tha’] 	[’conversation’, ’Web’, ’Government’, ...] 	found 	human-annotated 	CC BY-SA 4.0
SoftwareDocumentation (Buschbeck and Exel, 2020) 	[’ind’, ’tha’, ’vie’, ...] 	[’Web’, ’Product’] 	found 	expert-annotated 	CC BY-NC 4.0
TALPCo (Nomoto et al., 2018, 2019) 	[’ind’, ’tha’, ’vie’, ...] 	[’Conversation’, ’spoken’] 	found 	human-annotated 	CC BY-4.0
Tatoeba* (Tiedemann, 2020) 	[’ind’, ’tha’, ’vie’, ...] 	[’Written’] 	found 	human-annotated 	CC BY-2.0
TED2020 (Reimers and Gurevych, 2020) 	[’ind’,

Chunk 91 · 1,996 chars

uct’] 	found 	expert-annotated 	CC BY-NC 4.0
TALPCo (Nomoto et al., 2018, 2019) 	[’ind’, ’tha’, ’vie’, ...] 	[’Conversation’, ’spoken’] 	found 	human-annotated 	CC BY-4.0
Tatoeba* (Tiedemann, 2020) 	[’ind’, ’tha’, ’vie’, ...] 	[’Written’] 	found 	human-annotated 	CC BY-2.0
TED2020 (Reimers and Gurevych, 2020) 	[’ind’, ’tha’, ’vie’, ...] 	[’Education’, ’Social’, ’Spoken’, ...] 	found 	human-annotated 	CC BY-NC-ND 4.0
ThaiGov 	[’tha’] 	[’Government’, ’News’] 	found 	human-annotated 	PDDL
USEmbassy (Phatthiyaphaibun et al., 2023) 	[’tha’] 	[’News’] 	found 	derived 	CC0-1.0
VSoLSCSum (Nguyen et al., 2016a) 	[’vie’] 	[’Social’, ’Written’] 	found 	human-annotated 	CC BY-4.0
XLSum (Hasan et al., 2021) 	[’ind’, ’tha’, ’vie’, ...] 	[’News’, ’Written’] 	found 	human-annotated 	CC BY-NC-SA 4.0
Retrieval 	ACIQuAD (Doxolodeo and Krisnadhi, 2024) 	[’ind’] 	[’Encyclopaedic’, ’Written’] 	found 	expert-annotated 	CC-BY 4.0
Agricutlure1K (Min Si Thu,Khin Myat Noe) 	[’mya’] 	[’Encyclopaedic’, ’Written’] 	found 	expert-annotated 	CC BY-SA 4.0
AskCovidDrBot (Aung and San, 2025) 	[’mya’] 	[’Encyclopaedic’, ’Written’] 	found 	human-annotated 	MIT
ChatGPTOpenQA 	[’zsm’] 	[’Encyclopaedic’, ’Written’] 	found 	LM-generated 	CC BY-NC-SA 2.0
ContextSearch (Nguyen et al., 2025) 	[’tha’] 	[’STEM’, ’Humanities’, ’Social Sciences’, ...] 	found 	human-annotated 	MIT
IAppWiki (Viriyayudhakorn and Polpanumas, 2021) 	[’tha’] 	[’Encyclopaedic’, ’Web’, ’News’] 	found 	expert-annotated 	MIT
IDKMRC (Putri and Oh, 2022) 	[’tnd’] 	[’Encyclopaedic’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
IndicQA* (Doddapaneni et al., 2022) 	[’tam’] 	[’Web’, ’Written’] 	machine-translated and verified 	human-annotated 	CC BY 4.0
IndoNLG (Cahyawijaya et al., 2021) 	[’ind’] 	[’Religion’, ’Writen’] 	found 	human-annotated 	CC BY-SA 4.0
IndoQA (Jakarta Artificial Intelligence Research, 2023) 	[’ind’] 	[’Web’] 	found 	expert-annotated 	CC BY-ND 4.0
MLDR (Chen et al., 2024) 	[’tha’] 	[’Encyclopaedic’, ’Written’] 	found

Chunk 92 · 1,914 chars

d 	human-annotated 	CC BY 4.0
IndoNLG (Cahyawijaya et al., 2021) 	[’ind’] 	[’Religion’, ’Writen’] 	found 	human-annotated 	CC BY-SA 4.0
IndoQA (Jakarta Artificial Intelligence Research, 2023) 	[’ind’] 	[’Web’] 	found 	expert-annotated 	CC BY-ND 4.0
MLDR (Chen et al., 2024) 	[’tha’] 	[’Encyclopaedic’, ’Written’] 	found 	LM-generated 	MIT
MLQA* (Lewis et al., 2019) 	[’vie’] 	[’Encyclopaedic’, ’Written’] 	found 	human-annotated 	CC BY-SA 3.0
MIRACL* (Zhang et al., 2023) 	[’ind’, ’tha’] 	[’Encyclopaedic’, ’Written’] 	found 	expert-annotated 	Apache license 2.0
Microbiology1K (Si Thu, 2024) 	[’mya’] 	[’Encyclopaedic’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
QASiNa (Rizqullah et al., 2023) 	[’ind’] 	[’Religion’, ’Writen’] 	found 	human-annotated 	MIT
ThaiWikiQA (Trakultaweekoon et al., 2019) 	[’tha’] 	[’Encyclopaedic’, ’Written’] 	found 	human-annotated 	CC BY-NC-SA 3.0
TyDiQA (Clark et al., 2020) 	[’ind’, ’tha’] 	[’Encyclopaedic’, ’Written’] 	found 	human-annotated 	Apache license 2.0
ViQuAD2_0 (Nguyen et al., 2022) 	[’vie’] 	[’Encyclopaedic’, ’Written’] 	found 	expert-annotated 	MIT
WangchanXLegalThaiCCLRAG (Akarajaradwong et al., 2025) 	[’tha’] 	[’Legal’, ’Written’] 	found 	human-annotated 	MIT
XQuAD (Artetxe et al., 2019) 	[’tha’, ’vie’] 	[’Web’, ’Written’] 	found 	human-annotated 	CC BY-SA 4.0
Instruction Retrieval 	AlpacaInstruct 	[’ind’] 	None 	found 	LM-generated 	Apache license 2.0
Vietnamese52KAlpaca (Nhiem, 2023) 	[’vie’] 	None 	found 	LM-generated
WangchanThaiInstruct 	[’tha’] 	[’Medical’, ’Finance’, ’Legal’, ...] 	found 	human-annotated 	CC BY-SA 4.0
WangchanXSyntheticInstructThai120k (Pengpun et al., 2024) 	[’tha’] 	[’Encyclopaedic’, ’Written’] 	found 	LM-generated 	MIT
Reranking 	MIRACL (Zhang et al., 2023) 	[’ind’, ’tha’] 	[’Encyclopaedic’, ’Written’] 	found 	expert-annotated 	Apache license 2.0
Table 18: The datasets included in SEA-BED (part 2).

-- 35 of 35 --