Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation

Summary

This paper introduces a benchmark to evaluate large language models (LLMs) on Thai local dialects—Isan, Lanna, and Dambro—across five NLP tasks: summarization, question answering, translation, conversation, and food-related queries. The benchmark uses inputs, instructions, and labels in local dialects, with human evaluation focusing on fluency and dialect-specific accuracy. Results show a significant performance drop for LLMs when processing local dialects compared to standard Thai. Only proprietary models like GPT-4o and Gemini2 demonstrate notable fluency. Traditional metrics like BLEU and ROUGE-L are inadequate due to dialect-specific spelling and tokenization issues. The proposed human evaluation metric, involving native speakers, addresses these limitations by assessing generation and fluency separately. The study highlights the need for improved dialect-aware LLMs and underscores the challenges in handling underrepresented linguistic varieties.

PDF viewer

Chunks(26)

Chunk 0 · 1,992 chars

Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks
and Human Evaluation
Peerat Limkonchotiwat1,4,*, Kanruethai Masuk,2,*, Surapon Nonesung3,*,
Chalermpun Mai-On2,*, Sarana Nutanong2, Wuttikorn Ponwitayarat2,*,
Potsawee Manakul3,*,
1AI Singapore, 2Vidyasirimedhi Institute of Science and Technology,
3SCB10X, 4National University of Singapore,
Correspondence: peerat@aisingapore.org
Abstract
Large language models show promising results
in various NLP tasks. Despite these successes,
the robustness and consistency of LLMs in un-
derrepresented languages remain largely unex-
plored, especially concerning local dialects. Ex-
isting benchmarks also focus on main dialects,
neglecting LLMs’ ability on local dialect texts.
In this paper, we introduce a Thai local dialect
benchmark covering Northern (Lanna), North-
eastern (Isan), and Southern (Dambro) Thai,
evaluating LLMs on five NLP tasks: summa-
rization, question answering, translation, con-
versation, and food-related tasks. Furthermore,
we propose a human evaluation guideline and
metric for Thai local dialects to assess gen-
eration fluency and dialect-specific accuracy.
Results show that LLM performance declines
significantly in local Thai dialects compared
to standard Thai, with only proprietary models
like GPT-4o and Gemini2 demonstrating some
fluency1.
1 Introduction
Large language models (LLMs) play a crucial role
in natural language processing (NLP) by signifi-
cantly enhancing downstream performance across
diverse tasks and languages. Studies have shown
that LLMs surpass traditional multilingual mod-
els (Devlin et al., 2019; Conneau et al., 2020) in
tasks such as question answering and summariza-
tion (Touvron et al., 2023; Pipatanakul et al., 2023).
One key reason is that LLMs have significantly
larger parameter counts and training datasets, of-
ten exceeding traditional models by a factor of 26,
from 0.3 billion to 8 billion parameters. As a re-
sult, LLMs demonstrate greater robustness

Chunk 1 · 1,999 chars

wering and summariza-
tion (Touvron et al., 2023; Pipatanakul et al., 2023).
One key reason is that LLMs have significantly
larger parameter counts and training datasets, of-
ten exceeding traditional models by a factor of 26,
from 0.3 billion to 8 billion parameters. As a re-
sult, LLMs demonstrate greater robustness than
traditional models across most scenarios, particu-
larly in low-resource languages such as Thai (Peng-
pun et al., 2024; Phatthiyaphaibun et al., 2024),
*Equal contributions
1Data and Evaluation Tool: https://github.com/mrpee
rat/Thai_local_benchmark
Indonesian (Cahyawijaya et al., 2024, 2023), and
Tagalog (Montalan et al., 2024; Gamboa and Lee,
2025).
Researchers are increasingly evaluating LLMs
in underrepresented languages, particularly Thai,
to assess their robustness and consistency in low-
resource languages. Thai-H6 (Kim et al., 2025)
introduces an evaluation benchmark for assessing
LLMs’ understanding of Thai cultural knowledge.
SEACrowd (Lovenia et al., 2024) introduces a
Southeast Asian benchmark for evaluating LLMs
on local and cross-lingual texts, audios, and im-
ages, focusing on natural language generation and
understanding. The Thai LLM benchmark2 extends
SEACrowd by incorporating LLM-as-a-Judge and
Thai Exam datasets for a Thai language evalua-
tion. However, these benchmarks primarily focus
on the Central Thai dialect. Therefore, challenges
observed in the Central dialect may also reflect
underlying issues in local dialects.
Although there are 69 million Thai speakers,
only about 20 million have Central Thai as their
first language. The rest primarily grow up speaking
other dialects, including approximately 15.2 mil-
lion who speak Northeastern Thai (Isan), 6 million
who speak Northern Thai (Lanna), and 4.5 million
who speak Southern Thai (Dambro) or other local
varieties. As shown in Figure 1a, a distinct fea-
ture of Thai local dialects is the blending of words
between Central Thai and local dialects, as well
as among different

Chunk 2 · 1,996 chars

lion who speak Northeastern Thai (Isan), 6 million
who speak Northern Thai (Lanna), and 4.5 million
who speak Southern Thai (Dambro) or other local
varieties. As shown in Figure 1a, a distinct fea-
ture of Thai local dialects is the blending of words
between Central Thai and local dialects, as well
as among different local dialects. Moreover, each
local dialect has distinct words, some unique to
that dialect and others shared with other dialects
but not found in others. Recent studies examine
the evaluation of fluency and understanding in lo-
cal dialects of Indonesian (Syarfina et al., 2022;
Cahyawijaya et al., 2023; Veria Septianingtias and
Ariyani, 2024) and Filipino (Tabajunda, 2018; Gue-
2https://huggingface.co/spaces/ThaiLLM-Leade
rboard/leaderboard
arXiv:2504.05898v1 [cs.CL] 8 Apr 2025

-- 1 of 10 --

[หมู่เฮา]ตอนเทียง
มี[ไผ]อยากไปกิน[ตําบักหุ่ง]
ร้านครัวคุณกร[บ่?]
Identify local dialect units
Generation
= 0.5
[หมู่เฮา]
[ไผ]
[บ่?]
[ตําบักหุ่ง]
Fluency
= 1.0
[บ่อ]มี[ไผ]อยาก 	[บ่อ]
[ไผ]
(We)
(Who)
(No/Or Not)
(No/Or Not)
(Who)
(Papaya Salad)
หมู่เฮาตอนเทียงมีไผอยากไปกินตําบักหุ่งร้านครัวคุณกรบ่?
พวกเราตอนเทียงมีใครสนใจไปกินส้มตําไก่ย่างทีร้านครัวคุณกรไหม?
Anyone interested in joining us for papaya salad and grilled chicken at
Khun Korn’s Kitchen for lunch?
Central Thai:
English:
หมู่เฮาตอนเทียงมีไผสนใจไปกินส้มตําไก่ย่างตีร้านครัวคุณกรก่อ?	Northern:
โบ๋เราตอนเทียงยังใครทีอาดไปกินส้มตําไก่ย่างทีร้านครัวคุณกรมังม่าย?	Southern:
Eastern:
1. Shared tokens between local dialects (underline)
2. Some Dialects has specific tokens that not appear in other dialects (square)
3. Wh- question that used in
different dialects (circle)
(a) 	(b)
[Translation]:
Guys, do anyone want papaya salad
at Korn's restuarnt for lunch?
[Translation]: No one
want to go.
Figure 1: (a) We showed some unique characteristics of Thai local dialects compared to Central Thai: 1. shared
tokens in local dialects; 2. unique words for each local dialect; and 3. WH- question tokens that are different

Chunk 3 · 1,986 chars

ys, do anyone want papaya salad
at Korn's restuarnt for lunch?
[Translation]: No one
want to go.
Figure 1: (a) We showed some unique characteristics of Thai local dialects compared to Central Thai: 1. shared
tokens in local dialects; 2. unique words for each local dialect; and 3. WH- question tokens that are different in each
local dialect. (b) Illustration of our local dialect evaluation metric. The example demonstrates the assessment of
generation and fluency, highlighting variations in spelling and pronunciation among different Thai dialects.
vara et al., 2024). Evaluating the fluency and un-
derstanding of LLMs in Thai local dialects remains
underexplored.
In this paper, we introduce a benchmark for
evaluating local Thai dialects–Isan, Lanna, and
Dambro–to assess LLMs’ understanding of these
dialects. Specifically, we evaluate LLMs on five
tasks – summarization, question answering, trans-
lation, conversation, and food – using instructions,
inputs, and labels written in local dialects. Addi-
tionally, we introduce human evaluation alongside
a new Thai local dialect guideline and metric to
assess LLMs’ capabilities in understanding and
generating local dialects. The new metric evaluates
the generation and fluency of local dialects, reward-
ing or penalizing the model for fluency in a manner
consistent with human speakers. Our experimental
results show that LLM performance significantly
declines on summarization and translation tasks
when inputs are written in local dialects, compared
to inputs in Central Thai. Furthermore, only pro-
prietary models–GPT and Gemini–are capable of
speaking local Thai dialects, as shown by both hu-
man evaluation and our new metric.
Our contributions are as follows: We introduce a
benchmark for evaluating three Thai local dialects.
We also propose a guideline and metric to assess
the generation and fluency of LLMs in Thai local
dialects. Lastly, we are the first to investigate Thai
local dialects in the context of LLMs,

Chunk 4 · 1,999 chars

uation and our new metric.
Our contributions are as follows: We introduce a
benchmark for evaluating three Thai local dialects.
We also propose a guideline and metric to assess
the generation and fluency of LLMs in Thai local
dialects. Lastly, we are the first to investigate Thai
local dialects in the context of LLMs, presenting
comprehensive evaluations using both traditional
evaluation metrics and human assessment.
2 Benchmark and Metric
2.1 Benchmark Formulation
Data Selection. We create a new benchmark for
evaluating three Thai local dialects–Isan, Lanna,
and Dambro–drawing on examples from the Thai
LLM leaderboard to assess model performance.
We select generation tasks—Question Answering
(QA), summarization, and translation—because
they are more complex and better suited for assess-
ing model knowledge than more straightforward
multiple-choice QA tasks. We randomly selected
20 samples for each task, resulting in 80 samples
per dialect, where the translation task was both
English to local dialects and vice versa. In addi-
tion, we include 10 topics on food (i.e., asking
for a food recipe of [FOOD_NAME]) and conver-
sations between two individuals discussing Thai
traditional and cultural topics (i.e., talking bout
[TOPIC_NAME]). The full topics can be found in
Appendix A.7.
Gold Standard. Our dataset uses the gold standard
provided by the original Central Thai examples.
Note that we do not create labels for the food and
conversation topics, as these tasks are open-ended
and lack a definitive correct answer. Instead, we
rely on human evaluation to assess these tasks. The
number of samples is 100, all written in Central
Thai.
Translation. We enlisted native speakers of each
local dialect to translate the 100 samples from Cen-
tral Thai into their respective dialects. This in-
cludes translating the input, context (for QA and
summarization tasks), prompts, and labels, except
for those related to food and conversation sam-
ples. Furthermore, all annotators are asked

Chunk 5 · 1,988 chars

native speakers of each
local dialect to translate the 100 samples from Cen-
tral Thai into their respective dialects. This in-
cludes translating the input, context (for QA and
summarization tasks), prompts, and labels, except
for those related to food and conversation sam-
ples. Furthermore, all annotators are asked to re-
view translations completed by their peers within
the same region to ensure quality and consistency.
Therefore, the total number of samples is 400, com-
parable to previous LLM-as-a-Judge and Human
evaluation benchmarks (Zheng et al., 2023; Qin
et al., 2024). For the full information of data statis-
tics, prompts, topics, annotator demographics, and
guidelines, see Appendix A.

-- 2 of 10 --

Model 	Dialect QA ∆Diff Summarization ∆Diff Translation (E-L) ∆Diff Translation (L-E) ∆Diff
Llama-3.1-8b 	Central 60.74 	- 	20.07 	- 	11.41 	- 	44.61 	-
Typhoon1.5-8b Central 33.36 	- 	11.52 	- 	5.14 	- 	47.35 	-
Llama-3.1-70b Central 56.44 	- 	23.03 	- 	18.11 	- 	51.07 	-
Typhoon1.5-70b Central 46.99 	- 	10.83 	- 	8.46 	- 	50.54 	-
GPT-4o 	Central 56.03 	- 	14.67 	- 	21.926 	- 	62.82 	-
Gemini2 	Central 53.73 	- 	17.70 	- 	20.76 	- 	58.10 	-
Llama-3.1-8b 	Isan 	58.84 1.90 	16.47 	3.60 	3.17 	8.24 	6.59 	38.02
Typhoon1.5-8b Isan 	28.48 4.88 	8.54 	2.98 	0.83 	4.31 	31.6 	15.75
Llama-3.1-70b Isan 	57.59 ↑1.15 	18.77 	4.26 	8.98 	9.13 	33.74 	17.33
Typhoon1.5-70b Isan 	45.32 1.67 	10.87 	↑0.04 	2.82 	5.64 	41.81 	8.73
GPT-4o 	Isan 	44.21 11.82 	10.39 	4.28 	14.29 	7.63 	55.26 	7.56
Gemini2 	Isan 	40.58 13.15 	12.68 	5.02 	13.31 	7.45 	51.74 	6.36
Llama-3.1-8b 	Lanna 	55.66 5.08 	18.05 	2.02 	2.46 	8.95 	7.27 	37.34
Typhoon1.5-8b Lanna 	24.39 8.97 	7.28 	4.24 	0.49 	4.65 	18.48 	28.87
Llama-3.1-70b Lanna 	50.97 5.47 	17.41 	5.62 	3.73 	14.38 	15.92 	35.15
Typhoon1.5-70b Lanna 	44.74 2.25 	10.17 	0.66 	4.76 	3.70 	39.01 	11.53
GPT-4o 	Lanna 	45.73 10.30 	9.78 	4.89 	12.02 	9.90 	54.83 	7.99
Gemini2 	Lanna 	40.86 12.87 	11.70 	6.0 	7.68 	13.08 	47.28

Chunk 6 · 1,995 chars

24.39 8.97 	7.28 	4.24 	0.49 	4.65 	18.48 	28.87
Llama-3.1-70b Lanna 	50.97 5.47 	17.41 	5.62 	3.73 	14.38 	15.92 	35.15
Typhoon1.5-70b Lanna 	44.74 2.25 	10.17 	0.66 	4.76 	3.70 	39.01 	11.53
GPT-4o 	Lanna 	45.73 10.30 	9.78 	4.89 	12.02 	9.90 	54.83 	7.99
Gemini2 	Lanna 	40.86 12.87 	11.70 	6.0 	7.68 	13.08 	47.28 	10.82
Llama-3.1-8b 	Dambro 61.38 ↑0.64 	16.91 	3.16 	7.22 	4.19 	24.02 	20.59
Typhoon1.5-8b Dambro 31.34 2.02 	9.18 	2.34 	4.20 	0.94 	27.81 	19.54
Llama-3.1-70b Dambro 66.96 ↑10.52 	17.2 	5.83 	7.66 	10.45 	27.10 	23.97
Typhoon1.5-70b Dambro 52.10 ↑5.11 	9.74 	1.09 	0.78 	7.68 	39.19 	11.35
GPT-4o 	Dambro 49.28 6.75 	12.87 	1.80 	12.95 	8.97 	45.32 	17.50
Gemini2 	Dambro 47.91 5.82 	12.07 	5.63 	9.13 	11.63 	23.46 	34.64
Table 1: The main results of our benchmark. We calculate the ∆Diff by comparing Central and local dialects at the
same model. In addition, E equals to English and L equals to local dialects.
2.2 Human Evaluation for Local Dialects
Traditional evaluation metrics, such as BLEU (Pa-
pineni et al., 2002) and ROUGE-L (Lin, 2004),
struggle to assess dialectal text accurately. A key
challenge is their reliance on tokenization, which
fails when dialect words are missing from standard
dictionaries, resulting in inaccurate segmentation.
Furthermore, the absence of a standardized writ-
ing system leads to multiple valid spellings for the
same word. Therefore, these traditional metrics
that compare generated text to a fixed reference fail
to account for such variations, often misclassify-
ing correct responses as errors. These limitations
underscore the urgency of developing more flex-
ible evaluation methods that reflect the linguistic
diversity of dialects.
To bridge these gaps, we conduct our local di-
alect evaluation metric as follows.
1 Annotators evaluate whether the generated out-
put includes accurate local units, covering all
forms of writing: nouns, phrases, and sentences.
2 Different dialect spellings are allowed if they
share the

Chunk 7 · 1,989 chars

guistic
diversity of dialects.
To bridge these gaps, we conduct our local di-
alect evaluation metric as follows.
1 Annotators evaluate whether the generated out-
put includes accurate local units, covering all
forms of writing: nouns, phrases, and sentences.
2 Different dialect spellings are allowed if they
share the same pronunciation. We ensure this
by transliterating the word and comparing its
phonetic transcriptions.
3 We divide our assessment into two categories:
language Generation (similar to recall) and lan-
guage Fluency (similar to precision). First, an-
notators (native local dialect speakers) evaluate
language Generation by assigning a score of 0
(no local dialect generated), 0.5 (partially gen-
erated), or 1 (fully generated). Next, annotators
assess the correctness and naturalness (Fluency)
of the generated dialectal words using the same
scoring system as Generation: 0 (completely
incorrect), 0.5 (partially correct), and 1 (fully
correct and natural). The final score is calculated
by the average over two annotators.
Figure 1(b) illustrates an example of our evalua-
tion using the proposed metrics. Some extracted
units have the same meaning but are written differ-
ently. Both writings are considered correct because
they share the same pronunciation. In addition, the
averages of annotator agreements (Appendix A.4)
from three dialects are 0.7969 and 0.7449 for Gen-
eration and Fluency, respectively. We discussed
the possibility of replacing humans with LLMs in
Appendix B.
3 Experimental Setups and Results
3.1 Setup
To evaluate Thai local dialects, we use well-known
Thai and multilingual LLMs for 8 and 70 billion
parameters, namely Typhoon1.5 3 and Llama3.1 4.
We also evaluate proprietary LLMs, such as GPT-
4o and Gemini2, in our benchmark. We use BLEU
for translation and ROUGE-L for QA and summa-
rization according to the original benchmark (Love-
nia et al., 2024), including the generation setting for
a fair comparison with previous benchmark

Chunk 8 · 1,992 chars

Typhoon1.5 3 and Llama3.1 4.
We also evaluate proprietary LLMs, such as GPT-
4o and Gemini2, in our benchmark. We use BLEU
for translation and ROUGE-L for QA and summa-
rization according to the original benchmark (Love-
nia et al., 2024), including the generation setting for
a fair comparison with previous benchmark works.
3https://huggingface.co/collections/scb10x/ty
phoon-15x-6648d7b07ab33d141d6648b6
4https://huggingface.co/collections/meta-lla
ma/llama-31-669fc079a0c406a149a5738f

-- 3 of 10 --

A vs B Isan Lanna Dambro
A B Both None A B Both None A B Both None
Typhoon1.5-8b vs Llama3.1-8b 1 - - 19 - - - 20 1 - - 19
Typhoon1.5-70b vs Llama3.1-70b - 1 - 19 2 - - 18 4 - - 16
Gemini2 vs Typhoon1.5-70b 20 - - - 19 - - 1 16 2 1 1
Gemini2 vs GPT-4o 4 3 13 - 18 - 2 - 16 2 2 -
Table 2: Human fluency preference on conversation and food topics. The full details are described in Section 3.3.
3.2 Traditional Metric Results
Table 1 shows a significant decline in translation
performance for both 8B and 70B models, with the
largest drop observed in Lanna. Results in Isan and
Dambro were more mixed. In particular, QA per-
formance improved in both Dambro and Isan. One
possible explanation is that ROUGE-L measures
only text overlap, assessing correctness based on
matching words rather than fluency. Appendix C
provides LLMs’ outputs, revealing that ROUGE-L
assigned high scores even when Llama-3.1-70B
failed to produce local dialects. In particular, none
of the models produced responses in Thai local di-
alects, even when explicitly prompted to do so. If
reference labels are not written in local dialects,
models may still achieve high scores despite failing
to generate dialectal text, as shown in Table A.2 that
QA and summarization exhibit higher word over-
lap in Central and local dialects than translation.
This highlights the need for alternative evaluation
techniques that assess fluency rather than relying
solely on exact match

Chunk 9 · 1,999 chars

till achieve high scores despite failing
to generate dialectal text, as shown in Table A.2 that
QA and summarization exhibit higher word over-
lap in Central and local dialects than translation.
This highlights the need for alternative evaluation
techniques that assess fluency rather than relying
solely on exact match metrics.
3.3 Human Evaluation
To verify that existing models do not generate Thai
local dialects and the traditional metric problem in
Thai local dialects, we conduct a human evaluation
focused on fluency in food and conversation topics.
We also recruit three native speakers per dialect to
assess the fluency of LLM outputs, selecting one
of four options: (A) for Prefer A, (B) for Prefer B,
(Both) for liking both models equally, or (None)
for disliking without any preference. In addition,
we average the answer from three annotators to
formulate the final answer. The annotator guideline
can be found in Appendix A.6.
Table 2 confirms that all open-source models fail
to generate Thai local dialects, as expected. Ty-
phoon and Llama received the highest number of
both are the worst ratings from annotators. For
the Dambro dialect, Typhoon demonstrates some
ability to generate local dialects, aligning with Ta-
ble 1, where it showed improvement in Dambro
texts. Gemini generates local dialects more flu-
ently and outperforms other models in nearly every
comparison. However, these results reflect only a
preference-based evaluation, while the actual flu-
ency of LLMs in local dialects remains unmeasured.
This underscores the need for an alternative metric
to assess dialect fluency beyond exact match scores
or subjective human ratings.
3.4 Our Local Dialect Metric
To better assess LLMs’ fluency and generative abil-
ity in local dialects, we evaluate them using our pro-
posed metric. As described in Section 2.2, two na-
tive speakers evaluate Typhoon1.5-70B and Gem-
ini2 for QA, summarization, and translation.
Table 3 shows that when we focus only on eval-
uating

Chunk 10 · 1,992 chars

Dialect Metric
To better assess LLMs’ fluency and generative abil-
ity in local dialects, we evaluate them using our pro-
posed metric. As described in Section 2.2, two na-
tive speakers evaluate Typhoon1.5-70B and Gem-
ini2 for QA, summarization, and translation.
Table 3 shows that when we focus only on eval-
uating the generation and fluency performance in
Thai local dialects, only Gemini2 speaks Thai lo-
cal dialects. Although Typhoon1.5-70b outper-
formed Gemini2 in the traditional metric perfor-
mance in Table 1, we found a significant gap
between these models. Gemini2 achieves over
∼92.58 points in the Lanna dialect and ∼84.49
points in Dambro, while Typhoon achieves less
than 31 points and 12 points in Dambro and Lanna,
respectively. Moreover, the results of our experi-
ment also conform with Table 2 that Gemini2 is
better than Typhoon1.5 with more deep analysis
results, the Generation and Fluency score. This
emphasizes the robustness of our metric, which
can measure the ability to speak the local dialect
fluently, unlike the traditional metrics (Table 1) and
human preference (Table 2).
Model 	Isan 	Lanna 	Dambro
G. 	F. Avg. G. 	F. Avg. G. 	F. Avg.
Typhoon 10.34 15.74 13.04 10.71 13.22 11.97 30.50 30.00 30.25
Gemini2 86.57 82.41 84.49 91.33 93.99 92.58 75.50 68.50 71.83
Table 3: The results from our propose metric. Note that
G. is Generation and F. is Fluency.
4 Conclusion
Our study highlights the significant performance
gap in LLMs when processing Thai local dialects.
While proprietary models like GPT-4 and Gemini
demonstrate some capability in understanding and
generating local dialects, most LLMs struggle with
fluency and accuracy. Our proposed benchmark,
human evaluation, and a new guideline and met-

-- 4 of 10 --

rics provide valuable insights into these limitations,
paving the way for future improvements in multi-
lingual and dialect-aware LLMs.
Limitation
• We select only the main local dialects in Thai-
land since some dialects do not have a

Chunk 11 · 1,990 chars

proposed benchmark,
human evaluation, and a new guideline and met-

-- 4 of 10 --

rics provide valuable insights into these limitations,
paving the way for future improvements in multi-
lingual and dialect-aware LLMs.
Limitation
• We select only the main local dialects in Thai-
land since some dialects do not have a writing
system, only speaking.
• We experiment only with top-performance
Thai LLMs, such as Typhoon1.5, Llama3.1,
Gemini2, and GPT-4o. We acknowledge other
Thai LLMs, such as WangchanLION and
OpenThaiGPT. However, we found that only
the typhoon model has 8 and 70 billion model
parameters, covering all experiments in our
paper.
• Extending the proposed metric to other lan-
guages. We acknowledge that our paper does
not explain extending our metric to other lan-
guages. Our proposed metric can extend to
other languages when human annotators who
can speak local dialects are available. We will
study this in our future work.
• Limited number of annotators. We use only
five annotators for each local dialect. How-
ever, the agreement score demonstrated a high
agreement in the annotation part.
Ethical Statement
All annotators are volunteers. In addition, we
demonstrate the annotator details in Appendix A.3.
Moreover, we will release our Thai local dialect
and human evaluation results, for both the training
and our human evaluation methods, to the public
with the original license of each dataset, such as
XL-SUM (Hasan et al., 2021) and iAPP (Viriyayud-
hakorn and Polpanumas, 2021). In addition, we
have the dataset that was created by hand, and we
will release it with license CC-BY-SA, similar to
previous works.
Acknowledgement
This research is supported by the National Re-
search Foundation, Singapore under its National
Large Language Models Funding Initiative. Any
opinions, findings and conclusions or recommen-
dations expressed in this material are those of the
author(s) and do not reflect the views of National
Research Foundation, Singapore. We also

Chunk 12 · 1,995 chars

h is supported by the National Re-
search Foundation, Singapore under its National
Large Language Models Funding Initiative. Any
opinions, findings and conclusions or recommen-
dations expressed in this material are those of the
author(s) and do not reflect the views of National
Research Foundation, Singapore. We also would
like to thank all the annotators in this work for their
volunteer work.
References
Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji,
Genta Winata, Bryan Wilie, Fajri Koto, Rahmad
Mahendra, Christian Wibisono, Ade Romadhony,
Karissa Vincentio, Jennifer Santoso, David Moel-
jadi, Cahya Wirawan, Frederikus Hudi, Muham-
mad Satrio Wicaksono, Ivan Parmonangan, Ika Al-
fina, Ilham Firdausi Putra, Samsul Rahmadani, Yu-
lianti Oenang, Ali Septiandri, James Jaya, Kaustubh
Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith
Stevens, Made Nindyatama Nityasya, Muhammad
Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru,
Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu,
Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ich-
wanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Gra-
ham Neubig, Timothy Baldwin, Sebastian Ruder, Pas-
cale Fung, Herry Sujaini, Sakriani Sakti, and Ayu Pur-
warianti. 2023. NusaCrowd: Open source initiative
for Indonesian NLP resources. In Findings of the As-
sociation for Computational Linguistics: ACL 2023,
pages 13745–13818, Toronto, Canada. Association
for Computational Linguistics.
Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki
Putri, Wawan Cenggoro, Jhonson Lee, Salsabil Ak-
bar, Emmanuel Dave, Nuurshadieq Nuurshadieq,
Muhammad Mahendra, Rr Putri, Bryan Wilie, Genta
Winata, Alham Aji, Ayu Purwarianti, and Pascale
Fung. 2024. Cendol: Open instruction-tuned genera-
tive large language models for Indonesian languages.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 14899–14914, Bangkok, Thai-
land. Association for Computational Linguistics.
Alexis Conneau, Kartikay

Chunk 13 · 1,995 chars

24. Cendol: Open instruction-tuned genera-
tive large language models for Indonesian languages.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 14899–14914, Bangkok, Thai-
land. Association for Computational Linguistics.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440–
8451, Online. Association for Computational Lin-
guistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
Lance Calvin Lim Gamboa and Mark Lee. 2025. Fil-
ipino benchmarks for measuring sexist and homo-
phobic bias in multilingual language models from
Southeast Asia. In Proceedings of the First Workshop

-- 5 of 10 --

on Language Models for Low-Resource Languages,
pages 123–134, Abu Dhabi, United Arab Emirates.
Association for Computational Linguistics.
Rowena Cristina L. Guevara, Rhandley D. Cajote,
Michael Gringo Angelo R. Bayona, and Crisron
Rudolf G. Lucas. 2024. Philippine languages
database: A multilingual speech corpora for devel-
oping systems for low-resource languages. In Pro-
ceedings of the 3rd Annual Meeting of the Special
Interest Group on Under-resourced Languages @
LREC-COLING 2024, pages 264–271, Torino, Italia.
ELRA and ICCL.
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Is-
lam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang,
M. Sohel Rahman, and

Chunk 14 · 1,999 chars

ms for low-resource languages. In Pro-
ceedings of the 3rd Annual Meeting of the Special
Interest Group on Under-resourced Languages @
LREC-COLING 2024, pages 264–271, Torino, Italia.
ELRA and ICCL.
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Is-
lam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang,
M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-
sum: Large-scale multilingual abstractive summariza-
tion for 44 languages. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
pages 4693–4703, Online. Association for Computa-
tional Linguistics.
Dahyun Kim, Sukyung Lee, Yungi Kim, Attapol Ruther-
ford, and Chanjun Park. 2025. Representing the
under-represented: Cultural and core capability
benchmarks for developing Thai large language mod-
els. In Proceedings of the 31st International Con-
ference on Computational Linguistics, pages 4114–
4129, Abu Dhabi, UAE. Association for Computa-
tional Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana
Akbar, Lester James Validad Miranda, Jennifer
Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek
Mansurov, Joseph Marvin Imperial, Onno P. Kamp-
man, Joel Ruben Antony Moniz, Muhammad
Ravi Shulthan Habibi, Frederikus Hudi, Jann Rai-
ley Montalan, Ryan Ignatius Hadiwijaya, Joan-
ito Agili Lopo, William Nixon, Börje F. Karls-
son, James Jaya, Ryandito Diandaru, Yuze Gao,
Patrick Amadeus Irawan, Bin Wang, Jan Chris-
tian Blaise Cruz, Chenxi Whitehouse, Ivan Halim
Parmonangan, Maria Khelli, Wenyu Zhang, Lucky
Susanto, Reynard Adha Ryanda, Sonny Lazuardi Her-
mawan, Dan John Velasco, Muhammad Dehan Al
Kautsar, Willy Fitra Hendria, Yasmin Moslem, Noah
Flynn, Muhammad Farid Adilazuarda, Haochen Li,
Johanes Lee, R. Damanhuri, Shuo Sun, Muham-
mad Reza Qorib, Amirbek Djanibekov, Wei Qi
Leong, Quyet V. Do, Niklas Muennighoff,

Chunk 15 · 1,984 chars

ky
Susanto, Reynard Adha Ryanda, Sonny Lazuardi Her-
mawan, Dan John Velasco, Muhammad Dehan Al
Kautsar, Willy Fitra Hendria, Yasmin Moslem, Noah
Flynn, Muhammad Farid Adilazuarda, Haochen Li,
Johanes Lee, R. Damanhuri, Shuo Sun, Muham-
mad Reza Qorib, Amirbek Djanibekov, Wei Qi
Leong, Quyet V. Do, Niklas Muennighoff, Tan-
rada Pansuwan, Ilham Firdausi Putra, Yan Xu,
Tai Ngee Chia, Ayu Purwarianti, Sebastian Ruder,
William Chandra Tjhi, Peerat Limkonchotiwat, Al-
ham Fikri Aji, Sedrick Keh, Genta Indra Winata,
Ruochen Zhang, Fajri Koto, Zheng Xin Yong, and
Samuel Cahyawijaya. 2024. SEACrowd: A multi-
lingual multimodal data hub and benchmark suite
for Southeast Asian languages. In Proceedings of
the 2024 Conference on Empirical Methods in Natu-
ral Language Processing, pages 5155–5203, Miami,
Florida, USA. Association for Computational Lin-
guistics.
Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong,
Yosephine Susanto, Hamsawardhini Rengarajan, Al-
ham Fikri Aji, and William Chandra Tjhi. 2024.
Kalahi: A handcrafted, grassroots cultural llm evalu-
ation suite for filipino. Preprint, arXiv:2409.15380.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Parinthapat Pengpun, Can Udomcharoenchaikit, Weer-
ayut Buaphet, and Peerat Limkonchotiwat. 2024.
Seed-free synthetic data generation framework for
instruction-tuning llms: A case study in thai. In
Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics, ACL 2024 -
Student Research Workshop, Bangkok, Thailand, Au-
gust 11-16, 2024, pages 438–457. Association for
Computational Linguistics.
Wannaphong Phatthiyaphaibun, Surapon Nonesung,
Peerat Limkonchotiwat, Can Udomcharoenchaikit,
Jitkapat Sawatphol, Ekapol

Chunk 16 · 1,995 chars

l Meeting of the As-
sociation for Computational Linguistics, ACL 2024 -
Student Research Workshop, Bangkok, Thailand, Au-
gust 11-16, 2024, pages 438–457. Association for
Computational Linguistics.
Wannaphong Phatthiyaphaibun, Surapon Nonesung,
Peerat Limkonchotiwat, Can Udomcharoenchaikit,
Jitkapat Sawatphol, Ekapol Chuangsuwanich, and
Sarana Nutanong. 2024. CHIE: Generative MRC
evaluation for in-context QA with correctness, help-
fulness, irrelevancy, and extraneousness aspects. In
Proceedings of the 2nd GenBench Workshop on Gen-
eralisation (Benchmarking) in NLP, pages 154–164,
Miami, Florida, USA. Association for Computational
Linguistics.
Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee
Manakul, Sittipong Sripaisarnmongkol, Ruangsak
Patomwong, Pathomporn Chokchainant, and Kasima
Tharnpipitchai. 2023. Typhoon: Thai large language
models. Preprint, arXiv:2312.13951.
Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao,
Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei
Liu, Pengfei Liu, and Dong Yu. 2024. InFoBench:
Evaluating instruction following ability in large lan-
guage models. In Findings of the Association for
Computational Linguistics: ACL 2024, pages 13025–
13048, Bangkok, Thailand. Association for Compu-
tational Linguistics.
Tengku Syarfina, Satwiko Budiono, and Denny Adrian
Nurhuda. 2022. Language variations in jayapura city:
A study of dialectology. EAI.
Dastin Tabajunda. 2018. Linguistic schoolscape as
public communication: A study of announcements
and signages in de la salle university – dasmariñas.
In Proceedings of the 32nd Pacific Asia Conference
on Language, Information and Computation, Hong
Kong. Association for Computational Linguistics.

-- 6 of 10 --

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models.

Chunk 17 · 1,997 chars

inguistics.

-- 6 of 10 --

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models. Preprint,
arXiv:2302.13971.
Tajudin Nur Veria Septianingtias, Wahya and Farida
Ariyani. 2024. Lexical variation in the lampung
language, indonesia. Cogent Arts & Humanities,
11(1):2309740.
Kobkrit Viriyayudhakorn and Charin Polpanumas. 2021.
iapp_wiki_qa_squad.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
LLM-as-a-judge with MT-bench and chatbot arena.
In Thirty-seventh Conference on Neural Information
Processing Systems Datasets and Benchmarks Track.
A Data Statistics and Annotators
A.1 Data Statistics
As shown in Table 4, we have a total of 100 samples
for each local dialect. We use summarization, QA,
and translation for the traditional metric, BLEU,
and ROUGE-L. In contrast, we use conversation
and food topics for human evaluation since these
topics do not have the perfect answer, as in pre-
vious tasks. In addition, the topics of food and
conversation are culturally relevant, such as local
food in Southern or Northern Thailand, and the
conversation topic is events and culture in each
part of Thailand. In addition, we have 40 sam-
ples for translation since it is forward and back-
ward translation. Moreover, the QA samples are
from iAPP (Viriyayudhakorn and Polpanumas,
2021), the summarization samples are from XL-
SUM (Hasan et al., 2021), and the translation, food,
and conversation samples are formulated by anno-
tators.
Dialects #Sample of
Summarization
#Sample
of QA
#Sample
of Transaltion
#Sample
of Conversation
#Sample
of Food
Isan 	20 	20 	40 	10 	10
Lanna 	20 	20 	40 	10 	10
Dambro 	20 	20 	40 	10

Chunk 18 · 1,998 chars

ion samples are from XL-
SUM (Hasan et al., 2021), and the translation, food,
and conversation samples are formulated by anno-
tators.
Dialects #Sample of
Summarization
#Sample
of QA
#Sample
of Transaltion
#Sample
of Conversation
#Sample
of Food
Isan 	20 	20 	40 	10 	10
Lanna 	20 	20 	40 	10 	10
Dambro 	20 	20 	40 	10 	10
Table 4: Data Statistical: Number of Samples
A.2 Word Overlap
As shown in Table 5, we observe that the transla-
tion had the least overlap compared to other tasks.
This conforms with the results in Table 1 that the
least word overlap yields the lowest penalty, di-
rectly affecting the downstream task performance.
However, the word-overlap score still cannot ex-
plain the fluency performance of LLMs in speaking
Thai local dialects.
Dialects Summarization QA Translation
Isan 	0.9628 	0.9167 	0.9067
Lanna 	0.9767 	0.93318 	0.9072
Dambro 	0.9598 	0.9008 	0.9036
Table 5: Data Statistical: Word Overlap
A.3 Annotators
We list the author’s biographic as follows. For
Lanna, we have two women and three men, and an-
notators are between 25 and 30 years old. For Isan,
we have three women and two men; the annotators
are between 25 and 30 years old and 31 and 35
years old. For Dambro, we have three women and
two men, and annotators are between 25 and 30
years old. All annotators are born in non-Central
Thailand and speak local and Central dialects.
A.4 Annotator Agreement
In this study, we calculate the annotator agreement
of our proposed metric using Cohen’s Kappa score.
We use two annotators in this experiment, and both
of them need to annotate the same sample (100%
sample overlap). As shown in Table 6, we found
that Isan and Dambro have a similar agreement
score, 0.7 points, with a substantial agreement.
Moreover, we found almost perfect agreement on
Lanna. This is because the model can speak Lanna
better than other dialects. When the annotators
assign the score, it is easier for them to judge it.
Agreement 	Lanna 	Isan 	Dambro
Generation Fluency Generation

Chunk 19 · 1,990 chars

ar agreement
score, 0.7 points, with a substantial agreement.
Moreover, we found almost perfect agreement on
Lanna. This is because the model can speak Lanna
better than other dialects. When the annotators
assign the score, it is easier for them to judge it.
Agreement 	Lanna 	Isan 	Dambro
Generation Fluency Generation Fluency Generation Fluency
Cohen_kappa 	0.9315 	0.7319 	0.7078 	0.7077 	0.7514 	0.7951
Table 6: Data Statistical: Annotator agreement of our
proposed metric.
A.5 Example Code
Since we use all the generation setup from Lovenia
et al. (2024), we can run all the experiments with
one line code, similar to SEACrowd. For example:
1 # !/ bin / bash
2 MODEL_NAME = ’meta - llama / Llama -3.1 -8 B -
Instruct ’
3 python evaluation /
main_local_prompt_batch . py south ${
MODEL_NAME } 0 1
4 python evaluation /
main_local_prompt_batch . py north ${
MODEL_NAME } 0 1
Listing 1: Example Code

-- 7 of 10 --

In addition, the output will be in the CSV format,
including scores according to tasks. The bench-
mark can be run on a single A100 or H100 (800
GB) within < 1 hour.
A.6 Guideline for Annotators
Translation. From Section 3.2, we translate from
Central Thai to Thai local dialects using the follow-
ing guideline:
Translation Task
In this task, we need your help to revise
the translation of the prompt and its re-
sponse from English to your native lan-
guage.
Comparing Original and translated texts
and then editing the translation to be
more human (write your revision in the
edit column)
The goal is to make the translation results
look more like human writing
Note that the edit columns can’t be null.
You need to edit all the translations.
Do not remove or edit emojis, hashtags,
or special characters If the text represents
gender (she/he), please change it to the
general context (I, you, they, them)
All texts should be edited since it is not
100% natural. There is no empty field in
the revision column.
Model Comparison Guideline. From Section 3.3,
we ask three

Chunk 20 · 1,998 chars

remove or edit emojis, hashtags,
or special characters If the text represents
gender (she/he), please change it to the
general context (I, you, they, them)
All texts should be edited since it is not
100% natural. There is no empty field in
the revision column.
Model Comparison Guideline. From Section 3.3,
we ask three annotators to compare model A vs. B
where we mean the answer to get the final answer
of each sample. Annotators are blinded to model
identities for all examples. In addition, we aver-
age the answer from three annotators into the final
answer.
Guidelines for Evaluating Model A and
Model B
Evaluation Objective: Assess the perfor-
mance of Model A compared to Model
B in responding to provided questions/in-
structions/prompts.
Evaluate based on the following two as-
pects: 1. Accuracy of local dialect/lan-
guage usage 2. Accuracy of the response
Evaluation Options: You must choose
only one option for each criterion:
(i) A is better than B: Model A performs
better than Model B in the given aspect
(ii) B is better than A: Model B performs
better than Model A in the given aspect
(iii) A and B are equally good: Both mod-
els perform equally well in the given as-
pect
(iv) A and B are equally poor: Both mod-
els perform equally poorly in the given
aspect
Evaluation Justification: Provide a short
reason explaining why you chose that
score (repeating reasons is allowed if the
outputs are similar).
A.7 Food and Conversation Topics
We provide the full food and conversation topics in
Figure 2.
A.8 Prompts
We demonstrate the prompt we used in our experi-
ment in Figure 3. Note that all prompts are written
in local dialects, and we also added instructions for
performing in local dialects.
B LLM-as-a-judge
We acknowledge that the limitation of our proposed
metric is relying on human 100%. We use native
speakers to judge the fluency and generation score
for LLMs’ output. However, we tried to use LLM-
as-a-judge to solve this problem and found that
LLMs (i.e., Gemini2

Chunk 21 · 1,949 chars

ons for
performing in local dialects.
B LLM-as-a-judge
We acknowledge that the limitation of our proposed
metric is relying on human 100%. We use native
speakers to judge the fluency and generation score
for LLMs’ output. However, we tried to use LLM-
as-a-judge to solve this problem and found that
LLMs (i.e., Gemini2 and GPT4o) are not good
enough to understand Thai local dialects fluency,
resulting in inaccurate judgment compared to na-
tive speakers. Although Gemini2 achieves a high
score in Table 3, when it acts like a judge, it fails
to give a similar fluency score to the human, where
the preliminary results show that the correlation
between Geimini2 and Humans on Lanna is only
53.6 points (Spearman’s correlation). Therefore,
we omit the LLM-as-a-judge from our work.
C Output Examples
In this study, we show the incorrect and correct
examples when we focus on the local dialect out-
put. We divided it into two categories: (i) answer
correctly with no local dialects, and (ii) answer
correctly with local dialects to make it easier to
understand local Thai dialects. Note that we use
the output from Typhoon1.5-70b and Gemini2.

-- 8 of 10 --

Food
Conversation
Isan 	Lanna 	Dambro
แจ่วบักเขือ 	ลาบเหนือ 	แก๋งพุงปลา 	ประเพณีท้องถิน
อ่อมน้องวัว 	ขนมจีนนําเงียว 	แกงส้ม 	อาชีพ
ตําบักหุ่ง 	แกงฮังเล 	ลอกอผัดไข่ 	การคมนาคม
หมกหน่อไม้ 	นําพริกหนุ่ม 	ข้าวผัดเคย 	ครอบครัว
ซอยจุ๊ 	นําพริกอ่อง 	แกงส้มปล๊าพงยอดพร๊าว 	กลับบ้านช่วงเทศกาล
แกงขีเหล็ก 	แกงหัวปลี 	ข้าวผัดเคย 	การเมืองท้องถิน
นําตกคอหมูย่าง 	แกงหยวก 	ปล่าซายท๋อดขีมิน 	สภาพอากาศ
ปลาส้ม 	ไส้อัว 	ผัดผักเหลียง 	อาหารท้องถิน
แจ่วฮ้อน 	แคบหมู 	แก๋งเลียง 	เครืองแต่งกายท้องถิน
ตับหวาน 	แกงกระด้าง 	ผัดหลูกต๋อ 	การศึกษาท้องถิน
Figure 2: Food and Conversation topics. Note that these food names are very local in each part of Thailand.
Dialect 	Summarization 	Translation 	QA 	Food 	Conversation
Central	
จงสรุปข้อความด้านล่าง\n
ข้อความ: [INPUT]\nสรุป:
แปลข้อความต่อไปนีจาก [SOURCE] เป็น

Chunk 22 · 1,992 chars

กายท้องถิน
ตับหวาน 	แกงกระด้าง 	ผัดหลูกต๋อ 	การศึกษาท้องถิน
Figure 2: Food and Conversation topics. Note that these food names are very local in each part of Thailand.
Dialect 	Summarization 	Translation 	QA 	Food 	Conversation
Central	
จงสรุปข้อความด้านล่าง\n
ข้อความ: [INPUT]\nสรุป:
แปลข้อความต่อไปนีจาก [SOURCE] เป็น [TARGET]
ให้คุณแปลโดยตรงโดยไม่ต้องมีข้อมูลเพิมเติมใดๆ\n
ข้อความ: [INPUT]\nคําแปล:
โปรดอ้างอิงถึงข้อความด้านล่างนีและตอบคําถามต่อไปนี
ให้ตอบโดยใช้แค่ข้อความทีอยู่ในบทความ:\n
ข้อความ: [CONTEXT]\nคําถาม: [QUESTION]\nคําตอบ:
จงเขียนวิธีทํา [INPUT]	
จงสร้างบทสนทนาของคนภาคกลาง
โดยพูดคุยเกียวกับ [INPUT] บทสนทนาไม่ควรเกิน 5 turns
Isan	
หือสรุปข้อความตางล่างหือเป็นภาษาเหนือ\n
ข้อความ: [INPUT]\nสรุป:
แปลข้อความต่อไปนีจาก [SOURCE] เป๋น [TARGET]
หือตัวแปลโดยตรงโดยตีบะต้องมีข้อมูลเพิมเติมใดๆ\n
ข้อความ: [INPUT]\nคําแปล:
โปรดอ้างอิงถึงข้อความตางล่างนีและตอบคําถามต่อไปนีหือเป็นภาษาเหนือ
หือตอบโดยใช้ก่าข้อความตีอยู่ในบทความ:\n
ข้อความ: [CONTEXT]\nคําถาม: [QUESTION]\nคําตอบ:
หือเขียนวิธียะ [INPUT]
หือเป๋นภาษาเหนือ
หือเขียนกําอู้ของคนเหนือ
โดยอู้เกียวกับ [INPUT] บทสนทนาบะควรเกิน 5 turns
North	
จงสรุปข้อความทางลุ่ม\n
ข้อความ: [INPUT]\nสรุป:
แปลข้อความต่อไปนีจาก [SOURCE] เป็น [TARGET]
ให้เจ้าแปลโดยตรงโดยบ่ต้องมีข้อมูลเพิมเติมใดๆ\n
ข้อความ: [INPUT]\nคําแปล:
โปรดอ้างอิงถึงข้อความทางลุมพีและตอบคําถามต่อไปนีเป็นภาษาอีสาน
ให้ตอบโดยใช้แค่ข้อความทีอยู่ในบทความ:\n
ข้อความ:[CONTEXT]:\nคําถาม: [QUESTION]\nคําตอบ:
จงเขียนวิธีเฮ็ด [INPUT]	
จงเขียนบทเว่าของคนภาคอีสาน
โดยเว่าเกียวกับ [INPUT] บทสนทนาบ่ควรเกิน 5 turns
South
ให้สรุปข้อความข้างล่างเป็นภาษาใต้\n
ข้อความ:[INPUT]\nสรุปว่า:
แปลข้อความต่อจากนี [SOURCE] เป็น [TARGET]
ให้การแปลของเติลแปลตรงตัวเลย
ไม่ต้องไสข้อมูลเพิมเติมไหร\nข้อความ: [INPUT]\nคําแปล:
ช่วยใช้ข้อความข้างล่างนีตอบคําถามเป็นภาษาใต้
โดยตอบแค่คําตอบทีมีอยู๋ในบทความ:\n
ข้อความ: [CONTEXT]\nคําถาม: [QUESTION]\nคําตอบ:	
ให้เขียนวิธีทํา [INPUT] เป็นภาษาใต้	
ให้เขียนบทพูดของคนใต้
โดยแหลงเกียวกับ [INPUT] บทพูดไม่ควรเกิน 5 ตา
Dialect 	Summarization 	Translation 	QA 	Food

Chunk 23 · 1,977 chars

ูลเพิมเติมไหร\nข้อความ: [INPUT]\nคําแปล:
ช่วยใช้ข้อความข้างล่างนีตอบคําถามเป็นภาษาใต้
โดยตอบแค่คําตอบทีมีอยู๋ในบทความ:\n
ข้อความ: [CONTEXT]\nคําถาม: [QUESTION]\nคําตอบ:	
ให้เขียนวิธีทํา [INPUT] เป็นภาษาใต้	
ให้เขียนบทพูดของคนใต้
โดยแหลงเกียวกับ [INPUT] บทพูดไม่ควรเกิน 5 ตา
Dialect 	Summarization 	Translation 	QA 	Food 	Conversation
Central
Please summarize the follow text\n
Text: [INPUT]\nSummay:
Translate this text from [SOURCE] to [TARGET]
Please translate it without any additional information\n
ข้อความ: [INPUT]\nคําแปล:
Please reference to the provide context and answer the question
Please answer only the sentence that appear in the context:\n
Context: [CONTEXT]\nQuestion: [QUESTION]\nAnswer:
How to make a food [INPUT]	
Please create the conversation
Where the conversation topic is [INPUT] the conversation should not exceed 5 turns
Thai version
English version
Figure 3: The Thai prompts that we used in our experiments with the translation version.
แปลข้อความต่อจากนี ภาษาอังกฤษ เป็น ภาษาใต้ ให้การแปลของเติลแปลตรงตัวเลย ไม่ต้องไส	
ข้อมูลเพิมเติมไหร	
ข้อความ: I woke up very early today to go watch the buffalo race.	
The weather was so hot that when I got back home, I had to take a rest because I had a headache.	
คําแปล:
ผมถูกตืนเช้ามากวันนีเพือไปดูการแข่งควาย	
อากาศร้อนมากจนเมือผมกลับบ้าน ผมต้องพักเพราะผมมีหัวใจ	
The translation is almost correct.	
However, there is no local dialects at all in the response
เอ้อ ตืนแต่เช้าเหอวันนี หวังไปแหลงควายชน	
อากาศร้อนฉิบหาย พอเห้อกลับถึงบ้าน ต้องแหลงพัก	
เพราะว่าปวดหัวเติบเลย
Instruction 	Typhoon1.5 	Comment 	Annotators	
prefer
แปลข้อความต่อจากนี ภาษาอังกฤษ เป็น ภาษาใต้ ให้การแปลของเติลแปลตรงตัวเลย ไม่ต้องไส	
ข้อมูลเพิมเติมไหร	
ข้อความ: Why do I have to do this work again? I’ve revised it so many times, but it’s never	
finished. The person who keeps asking for revisions doesn’t even know what they’re doing. I’m so	
exhausted.	
คําแปล:
ทําไมฉันต้องทํางานนีอีก? ฉันแก้ไขมันหลายครังแล้ว	
แต่มันไม่เคยเสร็จสิน

Chunk 24 · 1,963 chars

งตัวเลย ไม่ต้องไส	
ข้อมูลเพิมเติมไหร	
ข้อความ: Why do I have to do this work again? I’ve revised it so many times, but it’s never	
finished. The person who keeps asking for revisions doesn’t even know what they’re doing. I’m so	
exhausted.	
คําแปล:
ทําไมฉันต้องทํางานนีอีก? ฉันแก้ไขมันหลายครังแล้ว	
แต่มันไม่เคยเสร็จสิน คนทีขอให้แก้ไขไม่รู้ว่าเขาทําอะไร ฉันเหนือยมาก	
The translation is almost correct.	
However, there is no local dialects at all in the response
เอ้อ เหอ ไซร กูต้องมาทําไหม่ตะงานนีหวา?	
กูแก้มาตังหลายรอบแล้วเห้อ มันก็ไม่แล้วสักที	
ไอ้คนสังแก้ก็ไม่รู้เรืองอีท่าไหน กูเหนือยฉิบหายเลยหว่ะ.
แปลข้อความต่อไปนีจาก ภาษาอังกฤษ เป๋น ภาษาเหนือ หือตัวแปลโดยตรงโดยตีบะต้องมีข้อมูลเพิม	
เติมใดๆ	
ข้อความ: I never knew before that Beer was so good at rock climbing. It’s impressive, and I’ll use	
him as an example of discipline and consistency in exercise.	
คําแปล:
ผมไม่เคยรู้มาก่อนว่าเบียร์เก่งในการปีนเขาขนาดนี	
มันน่าประทับใจ และผมจะใช้เขาเป็นตัวอย่างของความมีวินัย	
และความสมําเสมอในการออกกําลังกาย
The translation is almost correct.	
However, there is no local dialects at all in the response
เมือก่อนบ่เกยฮู้เลยว่าเบียร์เก่งปีนผาขนาดนี	
น่าทึงแต้ๆ แล้วจะเอาเปินไปเป็นตัวอย่างเรืองมีวินัย	
และสมําเสมอในการออกกําลังกาย
แปลข้อความต่อไปนีจาก ภาษาอังกฤษ เป๋น ภาษาเหนือ หือตัวแปลโดยตรงโดยตีบะต้องมีข้อมูลเพิม	
เติมใดๆ	
ข้อความ: On rainy days, you have to take an umbrella, but today, Gam forgot hers. So, she stayed	
home and relaxed.	
คําแปล:
ในวันฝนตก, สูต้องเอาจ้อง (ร่ม), แต่ว่าวันนี,	
แก๋มลืมของสู. ดังนัน, สูเลยอยู่บ้านแล้วก็พักผ่อน. 	The translation is almost correct with local dialects. 	-
แปลข้อความต่อไปนีจาก ภาษาอังกฤษ เป๋น ภาษาเหนือ หือตัวแปลโดยตรงโดยตีบะต้องมีข้อมูลเพิม	
เติมใดๆ	
ข้อความ: People who like to tease others are no-good. If you want to be a good person, don’t tease	
others. But Korn often gets teased, which makes him shout, "What the heck is your problem?"	
คําแปล:
คนตีชอบล้อเลียนคนอืน บะดี ถ้าอยากเป๋นคนดี	
อย่าล้อเลียนคนอืน

Chunk 25 · 986 chars

าษาเหนือ หือตัวแปลโดยตรงโดยตีบะต้องมีข้อมูลเพิม	
เติมใดๆ	
ข้อความ: People who like to tease others are no-good. If you want to be a good person, don’t tease	
others. But Korn often gets teased, which makes him shout, "What the heck is your problem?"	
คําแปล:
คนตีชอบล้อเลียนคนอืน บะดี ถ้าอยากเป๋นคนดี	
อย่าล้อเลียนคนอืน แต่ก่อนมักจะโดนล้อเลียน	
ทําให้เปินฮ้องว่า "อะหยังนักหนาของสู?"
The translation is almost correct with local dialects. 	-
Figure 4: Example from the translation task.
As shown in Figure 4, we notice that while Ty-
phoon can translate the text correctly, there are no
local dialects in this case. In contrast, Gemini2
(anotator prefer) can translate from English to local
dialect correctly. Moreover, we also observe Fig-
ure that both models code-switch between Central
and local dialects, as underscored in Figure 1. This
suggests that the gap in making LLMs speak Thai

-- 9 of 10 --

local dialects is significantly broad and needs more
attention.

-- 10 of 10 --