Are Large Language Models for Education Reliable for All Languages?

Summary

This study evaluates whether large language models (LLMs) are reliable for educational tasks across non-English languages. The authors tested six frontier models—GPT-4o, Gemini 2.0 Flash, Claude 3.7 Sonnet, Llama 3.1, Mistral Large, and Command-A—on four specific tasks: identifying student misconceptions, selecting targeted feedback, interactive tutoring, and grading translations. These tasks were assessed in eight languages (Mandarin, Hindi, Arabic, German, Farsi, Telugu, Ukrainian, and Czech) alongside English. Results indicate that while English performance remains dominant, the gap is not uniform. Performance generally correlates with the amount of language data in training corpora, with lower-resource languages like Telugu showing poorer outcomes. However, GPT-4o and Gemini 2.0 Flash demonstrated consistent, high-level performance across most languages and tasks. In contrast, models like Mistral and Command-A showed significant inconsistency, particularly failing in tutoring tasks for Czech and Telugu. The feedback selection task proved the most difficult for all models, often resulting in worse-than-random accuracy. Notably, using prompts in the target language rarely improved performance compared to English prompts; in many cases, English prompts yielded better results. The authors conclude that while some LLMs are reliable across languages, practitioners must verify model performance for specific target languages and tasks before deployment to avoid introducing educational harm or bias.

PDF viewer

Chunks(46)

Chunk 0 · 1,976 chars

Are Large Language Models for Education Reliable for All Languages?
Vansh Gupta*† Sankalan Pal Chowdhury∗†
Vilém Zouhar† Donya Rooein‡ Mrinmaya Sachan†
{guptav,spalchowd,vzouhar,msachan}@ethz.ch donya.rooein@unibocconi.it
†ETH Zurich ‡Bocconi University
Abstract
Large language models (LLMs) are increas-
ingly being adopted in educational settings.
These applications expand beyond English,
though current LLMs remain primarily English-
centric. In this work, we ascertain if their
use in education settings in non-English lan-
guages is warranted. We evaluated the perfor-
mance of popular LLMs on four educational
tasks: identifying student misconceptions, pro-
viding targeted feedback, interactive tutoring,
and grading translations in eight languages
(Mandarin, Hindi, Arabic, German, Farsi, Tel-
ugu, Ukrainian, Czech) in addition to English.
We find that the performance on these tasks
somewhat corresponds to the amount of lan-
guage represented in training data, with lower-
resource languages having poorer task perfor-
mance. However, at least some models are
able to more or less maintain their levels of
performance across all languages. Thus, we
recommend that practitioners first verify that
the LLM works well in the target language for
their educational task before deployment.
1 Introduction
Education is a multilingual, multicultural endeav-
our. AI-based technologies have recently shown
the potential to improve students’ learning expe-
riences, and educational systems worldwide are
increasingly adopting these tools (Gligorea et al.,
2023). From personalized instruction and targeted
feedback to appropriate content generation and in-
teractive tutoring, these tools offer solutions to key
educational challenges (Leon, 2024; Rooein et al.,
2024; Mosher et al., 2024). Large language models
such as GPT, Gemini, and Llama (OpenAI, 2023;
Team, 2024; Roumeliotis et al., 2023) have become
*Equal Contribution
0We release the collected dataset and code at

Chunk 1 · 1,991 chars

-
teractive tutoring, these tools offer solutions to key
educational challenges (Leon, 2024; Rooein et al.,
2024; Mosher et al., 2024). Large language models
such as GPT, Gemini, and Llama (OpenAI, 2023;
Team, 2024; Roumeliotis et al., 2023) have become
*Equal Contribution
0We release the collected dataset and code at github.com/
eth-lre/multilingual-educational-llm-bias. The dataset com-
prises 313,500 automatically evaluated model outputs across
seven languages, four tasks, and six models.
particularly influential, with early evidence sug-
gesting their ability to support teachers or scaffold
student learning (Kasneci et al., 2023; Alqahtani
et al., 2023).
Although most of these LLMs are trained on
multilingual corpora (OpenAI, 2019; Nvidia, 2022;
Peng et al., 2023; Gu and Dao, 2023), they are still
overwhelmingly English-centric (Argoub, 2022;
Ruder et al., 2022, Table 1). Inadequate adaptation
to local languages in an educational setting risks
diminishing their utility and exacerbating existing
inequalities by privileging dominant languages and
cultures. The question of multilingualism arises in
every domain where LLMs are applied (Lai et al.,
2023; Ahuja et al., 2023, 2024). However, it is es-
pecially important in the field of education, which
has seen wide use of LLMs despite the high stakes
(Alhafni et al., 2024; Raheja et al., 2023; Naismith
et al., 2023). Without rigorous evaluation tailored
to educational tasks across languages, deploying
LLMs in classrooms may introduce new forms
of harm, including misinformation, misalignment
with curricula, or culturally inappropriate content
(Almasoud et al., 2025).
In this work, we present an empirical investiga-
tion of the capabilities of frontier LLMs on educa-
tional tasks across several languages. We identify
four education-related tasks (identifying student
misconceptions, providing targeted feedback, inter-
active tutoring, and translation grading) with well-
defined language-agnostic metrics. We then

Chunk 2 · 1,993 chars

pirical investiga-
tion of the capabilities of frontier LLMs on educa-
tional tasks across several languages. We identify
four education-related tasks (identifying student
misconceptions, providing targeted feedback, inter-
active tutoring, and translation grading) with well-
defined language-agnostic metrics. We then evalu-
ate several frontier LLMs (Claude, Gemini, GPT4o,
Llama, and Mistral) on these tasks in eight lan-
guages (Mandarin, Hindi, Arabic, German, Farsi,
Telugu, Ukrainian, and Czech) in addition to En-
glish.
Our results show that though performance in
English still dominates, other languages are not
too far behind, at least for GPT4o and Gemini-2.0-
flash, which emerge as the best models. We also
find that using prompts in the language of the task
arXiv:2504.17720v2 [cs.CL] 5 Aug 2025

-- 1 of 20 --

is rarely helpful compared to English prompts.
2 Methods
We select our set of tasks based on 3 desiderata:
• Relevant to Education: We focus on tasks that
LLMs would encounter specifically in the role
of tutors, teachers, or teaching assistants. We
do not cover tasks like question answering or
solving math questions, which, while possibly
being relevant to education, are more general
tasks that are primarily studied in other contexts.
• Have a Language Component: We avoid tasks
whose formulation uses purely notation, for ex-
ample, solving a math equation. If the equation
is provided in mathematical notation, the task
would remain unchanged between different lan-
guages, making the question of multilingual per-
formance moot.
• Language Invariant Evaluation: Finally, we
need evaluation metrics that remain comparable
across languages to compare performance across
different languages efficiently. This means we
cannot rely on language-dependent metrics like
BLEU or COMET (Papineni et al., 2002; Rei
et al., 2020).
Based on these, we selected following four tasks:
Task 1: Misconception identification. An im-
portant aspect of teaching is fixing student

Chunk 3 · 1,990 chars

o compare performance across
different languages efficiently. This means we
cannot rely on language-dependent metrics like
BLEU or COMET (Papineni et al., 2002; Rei
et al., 2020).
Based on these, we selected following four tasks:
Task 1: Misconception identification. An im-
portant aspect of teaching is fixing student miscon-
ceptions, which first requires identifying the stu-
dent misconception (Liu et al., 2023). We build this
task on the EEDI Math Questions Dataset, which
contains thousands of multiple-choice questions
with four answer choices. For many of the wrong
choices, we have expert-annotated misconceptions
that could lead to a student picking the said choice.
We leverage these to build our task. The LLM
is given a multiple-choice question, the student’s
(incorrect) answer, and four possible misconcep-
tions. The candidate misconceptions include the
true misconception identified by experts and three
distractors chosen at random from the other mis-
conceptions present in the dataset. The LLM must
pick the correct misconception from these four op-
tions (see Example 1 for an example). We evaluate
the LLM performance by reporting accuracy in pre-
dicting the student misconception. Since the model
must pick one of four options, a random baseline
has an accuracy of 25%.
Task 2: Feedback selection. A key step towards
fixing students’ misconceptions is generating feed-
back to alleviate them. The EEDI dataset discussed
above also includes feedback for all the choices
we use for this part. The LLM is again given a
multiple-choice question, the student’s answer, and
this time, a set of four possible feedbacks, out of
which the LLM must select the feedback corre-
sponding to the student’s answer. Note that while
there are 4 possible feedbacks, one corresponds
to the correct answer. This one is easily identifi-
able as it reinforces the student’s answer, while the
feedbacks corresponding to wrong answers all try
to make the student realize their mistake. As

Chunk 4 · 1,994 chars

the feedback corre-
sponding to the student’s answer. Note that while
there are 4 possible feedbacks, one corresponds
to the correct answer. This one is easily identifi-
able as it reinforces the student’s answer, while the
feedbacks corresponding to wrong answers all try
to make the student realize their mistake. As an
example, see Option C in both parts of 2, which
are the only options in their respective questions
that do not start with a negative tone. Therefore,
if the selected answer is also the correct answer to
the problem, the LLM might be able to pick the
correct feedback using some shallow semantics,
which we want to avoid. Therefore, we ensure that
the selected answer is always incorrect. The ran-
dom baseline has an accuracy of 25%, or 33% if
choosing among responses to the wrong answer.
Task 3: Tutoring. For more complex misconcep-
tions, a single-turn feedback often does not suffice,
and fixing the misconception requires a multi-turn
conversation between the student and the teacher,
also known as tutoring. (Bloom, 1984; Cohen et al.,
1982) This involves a teacher LLM trying to help
the student identify and fix an error in their solu-
tion. We evaluate the tutoring ability of the LLM
by having it tutor a weaker LLM, which acts as
the student. Both the teacher and the student are
given the question, but only the teacher LLM can
access the correct answer. The student LLM is
instructed to stick to the wrong solution unless it
sees strong justification to shift. The teacher and
the student take turns to send messages, with the
teacher’s goal being to get the student model to
the correct answer, without revealing the answer
themselves. The teacher LLM is considered to get
a success if the student LLM states the answer. If
the teacher reveals the answer before the student
has gotten to it, it is counted as telling. An ad-
justed success occurs when there is a success but
no telling. The task is finally evaluated by Tutoring
score (Pal Chowdhury et al.,

Chunk 5 · 1,999 chars

e teacher LLM is considered to get
a success if the student LLM states the answer. If
the teacher reveals the answer before the student
has gotten to it, it is counted as telling. An ad-
justed success occurs when there is a success but
no telling. The task is finally evaluated by Tutoring
score (Pal Chowdhury et al., 2024), which is the
harmonic mean between success rate and adjusted
success rate.
This task differs from the other tasks on this list

-- 2 of 20 --

Language family Script Wikipedia CommonCrawl Speakers
English Germanic Latin 6973K 42.8% 1500M
Mandarin Sino-tibetan Hanzi1 1480K 5.8% 1184M
Hindi Indo-Iranian Brahmic 165K 0.20% 609M
Arabic Afro-Asiatic Abjad 1259K 0.68% 411M
German Germanic Latin 3021K 5.5% 411M
Farsi Indo-Iranian Abjad 1034K 0.74% 134M
Telugu Dravidian Brahmic 111K 0.02% 96M
Ukrainian Slavic Cyrillic 1371K 0.62% 39M
Czech Slavic Latin 566M 0.10% 12M
Table 1: Language information, number of speakers (Ethnologue 2025), and global representations of tested
languages in NLP (Wikipedia Articles and proportion in CommonCrawl in March 2025).
in at least two significant ways. First, it is a multi-
turn conversation task, so there is no scope for
guessing the answer. Secondly, the final evaluation
depends on the performance of the student LLM,
so the multilingual capabilities of the student LLM
also restrict the applicability of this task. These
factors make this task both slower to run and more
complex for the LLMs.
Task 4: Translation grading. A common field
of education that has seen an increase in the use of
LLMs is Language learning (Klimova et al., 2024;
Zhu et al., 2024). A representative task from this
field is to assign a grade to a translation provided
by a student. While we lack proper datasets across
languages with translations and their appropriate
grades, we can approximate this task by the fact
that the machine translation of a sentence should re-
ceive a higher grade than the exact translation with
one word replaced by a random

Chunk 6 · 1,987 chars

grade to a translation provided
by a student. While we lack proper datasets across
languages with translations and their appropriate
grades, we can approximate this task by the fact
that the machine translation of a sentence should re-
ceive a higher grade than the exact translation with
one word replaced by a random word. We use En-
glish sentences from Duolingo’s English→Spanish
SLAM dataset (Settles, 2018), which are machine
translated to other languages. We chose this dataset
because it is meant to be used for translation, so
it should contain fewer hard-to-translate sentences.
We filter out simple sentences that do not end with
a full stop or have fewer than five words. For each
translated sentence, we then create a correspond-
ing perturbed translation by replacing one of the
words in the sentence with a different word selected
at random from the other sentences in the dataset,
disrupting both the fluency and adequacy of the
translation. The LLM judges both the original and
perturbed versions on a scale from 1 (completely
incorrect) to 5 (perfect), with the expectation that it
should assign a strictly lower score to the perturbed
version. A model assigning all scores at random
would therefore score around 40%.
1Alternately referred to as Kanji, Hanja or Hantu
Figure 1: Multidimensional Scaling projection of lan-
guages based on syntax features from URIEL/lang2vec.
Languages used in our experiments are highlighted and
shown with full names, others are in ISO 639/set 2.
Language selection. We choose eight languages
for experiments: Mandarin, Hindi, Arabic, Ger-
man, Farsi, Telugu, Ukranian, Czech in addition
to English for comparison. This language selec-
tion reflects diverse linguistic properties, varying
levels of representation in training data, and differ-
ent language families (Foundation, 2024), see Ta-
ble 1. Hindi (Indo-Aryan) and Telugu (Dravidian)
represent major languages from the Indian subcon-
tinent that use the Brahmic script and are

Chunk 7 · 1,996 chars

his language selec-
tion reflects diverse linguistic properties, varying
levels of representation in training data, and differ-
ent language families (Foundation, 2024), see Ta-
ble 1. Hindi (Indo-Aryan) and Telugu (Dravidian)
represent major languages from the Indian subcon-
tinent that use the Brahmic script and are under-
represented in both CommonCrawl and Wikipedia.
German and Mandarin, on the other hand, are ex-
amples of languages well represented in both Com-
monCrawl and Wikipedia Farsi and Arabic offer
insights into LLM performance on a right-to-left
Abjad script, whereas Ukrainian and Czech allow
us to study generalisation in medium resource mor-
phologically rich languages, using the Cyrillic and

-- 3 of 20 --

Language Questions Misconception Feedback Translation
Mandarin 0.593 0.607 0.666 0.781
Hindi 0.455 0.546 0.593 0.831
Arabic 0.596 0.659 0.605 0.793
German 0.623 0.642 0.697 0.792
Farsi 0.574 0.644 0.708 0.840
Telugu 0.518 0.569 0.578 0.639
Ukrainian 0.607 0.642 0.682 0.821
Czech 0.611 0.626 0.663 0.805
Table 2: Average COMETDA,XL
23 scores for different languages for different components of the tasks. The Questions
are used for both Misconception and Feedback tasks. The Tutoring task is not translated.
Latin scripts, respectively.
To assess the typological diversity of our se-
lected languages, we used the URIEL typologi-
cal database (Littell et al., 2017) with lang2vec,
which provides dense vector representations of lan-
guages based on a range of typological, phyloge-
netic, and geographical features. As recommended
by the package, we extracted syntax features with
k-NN predictions for the missing values for a set
of 40 languages, constructed as the union of our
core experimental languages and the most widely
spoken languages worldwide according to Ethno-
logue (Eberhard et al., 2025). We projected each
language feature vector into two dimensions us-
ing Multidimensional Scaling, producing a 2D lan-
guage similarity plot. This allows us to

Chunk 8 · 1,996 chars

ges, constructed as the union of our
core experimental languages and the most widely
spoken languages worldwide according to Ethno-
logue (Eberhard et al., 2025). We projected each
language feature vector into two dimensions us-
ing Multidimensional Scaling, producing a 2D lan-
guage similarity plot. This allows us to visualise
(see Figure 1) the relative syntactic diversity of our
selected languages and confirm that they span a
broad typological space. The visualisation demon-
strates that our language selection (highlighted) is
well distributed across the typological landscape.
Translation. We obtain our tasks in all the above-
mentioned languages by machine translation. Fol-
lowing the GPT4 Technical Report (OpenAI, 2023,
Figure 5), we use Azure Translate to translate all
our examples to the target languages. However,
this introduces an additional noise source for tasks
performed in languages other than English. In fact,
after reviewing some of the translations manually,
it does look like the translations, though decent,
are not as easy to follow as their English coun-
terparts. This finding is further corroborated by
COMETDA,XL
23 (Rei et al., 2023) scores of the trans-
lations (see Table 7). This means that any dif-
ferences we observe between English and other-
language performance cannot be conclusively at-
tributed to the LLM being tested. However, we can
still compare the performance of different LLMs
across the same language, as the same translation
was used for all LLMs. Further, if at least one LLM
performs well in a task on a given language, we
can be reasonably certain that the translation for
that task-language pair was also good enough.
Models and prompts. We evaluate six state-of-
the-art LLMs praised for their multilingual ca-
pabilities: GPT-4o (OpenAI, 2023), Gemini 2.0
Flash (Team, 2024), Claude 3.7 Sonnet (Anthropic,
2024), Llama 3.1 405B (Grattafiori et al., 2024),
Mistral Large 2407 (AI, 2024; Jiang et al., 2023),
and Command-A (Cohere et

Chunk 9 · 1,998 chars

gh.
Models and prompts. We evaluate six state-of-
the-art LLMs praised for their multilingual ca-
pabilities: GPT-4o (OpenAI, 2023), Gemini 2.0
Flash (Team, 2024), Claude 3.7 Sonnet (Anthropic,
2024), Llama 3.1 405B (Grattafiori et al., 2024),
Mistral Large 2407 (AI, 2024; Jiang et al., 2023),
and Command-A (Cohere et al., 2025). We leave
all sampling parameters to their defaults. For
prompts, we use a simple chain of thought prompt-
ing method, where the model is first asked to ex-
plain why it would pick a certain answer, and then
asked to choose it in a separate prompt. Based on
literature (Mondshine et al., 2024; Huang et al.,
2023), it is unclear whether or not it is beneficial
to translate the prompt itself to the target language
or keep it in English, so we try both options.2,3
For each task, we use 1000 examples for report-
ing our results, sampled at random from the dataset,
except for 200 examples in the tutoring task, which
is multi-turn.
3 Results
In this section, we describe the results of five pop-
ular large language models on the four tasks de-
scribed in Section 2. The main results are shown
in Tables 3 to 6.
English is easiest for LLMs. The gap between
English and other languages is large in general. On
2A weaker model roleplays the student model used in the
tutoring task to be consistent with the original work. We only
use the original prompts because it does not work well with
non-English prompts.
3We machine-translate the prompts and manually verify
(with L1/L2 language knowledge) the translation adequacy.

-- 4 of 20 --

Input Options
Question: Which num-
ber is the greatest?
Student Answer:
5.0001
Right Answer:5.2
A: Believes the mean is total frequency divided by something,
B (correct): Thinks the more digits a number has the greater it is, regardless of place value,
C: Believes parallel lines have gradients that multiply to give -1,
D: When multiplying by a multiple of 10, gives an answer 10 times bigger than it should be
Question: What is

Chunk 10 · 1,998 chars

e mean is total frequency divided by something,
B (correct): Thinks the more digits a number has the greater it is, regardless of place value,
C: Believes parallel lines have gradients that multiply to give -1,
D: When multiplying by a multiple of 10, gives an answer 10 times bigger than it should be
Question: What is the
lowest common multi-
ple of 8 and 4?
Student answer: 4
Right Answer: 8
A: Subtracts instead of adds when answering worded problems,
B (correct): Confuses factors and multiples,
C: Rounds up instead of down,
D: Adds instead of multiplying when expanding bracket
Example 1: Two examples of the misconception identification task (English).
Input Options
Question: 6 pencils
cost £1.50. How
much do 3 pencils
cost?
Student answer:
25p
A: I think you have made an arithmetic error when halving £1.50. Use short division to divide by two,
B: I think you have used the incorrect notation for money. Consider how the monetary values in the
question are written,
C (correct answer): If 6 pencils cost £1.50, then 3 pencils cost half of £1.50, which is £0.75 or 75p.,
D (student answer): I think you have found the cost for one pencil. The question asks for the cost of
3 pencils.
Question: A film
starts at 8.50pm.
The film lasts
2 hours and 52
minutes. What time
does the film finish?
Student answer:
11.02pm
A (student answer): This isn’t quite right. Remember that there are 60 minutes in an hour, not 100 :),
B: I think you’ve confused your method a little. Noticing that 2 hours and 52 minutes is just 8 minutes
less than 3 hours is super, just make sure you add and subtract in the correct directions though :),
C: Almost there! Take care to notice how many hours and minutes you’re adding here. Is your answer
2 hours and 52 minutes later than 8.50pm?,
D (correct answer): Adding 2 hours to 8.50pm gives 10.50pm. Adding 10 minutes on takes us to
11.00pm, and adding the remaining 42 minutes gives 11.42pm.
Example 2: Two examples of the feedback selection task (English).
Math

Chunk 11 · 1,996 chars

hours and minutes you’re adding here. Is your answer
2 hours and 52 minutes later than 8.50pm?,
D (correct answer): Adding 2 hours to 8.50pm gives 10.50pm. Adding 10 minutes on takes us to
11.00pm, and adding the remaining 42 minutes gives 11.42pm.
Example 2: Two examples of the feedback selection task (English).
Math Problem Student’s (Incorrect) Solution Correct Solution
Sam sells bread. He has a target of selling
120 crates of bread in a week. One week he
was closed on Monday and Friday. Over the
weekend he sold 20 crates. On Tuesday he
sold 15 crates, on Wednesday 12 crates, and
Thursday 18 crates. By how many crates was
Sam off from his target for the week?
Sam had 5 days to sell bread because he was
closed on Monday and Friday. He sold a total
of 20 + 15 + 12 + 18 = 65 crates of bread from
Tuesday to Thursday. Adding the 20 crates he
sold over the weekend, Sam sold a total of 65 +
20 = 85 crates of bread in a week. Sam was off
from his target by 120 - 85 = 35 crates of bread.
During the whole week
Sam sold 15 + 12 + 18
+ 20 = 65 crates. Sam
was off his target by
120 - 65 = 55 crates.
Sophia is thinking of taking a road trip in her
car, and would like to know how far she can
drive on a single tank of gas. She has traveled
100 miles since last filling her tank, and she
needed to put in 4 gallons of gas to fill it up
again. The owner’s manual for her car says that
her tank holds 12 gallons of gas. How many
miles can Sophia drive on a single tank of gas?
Sophia used 4 out of the 12 gallons of gas in her
tank, so there are 12-4 = 8 gallons of gas left
in the tank. If Sophia can drive 100 miles on
4 gallons of gas, then she can drive 100/4 = 25
miles per gallon. Therefore, with 8 gallons of
gas left in the tank, Sophia can drive 25 x 8 =
200 miles on a single tank of gas.
To find miles per gal-
lon, divide 100 miles
/ 4 gallons = 25 miles
per gallon. To find how
far Olivia can go on a
single tank, multiply 25
miles per gallon × 12
gallons = 300 miles.
Example 3:

Chunk 12 · 1,995 chars

s per gallon. Therefore, with 8 gallons of
gas left in the tank, Sophia can drive 25 x 8 =
200 miles on a single tank of gas.
To find miles per gal-
lon, divide 100 miles
/ 4 gallons = 25 miles
per gallon. To find how
far Olivia can go on a
single tank, multiply 25
miles per gallon × 12
gallons = 300 miles.
Example 3: Two examples of the tutoring task.
English Source Original Translation Perturbed Translation Language
It is a kind of tomato.
Mandarin
Hindi
Arabic
Es ist eine Art Tomate Katze ist eine Art Tomate German
Farsi
Telugu
Ukrainian
Je to druh rajˇcete. matka to druh rajˇcete. Czech
Example 4: A single example of the translation grading task for non-English languages.

-- 5 of 20 --

English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English 97.6% 96.2% 95.1% 94.0% 95.0% 95.3% 97.6% 96.2% 95.1% 94.0% 95.0% 95.3%
Mandarin · 95.8% 95.2% · 92.5% · 92.1% · 92.8% · 92.9% 96.5% 95.0% · 91.7% 93.8% · 92.9% 94.1%
Hindi · 94.5% · 93.2% · 91.9% · 89.8% · 91.8% · 93.2% · 95.5% · 93.6% · 89.6% · 90.4% · 90.6% · 91.4%
Arabic · 95.9% · 93.0% · 92.0% ⋆ 86.0% · 92.6% · 93.4% · 95.9% · 93.0% · 92.8% · 90.9% · 92.0% 94.0%
German · 96.0% 96.2% 94.6% ⋆ 84.6% 95.1% 95.2% · 95.9% 96.6% 94.0% ⋆74.0% 94.9% 95.2%
Farsi · 94.8% · 93.3% · 93.0% · 87.5% · 92.7% · 93.1% · 95.1% · 94.4% ⋆68.0% · 88.3% ⋆66.9% · 93.6%
Telugu · 95.2% · 92.2% · 89.9% · 86.9% · 89.7% ⋆ 85.5% · 94.2% · 90.8% ⋆68.6% ⋆ 83.6% ⋆35.5% ⋆77.9%
Ukranian · 95.7% 94.9% · 92.9% 93.3% 94.4% 94.9% · 95.6% · 94.3% ⋆56.6% · 90.4% 94.2% 93.9%
Czech 96.9% 95.1% 94.5% 92.3% 94.5% 94.1% 96.6% 95.8% ⋆70.2% ⋆ 81.6% ⋆41.0% 94.5%
Table 3: Results (accuracy) for the misconception identification task. We mark results significantly lower (at least
10%=⋆, at least 5%=⋆, otherwise ·) than English with a one-sided 95% confidence t-test.
English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English

Chunk 13 · 1,992 chars

s (accuracy) for the misconception identification task. We mark results significantly lower (at least
10%=⋆, at least 5%=⋆, otherwise ·) than English with a one-sided 95% confidence t-test.
English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English 53.4% 38.2% 17.0% 51.1% 48.5% 39.7% 53.4% 38.2% 17.0% 51.1% 48.5% 39.7%
Mandarin · 49.6% ⋆ 29.7% · 12.3% · 43.0% · 40.1% · 31.8% ⋆ 41.1% ⋆19.2% ⋆ 5.8% ⋆30.3% ⋆30.3% ⋆ 27.8%
Hindi · 48.7% 35.6% · 13.0% · 43.6% · 40.5% · 31.6% ⋆32.1% ⋆13.4% ⋆ 6.2% · 44.3% ⋆18.6% ⋆18.8%
Arabic · 49.6% ⋆ 28.7% · 13.9% · 45.3% ⋆ 38.8% · 33.3% · 48.8% ⋆10.7% 16.3% 48.1% ⋆27.8% ⋆ 28.9%
German 52.5% · 32.1% 15.0% · 46.4% · 42.4% · 32.8% 50.6% · 30.8% 15.6% · 44.4% ⋆ 39.4% 37.6%
Farsi 50.2% ⋆ 27.9% · 11.3% · 44.9% · 41.3% ⋆ 30.9% · 45.9% · 31.6% 16.3% · 44.0% ⋆33.5% · 35.5%
Telugu · 45.2% ⋆ 27.6% · 10.4% · 43.4% ⋆34.0% ⋆ 26.3% ⋆13.9% ⋆12.7% ⋆ 6.1% ⋆ 37.7% ⋆15.5% ⋆9.5%
Ukranian 50.3% · 33.2% · 13.0% · 44.8% · 41.3% · 32.2% ⋆35.9% ⋆19.6% ⋆ 8.1% 52.8% ⋆31.0% ⋆ 27.2%
Czech 49.9% 37.8% · 14.1% · 46.5% · 41.6% ⋆ 30.7% ⋆ 42.7% ⋆ 26.1% 19.2% · 46.6% ⋆ 35.5% · 35.6%
Table 4: Results (accuracy) for the feedback selection task. We mark results significantly lower (at least 10%=⋆,
at least 5%=⋆, otherwise ·) than English with a one-sided 95% confidence t-test.
Harmonic mean Success/1-Telling
GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English 94.7% 97.0% 22.1% 93.0% 82.0% 95.5% 96.0/2.5% 97.5/1.0% 96.5/84.0% 93.5/1.0% 82.0/0.0% 96.0/1.0%
Mandarin 89.8% 89.0% 26.4% 79.7% 79.7% 88.2% 94.0/8.0% 90.5/3.0% 90.0/74.5% 80.5/1.5% 80.0/0.5% 93.0/9.0%
Hindi 90.5% 92.7% 24.2% ⋆ 72.2% 73.5% · 88.4% 95.0/8.5% 93.0/0.5% 89.5/75.5% 77.5/10.0% 73.5/0.0% 91.0/5.0%
Arabic 91.4% 89.7% 24.3% · 84.2% 75.2% 87.4% 94.5/5.9% 90.0/0.5% 91.0/77.0% 86.0/3.5% 75.5/0.5% 93.0/10.5%
German 90.7% 91.2% 23.4% 84.2% 77.2% · 86.3% 92.5/3.5% 92.0/1.5% 88.0/74.5% 85.0/1.5% 77.5/0.5%

Chunk 14 · 1,989 chars

3.0/9.0%
Hindi 90.5% 92.7% 24.2% ⋆ 72.2% 73.5% · 88.4% 95.0/8.5% 93.0/0.5% 89.5/75.5% 77.5/10.0% 73.5/0.0% 91.0/5.0%
Arabic 91.4% 89.7% 24.3% · 84.2% 75.2% 87.4% 94.5/5.9% 90.0/0.5% 91.0/77.0% 86.0/3.5% 75.5/0.5% 93.0/10.5%
German 90.7% 91.2% 23.4% 84.2% 77.2% · 86.3% 92.5/3.5% 92.0/1.5% 88.0/74.5% 85.0/1.5% 77.5/0.5% 90.5/8.0%
Farsi · 85.6% ⋆ 81.3% 28.7% 77.2% · 65.8% · 77.8% 89.0/6.5% 87.5/11.5% 91.5/74.5% 78.0/1.5% 69.5/7.0% 91.0/23.0%
Telugu ⋆50.1% ⋆39.5% 27.7% · 58.9% ⋆2.9% ⋆40.7% 77.5/40.5% 77.5/51.0% 85.5/69.0% 61.0/4.0% 59.0/57.5% 63.5/33.5%
Ukranian 91.2% 91.5% 23.5% · 81.2% 71.5% 90.9% 93.0/3.5% 92.0/1.0% 91.5/78.0% 84.0/5.5% 71.5/0.0% 93.5/5.0%
Czech ⋆43.8% ⋆44.1% 17.2% 70.2% ⋆2.9% ⋆21.5% 65.5/32.5% 73.5/42.0% 90.0/80.5% 71.5/2.5% 52.5/51.0% 77.0/64.5%
Table 5: Results (harmonic mean, success, and telling) for the tutoring task. We mark results significantly lower (at
least 10%=⋆, at least 5%=⋆, otherwise ·) than English with a one-sided 95% confidence t-test when occurring in
both success and telling. Telling is flipped such that higher is better.
English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
Mandarin 100.0% 99.3% 98.9% 99.9% 99.5% 99.9% 99.9% 99.4% 24.8% 99.6% 99.4% 99.9%
Hindi 91.5% 74.1% 92.1% 77.6% 82.4% 77.9% 93.8% 88.5% 56.5% 86.5% 87.6% 81.3%
Arabic 98.6% 97.9% 99.2% 98.8% 97.5% 99.0% 98.8% 98.3% 67.2% 98.6% 97.8% 97.9%
German 98.2% 97.9% 97.9% 98.2% 98.2% 98.2% 98.5% 98.3% 29.9% 98.0% 98.3% 97.8%
Farsi 95.3% 93.5% 96.0% 96.4% 92.3% 96.6% 96.8% 96.0% 67.0% 96.4% 94.1% 96.2%
Telugu 77.2% 33.7% 81.0% 51.9% 48.7% 25.2% 82.8% 46.8% 40.7% 82.1% 67.1% 15.6%
Ukranian 98.0% 97.3% 96.9% 96.5% 97.3% 98.3% 98.1% 97.9% 85.3% 97.7% 98.4% 98.2%
Czech 98.7% 98.3% 98.9% 98.3% 97.5% 98.8% 99.3% 98.8% 80.8% 98.7% 99.5% 99.2%
Table 6: Results (accuracy) for the translation grading task.
average4 across all tasks (excluding translation) and
models, English has 70.9%, in contrast to

Chunk 15 · 1,998 chars

6%
Ukranian 98.0% 97.3% 96.9% 96.5% 97.3% 98.3% 98.1% 97.9% 85.3% 97.7% 98.4% 98.2%
Czech 98.7% 98.3% 98.9% 98.3% 97.5% 98.8% 99.3% 98.8% 80.8% 98.7% 99.5% 99.2%
Table 6: Results (accuracy) for the translation grading task.
average4 across all tasks (excluding translation) and
models, English has 70.9%, in contrast to 63.1%
(Hindi), 55.3% (Czech), 67.8% (Ukrainian), 49.7%
(Telugu), 66.2% (Farsi), 66.8% (German), 64.6%
(Mandarin) and 67.4% (Arabic). This in itself does
not make it clear if the loss is due to the LLMs be-
4Averaging here is done to give a general idea, but we must
note that the scores are not equivalent. We use Accuracy for
tasks 1 and 2 but Tutoring Score for Task 3
ing weak or the translation quality being poor. The
poor performance on Telugu is largely driven by
Command-A and Mistral. The former is unsurpris-
ing as Telugu is the only language in our list that is
not officially supported by it (Cohere et al., 2025).
On the other hand, Mistral lists only 12 supported
languages of which we test only Hindi, Arabic,
German and Chinese. Telugu also has the lowest
representation in CommonCrawl and Wikipedia,

-- 6 of 20 --

Language Questions Misconception Feedback Translation
Mandarin 0.593 0.607 0.666 0.781
Hindi 0.455 0.546 0.593 0.831
Arabic 0.596 0.659 0.605 0.793
German 0.623 0.642 0.697 0.792
Farsi 0.574 0.644 0.708 0.840
Telugu 0.518 0.569 0.578 0.639
Ukrainian 0.607 0.642 0.682 0.821
Czech 0.611 0.626 0.663 0.805
Table 7: Average COMETDA,XL
23 scores for different languages for different components of the tasks. The Questions
are used for both Misconception and Feedback tasks. The Tutoring task is not translated.
so the result is expected. Manual analysis of the
low tutoring performance for Czech reveals that
the interactions switch between various language
formality styles, to the point that it becomes dis-
tracting. Additionally, the language used in Czech
classrooms is particular and likely not represented
on the internet.
Model performance

Chunk 16 · 1,999 chars

expected. Manual analysis of the
low tutoring performance for Czech reveals that
the interactions switch between various language
formality styles, to the point that it becomes dis-
tracting. Additionally, the language used in Czech
classrooms is particular and likely not represented
on the internet.
Model performance and consistency. Mistral is
the most inconsistent across non-English languages
(average deviation5=0.186). For example, it com-
pletely fails the tutoring task for both Czech and
Telugu, despite performing reasonably on other
languages in the same task. Command-A is not
much better (average deviation5=0.161). On the
other hand, Gemini is the most consistent (aver-
age deviation5=0.078) and also has the second-best
performance (average score 75.0%). GPT4o, is
the best performing model (average 78.6%) while
Claude performs the worst (average 49.3%) mostly
due to Feedback and Tutoring tasks.
Task difficulty. The worst performance is ob-
served in the Feedback task despite the similarity
to the Misconception identification task. While
Claude is still the standout worst performer with a
worse-than-random performance, all models strug-
gle. Further analysis in Table 11 shows that all
models tended to default the feedback correspond-
ing to the correct answer, with the models’ chain
of thoughts being “regardless of the student’s mis-
take, this is the feedback that gives the student the
most information about the correct answer.” Most
models perform well in the Translation evaluation
task, with the accuracy being even higher than hu-
man annotators, who were presented with attention
checks with similar perturbations (Kocmi et al.,
2024; Zouhar et al., 2025). They also do well in the
Misconceptions task, with most percentage scores
5We calculate the standard deviation across the six lan-
guages for each task and then calculate the mean.
(at least in the English prompt setting) being in
the 90s. The tutoring task seems to have the most
inconsistent performance

Chunk 17 · 1,997 chars

t al., 2025). They also do well in the
Misconceptions task, with most percentage scores
5We calculate the standard deviation across the six lan-
guages for each task and then calculate the mean.
(at least in the English prompt setting) being in
the 90s. The tutoring task seems to have the most
inconsistent performance across models and lan-
guages. In general, all models struggle in Czech
and Telugu, while Claude struggles in all languages.
Avoiding telling seems to be the more challenging
part of the problem for all the models, although
success rates are not very consistent either.
English and translated prompts. Excluding
for the Tutoring task (which did not use native
prompts), using English prompts yields better per-
formance than using translated prompts (averages
72.7% and 67.2%). The exceptions to these are
GPT, Llama, Gemini, and Mistral in the transla-
tion task though in most cases, the difference is
not very large. Note that some of the poor per-
formance could be attributed to the prompts being
translated and checked for correctness rather than
being written in the target language directly, which
could introduce some translationese. Regardless,
we believe it is best to keep prompts in English. As
a further note for English-speaking developers de-
signing multilingual applications, keeping prompts
in English ensures that the chains-of-thought re-
main English, making it easier to run sanity checks.
4 Related Work
LLMs, trained on vast multilingual texts, have dom-
inated tasks such as text generation, translation, and
dialogue (Brown et al., 2020), making them promis-
ing tools in Intelligent Tutoring Systems (ITS; Cor-
bett et al., 1997; Pal Chowdhury et al., 2024). Prior
work explores their use in educational contexts,
such as dynamic student interactions (Schmucker
et al., 2023), simulating expert and novice behavior
(Liu et al., 2023), and math word problem reason-
ing (Opedal et al., 2023).
Beyond mathematical context, LLMs have also
been explored

Chunk 18 · 1,999 chars

Pal Chowdhury et al., 2024). Prior
work explores their use in educational contexts,
such as dynamic student interactions (Schmucker
et al., 2023), simulating expert and novice behavior
(Liu et al., 2023), and math word problem reason-
ing (Opedal et al., 2023).
Beyond mathematical context, LLMs have also
been explored for other forms of learning. Cui

-- 7 of 20 --

and Sachan (2023) investigate LLMs in adaptive
and personalized exercise generation for language
learners, while (Wang et al., 2023) examines how
conversational tutoring strategies can aid student
understanding. Additionally, LLMs have been
used to assess grammatical correctness and trans-
lation accuracy (Kocmi and Federmann, 2023;
Omelianchuk et al., 2024; Freitag et al., 2024), fa-
cilitate automated essay scoring (Pack et al., 2024),
and provide corrective feedback in second language
writing (Han et al., 2024). While LLMs excel
in English, their abilities in other languages of-
ten vary, reflecting an over-representation of high-
resource languages in pre-training corpora. For ex-
ample, Koto et al. (2023) introduces IndoMMLU,
which reveals significant performance disparities
between Indonesian and English contexts. Sim-
ilarly, Holtermann et al. (2024) examines LLMs
across 137 languages and attributes discrepancies
in performance to tokenisation strategies. Li et al.
(2024); Armengol-Estapé et al. (2022) further find a
strong correlation between pre-training data propor-
tions and performance, reaffirming the gap between
high- and low-resource languages. For Catalan,
Armengol-Estapé et al. (2022) find that while GPT-
3 performed well in generative tasks, its compre-
hension capabilities were limited by the language’s
moderate representation.
Recent research has increasingly explored the ap-
plication of LLMs in multilingual educational con-
texts, though challenges persist in balancing perfor-
mance across languages. Systematic reviews of AI-
based language learning tools highlight the preva-
lence

Chunk 19 · 1,997 chars

s were limited by the language’s
moderate representation.
Recent research has increasingly explored the ap-
plication of LLMs in multilingual educational con-
texts, though challenges persist in balancing perfor-
mance across languages. Systematic reviews of AI-
based language learning tools highlight the preva-
lence of NLP and machine learning techniques for
error correction, feedback provision, and assess-
ment in non-English contexts, though they note
persistent gaps in dialogic competence and teacher
preparedness (Alhusaiyan, 2025). Studies evalu-
ating LLMs’ cross-lingual capabilities reveal per-
formance disparities, with models demonstrating
stronger skill tagging accuracy for English-centric
curricula compared to underrepresented languages
like Irish or Marathi (Kwak and Pardos, 2024). Bib-
liometric analyses indicate growing research inter-
est in AI for foreign language education, particu-
larly in vocabulary acquisition and writing support,
though most studies still focus on high-resource
European and Asian languages (Do˘gan and Talan,
2024). These works collectively underscore both
the transformative potential and current limitations
of LLMs in achieving equitable multilingual edu-
cational support.
To address multilingual education more directly,
projects like Kaleidoscope (Salazar et al., 2025)
and Aya (Üstün et al., 2024) by Cohere For AI aim
to support culturally diverse languages, while SEA-
HELM (Susanto et al., 2025) and ECLeKTic (Gold-
man et al., 2025) emphasise culturally grounded
evaluations in Southeast Asian and cross-lingual
contexts, respectively. These efforts highlight the
need for multilingual benchmarks that move be-
yond English-centric evaluations.
Prior pedagogical studies tend to assess single
LLMs in monolingual settings. We fill this gap
by benchmarking LLMs in multiple tasks. Specif-
ically, we conduct zero-shot experiments across
multiple models and languages to better analyze
their real-world applicability.
5 Conclusion
We

Chunk 20 · 1,991 chars

ond English-centric evaluations.
Prior pedagogical studies tend to assess single
LLMs in monolingual settings. We fill this gap
by benchmarking LLMs in multiple tasks. Specif-
ically, we conduct zero-shot experiments across
multiple models and languages to better analyze
their real-world applicability.
5 Conclusion
We analyse the performance of six well-known
state-of-the-art LLMs across six languages other
than English on four educational tasks. We find
that while performance in English continues to be
better than in other languages, the drop to other
models is not always large. In particular, we find
that GPT4o and Gemini 2.0 perform consistently
well across all languages, with a few exceptions.
We also note that English prompts work as well, if
not better, than prompts written in the target lan-
guage, when solving multilingual tasks. This opens
up opportunities for porting applications developed
for English into different languages. However, we
note that certain models perform poorly in some
tasks and languages, so we recommend first verify-
ing that a model works well in a particular language
on a specific educational task before deployment.
However, to answer the question posed by the title,
we believe that atleast some language models are
reliable across languages.
Limitations
The shown experiments could naturally be better
extended to more languages. The selected lan-
guages reflect a balance between author familiar-
ity, which is necessary for meaningful qualitative
analysis, and linguistic diversity, as evidenced by
their spread in URIEL feature space. Similarly, we
only covered six LLMs. In both cases, the cost of
experiments (see Table 8) becomes prohibitively
expensive, which motivated the data release in this
paper to enable further research.
Additionally, translation quality remains a con-

-- 8 of 20 --

Model API Total Miconception Feedback Tutoring Translation
Mistral Mistral API $530 $170 $170 $120 $70
Claude Anthropic $600 $190 $190 $135

Chunk 21 · 1,999 chars

ble 8) becomes prohibitively
expensive, which motivated the data release in this
paper to enable further research.
Additionally, translation quality remains a con-

-- 8 of 20 --

Model API Total Miconception Feedback Tutoring Translation
Mistral Mistral API $530 $170 $170 $120 $70
Claude Anthropic $600 $190 $190 $135 $85
Command Cohere $520 $165 $165 $120 $70
Llama Together.ai $600 $190 $190 $135 $80
GPT4o Open AI $80 $25 $25 $18 $12
Gemini Google Genai $30 $10 $10 $6 $4
Table 8: Approximate costs for the experiments. Does not include taxes or currency conversion charges. The total is
about $2360 with approximately an additional $500 spent on preliminary experiments.
cern, as previously discussed. A more thorough
evaluation would involve human translations for ev-
ery task, similar to the MMLU multilingual bench-
mark (Xuan et al., 2025), but doing so for all our
tasks would be resource-intensive.
Finally, the set of tasks is not a complete repre-
sentation of problems in the education space, pri-
marily because most of the more complex tasks
lack well-defined language-agnostic metrics.
Acknowledgements
Sankalan Pal Chowdhury is partially funded by
the ETH-EPFL JDPLS Program. Donya Rooein
is supported by the European Research Council
(ERC) under the European Union’s Horizon 2020
research and innovation program (grant agreement
No. 949944, INTEGRATOR).
References
Kabir Ahuja, Harshita Diddee, Rishav Hada, Milli-
cent Ochieng, Krithika Ramesh, Prachi Jain, Ak-
shay Nambi, Tanuja Ganu, Sameer Segal, Mohamed
Ahmed, Kalika Bali, and Sunayana Sitaram. 2023.
MEGA: Multilingual evaluation of generative AI.
In Proceedings of the 2023 Conference on Empir-
ical Methods in Natural Language Processing, pages
4232–4267. Association for Computational Linguis-
tics.
Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma,
Ishaan Watts, Ashutosh Sathe, Millicent Ochieng,
Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika
Bali, and Sunayana Sitaram. 2024. MEGAVERSE:
Benchmarking large language

Chunk 22 · 1,989 chars

Methods in Natural Language Processing, pages
4232–4267. Association for Computational Linguis-
tics.
Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma,
Ishaan Watts, Ashutosh Sathe, Millicent Ochieng,
Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika
Bali, and Sunayana Sitaram. 2024. MEGAVERSE:
Benchmarking large language models across lan-
guages, modalities, models and tasks. In Proceed-
ings of the 2024 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies (Volume
1: Long Papers), pages 2598–2637. Association for
Computational Linguistics.
Mistral AI. 2024. Large enough: Introducing mistral
large 2. Accessed: 2024-09-08.
Bashar Alhafni, Sowmya Vajjala, Stefano Bannò,
Kaushal Kumar Maurya, and Ekaterina Kochmar.
2024. LLMs in education: Novel perspec-
tives, challenges, and opportunities. Preprint,
arXiv:2409.11917.
Eman Alhusaiyan. 2025. A systematic review of current
trends in artificial intelligence in foreign language
learning. Saudi Journal of Language Studies, 5(1):1–
16.
Abdullah M. Almasoud, Muhammad Rafay Naeem,
Muhammad Imran Taj, Ibrahim Ghaznavi, and Ju-
naid Qadir. 2025. Toward inclusive educational AI:
Auditing frontier LLMs through a multiplexity lens.
ArXiv, abs/2501.03259.
Tariq Alqahtani, H. Badreldin, Mohammed A. Alrashed,
Abdulrahman I. Alshaya, S. Alghamdi, Khalid Bin
saleh, Shuroug A. Alowais, Omar A. Alshaya, I. Rah-
man, Majed S Al Yami, and Abdulkareem M. Al-
bekairy. 2023. The emergent role of artificial intel-
ligence, natural learning processing, and large lan-
guage models in higher education and research. Re-
search in social & administrative pharmacy : RSAP.
Anthropic. 2024. Introducing claude 3.5 sonnet. Ac-
cessed: 2024-09-08.
Sabrina Argoub. 2022. The NLP divide: English is not
the only natural language - polis.
Jordi Armengol-Estapé, Ona de Gibert Bonet, and Maite
Melero. 2022. On the multilingual capabilities of
very large-scale English language models. In

Chunk 23 · 1,982 chars

cy : RSAP.
Anthropic. 2024. Introducing claude 3.5 sonnet. Ac-
cessed: 2024-09-08.
Sabrina Argoub. 2022. The NLP divide: English is not
the only natural language - polis.
Jordi Armengol-Estapé, Ona de Gibert Bonet, and Maite
Melero. 2022. On the multilingual capabilities of
very large-scale English language models. In Pro-
ceedings of the Thirteenth Language Resources and
Evaluation Conference, pages 3056–3068. European
Language Resources Association.
Benjamin S. Bloom. 1984. The 2 sigma problem: The
search for methods of group instruction as effec-
tive as one-to-one tutoring. Educational Researcher,
13(6):4–16.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, and 12 others. 2020. Lan-
guage models are few-shot learners. Preprint,
arXiv:2005.14165.

-- 9 of 20 --

Peter A. Cohen, James A. Kulik, and Chen-Lin C. Kulik.
1982. Educational outcomes of tutoring: A meta-
analysis of findings. American Educational Research
Journal, 19(2):237–248.
Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay
Alammar, Yazeed Alnumay, Sophia Althammer,
Arkady Arkhangorodsky, Viraat Aryabumi, Dennis
Aumiller, Raphaël Avalos, and 1 others. 2025. Com-
mand a: An enterprise-ready large language model.
arXiv preprint arXiv:2504.00698.
Albert T. Corbett, Kenneth R. Koedinger, and John R.
Anderson. 1997. Chapter 37 - intelligent tutoring
systems. In Marting G. Helander, Thomas K. Lan-
dauer, and Prasad V. Prabhu, editors, Handbook of
Human-Computer Interaction (Second Edition), sec-
ond edition edition, pages 849–874. North-Holland,
Amsterdam.
Peng Cui and Mrinmaya Sachan. 2023. Adaptive and
personalized exercise generation for online language
learning. In Proceedings of the 61st Annual Meet-
ing of the Association for Computational

Chunk 24 · 1,994 chars

s, Handbook of
Human-Computer Interaction (Second Edition), sec-
ond edition edition, pages 849–874. North-Holland,
Amsterdam.
Peng Cui and Mrinmaya Sachan. 2023. Adaptive and
personalized exercise generation for online language
learning. In Proceedings of the 61st Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 10184–10198. Asso-
ciation for Computational Linguistics.
Yunus Do˘gan and Tarık Talan. 2024. Artificial intelli-
gence in foreign language learning: A bibliometric
analysis. Journal of Pedagogical Research, 9(2):206–
230.
David M. Eberhard, Gary F. Simons, and Charles D.
Fennig, editors. 2025. Ethnologue: Languages of the
World, 28 edition. SIL International, Dallas, Texas.
Online version: http://www.ethnologue.com.
Wikimedia Foundation. 2024. List of wikipedias by
language group. Accessed: 2024-09-08.
Markus Freitag, Nitika Mathur, Daniel Deutsch, Chi-
Kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian
Thompson, Frederic Blain, Tom Kocmi, Jiayi Wang,
David Ifeoluwa Adelani, Marianna Buchicchio,
Chrysoula Zerva, and Alon Lavie. 2024. Are LLMs
breaking MT metrics? results of the WMT24 metrics
shared task. In Proceedings of the Ninth Conference
on Machine Translation, pages 47–81. Association
for Computational Linguistics.
Ilie Gligorea, Marius Cioca, Romana Oancea, A. Gorski,
Hortensia Gorski, and Paul Tudorache. 2023. Adap-
tive learning using artificial intelligence in e-learning:
A literature review. Education Sciences.
Omer Goldman, Uri Shaham, Dan Malkin, Sivan Eiger,
Avinatan Hassidim, Yossi Matias, Joshua Maynez,
Adi Mayrav Gilady, Jason Riesa, Shruti Rijhwani,
Laura Rimell, Idan Szpektor, Reut Tsarfaty, and
Matan Eyal. 2025. ECLeKTic: A novel challenge
set for evaluation of cross-lingual knowledge transfer.
Preprint, arXiv:2502.21228.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schel-
ten, Alex Vaughan, Amy

Chunk 25 · 1,988 chars

tor, Reut Tsarfaty, and
Matan Eyal. 2025. ECLeKTic: A novel challenge
set for evaluation of cross-lingual knowledge transfer.
Preprint, arXiv:2502.21228.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schel-
ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh
Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi-
tra, Archie Sravankumar, Artem Korenev, Arthur
Hinsvark, and 542 others. 2024. The llama 3 herd of
models. Preprint, arXiv:2407.21783.
Albert Gu and Tri Dao. 2023. Mamba: Linear-
Time sequence modeling with selective state spaces.
Preprint, arXiv:2312.00752.
Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim,
Hyunseung Lim, Yoonsu Kim, Tak Yeon Lee, Hwa-
jung Hong, Juho Kim, So-Yeon Ahn, and Alice Oh.
2024. LLM-as-a-tutor in EFL writing education: Fo-
cusing on evaluation of student-LLM interaction. In
Proceedings of the 1st Workshop on Customizable
NLP: Progress and Challenges in Customizing NLP
for a Domain, Application, Group, or Individual
(CustomNLP4U), pages 284–293. Association for
Computational Linguistics.
Carolin Holtermann, Paul Röttger, Timm Dill, and Anne
Lauscher. 2024. Evaluating the elementary multi-
lingual capabilities of large language models with
multiq. Preprint, arXiv:2403.03814.
Haoyang Huang, Tianyi Tang, Dongdong Zhang,
Wayne Xin Zhao, Ting Song, Yan Xia, and Furu Wei.
2023. Not all languages are created equal in LLMs:
Improving multilingual capability by cross-lingual-
thought prompting. Preprint, arXiv:2305.07004.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
and William El Sayed. 2023. Mistral 7b. Preprint,
arXiv:2310.06825.
Enkelejda Kasneci, Kathrin Seßler, S. Küchemann,
M. Bannert, Daryna

Chunk 26 · 1,995 chars

ian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
and William El Sayed. 2023. Mistral 7b. Preprint,
arXiv:2310.06825.
Enkelejda Kasneci, Kathrin Seßler, S. Küchemann,
M. Bannert, Daryna Dementieva, F. Fischer, Urs
Gasser, G. Groh, Stephan Günnemann, Eyke Hüller-
meier, Stephan Krusche, Gitta Kutyniok, Tilman
Michaeli, Claudia Nerdel, J. Pfeffer, Oleksandra Po-
quet, Michael Sailer, Albrecht Schmidt, T. Seidel,
and 4 others. 2023. ChatGPT for good? on oppor-
tunities and challenges of large language models for
education. Learning and Individual Differences.
Blanka Klimova, Marcel Pikhart, and Liqaa Habeb Al-
Obaydi. 2024. Exploring the potential of ChatGPT
for foreign language education at the university level.
Frontiers in Psychology, 15:1269319.
Tom Kocmi and Christian Federmann. 2023. Large lan-
guage models are state-of-the-art evaluators of trans-
lation quality. In Proceedings of the 24th Annual
Conference of the European Association for Machine
Translation, pages 193–203. European Association
for Machine Translation.
Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis,
Roman Grundkiewicz, Marzena Karpinska, Maja
Popovi´c, Mrinmaya Sachan, and Mariya Shmatova.
2024. Error span annotation: A balanced approach

-- 10 of 20 --

for human evaluation of machine translation. In Pro-
ceedings of the Ninth Conference on Machine Trans-
lation, pages 1440–1453. Association for Computa-
tional Linguistics.
Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Bald-
win. 2023. Large language models only pass primary
school exams in indonesia: A comprehensive test on
IndoMMLU. Preprint, arXiv:2310.04928.
Yerin Kwak and Zachary A. Pardos. 2024. Bridging
large language model disparities: Skill tagging of
multilingual educational content. British Journal of
Educational Technology, 55(5):2039–2057.
Viet Dac Lai, Nghia Ngo, Amir Pouran Ben Veyseh,
Hieu

Chunk 27 · 1,998 chars

ndonesia: A comprehensive test on
IndoMMLU. Preprint, arXiv:2310.04928.
Yerin Kwak and Zachary A. Pardos. 2024. Bridging
large language model disparities: Skill tagging of
multilingual educational content. British Journal of
Educational Technology, 55(5):2039–2057.
Viet Dac Lai, Nghia Ngo, Amir Pouran Ben Veyseh,
Hieu Man, Franck Dernoncourt, Trung Bui, and
Thien Huu Nguyen. 2023. ChatGPT beyond En-
glish: Towards a comprehensive evaluation of large
language models in multilingual learning. In Find-
ings of the Association for Computational Linguistics:
EMNLP 2023, pages 13171–13189. Association for
Computational Linguistics.
Maikel Leon. 2024. Leveraging generative AI for on-
demand tutoring as a new paradigm in education.
International Journal on Cybernetics & Informatics.
Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ali Payani,
Ninghao Liu, and Mengnan Du. 2024. Quantifying
multilingual performance of large language models
across languages. Preprint, arXiv:2404.11553.
Patrick Littell, David R. Mortensen, Ke Lin, Katherine
Kairis, Carlisle Turner, and Lori Levin. 2017. URIEL
and lang2vec: Representing languages as typological,
geographical, and phylogenetic vectors. In Proceed-
ings of the 15th Conference of the European Chap-
ter of the Association for Computational Linguistics:
Volume 2, Short Papers, pages 8–14. Association for
Computational Linguistics.
Naiming Liu, Shashank Sonkar, Zichao Wang, Simon
Woodhead, and Richard G. Baraniuk. 2023. Novice
learner and expert tutor: Evaluating math reasoning
abilities of large language models with misconcep-
tions. Preprint, arXiv:2310.02439.
Itai Mondshine, Tzuf Paz-Argaman, Asaf
Achi Mordechai, and Reut Tsarfaty. 2024. HeSum: a
novel dataset for abstractive text summarization in
Hebrew. In Proceedings of the Seventh Workshop
on Technologies for Machine Translation of Low-
Resource Languages (LoResMT 2024), pages 26–36.
Association for Computational Linguistics.
Maggie A. Mosher, Lisa Dieker, and Rebecca Hines.
2024.

Chunk 28 · 1,989 chars

Tsarfaty. 2024. HeSum: a
novel dataset for abstractive text summarization in
Hebrew. In Proceedings of the Seventh Workshop
on Technologies for Machine Translation of Low-
Resource Languages (LoResMT 2024), pages 26–36.
Association for Computational Linguistics.
Maggie A. Mosher, Lisa Dieker, and Rebecca Hines.
2024. The past, present, and future use of artificial
intelligence in teacher education. Journal of Special
Education Preparation.
Ben Naismith, Phoebe Mulcaire, and Jill Burstein. 2023.
Automated evaluation of written discourse coherence
using GPT-4. In Proceedings of the 18th Workshop
on Innovative Use of NLP for Building Educational
Applications (BEA 2023), pages 394–403. Associa-
tion for Computational Linguistics.
Nvidia. 2022. Transformer model.
Kostiantyn Omelianchuk, Andrii Liubonko, Oleksandr
Skurzhanskyi, Artem Chernodub, Oleksandr Korni-
ienko, and Igor Samokhin. 2024. Pillars of gram-
matical error correction: Comprehensive inspection
of contemporary approaches in the era of large lan-
guage models. In Proceedings of the 19th Workshop
on Innovative Use of NLP for Building Educational
Applications (BEA 2024), pages 17–33. Association
for Computational Linguistics.
Andreas Opedal, Niklas Stoehr, Abulhair Saparov, and
Mrinmaya Sachan. 2023. World models for math
story problems. Preprint, arXiv:2306.04347.
OpenAI. 2019. Language models are unsupervised mul-
titask learners.
OpenAI. 2023. GPT-4 technical report. Preprint,
arXiv:2303.08774.
Austin Pack, Alex Barrett, and Juan Escalante. 2024.
Large language models and automated essay scoring
of english language learner writing: Insights into
validity and reliability. Computers and Education:
Artificial Intelligence, 6:100234.
Sankalan Pal Chowdhury, Vilém Zouhar, and Mrinmaya
Sachan. 2024. Autotutor meets large language mod-
els: A language model tutor with rich pedagogy and
guardrails. In Proceedings of the Eleventh ACM Con-
ference on Learning @ Scale, L@S ’24, page 5–15,
New York, NY, USA.

Chunk 29 · 1,994 chars

d Education:
Artificial Intelligence, 6:100234.
Sankalan Pal Chowdhury, Vilém Zouhar, and Mrinmaya
Sachan. 2024. Autotutor meets large language mod-
els: A language model tutor with rich pedagogy and
guardrails. In Proceedings of the Eleventh ACM Con-
ference on Learning @ Scale, L@S ’24, page 5–15,
New York, NY, USA. Association for Computing
Machinery.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318. Association for
Computational Linguistics.
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Al-
balak, Samuel Arcadinho, Stella Biderman, Huanqi
Cao, Xin Cheng, Michael Chung, Matteo Grella,
Kranthi Kiran GV, Xuzheng He, Haowen Hou, Ji-
aju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming
Kong, Bartlomiej Koptyra, Hayden Lau, and 15 oth-
ers. 2023. RWKV: Reinventing RNNs for the trans-
former era. Preprint, arXiv:2305.13048.
Vipul Raheja, Dhruv Kumar, Ryan Koo, and Dongyeop
Kang. 2023. CoEdIT: Text editing by task-specific
instruction tuning. Preprint, arXiv:2305.09857.
Ricardo Rei, Nuno M. Guerreiro, Jos
textasciitilde A© Pombal, Daan van Stigt, Mar-
cos Treviso, Luisa Coheur, José G. C. de Souza,
and André Martins. 2023. Scaling up CometKiwi:
Unbabel-IST 2023 submission for the quality esti-
mation shared task. In Proceedings of the Eighth
Conference on Machine Translation, pages 841–848.
Association for Computational Linguistics.

-- 11 of 20 --

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2685–2702. Association for
Computational Linguistics.
Donya Rooein, Paul Röttger, Anastassia Shaitarova, and
Dirk Hovy. 2024. Beyond flesch-kincaid: Prompt-
based metrics improve difficulty classification of ed-
ucational

Chunk 30 · 1,999 chars

dings of the 2020 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2685–2702. Association for
Computational Linguistics.
Donya Rooein, Paul Röttger, Anastassia Shaitarova, and
Dirk Hovy. 2024. Beyond flesch-kincaid: Prompt-
based metrics improve difficulty classification of ed-
ucational texts. In Proceedings of the 19th Workshop
on Innovative Use of NLP for Building Educational
Applications (BEA 2024), pages 54–67. Association
for Computational Linguistics.
Konstantinos I. Roumeliotis, Nikolaos D. Tselikas, and
Dimitrios K. Nasiopoulos. 2023. Llama 2: Early
adopters’ utilization of meta’s new open-source pre-
trained model. Preprints.
Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard.
2022. Square one bias in NLP: Towards a multi-
dimensional exploration of the research manifold. In
Findings of the Association for Computational Lin-
guistics: ACL 2022, pages 2340–2354. Association
for Computational Linguistics.
Israfel Salazar, Manuel Fernández Burda, Shayekh Bin
Islam, Arshia Soltani Moakhar, Shivalika Singh,
Fabian Farestam, Angelika Romanou, Danylo
Boiko, Dipika Khullar, Mike Zhang, Dominik
Krzemi´nski, Jekaterina Novikova, Luísa Shimabu-
coro, Joseph Marvin Imperial, Rishabh Maheshwary,
Sharad Duwal, Alfonso Amayuelas, Swati Rajwal,
Jebish Purbey, and 25 others. 2025. Kaleidoscope:
In-language exams for massively multilingual vision
evaluation. Preprint, arXiv:2504.07072.
Robin Schmucker, Meng Xia, Amos Azaria, and Tom
Mitchell. 2023. Ruffle&riley: Towards the auto-
mated induction of conversational tutoring systems.
Preprint, arXiv:2310.01420.
Burr Settles. 2018. Data for the 2018 duolingo
shared task on second language acquisition modeling
(SLAM).
Yosephine Susanto, Adithya Venkatadri Hulagadri,
Jann Railey Montalan, Jian Gang Ngui, Xian Bin
Yong, Weiqi Leong, Hamsawardhini Rengara-
jan, Peerat Limkonchotiwat, Yifan Mai, and
William Chandra Tjhi. 2025. SEA-HELM: South-
east asian holistic evaluation of language models.
Preprint,

Chunk 31 · 1,998 chars

age acquisition modeling
(SLAM).
Yosephine Susanto, Adithya Venkatadri Hulagadri,
Jann Railey Montalan, Jian Gang Ngui, Xian Bin
Yong, Weiqi Leong, Hamsawardhini Rengara-
jan, Peerat Limkonchotiwat, Yifan Mai, and
William Chandra Tjhi. 2025. SEA-HELM: South-
east asian holistic evaluation of language models.
Preprint, arXiv:2502.14301.
Gemini Team. 2024. Gemini: A family of highly capa-
ble multimodal models. Preprint, arXiv:2312.11805.
Ahmet Üstün, Viraat Aryabumi, Zheng Yong, Wei-Yin
Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhan-
dari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Fred-
die Vargus, Phil Blunsom, Shayne Longpre, Niklas
Muennighoff, Marzieh Fadaee, Julia Kreutzer, and
Sara Hooker. 2024. Aya model: An instruction fine-
tuned open-access multilingual language model. In
Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 15894–15939. Association for
Computational Linguistics.
Lingzhi Wang, Mrinmaya Sachan, Xingshan Zeng, and
Kam-Fai Wong. 2023. Strategize before teaching: A
conversational tutoring system with pedagogy self-
distillation. In Findings of the Association for Com-
putational Linguistics: EACL 2023, pages 2268–
2274. Association for Computational Linguistics.
Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng,
Yunze Xiao, Yun Xing, Junjue Wang, Huitao Li,
Xin Li, Kunyu Yu, Nan Liu, Qingyu Chen, Dou-
glas Teodoro, Edison Marrese-Taylor, Shijian Lu,
Yusuke Iwasawa, Yutaka Matsuo, and Irene Li. 2025.
MMLU-ProX: A multilingual benchmark for ad-
vanced large language model evaluation. Preprint,
arXiv:2503.10497.
Tiffany Zhu, Kexun Zhang, and William Yang Wang.
2024. Embracing AI in education: Understanding
the surge in large language model use by secondary
students. Preprint, arXiv:2411.18708.
Vilém Zouhar, Tom Kocmi, and Mrinmaya Sachan.
2025. AI-assisted human evaluation of machine
translation. Preprint, arXiv:2406.12419.

-- 12 of 20 --

English prompt Translated prompt
Language

Chunk 32 · 1,996 chars

bracing AI in education: Understanding
the surge in large language model use by secondary
students. Preprint, arXiv:2411.18708.
Vilém Zouhar, Tom Kocmi, and Mrinmaya Sachan.
2025. AI-assisted human evaluation of machine
translation. Preprint, arXiv:2406.12419.

-- 12 of 20 --

English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English 0.0% 0.7% 0.1% 2.1% 0.0% 0.9% 0.0% 0.7% 0.1% 2.1% 0.0% 0.9%
Mandarin 0.5% 1.6% 0.0% 3.2% 0.0% 0.2% 0.5% 1.6% 0.0% 3.2% 0.0% 0.2%
Hindi 0.0% 1.6% 0.4% 2.3% 0.0% 0.4% 0.0% 0.9% 0.2% 2.5% 0.0% 0.1%
Arabic 0.1% 1.7% 0.2% 2.1% 0.0% 0.2% 0.0% 1.1% 0.3% 2.2% 0.0% 0.1%
German 0.5% 1.6% 0.3% 2.3% 0.0% 0.2% 0.5% 1.6% 0.3% 2.3% 0.0% 0.2%
Farsi 0.0% 1.8% 0.2% 2.0% 0.0% 0.3% 0.0% 1.6% 0.2% 2.9% 0.0% 0.1%
Telugu 0.0% 0.1% 0.0% 2.2% 0.0% 0.4% 0.0% 0.3% 0.1% 1.7% 0.0% 0.0%
Ukranian 0.1% 1.6% 0.1% 2.2% 0.0% 0.3% 0.0% 1.8% 0.4% 1.6% 0.0% 0.1%
Czech 0.1% 1.6% 0.1% 1.9% 0.0% 0.7% 0.0% 1.4% 0.0% 0.7% 0.0% 0.5%
Table 9: Response error rate for the misconception identification task.
English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English 0.0% 0.3% 0.0% 1.3% 0.0% 0.0% 0.0% 0.3% 0.0% 1.3% 0.0% 0.0%
Mandarin 0.0% 0.1% 0.0% 1.5% 0.0% 0.2% 0.0% 0.0% 0.1% 1.6% 0.0% 0.1%
Hindi 0.0% 0.0% 0.0% 1.1% 0.0% 0.1% 0.0% 0.0% 0.0% 1.0% 0.0% 0.2%
Arabic 0.0% 0.0% 0.0% 1.5% 0.0% 0.1% 0.0% 0.0% 0.0% 2.1% 0.0% 0.2%
German 0.0% 0.0% 0.0% 1.1% 0.0% 0.1% 0.0% 0.0% 0.0% 0.8% 0.0% 0.2%
Farsi 0.0% 0.0% 0.0% 1.2% 0.0% 0.0% 0.0% 0.0% 0.0% 1.1% 0.0% 0.1%
Telugu 0.0% 0.0% 0.0% 1.7% 0.0% 0.1% 0.0% 0.2% 0.0% 1.8% 0.0% 0.1%
Ukranian 0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 0.0% 0.0% 1.3% 0.0% 0.0%
Czech 0.0% 0.0% 0.0% 1.1% 0.0% 0.0% 0.0% 0.0% 0.0% 3.0% 0.0% 0.0%
Table 10: Response error rate for the feedback selection task.
English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English

Chunk 33 · 1,994 chars

0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 0.0% 0.0% 1.3% 0.0% 0.0%
Czech 0.0% 0.0% 0.0% 1.1% 0.0% 0.0% 0.0% 0.0% 0.0% 3.0% 0.0% 0.0%
Table 10: Response error rate for the feedback selection task.
English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
English 23.7% 45.8% 75.0% 27.8% 23.0% 35.1% 23.7% 45.8% 75.0% 27.8% 23.0% 35.1%
Mandarin 26.9% 54.5% 81.5% 36.9% 33.8% 47.7% 32.6% 71.9% 89.7% 24.3% 49.3% 54.5%
Hindi 30.3% 42.3% 79.6% 34.9% 29.9% 47.9% 55.5% 78.9% 87.7% 33.2% 68.9% 71.3%
Arabic 28.0% 54.1% 79.5% 35.3% 32.9% 44.0% 21.9% 81.7% 73.6% 22.4% 36.4% 49.5%
German 25.1% 48.8% 79.4% 32.7% 30.4% 45.4% 22.3% 54.5% 77.0% 29.6% 32.9% 36.0%
Farsi 28.5% 52.5% 82.5% 31.6% 32.6% 45.5% 21.6% 52.1% 75.1% 29.3% 30.3% 28.9%
Telugu 29.2% 55.6% 81.5% 35.2% 33.1% 52.4% 78.3% 73.8% 89.5% 37.9% 70.9% 78.8%
Ukranian 27.3% 49.4% 80.3% 32.7% 33.5% 45.2% 47.3% 69.7% 87.1% 20.7% 46.7% 49.7%
Czech 27.9% 39.5% 80.2% 30.2% 31.8% 49.0% 34.9% 59.5% 67.8% 23.0% 33.1% 38.0%
Table 11: Rate of defaulting to the correct answer for the feedback selection task.
English prompt Translated prompt
Language GPT4o LLama Claude Gemini Mistral Cmd-A GPT4o LLama Claude Gemini Mistral Cmd-A
Mandarin 0.0% 0.1% 0.5% 0.0% 0.0% 0.0% 0.0% 0.1% 4.2% 0.0% 0.0% 0.0%
Hindi 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0%
Arabic 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 3.7% 0.0% 0.0% 0.0%
German 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 3.5% 0.0% 0.0% 0.0%
Farsi 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Telugu 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0%
Ukranian 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.5% 0.0% 0.0% 0.0%
Czech 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.8% 0.0% 0.0% 0.0%
Table 12: Response error rate for the translation grading task.

-- 13 of 20 --

A Experiment Prompts
A.1 Task: Misconception Identification
We used a sequence of 3 prompts:
System prompt:
You are an expert math tutor who knows

Chunk 34 · 1,996 chars

0.0% 1.5% 0.0% 0.0% 0.0%
Czech 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.8% 0.0% 0.0% 0.0%
Table 12: Response error rate for the translation grading task.

-- 13 of 20 --

A Experiment Prompts
A.1 Task: Misconception Identification
We used a sequence of 3 prompts:
System prompt:
You are an expert math tutor who knows about all grade-school level math misconceptions. Your task is
to select the accurate type of misconceptions your student has based on the (incorrect) answer he/
she gives to a multiple-choice math question. You will be given 4 misconceptions types. Your selected
misconception type should correspond to the given question and answer. Explain your reasoning
User message 1:
Question: {QUESTION}
Selected Answer: {SELECTED_ANSWER}
Misconceptions:
A. {Misconception 1}
B. {Misconception 2}
C. {Misconception 3}
D. {Misconception 4}
The position of the Misconception corresponding to the selected answer rotates from question to question.
The subsequent assistant message is stored as the chain-of-thought. Thereafter, we sent the second user
message.
User message 2:
Now based on your above explanation, output the option corresponding to the correct misconception.
Only say 'A', 'B', 'C', or 'D' without any other text. Do not say anything else.
The response to this part is the final answer. We regenerate until an answer of ‘A’, ‘B’, ‘C’, or ‘D’ is
received, up to 20 times. If no answer is received, a response of ‘E’ is saved.
This method is used for all models except Gemini. In case of Gemini, we use the generate_content
method, which is recommended for non-chat tasks and allows for a single user message. In this case, after
obtaining the chain-of-thought, we make a new query with the same system prompt but with the following
user message:
Gemini message:
You have previously given the following answer and explanation:
{COT}
Now based on your above explanation, output the option corresponding to the correct misconception.
Only say 'A', 'B', 'C', or 'D' without any

Chunk 35 · 1,994 chars

ought, we make a new query with the same system prompt but with the following
user message:
Gemini message:
You have previously given the following answer and explanation:
{COT}
Now based on your above explanation, output the option corresponding to the correct misconception.
Only say 'A', 'B', 'C', or 'D' without any other text. Do not say anything else.
Note that the last part is identical to User Message 2
When using translated prompts, the System Prompt and, User Message 2 and Gemini Message are
translated to the target language.

-- 14 of 20 --

A.2 Task: Feedback Selection
System prompt:
You are an expert math tutor who specialises in providing precise and helpful feedback for grade-
school level math questions. Your task is to select the correct explanation for a student's given
answer to a multiple-choice math question.
You will be provided with:
- A math question
- A specific answer chosen by the student (which can be correct or incorrect).
- Four possible explanations (labelled A, B, C, and D).
Your selected explanation should accurately correspond to the given answer. Provide your reasoning
for selecting the explanation.
User message 1:
Question: {QUESTION}
Selected Answer: {SELECTED_ANSWER}
Feedbacks:
A. {Feedback 1}
B. {Feedback 2}
C. {Feedback 3}
D. {Feedback 4}
The position of the Feedback corresponding to the selected answer rotates from question to question. If
it is placed at positions A, B, or C, the feedback corresponding to the correct answer is at position D.
Otherwise, it is at C. The subsequent assistant message is stored as the chain-of-thought. Thereafter, we
sent the second user message.
User message 2:
Now based on your above explanation, output the option corresponding to the correct explanation. Only
say 'A', 'B', 'C', or 'D' without any other text. Do not say anything else.
The response to this part is the final answer. We regenerate until an answer of ‘A’, ‘B’, ‘C’, or ‘D’ is
received, up to 20 times. If no answer is received, a

Chunk 36 · 1,992 chars

above explanation, output the option corresponding to the correct explanation. Only
say 'A', 'B', 'C', or 'D' without any other text. Do not say anything else.
The response to this part is the final answer. We regenerate until an answer of ‘A’, ‘B’, ‘C’, or ‘D’ is
received, up to 20 times. If no answer is received, a response of ‘E’ is saved.
This method is used for all models except Gemini. In case of Gemini, we use the generate_content
method, which is recommended for non-chat tasks and allows for a single user message. In this case, after
obtaining the chain-of-thought, we make a new query with the same system prompt but with the following
user message:
Gemini message:
You have previously given the following answer and explanation:
{COT}
Now based on your above explanation, output the option corresponding to the correct explanation. Only
say 'A', 'B', 'C', or 'D' without any other text. Do not say anything else.
Note that the last part is identical to User Message 2. When using translated prompts, the System Prompt,
User Message 2, and Gemini Message are translated to the target language. We manually made sure that
the formatting was maintained after the translation.

-- 15 of 20 --

A.3 Task: Tutoring
Student system prompt:
Student Persona: {STUDENT_PERSONA}
Math problem: {MATH_PROBLEM}
Student solution: {STUDENT_SOLUTION}
Context: You need to role-play the student, {STUDENT_NAME}, while the user roleplays the tutor. {
STUDENT_NAME} thinks their answer is correct. Only when the teacher provides several good reasoning
questions, {STUDENT_NAME} understands the problem and corrects the solution. {STUDENT_NAME} can use
calculator and thus makes no calculation errors. Send <EOM> tag at end of the student message.
Teacher system prompt:
A tutor and a student work together to solve the following math word problem.
Math problem: {MATH_PROBLEM}
The correct solution is as follows:
{GROUND_TRUTH}
You need to role-play the tutor while the user roleplays the student,

Chunk 37 · 1,975 chars

no calculation errors. Send <EOM> tag at end of the student message.
Teacher system prompt:
A tutor and a student work together to solve the following math word problem.
Math problem: {MATH_PROBLEM}
The correct solution is as follows:
{GROUND_TRUTH}
You need to role-play the tutor while the user roleplays the student, {STUDENT_NAME}. The tutor is a
soft-spoken empathetic man who dislikes giving out direct answers to students, and instead likes to
answer questions with other questions that would help the student understand the concepts, so that
she can solve the problem themselves.
{STUDENT_NAME} has come up with a solution, but it is incorrect. Please start the conversation, one
line at a time, aiming to figure out what is {STUDENT_NAME}'s solution and what is wrong with it.
Then try to get her to fix it.
The dialogue history was formatted as user-assistant message pairs for teacher and student roles. We
manually set the initial messages to initiate conversations in the target language.

-- 16 of 20 --

A.4 Task: Translation Grading
System prompt:
You are a language translation evaluator. Your task is to assess the quality of a translation from
English to {LANGUAGE}. You will be provided with two sentences:
1. An original English sentence.
2. A translated sentence in {LANGUAGE}.
Your goal is to rate the translation on a scale from 1 to 5 based on the following criteria:
1: The translation is incorrect, incomprehensible, or completely unrelated to the original English
sentence.
2: The translation has significant errors and distorts the meaning of the original English sentence.
3: The translation is understandable but contains notable errors or awkward phrasing.
4: The translation is mostly accurate with minor errors or slightly awkward phrasing.
5: The translation is fluent, natural, and accurately conveys the meaning of the original English
sentence without errors.
Explain your decision
User message 1:
English: {ENGLISH_SENTENCE}
{LANGUAGE}:

Chunk 38 · 1,995 chars

able errors or awkward phrasing.
4: The translation is mostly accurate with minor errors or slightly awkward phrasing.
5: The translation is fluent, natural, and accurately conveys the meaning of the original English
sentence without errors.
Explain your decision
User message 1:
English: {ENGLISH_SENTENCE}
{LANGUAGE}: {TRANSLATED_SENTENCE}
The subsequent assistant message is stored as the chain-of-thought. Thereafter, we sent the second user
message.
User message 2:
Now based on your above explanation, output the final score from 1 to 5. Only say '1', '2', '3', '4',
or '5' without any other text. Do not say anything else.
The response to this part is the final answer. We regenerate until an answer of ‘1’, ‘2’, ‘3’, ‘4’, or ‘5’ is
received, up to 20 times. If no answer is received, a response of ‘0’ is saved.
This method is used for all models except Gemini. In case of Gemini, we use the generate_content
method, which is recommended for non-chat tasks and allows for a single user message. In this case, after
obtaining the chain-of-thought, we make a new query with the same system prompt but with the following
user message:
Gemini message:
You have previously given the following answer and explanation:
{COT}
Now based on your above explanation, output the final score from 1 to 5. Only say '1', '2', '3', '4',
or '5' without any other text. Do not say anything else.
Note that the last part is identical to User Message 2
This sequence is repeated twice for each sentence, once with the original translation and once
with the perturbed translation. The scores are then compared. When using English prompts, the
LANGUAGE fields are set to their English exonyms, i.e., Mandarin, Hindi, Arabic, German, Farsi,
Telugu, Ukrainian, and Czech. When using translated prompts, the System Prompt, User Mes-
sage 2, and Gemini Message are translated to the target language. We manually made sure that
the formatting was maintained after the translation. We also use the language endonyms,

Chunk 39 · 1,999 chars

s, i.e., Mandarin, Hindi, Arabic, German, Farsi,
Telugu, Ukrainian, and Czech. When using translated prompts, the System Prompt, User Mes-
sage 2, and Gemini Message are translated to the target language. We manually made sure that
the formatting was maintained after the translation. We also use the language endonyms, namely
.

-- 17 of 20 --

B Translation Quality
As we mentioned in Limitations, an LLM performing poorly in a given language does not necessarily
mean that the LLM itself is bad. It could also mean that information was lost during translation. This is
particularly problematic because the machine translation systems likely suffer from the same resource
limitations that plague the LLMs in the first place. As such, we manually investigated a small subset of
translated questions for the languages they we are fluent in, namely Persian, Arabic, Czech, and Hindi.
For each language, we analysed 10 questions each for the Feedback and Misconception tasks, and 20
questions for the Translation Grading task.
In the case of Persian, the only recurring error was with mathematical notation, particularly that the
minus sign gets placed to the right of the numbers instead of the left, where it should be. This, however,
seems to be a rendering issue, which is a result of the fact that the minus sign (‘−’, U+2212) is often
replaced by the similar-looking hyphen (‘-’, U+002D), confusing the rendering program into believing
that it is rendering text. This should not be an issue since LLMs take raw Unicode encodings as input.
Beyond this, there were some minor tense errors, but the meanings were clear.
The issue with sign placement was also observed in Arabic. In addition, there seem to be some
translation errors. For example, the word ‘travel’ used here in the context of the movement of a graph was
translated to ‘liyusaafir’, which is more like ‘taking a trip’. We found no errors in the sentences for the
translation task. In Czech, the primary source of errors was improper

Chunk 40 · 1,995 chars

In addition, there seem to be some
translation errors. For example, the word ‘travel’ used here in the context of the movement of a graph was
translated to ‘liyusaafir’, which is more like ‘taking a trip’. We found no errors in the sentences for the
translation task. In Czech, the primary source of errors was improper context-dependent terminology. For
example, when translating the word ‘co-interior (angles)’, it missed the ‘co’ prefix and translated only the
‘interior’ part. While this is fine in regular speech, in Mathematical terminology, this can be confusing.
Despite making the translation harder to follow, the core meaning of the question is preserved.
In Hindi we found several cases where the Hindi sentence was difficult to follow for the Hindi speaking
author due to misinterpretation of polysemes by the translator e.g. the word ‘round’, which was being
used in the sense of ‘approximate’ was translated to the sense of ‘circle’ and ‘property’ which was being
used in the sense of ‘quality’, was translated as ‘possessions’. Also, the phrase ‘Not Quite’ was translated
to something like ‘Not Enough’, perhaps due to the word ‘quite’ not having a Hindi equivalent. However,
given the context, using the word for ’Almost’ would have been more tonally accurate. However, quite a
few translations were hard for the annotator to follow, but backtranslating them yielded reasonably good
results, meaning there was no information loss.
The translation exercises showed few errors, perhaps due to the sentences being easy to translate by
design. There were one or two mistranslations, but otherwise it worked well. One minor issue was
that word boundary detection, which was performed in Python using the regex ‘\b\w+\b’, sometimes
identified individual characters in Hindi rather than whole words. However, the resulting sentence still
had errors, just not the type of errors that we expected.

-- 18 of 20 --

Results for prompt in English
Task 1:
Misconception
identification
GPT4o

Chunk 41 · 1,995 chars

which was performed in Python using the regex ‘\b\w+\b’, sometimes
identified individual characters in Hindi rather than whole words. However, the resulting sentence still
had errors, just not the type of errors that we expected.

-- 18 of 20 --

Results for prompt in English
Task 1:
Misconception
identification
GPT4o 	Llama V3.1
405B
Claude 3.7
Sonnet
Gemini 2.0
Flash
Mistral Large
Latest
Command A
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
97.6% English	
96.2% English	
95.1% English	
94.0% English	
95.0% English	
95.3% English	
95.8% Mandarin	
95.2% Mandarin	
92.5% Mandarin	
92.1% Mandarin	
92.8% Mandarin	
92.9% Mandarin	
94.5% Hindi	
93.2% Hindi	
91.9% Hindi	
89.8% Hindi	
91.8% Hindi	
93.2% Hindi	
95.9% Arabic	
93.0% Arabic	
92.0% Arabic	
86.0% Arabic	
92.6% Arabic	
93.4% Arabic	
96.0% German	
96.2% German	
94.6% German	
84.6% German	
95.1% German	
95.2% German	
94.8% Farsi	
93.3% Farsi	
93.0% Farsi	
87.5% Farsi	
92.7% Farsi	
93.1% Farsi	
95.2% Telugu	
92.2% Telugu	
89.9% Telugu	
86.9% Telugu	
89.7% Telugu	
85.5% Telugu	
95.7% Ukranian	
94.9% Ukranian	
92.9% Ukranian	
93.3% Ukranian	
94.4% Ukranian	
94.9% Ukranian	
96.9% Czech	
95.1% Czech	
94.5% Czech	
92.3% Czech	
94.5% Czech	
94.1% Czech
Task 2:
Feedback
selection
GPT4o 	Llama V3.1
405B
Claude 3.7
Sonnet
Gemini 2.0
Flash
Mistral Large
Latest
Command A
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Accuracy
53.4% English	
38.2% English	
17.0% English	
51.1% English	
48.5% English	
39.7% English	
49.6% Mandarin	
29.7% Mandarin	
12.3% Mandarin	
43.0% Mandarin	
40.1% Mandarin	
31.8% Mandarin	
48.7% Hindi	
35.6% Hindi	
13.0% Hindi	
43.6% Hindi	
40.5% Hindi	
31.6% Hindi	
49.6% Arabic	
28.7% Arabic	
13.9% Arabic	
45.3% Arabic	
38.8% Arabic	
33.3% Arabic	
52.5% German	
32.1% German	
15.0% German	
46.4% German	
42.4% German	
32.8% German	
50.2% Farsi	
27.9% Farsi	
11.3% Farsi	
44.9% Farsi	
41.3% Farsi	
30.9% Farsi	
45.2% Telugu	
27.6% Telugu	
10.4% Telugu	
43.4% Telugu	
34.0% Telugu	
26.3% Telugu	
50.3% Ukranian	
33.2% Ukranian	
13.0%

Chunk 42 · 1,994 chars

Arabic	
38.8% Arabic	
33.3% Arabic	
52.5% German	
32.1% German	
15.0% German	
46.4% German	
42.4% German	
32.8% German	
50.2% Farsi	
27.9% Farsi	
11.3% Farsi	
44.9% Farsi	
41.3% Farsi	
30.9% Farsi	
45.2% Telugu	
27.6% Telugu	
10.4% Telugu	
43.4% Telugu	
34.0% Telugu	
26.3% Telugu	
50.3% Ukranian	
33.2% Ukranian	
13.0% Ukranian	
44.8% Ukranian	
41.3% Ukranian	
32.2% Ukranian	
49.9% Czech	
37.8% Czech	
14.1% Czech	
46.5% Czech	
41.6% Czech	
30.7% Czech
Task 3:
Tutoring
GPT4o 	Llama V3.1
405B
Claude 3.7
Sonnet
Gemini 2.0
Flash
Mistral Large
Latest
Command A
0.0
0.2
0.4
0.6
0.8
1.0
Tutoring Score
94.7% English	
97.0% English	
22.1% English	
93.0% English	
82.0% English	
95.5% English	
89.8% Mandarin	
89.0% Mandarin	
26.4% Mandarin	
79.7% Mandarin	
79.7% Mandarin	
88.2% Mandarin	
90.5% Hindi	
92.7% Hindi	
24.2% Hindi	
72.2% Hindi	
73.5% Hindi	
88.4% Hindi	
91.4% Arabic	
89.7% Arabic	
24.3% Arabic	
84.2% Arabic	
75.2% Arabic	
87.4% Arabic	
90.7% German	
91.2% German	
23.4% German	
84.2% German	
77.2% German	
86.3% German	
85.6% Farsi	
81.3% Farsi	
28.7% Farsi	
77.2% Farsi	
65.8% Farsi	
77.8% Farsi	
50.1% Telugu	
39.5% Telugu	
27.7% Telugu	
58.9% Telugu	
2.9% Telugu	
40.7% Telugu	
91.2% Ukranian	
91.5% Ukranian	
23.5% Ukranian	
81.2% Ukranian	
71.5% Ukranian	
90.9% Ukranian	
43.8% Czech	
44.1% Czech	
17.2% Czech	
70.2% Czech	
2.9% Czech	
21.5% Czech
Task 4:
Translation
grading
GPT4o 	Llama V3.1
405B
Claude 3.7
Sonnet
Gemini 2.0
Flash
Mistral Large
Latest
Command A
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of Non-perturbed Chosen	
100.0% Mandarin	
99.3% Mandarin	
98.9% Mandarin	
99.9% Mandarin	
99.5% Mandarin	
99.9% Mandarin	
91.5% Hindi	
74.1% Hindi	
92.1% Hindi	
77.6% Hindi	
82.4% Hindi	
77.9% Hindi	
98.6% Arabic	
97.9% Arabic	
99.2% Arabic	
98.8% Arabic	
97.5% Arabic	
99.0% Arabic	
98.2% German	
97.9% German	
97.9% German	
98.2% German	
98.2% German	
98.2% German	
95.3% Farsi	
93.5% Farsi	
96.0% Farsi	
96.4% Farsi	
92.3% Farsi	
96.6% Farsi	
77.2% Telugu	
33.7% Telugu	
81.0%

Chunk 43 · 1,998 chars

77.6% Hindi	
82.4% Hindi	
77.9% Hindi	
98.6% Arabic	
97.9% Arabic	
99.2% Arabic	
98.8% Arabic	
97.5% Arabic	
99.0% Arabic	
98.2% German	
97.9% German	
97.9% German	
98.2% German	
98.2% German	
98.2% German	
95.3% Farsi	
93.5% Farsi	
96.0% Farsi	
96.4% Farsi	
92.3% Farsi	
96.6% Farsi	
77.2% Telugu	
33.7% Telugu	
81.0% Telugu	
51.9% Telugu	
48.7% Telugu	
25.2% Telugu	
98.0% Ukranian	
97.3% Ukranian	
96.9% Ukranian	
96.5% Ukranian	
97.3% Ukranian	
98.3% Ukranian	
98.7% Czech	
98.3% Czech	
98.9% Czech	
98.3% Czech	
97.5% Czech	
98.8% Czech
Figure 2: Evaluation results of the four tasks across five lare language models. The error bars show a 95% confidence
interval (t-test). MathDial Graphs show tutoring score after five turns, most models flatline after 5 utterance pairs.
The English language column is absent because translation evaluation uses English as the source. All scores range
from 0.0 to 1.0, with higher being better, though they are not comparable with each other. Note the truncated y-axes
for better detail. Visualizes Tables 3 to 6.

-- 19 of 20 --

Results for prompt in target language
Task 1:
Misconception
identification
GPT4o 	Llama V3.1
405B
Claude 3.7
Sonnet
Gemini 2.0
Flash
Mistral Large
Latest
Command A
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
97.6% English	
96.2% English	
95.1% English	
94.0% English	
95.0% English	
95.3% English	
96.5% Mandarin	
95.0% Mandarin	
91.7% Mandarin	
93.8% Mandarin	
92.9% Mandarin	
94.1% Mandarin	
95.5% Hindi	
93.6% Hindi	
89.6% Hindi	
90.4% Hindi	
90.6% Hindi	
91.4% Hindi	
95.9% Arabic	
93.0% Arabic	
92.8% Arabic	
90.9% Arabic	
92.0% Arabic	
94.0% Arabic	
95.9% German	
96.6% German	
94.0% German	
74.0% German	
94.9% German	
95.2% German	
95.1% Farsi	
94.4% Farsi	
68.0% Farsi	
88.3% Farsi	
66.9% Farsi	
93.6% Farsi	
94.2% Telugu	
90.8% Telugu	
68.6% Telugu	
83.6% Telugu	
35.5% Telugu	
77.9% Telugu	
95.6% Ukranian	
94.3% Ukranian	
56.6% Ukranian	
90.4% Ukranian	
94.2% Ukranian	
93.9% Ukranian	
96.6% Czech	
95.8% Czech	
70.2%

Chunk 44 · 1,994 chars

94.9% German	
95.2% German	
95.1% Farsi	
94.4% Farsi	
68.0% Farsi	
88.3% Farsi	
66.9% Farsi	
93.6% Farsi	
94.2% Telugu	
90.8% Telugu	
68.6% Telugu	
83.6% Telugu	
35.5% Telugu	
77.9% Telugu	
95.6% Ukranian	
94.3% Ukranian	
56.6% Ukranian	
90.4% Ukranian	
94.2% Ukranian	
93.9% Ukranian	
96.6% Czech	
95.8% Czech	
70.2% Czech	
81.6% Czech	
41.0% Czech	
94.5% Czech
Task 2:
Feedback
selection
GPT4o 	Llama V3.1
405B
Claude 3.7
Sonnet
Gemini 2.0
Flash
Mistral Large
Latest
Command A
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Accuracy
53.4% English	
38.2% English	
17.0% English	
51.1% English	
48.5% English	
39.7% English	
41.1% Mandarin	
19.2% Mandarin	
5.8% Mandarin	
30.3% Mandarin	
30.3% Mandarin	
27.8% Mandarin	
32.1% Hindi	
13.4% Hindi	
6.2% Hindi	
44.3% Hindi	
18.6% Hindi	
18.8% Hindi	
48.8% Arabic	
10.7% Arabic	
16.3% Arabic	
48.1% Arabic	
27.8% Arabic	
28.9% Arabic	
50.6% German	
30.8% German	
15.6% German	
44.4% German	
39.4% German	
37.6% German	
45.9% Farsi	
31.6% Farsi	
16.3% Farsi	
44.0% Farsi	
33.5% Farsi	
35.5% Farsi	
13.9% Telugu	
12.7% Telugu	
6.1% Telugu	
37.7% Telugu	
15.5% Telugu	
9.5% Telugu	
35.9% Ukranian	
19.6% Ukranian	
8.1% Ukranian	
52.8% Ukranian	
31.0% Ukranian	
27.2% Ukranian	
42.7% Czech	
26.1% Czech	
19.2% Czech	
46.6% Czech	
35.5% Czech	
35.6% Czech
Task 4:
Translation
grading
GPT4o 	Llama V3.1
405B
Claude 3.7
Sonnet
Gemini 2.0
Flash
Mistral Large
Latest
Command A
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of Non-perturbed Chosen	
99.9% Mandarin	
99.4% Mandarin	
24.8% Mandarin	
99.6% Mandarin	
99.4% Mandarin	
99.9% Mandarin	
93.8% Hindi	
88.5% Hindi	
56.5% Hindi	
86.5% Hindi	
87.6% Hindi	
81.3% Hindi	
98.8% Arabic	
98.3% Arabic	
67.2% Arabic	
98.6% Arabic	
97.8% Arabic	
97.9% Arabic	
98.5% German	
98.3% German	
29.9% German	
98.0% German	
98.3% German	
97.8% German	
96.8% Farsi	
96.0% Farsi	
67.0% Farsi	
96.4% Farsi	
94.1% Farsi	
96.2% Farsi	
82.8% Telugu	
46.8% Telugu	
40.7% Telugu	
82.1% Telugu	
67.1% Telugu	
15.6% Telugu	
98.1% Ukranian	
97.9% Ukranian	
85.3%

Chunk 45 · 982 chars

Arabic	
97.8% Arabic	
97.9% Arabic	
98.5% German	
98.3% German	
29.9% German	
98.0% German	
98.3% German	
97.8% German	
96.8% Farsi	
96.0% Farsi	
67.0% Farsi	
96.4% Farsi	
94.1% Farsi	
96.2% Farsi	
82.8% Telugu	
46.8% Telugu	
40.7% Telugu	
82.1% Telugu	
67.1% Telugu	
15.6% Telugu	
98.1% Ukranian	
97.9% Ukranian	
85.3% Ukranian	
97.7% Ukranian	
98.4% Ukranian	
98.2% Ukranian	
99.3% Czech	
98.8% Czech	
80.8% Czech	
98.7% Czech	
99.5% Czech	
99.2% Czech
Figure 3: Evaluation results of the four tasks across five large language models. The error bars show 95% confidence
interval (t-test). MathDial Graphs show tutoring score after five turns, most models flatline after 5 utterance pairs.
The English language column is absent because translation evaluation uses English as the source. All scores range
from 0.0 to 1.0, with higher being better, though they are not comparable with each other. Note the truncated y-axes
for better detail. Visualizes Tables 3 to 6.

-- 20 of 20 --