Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Summary

This paper addresses cultural and linguistic biases in multilingual evaluation datasets, focusing on the Massive Multitask Language Understanding (MMLU) benchmark. The authors find that 28% of MMLU questions require culturally sensitive knowledge, with 84.9% of geographic questions focusing on North America or Europe. This Western-centric bias affects model rankings, as performance varies significantly between culturally agnostic (CA) and culturally sensitive (CS) subsets. To address these issues, the authors introduce Global-MMLU, an improved multilingual version of MMLU available in 42 languages. It incorporates professional and community annotations to enhance translation quality and includes labeled CA and CS subsets for more nuanced evaluation. The study evaluates 14 state-of-the-art models, showing that rankings change considerably between CA and CS subsets, especially for low-resource languages. The authors recommend using Global-MMLU over translated MMLU and reporting performance on CA and CS subsets separately to better understand model capabilities and cultural biases.

PDF viewer

Chunks(94)

Chunk 0 · 1,991 chars

Global MMLU : Understanding and
Addressing Cultural and Linguistic Biases in
Multilingual Evaluation
Shivalika Singhα1, Angelika Romanou2, Clémentine Fourrier3, David I. Adelani4,
Jian Gang Ngui5,6, Daniel Vila-Suero3, Peerat Limkonchotiwat5,6,
Kelly Marchisio7, Wei Qi Leong5,6, Yosephine Susanto5,6, Raymond Ng5,6,
Shayne Longpre8, Sebastian Ruder15, Wei-Yin Ko7, Madeline Smith1,
Antoine Bosselut2, Alice Oh9, André F. T. Martins10,11, Leshem Choshen12,
Daphne Ippolito13, Enzo Ferrante14, Marzieh Fadaee1, Beyza Ermisβ 1,
and Sara Hookerβ 1
1Cohere For AI, 2EPFL, 3Hugging Face, 4Mila, McGill University & Canada CIFAR AI Chair, 5AI Singapore,
6National University of Singapore, 7Cohere, 8MIT, 9KAIST, 10Instituto de Telecomunicações, 11Instituto
Superior Técnico, Universidade de Lisboa, 12MIT, MIT-IBM Watson AI Lab, 13Carnegie Mellon University,
14CONICET & Universidad de Buenos Aires, 15Meta AI Research
Abstract
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global
benchmarks. These biases stem not only from differences in language but also from the cultural
knowledge required to interpret questions, reducing the practical utility of translated datasets
like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning
or clarity of questions in the target language. A common practice in multilingual evaluation is
to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to
address these challenges. In this work, we trace the impact of both of these issues on multilingual
evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open
and proprietary models illustrates that progress on MMLU depends heavily on learning Western-
centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover,
for questions requiring geographic knowledge, an astounding 84.9% focus on either North Amer-
ican or

Chunk 1 · 1,999 chars

on of state-of-the-art open
and proprietary models illustrates that progress on MMLU depends heavily on learning Western-
centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover,
for questions requiring geographic knowledge, an astounding 84.9% focus on either North Amer-
ican or European regions. Rankings of model evaluations change depending on whether they
are evaluated on the full portion or the subset of questions annotated as culturally sensitive,
showing the distortion to model rankings when blindly relying on translated MMLU. We release
Global-MMLU , an improved MMLU with evaluation coverage across 42 languages – with
improved overall quality by engaging with compensated professional and community annotators
to verify translation quality while also rigorously evaluating cultural biases present in the original
dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as
culturally sensitive and culturally agnostic to allow for more holistic, complete eval-
uation.
Global-MMLU : https://hf.co/datasets/CohereForAI/Global-MMLU
Global-MMLU Lite : https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite
αFirst author. β Principal senior advisors.
Corresponding authors: {shivalika, beyza, sarahooker}@cohere.com
Released as a preprint on February 20, 2025 1
arXiv:2412.03304v2 [cs.CL] 19 Feb 2025

-- 1 of 57 --

1 Introduction
I contain multitudes. – Walt Whitman, 1855
Language cannot be simply reduced to a utilitarian tool, otherwise there would be no reason to
have so many diverse ways for saying the same thing or referring to similar concepts. Indeed,
language is also a marker of belonging and a repository of cultural knowledge (Labov, 1963; 1986;
Karlık, 2023). Today, state-of-the-art generative AI is used around the world and yet evaluation
of these systems is primarily conducted using English benchmarks (Zellers et al., 2019; Hendrycks
et al., 2020; Suzgun et al., 2022; Zhang et al., 2023b).

Chunk 2 · 1,995 chars

elonging and a repository of cultural knowledge (Labov, 1963; 1986;
Karlık, 2023). Today, state-of-the-art generative AI is used around the world and yet evaluation
of these systems is primarily conducted using English benchmarks (Zellers et al., 2019; Hendrycks
et al., 2020; Suzgun et al., 2022; Zhang et al., 2023b). Where multilingual evaluations are relied
upon, these are often simply machine translations of widely adopted English benchmarks (Lai
et al., 2023; Üstün et al., 2024).
A pressing question arises: how can we develop large language models (LLMs) that perform effec-
tively and fairly across the full spectrum of languages and cultures? The lack of comprehensive
evaluation benchmarks for many languages poses a significant obstacle for researchers and prac-
titioners striving to create truly multilingual systems. Often, a common practice is to simply
translate English benchmarks into other languages. In this work, we consider the implications of
this given one of the most ubiquitous examples – the Massive Multitask Language Understand-
ing (MMLU) dataset (Hendrycks et al., 2020). Originally compiled using sources in the English
language across 57 diverse subject areas such as elementary mathematics, computer science, and
law, the dataset is often machine-translated into resources for multilingual assessment, which we
collectively term transMMLU (Lai et al., 2023; Üstün et al., 2024; OpenAI, 2024; Dubey et al.,
2024; Bendale et al., 2024). However, the growing adoption of automatically translated “as-is”
transMMLU as a barometer of global AI progress deserves closer inspection and reflection.
While widely adopted for multilingual evaluations, the multilinguality achieved through the
translation of English datasets does not guarantee multiculturality. Evaluating on blindly-
translated datasets risks overemphasizing Western-centric concepts and knowledge. Cultural bias
can reduce the dataset’s practical effectiveness (and conceptual relevance) as a global

Chunk 3 · 1,999 chars

s, the multilinguality achieved through the
translation of English datasets does not guarantee multiculturality. Evaluating on blindly-
translated datasets risks overemphasizing Western-centric concepts and knowledge. Cultural bias
can reduce the dataset’s practical effectiveness (and conceptual relevance) as a global benchmark
when translated. For example, the original English MMLU dataset contains several subsets
which are US-specific, such as examinations in US History, US Accounting, and US Law. Such
cultural bias reduces the dataset’s practical effectiveness (and conceptual relevance) as a global
benchmark when translated. Furthermore, as these translated datasets become adopted for mul-
tilingual evaluation and developers optimize models for performance on transMMLU datasets,
we risk overfitting to the datasets’ cultural biases and incidentally setting multilingual evaluation
standards to be aligned with certain culture paradigms. Second, while machine translation ex-
pands language coverage, it also introduces practical evaluation challenges. Translation artifacts
known as translationese (Bizzoni et al., 2020; Vanmassenhove et al., 2021; Koppel & Ordan, 2011)
can be introduced, which causes a breakdown in evaluation quality. Automatic data curation is
also known to often exacerbate common data quality issues (Luccioni & Viviano, 2021; Kreutzer
et al., 2022; Ferrara, 2023; Caswell et al., 2020).
Our effort to address the above is twofold. We conduct an extensive evaluation to quantify the
impact of cultural biases in MMLU on model evaluations to-date and contribute improvements
to the overall translation quality to solve linguistic qualms. We hire professional annotators to
verify translation quality and include improvements from rigorous per-question post-edits as well
2

-- 2 of 57 --

as human translations. We release the comprehensive improved dataset Global-MMLU for
42 languages:
Amharic, Arabic, Bengali, Chinese, Czech, Dutch, English, Filipino, French,

Chunk 4 · 1,990 chars

e professional annotators to
verify translation quality and include improvements from rigorous per-question post-edits as well
2

-- 2 of 57 --

as human translations. We release the comprehensive improved dataset Global-MMLU for
42 languages:
Amharic, Arabic, Bengali, Chinese, Czech, Dutch, English, Filipino, French, German, Greek,
Hausa, Hebrew, Hindi, Igbo, Indonesian, Italian, Japanese, Korean, Kyrgyz, Lithuanian, Mala-
gasy, Malay, Nepali, Nyanja, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Sinhala,
Somali, Shona, Spanish, Swahili, Swedish, Telugu, Turkish, Ukrainian, Vietnamese, and Yoruba.
To address regional and cultural biases, we systematically annotate a subset of the original
English MMLU to identify questions where correctly answering requires cultural, geographical,
or dialect-specific knowledge. We refer to such questions as being Culturally-Sensitive (CS ), in
contrast to questions which do not require this prior knowledge, referred to as being Culturally-
Agnostic (CA ). We evaluate 14 state-of-the-art open-weight and proprietary models from 9
model families, focusing on those known for their high multilingual performance. This enables
rigorous evaluation of how such models serve diverse language users and isolates how ranking
may be subverted by questions which require primarily Western-centric knowledge. Through
extensive evaluations, we consistently find that cultural sensitivity has a significant impact on
model rankings. Our core contributions can be enumerated as follows:
• Analysis of MMLU for cultural biases: We observe that progress on MMLU depends
heavily on learning Western-centric concepts. Out of the annotated sample, we found that
28% of questions require specific knowledge of Western cultures. Moreover, for questions
requiring geographic knowledge, an astounding 84.9% focus on either North American or
European regions.
• Introducing Global-MMLU : We release a new multilingual MMLU test set spanning
42 languages,

Chunk 5 · 1,988 chars

annotated sample, we found that
28% of questions require specific knowledge of Western cultures. Moreover, for questions
requiring geographic knowledge, an astounding 84.9% focus on either North American or
European regions.
• Introducing Global-MMLU : We release a new multilingual MMLU test set spanning
42 languages, including English. This dataset combines professional translations with post-
edits (14 languages), crowdsourced translations (11 languages), and machine translations
(16 languages). By integrating this dataset with our cultural bias study, evaluations can
now report on both the CS and CA subsets. Additionally, we introduce Global-
MMLU Lite that provides a compact but high-quality alternative for multilingual
evaluation.
• Re-evaluation of state-of-the-art models: We evaluate the impact of the re-annotated
dataset on the relative performance of multilingual models. Among the 14 models tested,
rankings on CA datasets exhibited an average of 3.4 rank changes and 3.7 position shifts
compared to their performance on a uniform subsample of the MMLU dataset (MMLU
Annotated ). However, CS datasets showed significantly greater variability, with an
average of 5.7 rank changes and 7.3 position shifts across all languages.
• Role of data quality improvements: Our analysis highlights notable performance dif-
ferences between human-translated and machine-translated datasets for both high-resource
and low-resource languages. Human-translated datasets are essential for accurately assess-
ing model performance, especially on low-resource languages, as relying solely on machine-
translated data may obscure the true capabilities of models in these contexts. Without
access to high-quality human-translated or in-language datasets, the evaluation of low-
resource language performance remains uncertain.
3

-- 3 of 57 --

Chunk 6 · 1,997 chars

ese contexts. Without
access to high-quality human-translated or in-language datasets, the evaluation of low-
resource language performance remains uncertain.
3

-- 3 of 57 --



  	 	
	
			 


	 
 	


		
 	

  





 	 	 	 	
 			



 	

 
 



 
   


  	    
    
   
Figure 1: Overview of Global-MMLU preparation process. We engage with professional
and community annotators to improve the quality of translated MMLU. Additionally, we en-
gage in extensive annotation to provide rich meta-data for what questions in MMLU require
Culturally-Sensitive (CS ) knowledge such as 1) Cultural Knowledge , 2) Geographical
Knowledge or 3) Dialect Knowledge to answer correctly. We release this improved
Global-MMLU alongside extensive metadata annotations.
Stemming from our comprehensive results, we make the following recommendations for multilin-
gual evaluation of generative models:
• Report on Global-MMLU , instead of translated MMLU. We recommend prior-
itizing Global-MMLU over translated versions of MMLU for multilingual evaluation.
With its extensive language coverage and improvements based on professional annotations
and post-edited translations, Global-MMLU provides a more reliable and accurate
benchmark for assessing model performance across diverse languages.
• Report performance on culturally-sensitive and culturally-agnostic subsets sep-
arately. Our analysis demonstrates significant variability in model rankings between CA
and CS datasets,

Chunk 7 · 1,996 chars

translations, Global-MMLU provides a more reliable and accurate
benchmark for assessing model performance across diverse languages.
• Report performance on culturally-sensitive and culturally-agnostic subsets sep-
arately. Our analysis demonstrates significant variability in model rankings between CA
and CS datasets, with CS subsets showing greater variability. This variability, espe-
cially pronounced for low-resource languages and smaller models, highlights the importance
of evaluating these subsets independently. We recommend reporting performance on CA
and CS subsets separately to provide a clearer understanding of model capabilities and
better address the unique challenges posed by cultural and linguistic nuances in CS tasks.
2 Evaluating cultural bias in MMLU
2.1 Data Annotation Process
The goal of this work is to study how cultural biases in translated datasets influence the per-
formance of widely-used multilingual models. To achieve this, we worked with 200 professional
compensated and community annotators to review MMLU questions from the original En-
glish MMLU dataset to assess its cultural sensitivity. Annotators were presented with a
4

-- 4 of 57 --

A person in the pseudoindependent stage of
White racial identity is currently ___________.
Which of the following statements does NOT
accurately describe voting behavior in
the United States?
1 Registered voters between the ages of 35 and 45 are more
likely to vote than are those under the age of 21.
2 A registered voter who has attained his or her GED
is less likely to vote than a high school dropout.
3 Registered voters are more likely to vote in general
elections than they are in primary elections.
4 More women than men have voted in every
presidential election since 1980.
Opportunity costs or implicit costs of a
"Mom & Pop"-owned business are:
1 equal to accounting costs.
2 equal to accounting profits.
3 equal to earnings or profits that could have occurred
using resources elsewhere.
4

Chunk 8 · 1,987 chars

primary elections.
4 More women than men have voted in every
presidential election since 1980.
Opportunity costs or implicit costs of a
"Mom & Pop"-owned business are:
1 equal to accounting costs.
2 equal to accounting profits.
3 equal to earnings or profits that could have occurred
using resources elsewhere.
4 equal to earnings or profits that occurred for
Mom & Pop's business.
1 Developing an awareness of the role of Whites in
perpetrating racism
2 Unaware of race and racism
3 Exploring what it means to be White and confronting
own biases
4 Attempting to resolve moral dilemmas associated
with an awareness of race and racism

Figure 2: Examples of questions from MMLU dataset labelled as requiring cultural, regional or
dialectal knowledge.
representative random sample from each of the 57 exam subjects that compose MMLU (50 per
subject), totaling 2,850 samples. This annotated set is referred to as MMLU Annotated (MA)
throughout the paper. Annotators were asked to identify questions where correctly answering
depended upon 1) cultural knowledge , 2) geographic knowledge or 3) dialect knowledge .
We provide more context about each of these categories below:
• Cultural Knowledge . Annotators evaluated whether answering a question required
culture-specific knowledge. If so, they selected the relevant culture from a drop-down
menu with options: Western Culture, Eastern Asian Culture, Middle Eastern Culture,
South Asian Culture, African Culture, Latin American Culture, or Other. Cultural knowl-
edge encompasses recognizing and appreciating the beliefs, values, customs, and artistic
expressions of a particular group, shaped by shared traditions and heritage (Kipuri, 2009;
Liu et al., 2024; Mukherjee et al., 2024).
• Geographical or Regional Knowledge . Geographical knowledge refers to under-
standing

Chunk 9 · 1,990 chars

l knowl-
edge encompasses recognizing and appreciating the beliefs, values, customs, and artistic
expressions of a particular group, shaped by shared traditions and heritage (Kipuri, 2009;
Liu et al., 2024; Mukherjee et al., 2024).
• Geographical or Regional Knowledge . Geographical knowledge refers to under-
standing characteristics tied to specific regions, such as natural landmarks or environmen-
tal features. Annotators determined whether answering correctly required region-specific
knowledge. If applicable, they identified the relevant region from a drop-down menu with
the following options: North America, South America, Europe, Asia, Africa, Australia and
Oceania, and Antarctica.
• Dialect Knowledge . This category involves recognizing distinctive language varia-
tions or speech patterns used by people from specific regions or communities in English. It
includes slang terms, idiomatic expressions, and pronunciation differences that distinguish
regional speech from standardized forms of language. Notably, this assessment was con-
ducted on the original English sentences. Therefore, it specifically addresses variations in
English dialects or regional vocabulary, rather than any nuances that might arise during
the translation process.
Figure 20 in Appendix H illustrates the annotation interface used during this process. Annotators
were presented with questions one at a time from each of the 57 MMLU subjects and had to
analyze and label them for the presence of cultural, geographic, dialect knowledge. Each data
point was reviewed by at least three annotators, and some data-points had a maximum of 10
annotators. 96.4% of all data points were reviewed by more than 3 human annotators. We
classify each question as presenting cultural, geographic and dialect sensitivity according to
5

-- 5 of 57 --

100
80
60
40
20
0
Percentage of Samples (%)
Cultural
Regional
Multi-label
Dialect
Astronomy
Bio (Uni.)
Math(Uni.)
Chemistry (HS)
CS (HS)
Math (HS)
Micro econ.

Chunk 10 · 1,991 chars

reviewed by more than 3 human annotators. We
classify each question as presenting cultural, geographic and dialect sensitivity according to
5

-- 5 of 57 --

100
80
60
40
20
0
Percentage of Samples (%)
Cultural
Regional
Multi-label
Dialect
Astronomy
	Bio (Uni.)
	Math(Uni.)
Chemistry (HS)
CS (HS)
	Math (HS)
Micro econ. (HS)
Math (El.)
	Stats (HS)
	Nutrition
	Virology
	Marketing
	Physics (HS)
	Business Ethics
	Human Aging
	Management
Public Rel.
Macro econ. (HS)
Sexuality
Psychology (HS)
Int. Law
Accounting (Pro)
	Psychology (Pro)
Security
	Misc.
Prehistory
Geography (HS)
Sociology
	Fallacies
World Hist. (HS)
EU Hist. (HS)
	Law (Pro)
	Disputes
US Hist. (HS)
Moral Scenarios
	World Religions
	Gov. Politics (HS)
Philosophy
	Jurisprudence
Facts
US Foreign Policy
	Chemistry (Uni.)
Medicine (Uni.)
	Medicine (Pro)
	Electrical Eng.
Figure 3: Proportion of samples containing cultural, regional, or dialect-specific references per
subject in the MMLU dataset. Notably, all samples in the World Religions and Moral Scenarios
subjects include at least one such reference. Note that 12 subjects did not contain any Culturally-
Sensitive CS samples and have been excluded from the figure.
majority vote among annotators who reviewed each data point (Feldman, 1980). If half or more
of the annotators apply the same tag to a question, it is categorized under that tag. Detailed
information about the annotators and the annotation process is available in Appendix H.
We also asked annotators to annotate for temporal knowledge to determine if answers for ques-
tions change with time. We find that only 2.4% of annotated samples depend on temporal
knowledge. We provide more details about temporal analysis in the Appendix D.
To understand the prevalence of these attributes at an aggregate level, we also assign a label of
Culturally-Sensitive (CS ) if either Dialect Knowledge , Cultural Knowledge or
Geographic Knowledge are positively attributed to an example. If none of these properties
are

Chunk 11 · 1,995 chars

tails about temporal analysis in the Appendix D.
To understand the prevalence of these attributes at an aggregate level, we also assign a label of
Culturally-Sensitive (CS ) if either Dialect Knowledge , Cultural Knowledge or
Geographic Knowledge are positively attributed to an example. If none of these properties
are present, we deem an example to be Culturally-Agnostic (CA ). This enables us to track
at an aggregate level the fraction of the entire MMLU that requires CS knowledge.
2.2 Analysis of MMLU Cultural Biases
Figure 3 summarizes the results of this extensive annotation process. Our analysis reveals that
28% of MMLU requires CS knowledge – defined as requiring knowledge of either geographic
knowledge , cultural knowledge or dialect knowledge – to be answered correctly. Among
these, geographic knowledge emerges as the most frequently tagged bias, representing 54.7%
of all CS questions. Cultural knowledge follows at 32.7%, while dialect-specific knowledge
accounts for a mere 0.5% of all questions. Additionally, 10.6% of questions require both
cultural and geographic knowledge, and 1.5% involve a combination of all three types of nuanced
knowledge.
Western-centric culture dominates. Among the samples identified as requiring culturally
sensitive CS , a significant 86.5% were tagged as specific to Western cultural knowledge. In
contrast, the next closest category, South Asian cultural knowledge, accounted for only 4% of
the cultural tags. As Figure 4 shows, Latin American, African and Indigenous cultures are
6

-- 6 of 57 --

represented by 1.3%, 1.1% and 0.7% of the tags, respectively. This shows performing well on
MMLU heavily depends on mastering Western-centric cultural knowledge.
A similar trend is observed for geographic knowledge: 64.5% of CS samples were tagged as
needing regional knowledge of North America, followed by 20.4% tagged as requiring regional
knowledge of Europe. This concentration indicates that progress on MMLU predominantly re-
flects

Chunk 12 · 1,994 chars

ering Western-centric cultural knowledge.
A similar trend is observed for geographic knowledge: 64.5% of CS samples were tagged as
needing regional knowledge of North America, followed by 20.4% tagged as requiring regional
knowledge of Europe. This concentration indicates that progress on MMLU predominantly re-
flects knowledge of Western concepts and regions.
100
60
40
20
0
Percentage of Samples (%)
North
America
Europe Asia Africa South
America
Australia
& Oceania
80
Western South
Asia
Eastern
Asia
Middle
Eastern
Latin
American
African

Indegeous Other

Figure 4: Distribution of region (left) and culture (right) categories found in CS dataset. The
majority of Region tags (64.5%) correspond to North America, while the majority of Culture
tags (86.5%) are classified as Western. We have excluded samples that do not contain any region
or culture tags or contain multiple region or culture tags from this figure.
Culture-specific knowledge is overfit to a few countries. Figure 5 illustrates the dis-
tribution of cultural and regional tags across countries within the CS dataset. Our analysis
reveals that 73.9% of questions related to Western culture require knowledge about the United
States, followed by the United Kingdom at 8%, with smaller contributions from countries like
France and Germany. In contrast, Asian culture tags are predominantly associated with India,
accounting for 59%, while China and Japan represent only 17.9% each of the questions requiring
knowledge of Asian culture. Despite this, the overall representation of Asian cultures remains
limited, with only 4.0% of questions pertaining to South Asia and 3.1% to East Asia in the
MMLU dataset. Similarly, Middle Eastern culture is largely represented by Iraq (37.5%) and
Turkey (25%), yet its overall presence in the dataset is minimal, with just 2.7% of questions
addressing Middle Eastern cultural knowledge.

Chunk 13 · 1,997 chars

mited, with only 4.0% of questions pertaining to South Asia and 3.1% to East Asia in the
MMLU dataset. Similarly, Middle Eastern culture is largely represented by Iraq (37.5%) and
Turkey (25%), yet its overall presence in the dataset is minimal, with just 2.7% of questions
addressing Middle Eastern cultural knowledge. These findings highlight the dataset’s strong bias
toward the United States, with a significant portion of cultural tags tied to the U.S. For further
analysis of the culture–region relationship and detailed country-level insights, see Appendix G.
Cultural sensitivity varies considerably across subjects. The MMLU dataset, introduced
by Hendrycks et al. (2020), includes 57 subjects spanning four categories: STEM, Humanities,
Social Sciences, and Other. From the Other category, we selected relevant subjects and further
categorized them into Medical (Chen et al., 2023) and Business. Additional details about this
categorization are provided in Appendix B.
Figure 6 illustrates the data distribution for the CA subset, revealing significant variation
7

-- 7 of 57 --

USA
100
80
60
40
20
0
Percentage of Samples (%)

UK
Others
France
Germany
Greece
Italy
Spain
Russian
India
China
Japan
South Korea
Vietnam
Others
Iraq
Turkey
Egypt
Iran
Israel
Others

Figure 5: Distribution of cultural and regional tags across countries in the CS dataset. The
percentages indicate the representation of each country within the dataset. We have excluded
samples that do not contain any country tags or contain multiple country tags from this figure.
in cultural and regional references between different MMLU subjects and subject categories.
Questions from categories in Humanities and Social Sciences frequently required cultural or
regional knowledge, while those from the STEM and Medical categories generally did not. Overall
for Humanities, 68% of all questions were tagged

Chunk 14 · 1,996 chars

al and regional references between different MMLU subjects and subject categories.
Questions from categories in Humanities and Social Sciences frequently required cultural or
regional knowledge, while those from the STEM and Medical categories generally did not. Overall
for Humanities, 68% of all questions were tagged as CS . However, this bias was even more
pronounced for certain subjects within Humanities. Notably, more than 80% of samples for
subjects like Philosophy, Moral Scenarios1, High School US History, and High School Government
and Politics were deemed CS . Within the STEM category, only 30 out of 950 samples (3.15%)
were identified as CS , and for subjects such as Clinical Knowledge, Computer Security, and
Econometrics all question examples were classified as CA . These findings, detailed in Figure 6,
unsurprisingly reveal that certain subjects inherently exhibit more cultural or regional biases.
We provide examples of MMLU questions annotated as CS (Culturally Sensitive) and CA
(Culturally Agnostic) in the Appendix J.
Inter-annotator agreement. Each data point was reviewed by at least three annotators, and
some datapoints had a maximum of 10 annotators. 96.4% of all data points were reviewed
by more than 3 human annotators. Given this rich set of feedback on each data point, we
analyze the agreement between ratings from different annotators using Krippendorff ’s Alpha
scores (Krippendorff, 2004). We observed high inter-annotator agreement across most subjects,
with a unanimous cultural sensitivity agreement in the Anatomy subject. Six subjects showed
disagreement including High-school US History, while Moral Scenarios showed the most disagree-
ment. Detailed results are presented in Figure 23 and 24 in Appendix H.2.
Characteristics of CS versus CA subsets. Our extensive annotation process resulted
in two aggregated annotated subsets of MMLU: CS , which includes all questions labeled as
requiring dialect knowledge , cultural knowledge , or geographic

Chunk 15 · 1,995 chars

sagree-
ment. Detailed results are presented in Figure 23 and 24 in Appendix H.2.
Characteristics of CS versus CA subsets. Our extensive annotation process resulted
in two aggregated annotated subsets of MMLU: CS , which includes all questions labeled as
requiring dialect knowledge , cultural knowledge , or geographic knowledge to answer
1Morals might share universal truths and moral decisions may be well-defined given an underlying belief system,
but this does not seem to be the case in this scenario. That is, we observe that Moral Scenarios in MMLU are
geared towards Western Culture, and therefore CS knowledge, as it specifies “moral standards in the US” in
the instruction.
8

-- 8 of 57 --

STEM
Medical
Social Sciences
Humanities
Business
Other

100
80
60
40
20
0
Percentage of Samples (%)
Algebra
	Anatomy
	Clinical
	CS (Uni.)
Physics (Uni.)
	Computer Sec.
	Conc. Physics
	Econometrics
	Formal Logic
Bio (HS)
ML
Genetics
	Astronomy
	Bio (Uni.)
	Chem (Uni.)
	Math(Uni.)
Medicine (Uni.)
	Chemistry (HS)
CS (HS)
	Math (HS)
Micro econ. (HS)
	Medicine (Pro.)
	Electrical Eng.
Math (El.)
	Stats (HS)
	Nutrition
	Virology
	Marketing
	Physics (HS)
	Business Ethics
	Human Aging
	Management
	Public Rel.
Macro econ. (HS)
Sexuality
Psychology (HS)
Int. Law
Accounting (Pro)
	Psychology (Pro)
Security
Misc.
Prehistory
Geography (HS)
Sociology
	Fallacies
World Hist. (HS)
	EU Hist. (HS)
US Foreign Policy
Law (Pro)
	Disputes
Jurisprudence
	Philosophy
Facts
US Hist. (HS)
Gov. Politics (HS)
	Moral Scenarios
	World Religions
Figure 6: Proportion of samples retained per subject, after excluding those requiring cultural,
geographic and dialectic knowledge (selected based on majority agreement).
Number of Subjects Number of Samples Data Proportion
Categories MA CS CA MA CS CA MA CS CA
STEM 	19 11 19 950 23 927 33.3% 2.9% ↓ 45.0% ↑
Humanities 13 12 11 650 442 208 22.8% 55.8% ↑ 10.1% ↓
Social Sciences 12 11 12 600 208 392 21.1% 26.3% ↑ 19.1% ↓
Medical 7 5 7 350 19 331 12.3%

Chunk 16 · 1,988 chars

elected based on majority agreement).
Number of Subjects Number of Samples Data Proportion
Categories MA CS CA MA CS CA MA CS CA
STEM 	19 11 19 950 23 927 33.3% 2.9% ↓ 45.0% ↑
Humanities 13 12 11 650 442 208 22.8% 55.8% ↑ 10.1% ↓
Social Sciences 12 11 12 600 208 392 21.1% 26.3% ↑ 19.1% ↓
Medical 7 5 7 350 19 331 12.3% 2.4% ↓ 16.1% ↑
Business 4 4 4 200 36 164 7.0% 4.5% ↓ 8.0% ↑
Other 	2 2 2 100 64 36 3.5% 8.1% ↑ 1.8% ↓
Table 1: Statistics for MA , CS , and CA datasets. The left column displays the number
of subjects included in each dataset, the middle column shows the total number of samples per
category, and the right column illustrates changes in subject category distributions relative to
MA , with arrows indicating increases or decreases in representation.
correctly, and CA , comprising questions that do not require knowledge from these categories.
Table 1 provides a detailed breakdown of the number of subjects and samples in the CS and
CA subsets.
We observe notable differences in subject distribution between the CA and CS subsets, lead-
ing to shifts in category representation. For instance, while questions from the Social Sciences
category make up 21.1% of the MMLU Annotated , a uniformly balanced subsample of the
original MMLU, they are over-represented in CS , accounting for 26.3% of all questions requir-
ing CS knowledge. Conversely, questions from the STEM category, which contribute 33.3% of
the MMLU Annotated , are under-represented in CS , making up only 2.9% of all questions
identified as requiring CS knowledge. These shifts reflect how the nature of the CS subset
emphasizes cultural and contextual knowledge over technical or scientific content.
Overall, the proportions of STEM, Medical, and Business categories increase in the CA subset
due to their globally relevant content. Conversely, Humanities and Social Sciences are over-
represented in the CS subset compared to the original MMLU, as these fields frequently include
cultural or regional

Chunk 17 · 1,996 chars

ientific content.
Overall, the proportions of STEM, Medical, and Business categories increase in the CA subset
due to their globally relevant content. Conversely, Humanities and Social Sciences are over-
represented in the CS subset compared to the original MMLU, as these fields frequently include
cultural or regional references. These findings are critical to the model evaluations in Section 4,
9

-- 9 of 57 --

illustrating how cultural references in MMLU influence dataset composition and, ultimately,
model performance.
3 Introducing Global-MMLU
To date, many multilingual evaluations have relied on translated MMLU with the most widely
adopted existing multilingual MMLU translation dataset being translated into 26 languages using
ChatGPT2 supported by GPT-3.5 (Lai et al., 2023). We release an improved Global-MMLU
benchmark which is both of higher quality and also supports analysis on both CS and CA
subsets.
Here, we improve quality by incorporating professional edits and translations from native speakers
for a subset of languages and expanding coverage to 42 languages. We achieve this through a com-
bination of paid professional translations, community contributions, and higher-quality machine
translation. This effort involved professionally compensated annotators for four gold-standard
languages and a broader pool of community annotators who contributed to translations in 11
additional languages. Where available, we also included the professional human translations from
the MMMLU dataset3 for 14 languages. We rely as much as possible on human-verified transla-
tions to ensure that the translations are reliable and minimize the biases introduced, specifically
translationese which might be more pronounced in Machine Translation (Bizzoni et al., 2020;
Vanmassenhove et al., 2021; Koppel & Ordan, 2011). Alongside these quality improvements
through human verification, we include the metadata for the CS and CA annotations de-
veloped in the previous sections to allow

Chunk 18 · 1,997 chars

pecifically
translationese which might be more pronounced in Machine Translation (Bizzoni et al., 2020;
Vanmassenhove et al., 2021; Koppel & Ordan, 2011). Alongside these quality improvements
through human verification, we include the metadata for the CS and CA annotations de-
veloped in the previous sections to allow for analysis on all subsets of data. Below, we provide
further details about our efforts to improve the quality of MMLU and engage compensated hu-
man annotators in translating and verifying quality as well as identifying the CS and CA
subsets.
3.1 Translation Process
Figure 7: ChrF++ scores for Google Translate and GPT-3.5-Turbo
2https://chat.openai.com/chat
3https://huggingface.co/datasets/openai/MMMLU
10

-- 10 of 57 --

We first translated the English MMLU dataset into 41 languages using the Google Translate
API.4 Despite its cost, we chose to use Google Translate because comprehensive evaluations
spanning 102 languages (Zhu et al., 2024) demonstrate that Google Translate significantly out-
performs alternatives such as NLLB (NLLB-Team et al., 2022), GPT-4, and ChatGPT, on low-
resource languages (Robinson et al., 2023). Recent work (Kocmi et al., 2024) have shown that
LLMs have begun to surpass popular online translation tools like Google Translate for machine
translation on specific high-resource languages. However, given that there is a known tendency
for models to favor their own generations (Panickssery et al., 2024; Shimabucoro et al., 2024),
we decided to use Google Translate for every language in order to avoid introducing bias into
model evaluations. To empirically validate this choice, we compared Google Translate’s outputs
with translations performed by GPT-3.5-turbo, which had been previously used to translate the
MMLU dataset (Lai et al., 2023). As shown in Figure 7, we find that Google Translate achieved
higher ChrF++ scores (Popović, 2017) across all subjects and lower deviation in performance
across languages, consistent with the

Chunk 19 · 1,995 chars

s
with translations performed by GPT-3.5-turbo, which had been previously used to translate the
MMLU dataset (Lai et al., 2023). As shown in Figure 7, we find that Google Translate achieved
higher ChrF++ scores (Popović, 2017) across all subjects and lower deviation in performance
across languages, consistent with the findings of previous research (Popović, 2017) about its
superiority in translation quality. Following the translation process, native speakers reviewed
and edited the translations to ensure accuracy and fluency, thereby enhancing global representa-
tion. These edits were performed by two types of annotators: professional annotators and native
community annotators.
Professional Annotators. We hired compensated professional annotators for four languages:
Arabic, French, Hindi, and Spanish. These annotators reviewed the machine translations to
ensure fluency and cultural appropriateness, making edits where necessary. We refer to this set
of translation as our “Gold Set”. We include more details about compensated annotation process
in the Appendix H.1.
Community Annotators. In addition to professional annotations for a subset of languages, we
also facilitated community contributions to verify translation quality across a broader range of
languages, focusing on fluency edits and correcting poor translations. This participatory research
approach (Birhane et al., 2022; Corbett et al., 2023; Delgado et al., 2023; Singh et al., 2024; Üstün
et al., 2024) involved collaboration across multiple institutions globally. Such cross-sectional ef-
forts are crucial for gathering linguistic data at scale and fostering community engagement—both
essential for developing inclusive language technologies (Joshi et al., 2019; Nekoto et al., 2020;
Singh et al., 2024; Romanou et al., 2024). We established a criterion requiring a minimum of
50 human-translated samples for each language before its inclusion in Global-MMLU . This
threshold was met by eleven languages: Amharic,

Chunk 20 · 1,994 chars

for developing inclusive language technologies (Joshi et al., 2019; Nekoto et al., 2020;
Singh et al., 2024; Romanou et al., 2024). We established a criterion requiring a minimum of
50 human-translated samples for each language before its inclusion in Global-MMLU . This
threshold was met by eleven languages: Amharic, Czech, Malay, Persian, Romanian, Russian,
Sinhala, Telugu, Turkish, Ukrainian, and Vietnamese. In the following sections, we refer to this
set of languages as “Community Translated”.
The participation of native speakers from diverse regions introduced logistical challenges in both
data selection and quality control. To overcome these, we adopted Argilla5 as our primary
annotation platform. In line with our community-based approach, Argilla’s collaborative features
and customizable workflows enabled us to efficiently manage contributions from various regions
while maintaining consistency in translation quality. Annotators were presented with both the
original and machine-translated questions and answers, and were asked to edit any translations
that did not accurately capture the intent of the original text. The translation interface is shown
4https://cloud.google.com/translate
5https://github.com/argilla-io/argilla
11

-- 11 of 57 --

in Figure 21 in Appendix I.
Chinese
German
Indonesian
Italian
Japanese
Korean
Portoguese
Yoruba
Arabic
French
Hindi
Russian
Spanish
Amharic
Telugu
Ukrainian
Vietnamese
Turkish
Sinhala
Czech
Persian
Romanian
Malay
Professionally Translated
Community Translated
100
80
60
40
20
0
Percentage of Samples (%)
Bengali
Swahili
Figure 8: Percentage of Human-Translated Samples in MMLU Annotated .
MMMLU Translations. As detailed in the OpenAI-o1 system card,6 MMMLU7 is a profes-
sionally human-translated dataset released by OpenAI in 14 languages. To maximize the inclu-
sion of human-translated content in Global-MMLU , we incorporated this dataset wherever
possible. Since MMMLU overlaps with our Gold Set, we utilized the

Chunk 21 · 1,996 chars

slations. As detailed in the OpenAI-o1 system card,6 MMMLU7 is a profes-
sionally human-translated dataset released by OpenAI in 14 languages. To maximize the inclu-
sion of human-translated content in Global-MMLU , we incorporated this dataset wherever
possible. Since MMMLU overlaps with our Gold Set, we utilized the remaining 10 languages:
Bengali, Chinese, German, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba
from this dataset.
Figure 8 highlights the number of samples edited by professional annotators and community
contributors. A total of 7,565 edits were made, accounting for 36.9% of the samples reviewed.
On average, professional annotators edited 789 samples per language (38.5% of the total) in the
Gold Set, while community contributors edited 362 samples per language (17.7% of the total). It
is important to note that the differences in edit rates likely reflect variations in time and resources
available to professional versus community annotators, and cannot be interpreted as differences
in translation quality across languages. Additional analyses of question and answer lengths, as
well as edit distances across subject categories, are presented in Appendix I.
3.2 Data Composition of Global-MMLU
Global-MMLU is our comprehensive test set encompassing all 14K samples from MMLU
across 42 languages (including English), resulting in a total of 589,764 samples, created by
integrating multiple data sources, including human-translated datasets, machine translations,
and the original English MMLU. Throughout the Model Evaluations section, we also report on
different subsets of Global-MMLU , described as follows:
MMLU Annotated . This subset consists of 2,850 question-answer pairs sampled at uniform
from the MMLU dataset (50 questions per subject), representing 20% of the original data and
serving as a representative random sample. These samples are annotated in English to determine
whether answering requires cultural, geographic, dialectal, or

Chunk 22 · 1,994 chars

his subset consists of 2,850 question-answer pairs sampled at uniform
from the MMLU dataset (50 questions per subject), representing 20% of the original data and
serving as a representative random sample. These samples are annotated in English to determine
whether answering requires cultural, geographic, dialectal, or temporal knowledge. The anno-
tations are then applied to corresponding samples in 41 other languages, resulting in a total of
119,700 samples.
6https://openai.com/index/openai-o1-system-card/
7https://hf.co/datasets/openai/MMMLU
12

-- 12 of 57 --

Culturally-Sensitive (CS) . This subset contains samples identified as requiring dialect
knowledge , cultural knowledge or geographic knowledge to answer correctly. It includes
792 annotated samples in English based on majority voting by annotators. These annotations
are extended to 41 additional languages, creating a dataset with 33,264 entries. This subset is
particularly useful for evaluating model performance on culturally contextual tasks.
Culturally-Agnostic (CA) . This subset includes samples that do not contain cultural,
regional, or dialectal references. It serves as a baseline for evaluating models on tasks that do
not require specific contextual knowledge. The subset consists of 2,058 annotated samples in
English, which are extended to 41 languages for a total of 86,436 entries.
Global-MMLU Lite . This is a “lite” version of Global-MMLU covering 15 languages
which are fully human translated or post-edited, along with English. It includes 200 CS and
200 CA samples per language, totaling 6,000 samples. Further details on its preparation are in
Appendix C.
4 Model Evaluations
One of the key findings from Section 2.2 is that MMLU presents severe biases towards CS
knowledge. In this section, we seek to understand how these biases may have impacted evaluation
of open-weights and closed models. To do so, we measure changes to model rankings on 3 subsets
of data: Global-MMLU Annotated , Global-MMLU

Chunk 23 · 1,993 chars

of the key findings from Section 2.2 is that MMLU presents severe biases towards CS
knowledge. In this section, we seek to understand how these biases may have impacted evaluation
of open-weights and closed models. To do so, we measure changes to model rankings on 3 subsets
of data: Global-MMLU Annotated , Global-MMLU Culturally-Agnostic (CA ) and
Global-MMLU Culturally-Sensitive (CS ). By comparing model performance across these
three subsets, we aim to address the following questions: (1) How do models perform on the
MMLU test set when it includes culturally-sensitive samples? and (2) How do models perform on
samples that do not require specific contextual knowledge, ensuring consistent and fair evaluations
across different languages and regions?
4.1 Experimental Setup
We evaluated 14 recent state-of-the-art language models from 9 model families, focusing on those
known for their high multilingual performance. These include small models like Aya Expanse
8B, Gemma2 9B, SEA-LION v3 (9B), Llama 3.1 8B, Mistral Nemo 12B, and Qwen 2.5 7B;
mid-size models, comprising Aya Expanse 32B, CommandR (34B), Gemma2 27B, and Qwen
2.5 32B; large models, such as Llama 3.1 70B and CommandR+; and closed-weight models,
specifically GPT-4o and Claude Sonnet 3.5. A more detailed description of the models covered
is mentioned in the Appendix E. We note that all these models do not claim to support the same
set of languages, and none claim to support the full set of languages we cover.
Evaluation Setup. We use lm-evaluation-harness (Gao et al., 2024) to evaluate the open
multilingual models in a 5-shot setting. For closed models (i.e., GPT-4o and Claude-Sonnet
3.5), we also do 5-shot evaluation. However, since log probabilities are not accessible via API
for closed models, we send the 5-shot prompt via API and get the corresponding generation
from the model. We use a system preamble to make the model respond with only the correct
answer option and extract the answer from the output

Chunk 24 · 1,993 chars

also do 5-shot evaluation. However, since log probabilities are not accessible via API
for closed models, we send the 5-shot prompt via API and get the corresponding generation
from the model. We use a system preamble to make the model respond with only the correct
answer option and extract the answer from the output generation. For prompting, we follow the
same approach as specified in (Hendrycks et al., 2020) and use prompt instructions in the same
language as the sample.
13

-- 13 of 57 --

Languages. We categorize the languages into two main groups for reporting the results. The
first group consists of human-translated data only, which covers 10 languages from OpenAI’s
human-translated MMLU test set and 4 additional languages from our professionally translated
set. The second group contains all our data (combining professional, community and machine
translations), organized by language resource availability high-resource, mid-resource, and
low-resource languages as defined by Joshi et al. (2019) and categorized in (Singh et al., 2024).
We report results for each of these categories. The high-resource languages are Arabic, Chinese,
Czech, Dutch, English, French, German, Hindi, Italian, Japanese, Persian, Polish, Portuguese,
Russian, Spanish, Swedish, Turkish, Vietnamese, mid-resource languages are Bengali, Filipino,
Greek, Hebrew, Indonesian, Korean, Lithuanian, Malay, Romanian, Serbian, Ukrainian and
low-resource languages are Amharic, Hausa, Igbo, Kyrgyz, Malagasy, Nepali, Nyanja, Shona,
Sinhala, Somali, Swahili, Telugu, Yoruba.
4.2 Results
Evaluations on Human-Translated Data. To assess the performance of models on high-
quality, human-translated data, we conducted evaluations using the subset of 14 languages with
human-translated data. The analysis focuses on both the CA and CS subsets to explore
how models handle tasks with and without cultural context.
Aya Expanse-32B
CommandR+
Gemma2-27B
Llama-3.1-70B
Mistral-Nemo
Qwen2.5-32B
SEA-LION-v3
Claude

Chunk 25 · 1,995 chars

slated data, we conducted evaluations using the subset of 14 languages with
human-translated data. The analysis focuses on both the CA and CS subsets to explore
how models handle tasks with and without cultural context.
Aya Expanse-32B
CommandR+
Gemma2-27B
Llama-3.1-70B
Mistral-Nemo
Qwen2.5-32B
SEA-LION-v3
Claude Sonnet
GPT4o
50
60
70
80
90
Accuracy (%)
10.78 11.73
7.98
9.48
8.60
13.04
7.55
5.99
6.17
Culturally Agnostic (CA)
Aya Expanse-32B
CommandR+
Gemma2-27B
Llama-3.1-70B
Mistral-Nemo
Qwen2.5-32B
SEA-LION-v3
Claude Sonnet
GPT4o
50
60
70
80
90
13.28 14.20 9.07
9.66
11.51
13.10
8.76
6.06 6.21
Culturally Sensitive (CS)
Figure 9: Model evaluations on CA and CS data samples on human-translated 14 lan-
guages. The error bars indicates the standard deviation across languages.
We evaluated 14 models from 9 different model families, including 2 closed-source models. Fig-
ure 9 presents the results aggregated across 14 languages. We note that the focus of this eval-
uation is not to compare model performances directly but to analyze their behaviors on CA
and CS datasets. Direct comparisons between proprietary models and open-weight models are
not feasible due to significant differences in model sizes (although we note that the parameter
sizes of proprietary models have not been officially disclosed) and different evaluation methods.
Nonetheless, the results show that closed-source proprietary models, such as GPT-4o and Claude
3.5 Sonnet, consistently outperform smaller open-source models. Interestingly, the performance
gap between these models is narrower on CS datasets than on CA datasets.
Additionally, we assess mid-size and large open-weight models on Global-MMLU Lite , a
fully human-translated (or post-edited) subset evenly balanced between CS and CA samples.
14

-- 14 of 57 --

Unlike the full Global-MMLU , this balance enables clearer comparisons. Figure 10 shows
that overall, models perform better on the CA portion.
Aya Expanse 32B
CommandR+
Qwen2.5

Chunk 26 · 1,996 chars

models on Global-MMLU Lite , a
fully human-translated (or post-edited) subset evenly balanced between CS and CA samples.
14

-- 14 of 57 --

Unlike the full Global-MMLU , this balance enables clearer comparisons. Figure 10 shows
that overall, models perform better on the CA portion.
Aya Expanse 32B
CommandR+
Qwen2.5 32B
SEA-LIONv3
Gemma2 27B
Llama-3.1 70B
Mistral Nemo
50
60
70
80
90
8.37 9.29
5.82
7.85
6.76
9.73
6.31
Culturally Agnostic (CA)
Aya Expanse 32B
CommandR+
Qwen2.5 32B
SEA-LIONv3
Gemma2 27B
Llama-3.1 70B
Mistral Nemo
50
60
70
80
90
10.88
12.35
11.93
8.25
8.77
8.53
12.30
Culturally Sensitive (CS)
Figure 10: Model evaluations on CA and CS samples in Global-MMLU Lite . Error
bars indicate standard deviation across languages.
Performance on CS is higher but presents more variance Another key observation
is that the average accuracy across all models is higher on CS datasets compared to CA
datasets. This trend can be attributed to the nature of the CS samples, which are predom-
inantly drawn from Social Sciences and Humanities domains where models generally perform
better. In contrast, CA datasets include more challenging categories, such as Medical and
STEM, as illustrated in Figure 15.
However, the standard deviation in performance across languages is higher for CS data than
for CA data for all models. This can be attributed to several factors: culturally sensitive
tasks are inherently more challenging and require deeper contextual understanding, making them
more susceptible to variations in translation quality. Nuanced cultural, regional, or dialectal
references in CS tasks often amplify this sensitivity, as differences in how these references are
translated can affect model performance. Furthermore, many large language models are trained
predominantly on data from high-resource or Western cultures, leading to biases that favor these
contexts and cause inconsistencies when applied to less-represented cultures.
On Global-MMLU Lite , the pattern

Chunk 27 · 1,992 chars

se references are
translated can affect model performance. Furthermore, many large language models are trained
predominantly on data from high-resource or Western cultures, leading to biases that favor these
contexts and cause inconsistencies when applied to less-represented cultures.
On Global-MMLU Lite , the pattern shifts: CS tasks have lower average accuracies and
greater variance than CA tasks. This highlights how cultural specificity increases performance
instability, when the CS and CA samples are balanced.
Evaluations Across High-, Mid-, and Low-Resource Languages. To analyze model
performance across languages with varying resource availability, we evaluated the models on
CA and CS subsets, categorized into high-, mid-, and low-resource languages. This
evaluation provides insights into how models handle linguistic diversity and cultural nuances
across different resource levels.
Performance degrades on low-resource languages with higher variability For both
CA and CS datasets, high-resource languages consistently achieve the highest average
accuracy across all models. As expected, performance declines significantly for low-resource
languages due to the limited availability of high-quality training data, which hinders model
generalization. This decline is accompanied by an increase in performance variability, with
15

-- 15 of 57 --

Aya Expanse-32B
CommandR+	
Gemma2-27B	
Llama-3.1-70B	
Mistral-Nemo	
Qwen2.5-32B	
SEA-LION-v3
Claude Sonnet
GPT4o
30
40
50
60
70
80
90
Accuracy (%)
2.85 	3.76 	3.44
3.48
4.49
3.44
3.96
1.69 1.63
Culturally Agnostic (CA)
Aya Expanse-32B
CommandR+	
Gemma2-27B	
Llama-3.1-70B	
Mistral-Nemo	
Qwen2.5-32B	
SEA-LION-v3
Claude Sonnet
GPT4o
30
40
50
60
70
80
90
3.45 	4.49 	4.26
3.84
5.78
3.37
5.03
2.39 	2.15
Culturally Sensitive (CS)
Aya Expanse-32B
CommandR+	
Gemma2-27B	
Llama-3.1-70B	
Mistral-Nemo	
Qwen2.5-32B	
SEA-LION-v3
Claude Sonnet
GPT4o
30
40
50
60
70
80
90
Accuracy (%)
5.47 	6.38
2.39
2.63
3.64
4.60
2.78
1.65 1.31
Culturally

Chunk 28 · 1,980 chars

Claude Sonnet
GPT4o
30
40
50
60
70
80
90
3.45 	4.49 	4.26
3.84
5.78
3.37
5.03
2.39 	2.15
Culturally Sensitive (CS)
Aya Expanse-32B
CommandR+	
Gemma2-27B	
Llama-3.1-70B	
Mistral-Nemo	
Qwen2.5-32B	
SEA-LION-v3
Claude Sonnet
GPT4o
30
40
50
60
70
80
90
Accuracy (%)
5.47 	6.38
2.39
2.63
3.64
4.60
2.78
1.65 1.31
Culturally Agnostic (CA)
Aya Expanse-32B
CommandR+	
Gemma2-27B	
Llama-3.1-70B	
Mistral-Nemo	
Qwen2.5-32B	
SEA-LION-v3
Claude Sonnet
GPT4o
30
40
50
60
70
80
90
6.69 	8.17 	3.68
3.01
5.15
5.13
4.26
2.67 	2.68
Culturally Sensitive (CS)
Aya Expanse-32B
CommandR+	
Gemma2-27B	
Llama-3.1-70B	
Mistral-Nemo	
Qwen2.5-32B	
SEA-LION-v3
Claude Sonnet
GPT4o
30
40
50
60
70
80
90
Accuracy (%)
5.29 	4.90
6.91 8.73
6.33
7.88 	5.42
5.14 	6.74
Culturally Agnostic (CA)
Aya Expanse-32B
CommandR+	
Gemma2-27B	
Llama-3.1-70B	
Mistral-Nemo	
Qwen2.5-32B	
SEA-LION-v3
Claude Sonnet
GPT4o
30
40
50
60
70
80
90
5.43 	5.54
7.33
9.77
7.08
7.95
5.61
5.27 7.07
Culturally Sensitive (CS)
Figure 11: Model evaluations on (Top) high-resource , (Mid) mid-resource and (Bottom)
low resource data samples for CA and CS subsets.
the standard deviation rising for mid-resource languages and even more so for low-resource
languages, particularly on CS datasets.
The average standard deviation for high-resource languages is 3.21 on CA datasets and
3.86 on CS datasets. For mid-resource languages, these values increase to 3.42 and 4.6,
16

-- 16 of 57 --

respectively. Low-resource languages exhibit significantly higher standard deviations, with
averages rising to 6.37 on CA datasets and 6.78 on CS datasets. These represent increases
of 98% and 75% compared to high-resource languages, highlighting the greater variability and
sensitivity in low-resource settings. This increased variability in model performances highlights
the challenges of culturally sensitive tasks, which demand a nuanced understanding of regional
or dialectal references. Across all level of resourcefulness, performance on CS shows

Chunk 29 · 1,993 chars

highlighting the greater variability and
sensitivity in low-resource settings. This increased variability in model performances highlights
the challenges of culturally sensitive tasks, which demand a nuanced understanding of regional
or dialectal references. Across all level of resourcefulness, performance on CS shows higher
variability than CA .
Model Rank Changes. This section explores how model performance rankings differ between
CA and CS datasets, calculated relative to their ranks on MA , across multiple languages.
Table 2 highlights rank changes for human-translated languages, organized by resource level:
high-resource, mid-resource, and low-resource. These rankings offer valuable insights into
how dataset type, resource availability and model size impact model performances. Comprehen-
sive rankings for all languages are available in Table 6 and Table 7 in Appendix F.1.
The rank changes reveal three key findings:
1) Models perform differently across CA and CS datasets, with the latter show-
ing greater variation. Rankings on CA datasets exhibit minimal changes. For instance,
Italian, Japanese, and Portuguese show no rank changes, while Arabic and French each experi-
ence only two shifts, each by one position.
On the other hand, model performance varies significantly on CS datasets. Chinese and Hindi
emerge as the most sensitive languages to culture-specific knowledge, with models showing both
increases and decreases in rankings. Similar variations are evident in French, German, Italian,
Japanese, and Portuguese. Notably, models from the Aya Expanse and CommandR families tend
to show positive trends on CS datasets, particularly for these languages. On average, across all
languages, CA datasets see 3.4 rank changes and 3.7 position changes, whereas CS datasets
experience markedly higher volatility, with 5.7 rank changes and 7.3 position changes.
2) The difference between performances on CA and CS datasets are less on
low-resource languages. High-resource

Chunk 30 · 1,994 chars

nguages. On average, across all
languages, CA datasets see 3.4 rank changes and 3.7 position changes, whereas CS datasets
experience markedly higher volatility, with 5.7 rank changes and 7.3 position changes.
2) The difference between performances on CA and CS datasets are less on
low-resource languages. High-resource languages demonstrate relatively stable rankings on
CA datasets, with an average of 3.3 rank changes and a maximum shift of 3 positions. However,
on CS datasets, ranking changes are more pronounced, with an average of 6.8 rank changes and
9.1 position shifts. In contrast, mid-resource languages display moderate variability. While
small models face slightly greater fluctuations on CS datasets, their performance on CA
datasets remains more consistent. For mid-resource languages, the average rank changes are
3.7 on CA and 4.7 on CS , with corresponding position changes of 4.7 and 4.9. Among the
three resource groups, mid-resource languages show the smallest difference between CA and
CS performance.
Low-resource languages show an increase in the difference between CA and CS rank
changes compared to mid-resource. Average rank changes are 3.3 on CA datasets and 3.7
on CS , with position changes rising to 5.7 on CA and 7.9 on CS . Notably, this group
also experiences the largest rank changes. Table 3 highlights the most significant changes across
all languages, including rank shifts of up to 5 positions for Malagasy, and 13 ranking
changes for the models on Ukrainian. These findings underscore how resource levels amplify
rank changes, even within CA datasets.
17

-- 17 of 57 --

Language 	Dataset	
Aya Exp. 8B	
Aya Exp. 32B	
CommandR	
CommandR+	
Gemma2 9B	
Gemma2 27B	
Llama-3.1 8B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 7B	
Qwen2.5 32B	
SEA-LION-v3	
GPT4o
Claude Sonnet
Arabic 	- 	- 	- 	- 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	- 	-
- 	↑1 	- 	- 	- 	↓1 	- 	↑1 	- 	- 	↓1 	- 	- 	-
Chinese 	- 	- 	↓1 	- 	↑1 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	-
↑1 ↑1 ↑1 ↑2 ↑1 	- 	↓1 ↑1 	- 	↓3 ↓1 ↓2 ↑1

Chunk 31 · 1,990 chars

dR+	
Gemma2 9B	
Gemma2 27B	
Llama-3.1 8B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 7B	
Qwen2.5 32B	
SEA-LION-v3	
GPT4o
Claude Sonnet
Arabic 	- 	- 	- 	- 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	- 	-
- 	↑1 	- 	- 	- 	↓1 	- 	↑1 	- 	- 	↓1 	- 	- 	-
Chinese 	- 	- 	↓1 	- 	↑1 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	-
↑1 ↑1 ↑1 ↑2 ↑1 	- 	↓1 ↑1 	- 	↓3 ↓1 ↓2 ↑1 ↓1
English 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	↑1 ↑1 	- 	↓1 	-
- 	↑1 	- 	- 	- 	- 	- 	↑1 	- 	↓1 ↓1 	- 	- 	-
French 	- 	↑1 	- 	- 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	-
- 	↑2 ↑2 ↑1 	- 	↓2 	- 	↑1 	- 	↓3 ↓1 ↑1 	- 	-
German 	- 	↓1 	- 	↓1 	- 	↑1 	- 	- 	- 	↑1 	- 	- 	- 	-
- 	- 	↓1 	- 	↑2 	- 	- 	↑1 	- 	↓3 ↓1 ↑2 	- 	-
Hindi 	- 	↑1 ↓2 ↓1 ↑1 	- 	- 	- 	- 	- 	- 	↑1 	- 	-
↑1 ↓1 ↑1 ↑2 	- 	↓1 ↑1 	- 	↑1 ↓3 ↓1 	- 	↑1 ↓1
Italian 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	-
- 	- 	↑1 ↑1 	- 	↓1 	- 	↑1 	- 	↓2 ↓1 ↑1 	- 	-
Japanese 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	-
- 	↑1 ↑1 ↑1 ↑1 ↓2 	- 	↑1 	- 	↓1 ↓1 ↓1 	- 	-
Portuguese 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	-
- 	↑1 ↑1 ↑1 ↑1 ↓1 	- 	↑1 	- 	↓2 ↓1 ↓1 	- 	-
Spanish 	- 	↓1 	- 	↓1 	- 	↑1 	- 	- 	- 	↑1 	- 	- 	- 	-
- 	- 	↑1 	- 	↑2 	- 	- 	↑1 	- 	↓3 ↓1 	- 	- 	-
Bengali 	- 	↑1 	- 	- 	- 	- 	- 	↓1 ↓1 	- 	- 	- 	- 	-
- 	- 	- 	- 	- 	- 	- 	- 	↑1 ↓1 	- 	- 	- 	-
Indonesian 	- 	- 	↓1 ↓1 ↓1 ↑1 	- 	- 	- 	↑2 	- 	- 	- 	-
- 	- 	↑1 	- 	- 	- 	↓1 ↑1 ↑1 	- 	↓1 ↓1 	- 	-
Korean 	↓1 ↓1 ↓1 	- 	- 	↑1 ↑1 	- 	- 	↑1 	- 	- 	- 	-
- 	↑1 ↑1 ↓1 	- 	↓1 	- 	↑1 	- 	- 	↓1 	- 	- 	-
Sinhala 	- 	↑1 	- 	- 	- 	- 	↓3 	- 	- 	↑2 	- 	- 	- 	-
- 	↓1 ↑1 ↑1 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	-
Swahili 	- 	↓1 	- 	- 	- 	- 	↑1 	- 	- 	- 	- 	- 	- 	-
- 	- 	↑1 	- 	- 	- 	↓1 	- 	- 	- 	- 	- 	↓1 ↑1
Yoruba 	- 	↑1 ↓2 	- 	↓1 	- 	- 	- 	- 	↑2 ↑1 ↓1 	- 	-
- 	↓1 ↑1 ↑1 ↑1 	- 	- 	- 	- 	- 	↓2 	- 	- 	-
Table 2: Changes in model rankings on CA and CS datasets, based on MA , across
human-translated languages, including English. Languages are categorized as high-, mid-,
and low-resource. Color-coded boxes indicate increases ( ↑ ) and decreases ( ↓ ) in rank.
3) Model size influences performance variations. We analyzed

Chunk 32 · 1,997 chars

Table 2: Changes in model rankings on CA and CS datasets, based on MA , across
human-translated languages, including English. Languages are categorized as high-, mid-,
and low-resource. Color-coded boxes indicate increases ( ↑ ) and decreases ( ↓ ) in rank.
3) Model size influences performance variations. We analyzed performance variations
across three model groups, as defined in the Model section (excluding closed-weight models due
to unknown sizes). Our findings highlight distinct trends for large, mid-size, and small models:
Large models demonstrate higher consistency across datasets and resource levels. The average
rank changes for large models are minimal, at 0.21 for CA and 0.67 for CS . The maximum
position shift for models in this group is 3 while it can be 5 for small-models. This consistency
reflects their robustness and higher capacity to generalize across diverse datasets.
Mid-size models, on the other hand, show much bigger variability. Their average rank changes are
18

-- 18 of 57 --

0.33 for CA and 1.97 for CS , indicating they are more sensitive to dataset characteristics,
particularly in the CS datasets that requires cultural knowledge.
Small models exhibit the smallest difference in rank change between CA and CS (0.35 and
0.45, respectively). However, this apparent stability stems from their weaker overall performance
across both datasets. For instance, the average accuracy for small models is 51.3% on CA and
54.8% on CS , while mid-size models achieve 59.1% and 61.7%, and large models perform at
61.6% and 66.8% on CA and CS , respectively.
Language Dataset	
Aya Exp. 8B	
Aya Exp. 32B	
CommandR	
CommandR+	
Gemma2 9B	
Gemma2 27B	
Llama-3.1 8B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 7B	
Qwen2.5 32B	
SEA-LION-v3	
GPT4o
Claude Sonnet
Greek 	↓1 ↓1 	- 	- 	- 	↑1 	- 	- 	↓1 ↑2 	- 	- 	- 	-
- 	- 	↑2 ↑3 	- 	↓1 ↑1 	- 	- 	↓1 ↓4 - 	- 	-
Ukrainian 	- 	↑1 	- 	↓1 ↓1 	- 	- 	- 	- 	↑1 	- 	- 	- 	-
- 	↑1 	- 	↑1 	- 	↓2 	- 	↑1 ↑1 ↓1 ↓1 - ↑1 ↓1
Malagasy 	- 	↓1 	- 	- 	- 	-

Chunk 33 · 1,997 chars

ama-3.1 8B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 7B	
Qwen2.5 32B	
SEA-LION-v3	
GPT4o
Claude Sonnet
Greek 	↓1 ↓1 	- 	- 	- 	↑1 	- 	- 	↓1 ↑2 	- 	- 	- 	-
- 	- 	↑2 ↑3 	- 	↓1 ↑1 	- 	- 	↓1 ↓4 - 	- 	-
Ukrainian 	- 	↑1 	- 	↓1 ↓1 	- 	- 	- 	- 	↑1 	- 	- 	- 	-
- 	↑1 	- 	↑1 	- 	↓2 	- 	↑1 ↑1 ↓1 ↓1 - ↑1 ↓1
Malagasy 	- 	↓1 	- 	- 	- 	- 	- 	- 	- 	↑1 	- 	- 	- 	-
- 	↑1 ↑4 ↑1 	- 	- 	↓1 	- 	↑1 ↓1 ↓5 - 	- 	-
Shona 	- 	- 	- 	- 	↓1 	- 	- 	- 	- 	- 	↑1 - ↓1 ↑1
↑2 	- 	↑1 ↑1 	- 	- 	↑1 	- 	- 	↓4 ↓1 - 	- 	-
Table 3: Changes in model rankings on CA and CS datasets, based on MA on Greek,
Ukrainian, Malagasy, and Shona.
Overall, we can conclude that dataset characteristics significantly impact model performance
across all model sizes, though the magnitude of variability differs. Across all groups, models
demonstrate sensitivity to the diverse cultural and linguistic nuances present in CS datasets,
with performance variations reflecting their capacity to adapt to dataset-specific nuances. This
pattern holds consistently, regardless of model size, though the magnitude of variability differs.
A similar trend appears in Global-MMLU Lite , where despite being smaller and balanced,
performance volatility is still higher on CS datasets, particularly for low-resource languages as
shown in Table 4.
Human Translated vs. Machine Translated. We compared models on Human-Translated
(HT) and Machine-Translated (MT) CS datasets to gain deeper insights into model behavior.
Figure 12 illustrates the model performances for one high-resource language (French), one
mid-resource language (Korean), one low-resource language (Yoruba).
The key finding is that models generally perform better on human-translated data for high-
resource languages. This is likely because these languages benefit from extensive in-language
training data. However, this trend shifts for mid-resource languages. The figure reveals that the
performance gap between HT and MT narrows for models such as Claude Sonnet and Qwen2.5
32B. Conversely, models

Chunk 34 · 1,993 chars

d data for high-
resource languages. This is likely because these languages benefit from extensive in-language
training data. However, this trend shifts for mid-resource languages. The figure reveals that the
performance gap between HT and MT narrows for models such as Claude Sonnet and Qwen2.5
32B. Conversely, models like CommandR+ and Aya Expanse 32B continue to perform better
on HT data. Notably, these two models have strong Korean language support, which can be
attributed to a substantial amount of in-language training data.
19

-- 19 of 57 --

Language Dataset	
Aya Exp. 32B	
CommandR+	
Gemma2 27B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 32B	
SEA-LION-v3
Arabic 	CA 	- 	↓1 ↑1 	- 	- 	- 	-
CS 	↑1 	- 	↓1 	- 	- 	- 	-
Chinese 	CA 	↑1 ↓1 	- 	- 	- 	- 	-
CS 	- 	↑1 ↓1 	- 	- 	- 	-
English 	CA 	↓1 ↓1 ↑1 ↓1 	- 	↑1 ↑1
CS 	↑1 	- 	↓1 	- 	↑1 	- 	↓1
French 	CA 	↑1 ↓1 	- 	- 	- 	- 	-
CS 	↓1 ↑1 ↓1 ↑1 ↑2 ↓1 ↓1
German 	CA 	- 	↓1 	- 	↓1 	- 	↑2 	-
CS 	- 	↑1 	- 	↓1 	- 	- 	-
Hindi 	CA 	↓1 	- 	- 	- 	- 	↓2 ↑3
CS 	- 	- 	- 	- 	- 	- 	-
Italian 	CA 	↑2 ↓3 	- 	- 	- 	- 	↑1
CS 	- 	- 	- 	- 	↑1 	- 	↓1
Japanese 	CA 	↑1 ↓1 	- 	- 	- 	- 	-
CS 	- 	- 	- 	- 	- 	- 	-
Language 	Dataset	
Aya Exp. 32B	
CommandR+	
Gemma2 27B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 32B	
SEA-LION-v3
Portuguese 	CA 	↓1 ↓2 ↑1 ↓1 	- 	↑1 ↑2
CS 	↑1 	- 	↓1 	- 	- 	- 	-
Spanish 	CA 	- 	- 	- 	- 	- 	- 	-
CS 	- 	- 	- 	↑1 	- 	↓1 	-
Bengali 	CA 	↑1 	- 	- 	- 	↓1 	- 	-
CS 	- 	- 	- 	- 	- 	- 	-
Indonesian 	CA 	- 	- 	- 	- 	- 	- 	-
CS 	↑1 ↑1 ↓2 	- 	- 	- 	-
Korean 	CA 	↓1 ↑1 	- 	- 	- 	- 	-
CS 	- 	- 	- 	- 	- 	- 	-
Swahili 	CA 	↓1 ↑1 ↑1 ↓1 ↑1 ↓1 	-
CS 	↑1 ↓1 	- 	- 	- 	- 	-
Yoruba 	CA 	- 	↓2 	- 	↓2 	- 	↑1 ↑3
CS 	↑3 ↑1 ↓4 ↑1 	- 	- 	↓1
Table 4: Changes in model rankings on CA and CS datasets, based on total accuracy on
Global-MMLU Lite . Languages are categorized as high-, mid-, and low-resource.
Color-coded boxes indicate increases ( ↑ ) and decreases ( ↓ ) in rank.
For low-resource languages, a distinct pattern emerges. As shown in the figure, models

Chunk 35 · 1,997 chars

le 4: Changes in model rankings on CA and CS datasets, based on total accuracy on
Global-MMLU Lite . Languages are categorized as high-, mid-, and low-resource.
Color-coded boxes indicate increases ( ↑ ) and decreases ( ↓ ) in rank.
For low-resource languages, a distinct pattern emerges. As shown in the figure, models such
as Claude Sonnet and GPT-4o perform significantly better on MT data than on HT data. Sim-
ilarly, CommandR+ and Qwen2.5 32B also show improved performance on MT data, albeit
with less pronounced differences. This behavior is likely because these models primarily rely on
machine-translated data for low-resource languages during training, and the distribution of the
machine-translated test set aligns more closely with their training data. Notably, the only model
demonstrating consistent performance across both HT and MT datasets is Aya Expanse 32B,
which can be attributed to its broad coverage and strong support for low-resource languages.
These results underscore the importance of in-language or human-translated datasets for eval-
uating low-resource languages. The Global-MMLU dataset provides a valuable tool for
assessing the in-language performance of large language models (LLMs) on low-resource lan-
guages, offering insights into their capabilities and limitations in such contexts.
5 Related Work
5.1 Multilingual Knowledge Evaluation
As the MMLU benchmark has become a standard for evaluating LLMs (Beeching et al., 2023;
OpenAI, 2024; Dubey et al., 2024; Üstün et al., 2024; Aryabumi et al., 2024), addressing its lim-
itations and introducing enhancements are essential to maintaining high evaluation standards.
For English, Gema et al. (2024) manually re-annotated 3K questions across 30 MMLU sub-
jects to identify quality or problematic questions and released it as MMLU-redux. Wang et al.
(2024) introduced an extended version of this dataset, MMLU-Pro, which adds more challenging,
20

-- 20 of 57 --

Figure 12: Comparison of model performance on

Chunk 36 · 1,996 chars

al. (2024) manually re-annotated 3K questions across 30 MMLU sub-
jects to identify quality or problematic questions and released it as MMLU-redux. Wang et al.
(2024) introduced an extended version of this dataset, MMLU-Pro, which adds more challenging,
20

-- 20 of 57 --

Figure 12: Comparison of model performance on human-translated and machine-translated CS
in French, Korean, and Yoruba.
reasoning-focused questions and expands the answer choice set from four to ten options. MMLU-
Pro+ extends the previous work by incorporating questions with multiple correct answers across
diverse domains and evaluating higher-order reasoning in LLMs (Taghanaki et al., 2024).While
these efforts enhance the difficulty and diversity of tasks, they remain restricted to English alone.
Language-specific variants of comprehensive multiple-choice exam benchmarks are typically cen-
tered around a single language. Examples include ArabicMMLU (Koto et al., 2024), CMMLU (Li
et al., 2024a), IndoMMLU (Koto et al., 2023), ThaiExam (Pipatanakul et al., 2023), , Turkish-
MMLU (Yüksel et al., 2024), AfriMMLU (Adelani et al., 2024), Khayyam Challenge (Ghahroodi
et al., 2024), KMMLU (Son et al., 2024a), HAE-RAE (Son et al., 2024b) and VNHSGE (Dao
et al., 2023) covering Arabic, Chinese, Indonesian, Thai, Turkish, Persian, Korean, and Viet-
namese, respectively.
There have been multiple efforts to design and construct evaluation datasets that cater to mul-
tilingual settings. AGIEval is a compilation of human-centric standardized exams to assess lan-
guage model performance in English and Chinese (Zhong et al., 2023). BEnQ is similar but for
English and Bengali (Shafayat et al., 2024). EXAMS is a multilingual high school examination
collection covering 16 languages (Hardalov et al., 2020). M3EXAMS is a multimodal multi-
lingual benchmark supporting 9 languages with three educational levels (Zhang et al., 2023a).
Both evaluation sets process exams on various topics in different countries and build

Chunk 37 · 1,999 chars

). EXAMS is a multilingual high school examination
collection covering 16 languages (Hardalov et al., 2020). M3EXAMS is a multimodal multi-
lingual benchmark supporting 9 languages with three educational levels (Zhang et al., 2023a).
Both evaluation sets process exams on various topics in different countries and build per-language
benchmarks. These initiatives strive to evaluate the performance of language models across var-
ious languages; however, they often support a small number of languages and lack a consistent,
21

-- 21 of 57 --

standardized framework for direct comparison between languages. We note recent work IN-
CLUDE as an exception to this as one of the most extensive evaluation benchmarks, compiled
from local exams across various countries and languages, covering 44 languages (Romanou et al.,
2024).
To enable evaluation across a wider range of languages, efforts have also been made to translate
the MMLU dataset into multiple languages. Lai et al. (2023) use ChatGPT to translate the
English MMLU dataset into 26 languages. However, the quality of translations produced by
ChatGPT can vary significantly across different languages and is not always reliable (Robinson
et al., 2023). More recently OpenAI released MMMLU by translating MMLU into 14 lan-
guages using professional human translators, and we incorporate this high-quality dataset into
our benchmark.
5.2 Culturally-aware Evaluation
Recent research has increasingly focused on examining the cultural alignment of LLMs. Studies
such as Arora et al. (2022) and Cao et al. (2023) have explored LLMs’ ability to understand cross-
cultural differences in values and beliefs. To ensure accurate cross-cultural and cross-linguistic
representation, SEA-HELM8 (previously known as BHASA (Leong et al., 2023))9 is an evaluation
suite which emphasizes Southeast Asian languages and contains a variety of tasks, including
manually handcrafted linguistic diagnostics as well as manually translated and validated SEA-
IFEval

Chunk 38 · 1,991 chars

oss-cultural and cross-linguistic
representation, SEA-HELM8 (previously known as BHASA (Leong et al., 2023))9 is an evaluation
suite which emphasizes Southeast Asian languages and contains a variety of tasks, including
manually handcrafted linguistic diagnostics as well as manually translated and validated SEA-
IFEval and SEA-MTBench. Wang et al. (2023) and Masoud et al. (2024) demonstrate that
LLMs often reflect values and opinions aligned with Western culture, a trend that persists across
multiple languages. Additionally, benchmarks like those introduced by Naous et al. (2024) and
Rao et al. (2024) aim to measure cultural biases in LLMs, while Ventura et al. (2024) investigates
cultural biases within text-to-image diffusion models, proposing a comprehensive suite of cultural
evaluation techniques. Aakanksha et al. (2024) studied aligning language models balancing dual
objectives: addressing and optimizing for a non-homogeneous set of languages and cultural
preferences based upon annotations from professional multilingual annotators while minimizing
both global and local harms. Some studies focus on specific cultural aspects, such as Myung et al.
(2024), Magomere et al. (2024), and Montalan et al. (2024), which evaluate LLMs’ understanding
of everyday cultural knowledge across diverse cultures and regions.
In addition, several studies have explored evaluating multilingual visual language models (VLMs).
PangeaBench is a holistic evaluation suite encompassing 14 pre-existing datasets covering 47
languages (Yue et al., 2024). Romero et al. (2024) presents CVQA, a culturally diverse multilin-
gual Visual Question Answering benchmark that includes culturally-driven images and questions
across 30 countries and 31 languages. Vayani et al. (2024) introduces a multimodal benchmark
including culturally diverse images paired with text across 100 languages.
Numerous studies have also explored the role of pre-training in shaping the cultural biases present
in LLMs. For

Chunk 39 · 1,990 chars

culturally-driven images and questions
across 30 countries and 31 languages. Vayani et al. (2024) introduces a multimodal benchmark
including culturally diverse images paired with text across 100 languages.
Numerous studies have also explored the role of pre-training in shaping the cultural biases present
in LLMs. For example, Chen et al. (2024) examines the impact of native versus translated data on
LLM instruction tuning and evaluation. Their findings reveal that models fine-tuned with native
instructions typically outperform those trained using translated data. Similarly, Choenni et al.
(2024) investigates the reliability of machine translation as a substitute for human translation in
8An acronym for SouthEast Asian Holistic Evaluation of Language Models.
9https://leaderboard.sea-lion.ai
22

-- 22 of 57 --

large-scale multilingual evaluations, highlighting its effectiveness across a diverse set of languages.
Üstün et al. (2024) released the Aya-101 model and focused on in-language prompting and using
a comprehensive dataset of human-written data for instruction tuning large language models
across 114 languages to reflect local culture and preferences (Singh et al., 2024). Additionally,
significant efforts have been made to incorporate knowledge from various cultures into LLMs
to achieve broader cultural alignment. For instance, Li et al. (2024b) proposes a cost-effective
fine-tuning strategy to embed cultural differences into LLMs, facilitating better representation
and understanding of global cultural nuances. Meanwhile, AlKhamissi et al. (2024) introduces
“Anthropological Prompting” a novel method that employs anthropological reasoning to enhance
the cultural alignment of LLMs.
5.3 Participatory Open Science Projects
Participatory research empowers diverse communities to actively contribute to the research pro-
cess, ensuring that outcomes are inclusive, contextually relevant, and address real-world needs.
Previous participatory research efforts have

Chunk 40 · 1,996 chars

o enhance
the cultural alignment of LLMs.
5.3 Participatory Open Science Projects
Participatory research empowers diverse communities to actively contribute to the research pro-
cess, ensuring that outcomes are inclusive, contextually relevant, and address real-world needs.
Previous participatory research efforts have primarily focused on specific regions or tasks such as
translation, character recognition, audio segmentation, and transcription. For instance, Clanuwat
et al. (2018) addressed the challenge of reading and understanding Kuzushiji, an old cursive style
of Japanese writing no longer commonly used. Another notable example of culturally diverse
data collection is MaRVL (Multicultural Reasoning over Vision and Language; Liu et al., 2021),
where native speakers of five typologically, genealogically, and geographically diverse languages
(Indonesian, Swahili, Tamil, Turkish, and Mandarin Chinese) contributed images reflecting their
cultures. Professional linguists fluent in these languages then wrote captions for the images.
However, MaRVL’s dataset is relatively small, with fewer than 8,000 data points, limiting its
use to evaluation purposes. Similarly, Hernandez Mena & Meza Ruiz (2022) developed eight
open-access resources for Mexican and Latin American Spanish by establishing a social service
program where students voluntarily contributed to tasks like audio segmentation and transcrip-
tion. Notably, these efforts are largely concentrated on image and speech, unlike our work, which
focuses on text. Cañete et al. (2020) spearheaded the collection of a Latin American Spanish
dataset to train a language model. Guevara-Rukoz et al. (2020) explored the development of a
crowd-sourced corpus for Latin American Spanish dialects to address resource scarcity for these
languages. Masakhane utilized a participatory research framework to curate NLP datasets and
build models for several underrepresented African languages (∀ et al., 2020; Adelani et al., 2021;
2023).

Chunk 41 · 1,998 chars

ored the development of a
crowd-sourced corpus for Latin American Spanish dialects to address resource scarcity for these
languages. Masakhane utilized a participatory research framework to curate NLP datasets and
build models for several underrepresented African languages (∀ et al., 2020; Adelani et al., 2021;
2023). Aligned with the goals of having a participatory framework and open-access resources,
Project SEALD,10,11 a collaboration between AI Singapore and Google Research, pioneered mul-
tilingual data collection for Large Language Models (LLMs) in Southeast Asia (SEA). The output
of this project continues to contribute to the development of open-source multilingual models
in this region, namely SEA-LION12 and its derivatives, such as WangchanLion (Phatthiyaphai-
bun et al., 2024) and Sahabat-AI.13 Similarly, the NusaCrowd initiative by Cahyawijaya et al.
(2023) focused on aggregating and standardizing data sources for Indonesian languages. The
ongoing SEACrowd project14 represents a similar effort, aiming to standardize data resources
for all Southeast Asian languages (Lovenia et al., 2024). The Aya Initiative, through a global
community effort of 3,000 contributors, collected instruction data in 114 languages, fostering lin-
10An acronym for Southeast Asian Languages in One Network Data.
11https://aisingapore.org/aiproducts/southeast-asian-languages-in-one-network-data-seald/
12https://sea-lion.ai
13https://sahabat-ai.com
14https://github.com/SEACrowd
23

-- 23 of 57 --

guistic diversity and inclusivity to create one of the largest multilingual datasets for advancing
state-of-the-art language models (Singh et al., 2024; Üstün et al., 2024).
6 Conclusion
We evaluate the cultural biases present in MMLU and find that 28% of all questions require
culturally-sensitive knowledge. In particular, progress on MMLU depends heavily on learning
Western-centric concepts. For questions requiring geographic knowledge, the vast majority focus
on North America and Europe.

Chunk 42 · 1,992 chars

Conclusion
We evaluate the cultural biases present in MMLU and find that 28% of all questions require
culturally-sensitive knowledge. In particular, progress on MMLU depends heavily on learning
Western-centric concepts. For questions requiring geographic knowledge, the vast majority focus
on North America and Europe. This cultural bias remains in translated variants of MMLU
that are widely used for multilingual LLM evaluation, which reduces the dataset’s practical
effectiveness as a global benchmark and risks over-indexing evaluations on Western-centric idioms
and knowledge.
We examine the impact of translation artifacts and cultural bias on multilingual model rank-
ings. We introduce Global-MMLU and Global-MMLU Lite , multilingual multi-domain
datasets that distinguish between culturally-sensitive (CS ) and culturally-agnostic (CA )
knowledge. By incorporating professional and crowd-sourced annotations, these subsets enable
rigorous multilingual model evaluation.
Finally, we evaluate a large group of state-of-the-art open-weight and proprietary models to
understand performance differences on both these subsets. We find that model rankings change
depending on whether models are assessed on culturally-sensitive or culturally-agnostic subsets,
highlighting that progress on translated MMLU is insufficient as an indicator of performance.
Instead, we recommend evaluations for multilingual reports on Global-MMLU and both
CA and CS subsets as part of the holistic evaluation of progress in multilingual LLM
capabilities. As part of our commitment to the research ecosystem, we release Global-MMLU
and Global-MMLU Lite under a fully permissive license for use in evaluations at https:
//hf.co/datasets/CohereForAI/Global-MMLU and https://huggingface.co/datasets/Cohe
reForAI/Global-MMLU-Lite.
7 Limitations
Uneven distribution of contributions Beyond the gold standard languages where we engaged
with compensated annotators, participation from community annotators was heavily

Chunk 43 · 1,995 chars

se in evaluations at https:
//hf.co/datasets/CohereForAI/Global-MMLU and https://huggingface.co/datasets/Cohe
reForAI/Global-MMLU-Lite.
7 Limitations
Uneven distribution of contributions Beyond the gold standard languages where we engaged
with compensated annotators, participation from community annotators was heavily skewed
across languages. Despite a large volume of community annotators, there was a ‘long tail’ of
annotators only contributing one or two annotations. Similarly, there is a huge gap between
languages with the highest number of contributions and ones with the lowest number of contri-
butions. Consequently, this suggests potential unevenness in dataset distributions across different
languages and a lack of annotator diversity within some languages dominated by one or two fre-
quent contributors.
Language and dialect coverage We focus on 42 lanugages for Global-MMLU . However,
this is still only a tiny fraction of the world’s linguistic diversity. Of the world’s approximately
7,000 languages, only half of them are captured in any sort of written form (Adda et al., 2016).
Of this half, only a few hundred are included on the internet in machine readable corpora (Adda
et al., 2016). Future work is needed to continue to improve evaluations beyond these 42 languages
24

-- 24 of 57 --

and to take into account how technology serves different dialects (a topic we do not address here).
Geo-cultural variation within a language often gives rise to new dialects or creoles over time
(Zampieri et al., 2020; Wolfram, 1997) and, as such, dialects can serve an important function in
establishing and maintaining cultural identity(Falck et al., 2012). Many different dialects that
are generally recognized as belonging to a single parent language are not represented in this
evaluation dataset.
Toxic or offensive speech Our annotation interface does not contain specific flags for toxic,
harmful, or offensive speech, so it is possible that Global-MMLU contains some data

Chunk 44 · 1,994 chars

any different dialects that
are generally recognized as belonging to a single parent language are not represented in this
evaluation dataset.
Toxic or offensive speech Our annotation interface does not contain specific flags for toxic,
harmful, or offensive speech, so it is possible that Global-MMLU contains some data that
could be considered harmful. We believe this is of relatively low risk because of the nature of the
original MMLU and the focus on examination material. However, we did not monitor or track
this explicitly during our cultural sensitivity annotations or translation post-edits.
Region Category Assignment: For the annotation of geographically sensitive questions, we
classified regions into six geographic regions (Africa, Asia, Europe, North America, Oceania,
and South America).15 However, based upon discussions we would going forward recommend
switching to the taxonomy proposed by the World Bank which is more granular and includes
separate designations for Central America and Sub-Saharan Africa.16
Identifying cultural sensitivity does not guarantee cultural inclusion. We acknowledge
that efforts like the proposed Global-MMLU highlight important limitations in current datasets
by identifying gaps in non-Western cultural representation. Identifying whether a dataset is
culturally agnostic or not is highly relevant as mere translations may create the illusion that
datasets are being more culturally inclusive and validating models in that sense, while this is
not the real case. However, it must be noted that they do not fully resolve the issue. Future
work must prioritize the integration of diverse culturally grounded knowledge to achieve true
inclusivity and fairness in multilingual AI evaluation.
8 Acknowledgments
We would like to thank members of the Cohere For AI community who championed this initia-
tive and helped with annotating samples for cultural sensitivity as well as improving translation
quality across many languages. In particular, we

Chunk 45 · 1,996 chars

eve true
inclusivity and fairness in multilingual AI evaluation.
8 Acknowledgments
We would like to thank members of the Cohere For AI community who championed this initia-
tive and helped with annotating samples for cultural sensitivity as well as improving translation
quality across many languages. In particular, we recognize Ashay Srivastava, Aurélien-Morgan
Claudon, Bevnm SaiAsrit, Danylo Boiko, Hanna Yukhymenko, Sai Vineetha Baddepudi Venkata
Naga Sri, Sangyeon Kim, Tadesse Destaw Belay, Alperen Ünlü, Mohammed Hamdy, Muham-
mad Rafi Sudrajat, Olusanya Joy Naomi, Vu Trong Ki, Yiyang Nan, Abdelmoneim Shahd, Arwa
ALaya, Bimasena Putra, Emad Alghamdi, Fabian Farestam, Mridul Sharma, Sayuru Bopitiya,
Surya Abhinai who contributed a significant amount to each of their languages. A special thank
you to Claire Cheng and Trisha Starostina for helping to coordinate the Cohere professional an-
notators who contributed to this project. We thank all these compensated experts who provided
their language knowledge to comprehensively improve quality over our gold languages.
15https://www.pewresearch.org/global/2013/06/04/regional-categorization/
16https://ourworldindata.org/world-region-map-definitions
25

-- 25 of 57 --

References
Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer,
Marzieh Fadaee, and Sara Hooker. The multilingual alignment prism: Aligning global and
local preferences to reduce harm. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen
(eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process-
ing, pp. 12027–12049, Miami, Florida, USA, November 2024. Association for Computational
Linguistics. URL https://aclanthology.org/2024.emnlp-main.671.
Gilles Adda, Sebastian Stüker, Martine Adda-Decker, Odette Ambouroue, Laurent Besacier,
David Blachon, Hélène Bonneau-Maynard, Pierre Godard, Fatima Hamlaoui, Dmitry Idiatov,
Guy-Noël Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Annie Rialland, Mark

Chunk 46 · 1,983 chars

nal
Linguistics. URL https://aclanthology.org/2024.emnlp-main.671.
Gilles Adda, Sebastian Stüker, Martine Adda-Decker, Odette Ambouroue, Laurent Besacier,
David Blachon, Hélène Bonneau-Maynard, Pierre Godard, Fatima Hamlaoui, Dmitry Idiatov,
Guy-Noël Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Annie Rialland, Mark Van de
Velde, François Yvon, and Sabine Zerbian. Breaking the unwritten language barrier: The bulb
project. Procedia Computer Science, 81:8–14, 2016. ISSN 1877-0509. doi: https://doi.org/
10.1016/j.procs.2016.04.023. URL https://www.sciencedirect.com/science/article/
pii/S1877050916300370. SLTU-2016 5th Workshop on Spoken Language Technologies for
Under-resourced languages 09-12 May 2016 Yogyakarta, Indonesia.
David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Con-
stantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder,
Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue,
Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene
Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani,
Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David,
Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Em-
manuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde,
Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode,
Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dib-
ora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing
Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima
DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei.
MasakhaNER: Named entity recognition for African languages. Transactions of the Asso-
ciation for Computational Linguistics, 9:1116–1131, 2021. doi:

Chunk 47 · 1,998 chars

g
Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima
DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei.
MasakhaNER: Named entity recognition for African languages. Transactions of the Asso-
ciation for Computational Linguistics, 9:1116–1131, 2021. doi: 10.1162/tacl_a_00416. URL
https://aclanthology.org/2021.tacl-1.66.
David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo
Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo,
Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David,
Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abra-
ham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad,
Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda Yigezu, Tajuddeen
Gwadabe, Idris Abdulmumin, Mahlet Taye, Oluwabusayo Awoyomi, Iyanuoluwa Shode, Tolu-
lope Adelani, Habiba Abdulganiyu, Abdul-Hakeem Omotayo, Adetola Adeeko, Abeeb Afolabi,
Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Ogbu,
Chinedu Mbonu, Chiamaka Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola Awosan,
Tadesse Kebede, Toadoum Sari Sakayo, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf,
Mardiyyah Oduwole, Kanda Tshinu, Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos
Nigusse, Abdulmejid Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed,
Evrard Ngabire, Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp. MasakhaNEWS: News
26

-- 26 of 57 --

topic classification for African languages. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu,
Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th Inter-
national Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-
Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 144–159, Nusa Dua, Bali,

Chunk 48 · 1,992 chars

Hu, Wei Lu,
Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th Inter-
national Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-
Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 144–159, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. doi:
10.18653/v1/2023.ijcnlp-main.10. URL https://aclanthology.org/2023.ijcnlp-main.10.
David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Zhuang Yun Jian, Jesujoba Oluwadara
Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chia-
maka Chukwuneke, Happy Buzaaba, Blessing K. Sibanda, Godson Kalipe, Jonathan Mukiibi,
Salomon Kabongo KABENAMUALU, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela,
Nkiruka Bridget Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei,
Sokhar Samb, Tadesse Kebede Guge, and Pontus Stenetorp. Irokobench: A new benchmark
for african languages in the age of large language models. ArXiv, abs/2406.03368, 2024. URL
https://api.semanticscholar.org/CorpusID:270258352.
Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. Investigating
cultural alignment of large language models. arXiv preprint arXiv:2402.13231, 2024.
Arnav Arora, Lucie-Aimée Kaffee, and Isabelle Augenstein. Probing pre-trained language models
for cross-cultural differences in values. arXiv preprint arXiv:2203.13722, 2022.
Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin,
Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max
Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil
Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. Aya 23: Open weight releases to
further multilingual progress, 2024. URL https://arxiv.org/abs/2405.15032.
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen
Rajani, Omar

Chunk 49 · 1,988 chars

Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil
Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. Aya 23: Open weight releases to
further multilingual progress, 2024. URL https://arxiv.org/abs/2405.15032.
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen
Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https:
//huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
Abhijit Bendale, Michael Sapienza, Steven Ripplinger, Simon Gibbs, Jaewon Lee, and Pranav
Mistry. Sutra: Scalable multilingual language model architecture, 2024. URL https://arxi
v.org/abs/2405.06694.
Steven Bird. Local languages, third spaces, and other high-resource scenarios. In Smaranda
Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7817–7829,
Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/20
22.acl-long.539. URL https://aclanthology.org/2022.acl-long.539.
Abeba Birhane, William Isaac, Vinodkumar Prabhakaran, Mark Diaz, Madeleine Clare Elish,
Iason Gabriel, and Shakir Mohamed. Power to the people? opportunities and challenges for
participatory ai. In Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO
’22. ACM, October 2022. doi: 10.1145/3551624.3555290. URL http://dx.doi.org/10.1145
/3551624.3555290.
Yuri Bizzoni, Tom S Juzek, Cristina España-Bonet, Koel Dutta Chowdhury, Josef van Genabith,
and Elke Teich. How human is machine translationese? comparing human and machine
translations of text and speech. In Marcello Federico, Alex Waibel, Kevin Knight, Satoshi
27

-- 27 of 57 --

Chunk 50 · 1,995 chars

speech. In Marcello Federico, Alex Waibel, Kevin Knight, Satoshi
27

-- 27 of 57 --

Nakamura, Hermann Ney, Jan Niehues, Sebastian Stüker, Dekai Wu, Joseph Mariani, and
Francois Yvon (eds.), Proceedings of the 17th International Conference on Spoken Language
Translation, pp. 280–290, Online, July 2020. Association for Computational Linguistics. doi:
10.18653/v1/2020.iwslt-1.34. URL https://aclanthology.org/2020.iwslt-1.34.
Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto,
Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer San-
toso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan
Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali
Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith
Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito
Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo
Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timo-
thy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, and Ayu Pur-
warianti. NusaCrowd: Open source initiative for Indonesian NLP resources. In Anna
Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for
Computational Linguistics: ACL 2023, pp. 13745–13818, Toronto, Canada, July 2023. As-
sociation for Computational Linguistics. doi: 10.18653/v1/2023.f indings-acl.868. URL
https://aclanthology.org/2023.findings-acl.868.
Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich. Assessing
cross-cultural alignment between chatgpt and human societies: An empirical study. arXiv
preprint arXiv:2303.17466, 2023.
Isaac Caswell, Theresa Breiner, Daan van Esch, and Ankur Bapna. Language ID in the wild:
Unexpected challenges on the path to a thousand-language web text corpus. In Donia Scott,
Nuria Bel, and

Chunk 51 · 1,991 chars

sing
cross-cultural alignment between chatgpt and human societies: An empirical study. arXiv
preprint arXiv:2303.17466, 2023.
Isaac Caswell, Theresa Breiner, Daan van Esch, and Ankur Bapna. Language ID in the wild:
Unexpected challenges on the path to a thousand-language web text corpus. In Donia Scott,
Nuria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on
Computational Linguistics, pp. 6588–6608, Barcelona, Spain (Online), December 2020. Inter-
national Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.579.
URL https://aclanthology.org/2020.coling-main.579.
José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez.
Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020, 2020.
Pinzhen Chen, Simon Yu, Zhicheng Guo, and Barry Haddow. Is it good data for multilingual
instruction tuning or just bad multilingual evaluation for large language models?, 2024. URL
https://arxiv.org/abs/2406.12822.
Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba,
Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami,
Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit,
Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosse-
lut. Meditron-70b: Scaling medical pretraining for large language models, 2023. URL
https://arxiv.org/abs/2311.16079.
Rochelle Choenni, Sara Rajaee, Christof Monz, and Ekaterina Shutova. On the evaluation prac-
tices in multilingual nlp: Can machine translation offer an alternative to human translations?,
2024. URL https://arxiv.org/abs/2406.14267.
Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and
David Ha. Deep learning for classical japanese literature, 2018.
28

-- 28 of 57 --

Eric Corbett, Emily Denton, and Sheena Erete. Power and public participation in ai. In Pro-
ceedings of the 3rd ACM

Chunk 52 · 1,992 chars

tps://arxiv.org/abs/2406.14267.
Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and
David Ha. Deep learning for classical japanese literature, 2018.
28

-- 28 of 57 --

Eric Corbett, Emily Denton, and Sheena Erete. Power and public participation in ai. In Pro-
ceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and
Optimization, EAAMO ’23, New York, NY, USA, 2023. Association for Computing Machinery.
ISBN 9798400703812. doi: 10.1145/3617694.3623228. URL https://doi.org/10.1145/3617
694.3623228.
Xuan-Quy Dao, Ngoc-Bich Le, The-Duy Vo, Xuan-Dung Phan, Bac-Bien Ngo, Van-Tien Nguyen,
Thi-My-Thanh Nguyen, and Hong-Phuoc Nguyen. Vnhsge: Vietnamese high school graduation
examination dataset for large language models, 2023. URL https://arxiv.org/abs/2305.1
2199.
Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. The participatory turn in ai
design: Theoretical foundations and the current state of practice. Proceedings of the 3rd ACM
Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 2023. URL
https://api.semanticscholar.org/CorpusID:263605822.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.
Oliver Falck, Stephan Heblich, Alfred Lameli, and Jens Südekum. Dialects, cultural identity,
and economic exchange. Journal of urban economics, 72(2-3):225–239, 2012.
Allan M. Feldman. Majority Voting, pp. 161–177. Springer US, Boston, MA, 1980. ISBN 978-
1-4615-8141-3. doi: 10.1007/978-1-4615-8141-3_10. URL https://doi.org/10.1007/978-1
-4615-8141-3_10.
Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models.
First Monday, November 2023. ISSN 1396-0466. doi: 10.5210/fm.v28i11.13346. URL
http://dx.doi.org/10.5210/fm.v28i11.13346.
∀, Wilhelmina Nekoto, Vukosi Marivate,

Chunk 53 · 1,968 chars

8141-3_10. URL https://doi.org/10.1007/978-1
-4615-8141-3_10.
Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models.
First Monday, November 2023. ISSN 1396-0466. doi: 10.5210/fm.v28i11.13346. URL
http://dx.doi.org/10.5210/fm.v28i11.13346.
∀, Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbo-
hungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabena-
mualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez
Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-
Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi
Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro
Orife, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness
Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher
Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing
Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem, Adewale Akinfaderin, and Ab-
dallah Bashir. Participatory research for low-resourced machine translation: A case study
in African languages. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the
Association for Computational Linguistics: EMNLP 2020, Online, November 2020. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2020.f indings-emnlp.195. URL
https://aclanthology.org/2020.findings-emnlp.195.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles
Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas
Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron,
29

-- 29 of 57 --

Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A frame-
work for few-shot language model evaluation, 07 2024. URL

Chunk 54 · 1,996 chars

, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas
Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron,
29

-- 29 of 57 --

Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A frame-
work for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records
/12608602.
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria
Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi
Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken,
and Pasquale Minervini. Are we done with mmlu?, 2024. URL https://arxiv.org/abs/24
06.04127.
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya
Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti,
Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya
Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti,
Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo-
pher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena
Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru,
Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin,
James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy
Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican,
Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth,
Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar
Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu,
Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian
Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De,

Chunk 55 · 1,993 chars

kolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar
Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu,
Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian
Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted
Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed,
Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals,
Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Bar-
ral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev,
and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024.
URL https://arxiv.org/abs/2403.08295.
Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dast-
gheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban.
Khayyam challenge (persianmmlu): Is your llm truly wise to the persian language?, 2024.
URL https://arxiv.org/abs/2404.06644.
Adriana Guevara-Rukoz, Isin Demirsahin, Fei He, Shan-Hui Cathy Chu, Supheakmungkol Sarin,
Knot Pipatsrisawat, Alexander Gutkin, Alena Butryna, and Oddur Kjartansson. Crowdsourc-
ing Latin American Spanish for low-resource text-to-speech. In Nicoletta Calzolari, Frédéric
Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hi-
toshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk,
and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation
Conference, pp. 6504–6513, Marseille, France, May 2020. European Language Resources As-
sociation. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.801.
Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov, Ivan Koychev, and
Preslav Nakov. EXAMS: A multi-subject high school examinations dataset for cross-lingual
and

Chunk 56 · 1,996 chars

Marseille, France, May 2020. European Language Resources As-
sociation. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.801.
Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov, Ivan Koychev, and
Preslav Nakov. EXAMS: A multi-subject high school examinations dataset for cross-lingual
and multilingual question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang
Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language
30

-- 30 of 57 --

Processing (EMNLP), pp. 5427–5444, Online, November 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.emnlp-main.438. URL https://aclanthology.org/202
0.emnlp-main.438.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
Carlos Daniel Hernandez Mena and Ivan Vladimir Meza Ruiz. Creating Mexican Spanish lan-
guage resources through the social service program. In Chris Callison-Burch, Christopher
Cieri, James Fiumara, and Mark Liberman (eds.), Proceedings of the 2nd Workshop on Novel
Incentives in Data Collection from People: models, implementations, challenges and results
within LREC 2022, pp. 20–24, Marseille, France, June 2022. European Language Resources
Association. URL https://aclanthology.org/2022.nidcp-1.4.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark,
AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv
preprint arXiv:2410.21276, 2024.
Mika Hämäläinen. Endangered Languages are not Low-Resourced!, pp. 1–11. University of
Helsinki, March 2021. doi: 10.31885/9789515150257.1. URL http://dx.doi.org/10.31
885/9789515150257.1.
Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srini-
vasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali. Unsung
challenges of building and

Chunk 57 · 1,990 chars

ersity of
Helsinki, March 2021. doi: 10.31885/9789515150257.1. URL http://dx.doi.org/10.31
885/9789515150257.1.
Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srini-
vasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali. Unsung
challenges of building and deploying language technologies for low resource language commu-
nities. In Dipti Misra Sharma and Pushpak Bhattacharya (eds.), Proceedings of the 16th
International Conference on Natural Language Processing, pp. 211–219, International Insti-
tute of Information Technology, Hyderabad, India, December 2019. NLP Association of India.
URL https://aclanthology.org/2019.icon-1.25.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state
and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095,
2020.
Meryem Karlık. Exploring the impact of culture on language learning: How understanding cul-
tural context and values can deepen language acquisition. International Journal of Language,
Linguistics, Literature and Culture, 2:5–11, 09 2023. doi: 10.59009/ijlllc.2023.0035.
Naomi Kipuri. Chapter ii: culture. In UN, Department of Economic and Social Affairs, Divi-
sion for Social Policy and Development, Secretariat of the Permanent Forum on Indigenous
Issues (ed.), State of the world’s indigenous peoples: ST/ESA/328, New York: United Nations
publication, pp. 51–81, 2009.
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Chris-
tian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, et al.
Findings of the wmt24 general machine translation shared task: the llm era is here but mt
is not solved yet. In Proceedings of the Ninth Conference on Machine Translation, pp. 1–46,
2024.
31

-- 31 of 57 --

Moshe Koppel and Noam Ordan. Translationese and its dialects. In Proceedings of the 49th
annual meeting of the association for computational

Chunk 58 · 1,989 chars

machine translation shared task: the llm era is here but mt
is not solved yet. In Proceedings of the Ninth Conference on Machine Translation, pp. 1–46,
2024.
31

-- 31 of 57 --

Moshe Koppel and Noam Ordan. Translationese and its dialects. In Proceedings of the 49th
annual meeting of the association for computational linguistics: Human language technologies,
pp. 1318–1326, 2011.
Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. Large language models only
pass primary school exams in indonesia: A comprehensive test on indommlu, 2023. URL
https://arxiv.org/abs/2310.04928.
Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha
Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash,
Preslav Nakov, and Timothy Baldwin. Arabicmmlu: Assessing massive multitask language
understanding in arabic, 2024. URL https://arxiv.org/abs/2402.12840.
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-
Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang
Setyawan, et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transac-
tions of the Association for Computational Linguistics, 10:50–72, 2022. doi: 10.1162/tacl_a
_00447. URL https://aclanthology.org/2022.tacl-1.4.
Klaus Krippendorff. Reliability in content analysis: Some common misconceptions and recom-
mendations. Human Communication Research, 30(3):411–433, 2004.
William Labov. The social motivation of a sound change. Word, 19(3):273–309, 1963.
William Labov. The social stratification of (r) in new york city department stores. In Dialect
and language variation, pp. 304–329. Elsevier, 1986.
Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien
Nguyen. Okapi: Instruction-tuned large language models in multiple languages with reinforce-
ment learning from human feedback. In Yansong Feng and Els Lefever (eds.), Proceedings of the
2023

Chunk 59 · 1,996 chars

iation, pp. 304–329. Elsevier, 1986.
Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien
Nguyen. Okapi: Instruction-tuned large language models in multiple languages with reinforce-
ment learning from human feedback. In Yansong Feng and Els Lefever (eds.), Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra-
tions, pp. 318–327, Singapore, December 2023. Association for Computational Linguistics. doi:
10.18653/v1/2023.emnlp-demo.28. URL https://aclanthology.org/2023.emnlp-demo.28.
Wei Qi Leong, Jian Gang Ngui, Yosephine Susanto, Hamsawardhini Rengarajan, Kengatharaiyer
Sarveswaran, and William Chandra Tjhi. Bhasa: A holistic southeast asian linguistic and
cultural evaluation suite for large language models. arXiv preprint arXiv:2309.06085, 2023.
Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.
Soviet Physics Doklady, 10(8):707–710, 1966.
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and
Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese,
2024a. URL https://arxiv.org/abs/2306.09212.
Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen,
Bill Yuchen Lin, Nouha Dziri, Xiang Ren, and Yejin Choi. Culture-gen: Revealing global
cultural perception in language models through natural language prompting, 2024b. URL
https://arxiv.org/abs/2404.10199.
Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen. Culturally aware and adapted nlp: A
taxonomy and a survey of the state of the art. arXiv preprint arXiv:2406.03930, 2024.
32

-- 32 of 57 --

Chunk 60 · 1,945 chars

-- 32 of 57 --

Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond
Elliott. Visually grounded reasoning across languages and cultures. In Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, pp. 10467–10485,
Online and Punta Cana, Dominican Republic, November 2021. Association for Computational
Linguistics. URL https://aclanthology.org/2021.emnlp-main.818.
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James Validad Miranda, Jen-
nifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial,
Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Fred-
erikus Hudi, Jann Railey Montalan, Ryan Ignatius Hadiwijaya, Joanito Agili Lopo, William
Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus
Irawan, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse, Ivan Halim Parmonan-
gan, Maria Khelli, Wenyu Zhang, Lucky Susanto, Reynard Adha Ryanda, Sonny Lazuardi
Hermawan, Dan John Velasco, Muhammad Dehan Al Kautsar, Willy Fitra Hendria, Yasmin
Moslem, Noah Flynn, Muhammad Farid Adilazuarda, Haochen Li, Johanes Lee, R. Daman-
huri, Shuo Sun, Muhammad Reza Qorib, Amirbek Djanibekov, Wei Qi Leong, Quyet V.
Do, Niklas Muennighoff, Tanrada Pansuwan, Ilham Firdausi Putra, Yan Xu, Tai Ngee
Chia, Ayu Purwarianti, Sebastian Ruder, William Chandra Tjhi, Peerat Limkonchotiwat, Al-
ham Fikri Aji, Sedrick Keh, Genta Indra Winata, Ruochen Zhang, Fajri Koto, Zheng Xin
Yong, and Samuel Cahyawijaya. SEACrowd: A multilingual multimodal data hub and
benchmark suite for Southeast Asian languages. In Yaser Al-Onaizan, Mohit Bansal, and
Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Nat-
ural Language Processing, pp. 5155–5203, Miami, Florida, USA, November 2024. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.296.

Chunk 61 · 1,997 chars

te for Southeast Asian languages. In Yaser Al-Onaizan, Mohit Bansal, and
Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Nat-
ural Language Processing, pp. 5155–5203, Miami, Florida, USA, November 2024. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.296. URL
https://aclanthology.org/2024.emnlp-main.296.
Alexandra Luccioni and Joseph Viviano. What’s in the box? an analysis of undesirable content in
the Common Crawl corpus. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.),
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and
the 11th International Joint Conference on Natural Language Processing (Volume 2: Short
Papers), pp. 182–189, Online, August 2021. Association for Computational Linguistics. doi:
10.18653/v1/2021.acl-short.24. URL https://aclanthology.org/2021.acl-short.24.
Jabez Magomere, Shu Ishida, Tejumade Afonja, Aya Salama, Daniel Kochin, Foutse Yuehgoh,
Imane Hamzaoui, Raesetje Sefala, Aisha Alaagib, Elizaveta Semenova, et al. You are what you
eat? feeding foundation models a regionally diverse food dataset of world wide dishes. arXiv
preprint arXiv:2406.09496, 2024.
Reem I. Masoud, Ziquan Liu, Martin Ferianc, Philip Treleaven, and Miguel Rodrigues. Cultural
alignment in large language models: An explanatory analysis based on hofstede’s cultural
dimensions, 2024. URL https://arxiv.org/abs/2309.12342.
Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini
Rengarajan, William Chandra Tjhi, and Alham Fikri Aji. Kalahi: A handcrafted, grassroots
cultural llm evaluation suite for filipino, 2024. URL https://arxiv.org/abs/2409.15380.
Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri
Aji, and Monojit Choudhury. Cultural conditioning or placebo? on the effectiveness of socio-
demographic prompting. arXiv preprint arXiv:2406.11661, 2024.
33

-- 33 of 57 --

Junho Myung, Nayeon

Chunk 62 · 1,983 chars

L https://arxiv.org/abs/2409.15380.
Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri
Aji, and Monojit Choudhury. Cultural conditioning or placebo? on the effectiveness of socio-
demographic prompting. arXiv preprint arXiv:2406.11661, 2024.
33

-- 33 of 57 --

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsu-
vas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, et al. Blend: A
benchmark for llms on everyday knowledge in diverse cultures and languages. arXiv preprint
arXiv:2406.09948, 2024.
Tarek Naous, Michael J. Ryan, Alan Ritter, and Wei Xu. Having beer after prayer? measuring
cultural bias in large language models, 2024. URL https://arxiv.org/abs/2305.14456.
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole,
Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon
Kabongo, Salomey Osei, et al. Participatory research for low-resourced machine translation:
A case study in african languages. arXiv preprint arXiv:2010.02353, 2020.
NLLB-Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield,
Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler
Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gon-
zalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk
Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey
Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn,
Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
No language left behind: Scaling human-centered machine translation, 2022.
OpenAI. GPT-4 Technical Report. 2024. URL https://arxiv.org/abs/2303.08774.
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor
Their Own Generations, 2024. URL

Chunk 63 · 1,996 chars

, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
No language left behind: Scaling human-centered machine translation, 2022.
OpenAI. GPT-4 Technical Report. 2024. URL https://arxiv.org/abs/2303.08774.
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor
Their Own Generations, 2024. URL https://arxiv.org/abs/2404.13076.
Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat
Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchai-
chot, Ekapol Chuangsuwanich, and Sarana Nutanong. Wangchanlion and wangchanx mrc eval.
arXiv preprint arXiv:2403.16127, 2024.
Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarnmongkol,
Ruangsak Patomwong, Pathomporn Chokchainant, and Kasima Tharnpipitchai. Typhoon:
Thai large language models, 2023. URL https://arxiv.org/abs/2312.13951.
Maja Popović. chrf++: words helping character n-grams. In Proceedings of the Second Con-
ference on Machine Translation, pp. 612–618, Copenhagen, Denmark, September 2017. As-
sociation for Computational Linguistics. doi: 10.18653/v1/W17- 4770. URL https:
//aclanthology.org/W17-4770.
Abhinav Rao, Akhila Yerukola, Vishwa Shah, Katharina Reinecke, and Maarten Sap. Normad:
A framework for measuring the cultural adaptability of large language models, 2024. URL
https://arxiv.org/abs/2404.12464.
Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. Chatgpt mt:
Competitive for high- (but not low-) resource languages, 2023. URL https://arxiv.org/ab
s/2309.07423.
Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shiv-
alika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso
Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny
34

-- 34 of 57 --

Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou,
Daniel Fernando Erazo Florez, Fabian Farestam, Joseph

Chunk 64 · 1,994 chars

Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso
Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny
34

-- 34 of 57 --

Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou,
Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Is-
lam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher
Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang
Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther
Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh,
Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar,
Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan
Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, and Antoine Bosselut.
Include: Evaluating multilingual language understanding with regional knowledge, 2024. URL
https://arxiv.org/abs/2411.19799.
David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed,
Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo
Tonja, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. 38th
Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and
Benchmarks, 2024.
Sheikh Shafayat, H Hasan, Minhajur Mahim, Rifki Putri, James Thorne, and Alice Oh. Benqa:
A question answering benchmark for bengali and english. In ACL Findings, 2024.
Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, and Sara Hooker. LLM
see, LLM do: Leveraging active inheritance to target non-differentiable objectives. In Yaser
Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Processing, pp. 9243–9267, Miami, Florida, USA,
November

Chunk 65 · 1,994 chars

zer, Marzieh Fadaee, and Sara Hooker. LLM
see, LLM do: Leveraging active inheritance to target non-differentiable objectives. In Yaser
Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Processing, pp. 9243–9267, Miami, Florida, USA,
November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-m
ain.521. URL https://aclanthology.org/2024.emnlp-main.521.
Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje Karlsson, Abinaya Mahendiran, Wei-
Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O’Mahony, Mike
Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Moura, Dominik
Krzemiński, Hakimeh Fadaei, Irem Ergun, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake,
Zaid Alyafeai, Vu Chien, Sebastian Ruder, Surya Guthikonda, Emad Alghamdi, Sebastian
Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee,
and Sara Hooker. Aya dataset: An open-access collection for multilingual instruction tun-
ing. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 11521–11567, Bangkok, Thailand, August 2024. Association for Computational Linguistics.
doi: 10.18653/v1/2024.acl-long.620. URL https://aclanthology.org/2024.acl-long.620.
Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi,
Cheonbok Park, Kang Min Yoo, and Stella Biderman. Kmmlu: Measuring massive multitask
language understanding in korean, 2024a. URL https://arxiv.org/abs/2402.11548.
Guijin Son, Hanwool Lee, Suwan Kim, Huiseo Kim, Jaecheol Lee, Je Won Yeom, Jihyu Jung,
Jung Woo Kim, and Songseong Kim. Hae-rae bench: Evaluation of korean knowledge in
language models, 2024b. URL https://arxiv.org/abs/2309.02706.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won
Chung, Aakanksha

Chunk 66 · 1,999 chars

nwool Lee, Suwan Kim, Huiseo Kim, Jaecheol Lee, Je Won Yeom, Jihyu Jung,
Jung Woo Kim, and Songseong Kim. Hae-rae bench: Evaluation of korean knowledge in
language models, 2024b. URL https://arxiv.org/abs/2309.02706.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won
Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-
35

-- 35 of 57 --

bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,
2022.
Saeid Asgari Taghanaki, Aliasgahr Khani, and Amir Khasahmadi. Mmlu-pro+: Evaluating
higher-order reasoning and shortcut learning in llms, 2024. URL https://arxiv.org/abs/24
09.02257.
Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. Machine translationese: Ef-
fects of algorithmic bias on linguistic complexity in machine translation. In Paola Merlo, Jorg
Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Linguistics: Main Volume, pp. 2203–2213, On-
line, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-mai
n.188. URL https://aclanthology.org/2021.eacl-main.188.
Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar,
Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuck-
reja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker,
Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh
Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani,
Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova,
Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ke-
tan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei,
Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabr-
era, Johan Obando-Ceron, Olympiah Otieno,

Chunk 67 · 1,971 chars

ay Manjunath, Kamila Zhumakhanova,
Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ke-
tan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei,
Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabr-
era, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoo-
jan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Am-
rin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha,
Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar
Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, and Fahad Khan.
All languages matter: Evaluating lmms on culturally diverse 100 languages, 2024. URL
https://arxiv.org/abs/2411.16508.
Mor Ventura, Eyal Ben-David, Anna Korhonen, and Roi Reichart. Navigating cultural chasms:
Exploring and unlocking the cultural pov of text-to-image models, 2024. URL https://arxi
v.org/abs/2310.01929.
Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and
Michael R Lyu. Not all countries celebrate thanksgiving: On the cultural dominance in large
language models. arXiv preprint arXiv:2310.12481, 2023.
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weim-
ing Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang,
Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-
task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574.
Walt Wolfram. Issues in dialect obsolescence: An introduction. American speech, 72(1):3–11,
1997.
Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja,
Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu-
big. Pangea: A fully open multilingual multimodal llm for 39 languages. arXiv preprint
arXiv:2410.16153, 2024. URL

Chunk 68 · 1,993 chars

American speech, 72(1):3–11,
1997.
Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja,
Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu-
big. Pangea: A fully open multilingual multimodal llm for 39 languages. arXiv preprint
arXiv:2410.16153, 2024. URL https://arxiv.org/abs/2410.16153.
36

-- 36 of 57 --

Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Şenel, Anna Korhonen, and Hinrich Schütze.
Turkishmmlu: Measuring massive multitask language understanding in turkish, 2024. URL
https://arxiv.org/abs/2407.12402.
Marcos Zampieri, Preslav Nakov, and Yves Scherrer. Natural language processing for similar
languages, varieties, and dialects: A survey. Natural Language Engineering, 26(6):595–612,
2020.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing.
M3exam: A multilingual, multimodal, multilevel benchmark for examining large language
models, 2023a. URL https://arxiv.org/abs/2306.05179.
Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the
performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474,
2023b.
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied,
Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation
models, 2023. URL https://arxiv.org/abs/2304.06364.
Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun
Chen, and Lei Li. Multilingual machine translation with large language models: Empirical
results and analysis. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of
the Association for Computational Linguistics: NAACL 2024, pp. 2765–2781, Mexico City,
Mexico, June 2024. Association for Computational Linguistics. doi:

Chunk 69 · 1,998 chars

i Li. Multilingual machine translation with large language models: Empirical
results and analysis. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of
the Association for Computational Linguistics: NAACL 2024, pp. 2765–2781, Mexico City,
Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findi
ngs-naacl.176. URL https://aclanthology.org/2024.findings-naacl.176.
Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke
Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blun-
som, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker.
Aya model: An instruction finetuned open-access multilingual language model, 2024. URL
https://arxiv.org/abs/2402.07827.
A Global-MMLU Languages
In this work we will refer to groups of languages to be “lower-”, “mid-” or “higher”-resourced ac-
cording to their recorded, written, and catalogued NLP resources (Joshi et al., 2020). We group
these 5 distinct clusters following the groupings in (Singh et al., 2024) into a rough taxonomy
of lower-resourced (LR), mid-resourced (MR) and higher-resourced (HR). We note
that this grouping is inevitably imperfect; languages and their varieties cannot absolutely nor
universally be classified based on this single dimension (Hämäläinen, 2021; Bird, 2022). The cat-
egorization in our case serves the purpose of aggregation in our analysis of the data distribution.
ISO Code Language Script Resource Type
am Amharic Ge’ez Low ♦	♢ ♣	♤
37

-- 37 of 57 --

ar Arabic Arabic High ♣	♤
bn Bengali Bengali Mid ♣	♤
cs Czech Latin High ♦	♢ ♣	♤
de German Latin High ♣	♤
el Greek Greek Mid ♦	♢
en English Latin High ♦	♢ ♣	♤
fil Filipino Latin Mid ♦	♢
fr French Latin High ♣	♤
ha Hausa Latin Low ♦	♢
he Hebrew Hebrew Mid ♦	♢
hi Hindi Devanagari High ♣	♤
ig Igbo Latin Low ♦	♢
id Indonesian Latin Mid ♣	♤
it Italian Latin High ♣	♤
ja Japanese Japanese High ♣	♤
ky Kyrgyz Cyrillic Low ♦	♢
ko

Chunk 70 · 1,994 chars

♤
el Greek Greek Mid ♦	♢
en English Latin High ♦	♢ ♣	♤
fil Filipino Latin Mid ♦	♢
fr French Latin High ♣	♤
ha Hausa Latin Low ♦	♢
he Hebrew Hebrew Mid ♦	♢
hi Hindi Devanagari High ♣	♤
ig Igbo Latin Low ♦	♢
id Indonesian Latin Mid ♣	♤
it Italian Latin High ♣	♤
ja Japanese Japanese High ♣	♤
ky Kyrgyz Cyrillic Low ♦	♢
ko Korean Hangul Mid ♣	♤
lt Lithuanian Latin Mid ♦	♢
mg Malagasy Latin Low ♦	♢
ms Malay Latin Mid ♦	♢ ♣	♤
ne Nepali Devanagari Low ♦	♢
nl Dutch Latin High ♦	♢
ny Nyanja Latin Low ♦	♢
fa Persian Arabic High ♦	♢ ♣	♤
pl Polish Latin High ♦	♢
pt Portuguese Latin High ♣	♤
ro Romanian Latin Mid ♦	♢ ♣	♤
ru Russian Cyrillic High ♦	♢ ♣	♤
sin Sinhala Sinhala Low ♦	♢ ♣	♤
sn Shona Latin Low ♦	♢
som Somali Latin Low ♦	♢
es Spanish Latin High ♣	♤
sr Serbian Cyrillic High ♦	♢
sw Swahili Latin Low ♣	♤
sv Swedish Latin High ♦	♢
te Telugu Telugu Low ♦	♢ ♣	♤
tr Turkish Latin High ♦	♢ ♣	♤
uk Ukrainian Cyrillic Mid ♦	♢ ♣	♤
vi Vietnamese Latin High ♦	♢ ♣	♤
yo Yorùbá Latin Low ♣	♤
zh Chinese Hans High ♣	♤
Table 5: 42 languages in Global-MMLU , along with each language’s script and resource
category. We followed (Singh et al., 2024) and categorized languages as low, mid and high
resource based on language classes proposed by (Joshi et al., 2020) (low: [0, 1, 2], mid: [3],
high: [4, 5]). In Global-MMLU, the language is either fully machine translated ♦	♢, fully human
translated ♣	♤, or contains both machine and human translated data ♦	♢ ♣	♤.
B Global-MMLU Subject Categories
Global-MMLU covers six diverse subject categories: STEM, Humanities, Social Sciences,
Medical, Business, and Other. For a consistent approach, we adopt the classification proposed
by Hendrycks et al. (2020) for the MMLU dataset to categorize subjects as STEM, Humanities,
and Social Sciences. However, we further refine the ’Other’ category from the original MMLU
dataset by breaking it down into two distinct categories: Medical and Business. Within the
’Other’ category, subjects such as clinical knowledge,

Chunk 71 · 1,996 chars

ks et al. (2020) for the MMLU dataset to categorize subjects as STEM, Humanities,
and Social Sciences. However, we further refine the ’Other’ category from the original MMLU
dataset by breaking it down into two distinct categories: Medical and Business. Within the
’Other’ category, subjects such as clinical knowledge, college medicine, human aging, medical
genetics, nutrition, professional medicine, and virology are classified under the Medical category.
38

-- 38 of 57 --

Meanwhile, business ethics, management, marketing, and professional accounting fall under the
Business category. It’s worth noting that the ’Other’ category in Global-MMLU , sometimes
referred to as ’General Knowledge’, includes the remaining two subjects from the original MMLU
’Other’ category: global facts and miscellaneous.
C Global-MMLU Lite
Social Sciences
Humanities
General Knowledge (Other)
Business
STEM
Medical
0
5
10
15
20
25
Percentage of samples (%)
25.00% 25.00%
14.50% 14.00%
11.50%
9.50%
Figure 13: Distribution of samples across subject categories in Global-MMLU Lite
As mentioned in section 3.2, Global-MMLU Lite is a lighter version of Global-MMLU
containing 200 CS and 200 CA samples per language for 15 human-translated or post-edited
languages, including English.
For preparing Global-MMLU Lite , we took the MA subset of Global-MMLU containing
50 samples per subject and looked at proportion of CS and CA samples available per subject.
Subjects exclusively tagged as CS or CA (14 in total) were excluded to ensure both categories
were represented within each subject. Consequently, Social Sciences and Humanities subjects
are more prevalent in Global-MMLU Lite , as shown in Figure 13.
However, we aimed for a balanced distribution across subject categories. Social Science subjects
like High School Geography and Sociology had higher proportion of CS samples whereas STEM
subjects like Abstract Algebra had higher number of CA samples. To maintain balance, we
sampled five CS and five CA

Chunk 72 · 1,998 chars

wn in Figure 13.
However, we aimed for a balanced distribution across subject categories. Social Science subjects
like High School Geography and Sociology had higher proportion of CS samples whereas STEM
subjects like Abstract Algebra had higher number of CA samples. To maintain balance, we
sampled five CS and five CA samples per subject where available. Few subjects like Anatomy
or High School Mathematics had only one CS sample available, so for such subjects, only one CS
and one CA sample was taken. Samples from few subjects of Business and Medical categories
were slightly upsampled to ensure adequate representation.
The General Knowledge category, comprising only Miscellaneous and Global Facts, was also
upsampled, with 22 samples from Miscellaneous and 8 from Global Facts per category. This
adjustment ensures sufficient coverage for evaluating general knowledge capabilities. The over-
all goal with Global-MMLU Lite is to have a balanced dataset for efficient multilingual
evaluation across multiple languages.
39

-- 39 of 57 --

D Temporal Knowledge
As part of the annotation process, annotators were also asked to label samples for temporal or
time-sensitive knowledge . This applies to questions where the correct answer may change over
time due to factors such as current political leaders or economic statistics. Figure 14 shows the
distribution of time sensitive samples in MMLU Annotated . Overall it is observed that only
2.4% of the dataset is tagged as time-sensitive and majority of these samples fall under Social
Sciences, Humanities, Medical and Other categories. STEM is the only category with no time
sensitive samples at all.
100
80
60
40
20
0
Percentage of Samples (%)

Time Sensitive
Not Time Sensitive
Figure 14: Distribution of time-sensitive samples across subject categories. Note that STEM
subjects do not include any temporal knowledge.
E Models Covered
Details of each model are descibed below:
• Aya

Chunk 73 · 1,997 chars

ercentage of Samples (%)
 		 	 	 	 
Time Sensitive
Not Time Sensitive
Figure 14: Distribution of time-sensitive samples across subject categories. Note that STEM
subjects do not include any temporal knowledge.
E Models Covered
Details of each model are descibed below:
• Aya Expanse17 family of models include 8B18 and 32B19 parameter models. Aya Expanse
models support 23 languages including Arabic, Chinese (simplified & traditional), Czech,
Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese,
Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian,
and Vietnamese. Aya Expanse builds on the Aya initiative which includes multilingual first
releases like Aya 101 (Üstün et al., 2024), Aya 23 (Aryabumi et al., 2024) and extensive
multilingual datasets such as Aya collection (Singh et al., 2024).
• Command R and R+ are open-weight models of size 34B20 and 104B21 respectively
which both support 10 languages: English, French, Spanish, Italian, German, Brazilian
Portuguese, Japanese, Korean, Arabic, Simplified Chinese. We use Command-R 08-2024
and Command-R+ 08-2024 for evaluation.
17https://hf.co/blog/aya-expanse
18https://hf.co/CohereForAI/aya-expanse-8b
19https://hf.co/CohereForAI/aya-expanse-32b
20https://hf.co/CohereForAI/c4ai-command-r-08-2024
21https://hf.co/CohereForAI/c4ai-command-r-plus-08-2024
40

-- 40 of 57 --

• Gemma2 (Gemma Team et al., 2024) is part of the Gemma model family. The languages
targeted are not explicitly reported. We evaluate the instruct-tuned 9B (gemma-2-9b-it)
and 27B (gemma-2-27b-it) variants.
• Gemma2-9B-CPT-SEA-LIONv322 is part of the SEA-LION23,24 collection of models
trained for Southeast Asian (SEA) languages, including Burmese, Chinese, English, Fil-
ipino, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tamil, Thai, and Vietnamese.
We use Gemma2-9B-CPT-SEA-LIONv3-Instruct for evaluation.
• Llama 3.1 (Dubey et al., 2024) Llama

Chunk 74 · 1,887 chars

s part of the SEA-LION23,24 collection of models
trained for Southeast Asian (SEA) languages, including Burmese, Chinese, English, Fil-
ipino, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tamil, Thai, and Vietnamese.
We use Gemma2-9B-CPT-SEA-LIONv3-Instruct for evaluation.
• Llama 3.1 (Dubey et al., 2024) Llama 3.1 is a series of open LLM models that come in
three sizes: 8B, 70B, and 405B parameters. All variants support 8 languages, including
English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. We use Llama-
3.1-8B-Instruct and Llama-3.1-70B-Instruct for evaluation.
• Mistral Nemo25 is a 12B model which supports 11 languages including English, French,
German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.
• Qwen 2.526 models support up to 29 languages, including Chinese, English, French, Span-
ish, and Portuguese. We evaluate Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct variants
of Qwen 2.5.
• GPT-4o (Hurst et al., 2024) is a multilingual, multimodal closed-model and is part of the
GPT-4 family. The languages targeted are not explicitly reported.
• Claude Sonnet 3.5 is also a multilingual, multimodal closed-model from the Claude 3.5
family. The languages supported by this model are also unknown.
F Additional Results
F.1 Model Rank Changes
Table 6 presents the rank changes and corresponding position shifts (indicated next to the ar-
rows) for high-resource languages, while Table 7 provides similar data for mid- and low-resource
languages. The rightmost columns in each table summarize the total number of models that
changed ranks (Total Rank Change) and the total number of position shifts in the rankings
(Total Position Change). A detailed analysis of these results is provided in Section 4.2.
22https://hf.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct
23An acronym for Southeast Asian Languages in One

Chunk 75 · 1,996 chars

total number of models that
changed ranks (Total Rank Change) and the total number of position shifts in the rankings
(Total Position Change). A detailed analysis of these results is provided in Section 4.2.
22https://hf.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct
23An acronym for Southeast Asian Languages in One Network.
24https://github.com/aisingapore/sealion
25https://hf.co/mistralai/Mistral-Nemo-Instruct-2407
26https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e
41

-- 41 of 57 --

Language 	Dataset	
Aya Exp. 8B	
Aya Exp. 32B	
CommandR	
CommandR+	
Gemma2 9B	
Gemma2 27B	
Llama-3.1 8B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 7B	
Qwen2.5 32B	
SEA-LION-v3	
GPT4o
Claude Sonnet	
Total rank change
	Total position change
Arabic 	- 	- 	- 	- 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	- 	- 	2 2
- 	↑1 	- 	- 	- 	↓1 	- 	↑1 	- 	- 	↓1 	- 	- 	- 	4 4
Chinese 	- 	- 	↓1 	- 	↑1 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	- 	4 4
↑1 	↑1 	↑1 	↑2 	↑1 	- 	↓1 	↑1 	- 	↓3 	↓1 	↓2 	↑1 	↓1 12 16
Czech 	- 	- 	- 	- 	- 	- 	- 	↓1 	- 	- 	↑1 	- 	- 	- 	2 2
↑2 	↓1 	- 	↑3 	- 	↓1 	↓2 	- 	- 	- 	↓1 	- 	- 	- 	6 10
Dutch 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	- 	- 	↑1 	↑2 	↓1 	- 	↑1 	- 	↓2 	↓1 	- 	- 	- 	6 8
English 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	↑1 	↑1 	- 	↓1 	- 	4 4
- 	↑1 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	↓1 	- 	- 	- 	4 4
French 	- 	↑1 	- 	- 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	- 	2 2
- 	↑2 	↑2 	↑1 	- 	↓2 	- 	↑1 	- 	↓3 	↓1 	↑1 	- 	- 	8 13
German 	- 	↓1 	- 	↓1 	- 	↑1 	- 	- 	- 	↑1 	- 	- 	- 	- 	4 4
- 	- 	↓1 	- 	↑2 	- 	- 	↑1 	- 	↓3 	↓1 	↑2 	- 	- 	6 10
Hindi 	- 	↑1 	↓2 	↓1 	↑1 	- 	- 	- 	- 	- 	- 	↑1 	- 	- 	5 6
↑1 	↓1 	↑1 	↑2 	- 	↓1 	↑1 	- 	↑1 	↓3 	↓1 	- 	↑1 	↓1 11 14
Italian 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	- 	↑1 	↑1 	- 	↓1 	- 	↑1 	- 	↓2 	↓1 	↑1 	- 	- 	7 8
Japanese 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	↑1 	↑1 	↑1 	↑1 	↓2 	- 	↑1 	- 	↓1 	↓1 	↓1 	- 	- 	9 10
Persian 	↑1 	↑1 	- 	↓1 	- 	- 	↑1 	- 	↓2 	- 	- 	- 	- 	- 	5 6
- 	- 	- 	↑2 	↑1 	↓2 	- 	- 	↑1 	↑1 	↓1 	↑1 	- 	- 	7 9
Polish 	↑2 	↑1 	↑2 	↓1 	↓1 	-

Chunk 76 · 1,996 chars

- 	- 	0 0
- 	- 	↑1 	↑1 	- 	↓1 	- 	↑1 	- 	↓2 	↓1 	↑1 	- 	- 	7 8
Japanese 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	↑1 	↑1 	↑1 	↑1 	↓2 	- 	↑1 	- 	↓1 	↓1 	↓1 	- 	- 	9 10
Persian 	↑1 	↑1 	- 	↓1 	- 	- 	↑1 	- 	↓2 	- 	- 	- 	- 	- 	5 6
- 	- 	- 	↑2 	↑1 	↓2 	- 	- 	↑1 	↑1 	↓1 	↑1 	- 	- 	7 9
Polish 	↑2 	↑1 	↑2 	↓1 	↓1 	- 	↓1 	- 	↓1 	↑2 	- 	↓1 	- 	- 	9 12
- 	- 	↑2 	↑2 	- 	↓1 	- 	↑1 	↑1 	↓1 	↓1 	- 	- 	- 	7 9
Portuguese 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	↑1 	↑1 	↑1 	↑1 	↓1 	- 	↑1 	- 	↓2 	↓1 	↓1 	- 	- 	9 10
Russian 	- 	↓1 	↓1 	↓1 	↑1 	- 	- 	- 	- 	↑2 	- 	- 	- 	- 	5 6
↑1 	- 	- 	↑2 	↓1 	↓1 	↑1 	- 	- 	↓2 	↓1 	↑3 	- 	- 	8 12
Serbian 	- 	↓1 	- 	↑1 	- 	- 	- 	↓1 	- 	↓1 	↑1 	- 	- 	- 	5 5
- 	↑2 	↑1 	↓1 	↑1 	- 	- 	- 	- 	- 	- 	- 	- 	- 	4 5
Spanish 	- 	↓1 	- 	↓1 	- 	↑1 	- 	- 	- 	↑1 	- 	- 	- 	- 	4 4
- 	- 	↑1 	- 	↑2 	- 	- 	↑1 	- 	↓3 	↓1 	- 	- 	- 	5 8
Swedish 	- 	↓1 	- 	↓1 	- 	↑1 	- 	- 	- 	↓1 	- 	- 	- 	- 	4 4
- 	- 	↑1 	- 	↑2 	- 	- 	↑1 	- 	↓3 	↓1 	- 	- 	- 	5 8
Turkish 	- 	- 	- 	↓1 	↑1 	- 	↓1 	- 	- 	↑1 	- 	- 	- 	4 4
- 	↑2 	↓1 	↑1 	- 	↓1 	- 	- 	- 	- 	↓2 	- 	- 	- 	5 7
Vietnamese 	- 	- 	- 	↓1 	- 	↑1 	↓1 	- 	- ↑1 	- 	- 	- 	- 	- 	4 4
- 	↓1 	↑3 	- 	- 	↓1 	↓1 	- 	↑1 	↓1 	- 	- 	- 	- 	6 8
Table 6: Model rankings with MA rank as the reference for high-resource languages ( ). First
row indicates changes in CA ranks, while second row shows the changes in CS ranks relative
to MA. Color-coded boxes highlight increases ( ↑ ) and decreases ( ↓ ).
42

-- 42 of 57 --

Language 	Dataset	
Aya Exp. 8B	
Aya Exp. 32B	
CommandR	
CommandR+	
Gemma2 9B	
Gemma2 27B	
Llama-3.1 8B	
Llama-3.1 70B	
Mistral Nemo	
Qwen2.5 7B	
Qwen2.5 32B	
SEA-LION-v3	
GPT4o
Claude Sonnet	
Total rank change
	Total position change
Bengali 	- 	↑1 	- 	- 	- 	- 	- 	↓1 	↓1 	- 	- 	- 	- 	- 	3 3
- 	- 	- 	- 	- 	- 	- 	- 	↑1 	↓1 	- 	- 	- 	- 	2 2
Filipino 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	- 	- 	- 	- 	↑1 	↑1 	- 	↓1 	- 	↓1 	- 	↓1 	↑1 6 6
Greek 	↓1 	↓1 	- 	- 	- 	↑1 	- 	- 	↓1 	↑2 	- 	- 	- 	- 	5 6
- 	- 	↑2 	↑3 	- 	↓1 	↑1 	- 	-

Chunk 77 · 1,999 chars

al position change
Bengali 	- 	↑1 	- 	- 	- 	- 	- 	↓1 	↓1 	- 	- 	- 	- 	- 	3 3
- 	- 	- 	- 	- 	- 	- 	- 	↑1 	↓1 	- 	- 	- 	- 	2 2
Filipino 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	- 	- 	- 	- 	↑1 	↑1 	- 	↓1 	- 	↓1 	- 	↓1 	↑1 6 6
Greek 	↓1 	↓1 	- 	- 	- 	↑1 	- 	- 	↓1 	↑2 	- 	- 	- 	- 	5 6
- 	- 	↑2 	↑3 	- 	↓1 	↑1 	- 	- 	↓1 	↓4 	- 	- 	- 	6 12
Hebrew 	↓1 	↑1 	- 	↓1 	- 	- 	- 	- 	↑1 	- 	- 	- 	- 	- 	4 4
- 	↑2 	- 	↑2 	- 	↓2 	- 	- 	- 	- 	↓2 	- 	- 	- 	4 8
Indonesian 	- 	- 	↓1 	↓1 	↓1 	↑1 	- 	- 	- 	↑2 	- 	- 	- 	- 	5 6
- 	- 	↑1 	- 	- 	- 	↓1 	↑1 	↑1 	- 	↓1 	↓1 	- 	- 	6 6
Korean 	↓1 	↓1 	↓1 	- 	- 	↑1 	↑1 	- 	- 	↑1 	- 	- 	- 	- 	6 6
- 	↑1 	↑1 	↓1 	- 	↓1 	- 	↑1 	- 	- 	↓1 	- 	- 	- 	6 6
Malay 	- 	- 	- 	- 	↓1 	- 	- 	↓1 	- 	↑1 	↑1 	- 	- 	- 	4 4
- 	↑1 	↑1 	↓1 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	- 	4 4
Lithuanian 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
- 	- 	- 	↑2 	- 	- 	- 	- 	- 	- 	- 	↓2 	- 	- 	2 4
Romanian 	- 	↑1 	- 	↓1 	- 	- 	↑1 	↓1 	↓1 	- 	↑1 	- 	- 	- 	6 6
- 	- 	- 	↑2 	- 	↓1 	- 	- 	- 	- 	- 	- 	- 	- 	2 3
Ukrainian 	- 	↑1 	- 	↓1 	↓1 	- 	- 	- 	- 	↑1 	- 	- 	- 	- 	4 4
- 	↑1 	- 	↑1 	- 	↓2 	- 	↑1 	↑1 	↓1 	↓1 	- 	↑1 	↓1 9 10
Amharic 	- 	- 	↓1 	↑1 	↓1 	- 	- 	- 	- 	- 	↑1 	- 	- 	- 	4 4
↓1 	↑2 	↑2 	↓1 	- 	- 	- 	- 	↑1 	↓3 	- 	- 	- 	- 	6 10
Hausa 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	0 0
↑1 	↓1 	↑3 	↓1 	- 	- 	↓1 	- 	↓1 	↓1 	↑1 	- 	- 	- 	8 10
Igbo 	- 	- 	↓1 	- 	- 	- 	↓1 	- 	- 	↑1 	↑1 	- 	- 	- 	4 4
- 	↑1 	- 	- 	↑1 	- 	- 	- 	↑2 	↓3 	- 	↓1 	- 	- 	5 8
Kyrgyz 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	- 	↑1 	- 	- 	- 	2 2
- 	↓1 	↑1 	↑1 	- 	- 	↑1 	- 	- 	↓2 	- 	- 	- 	- 	5 6
Malagasy 	- 	↓1 	- 	- 	- 	- 	- 	- 	- 	↑1 	- 	- 	- 	- 	2 2
- 	↑1 	↑4 	↑1 	- 	- 	↓1 	- 	↑1 	↓1 	↓5 	- 	- 	- 	7 14
Nepali 	- 	- 	- 	- 	- 	- 	- 	- 	↓1 	↑1 	- 	- 	- 	- 	2 2
- 	- 	- 	- 	- 	↑1 	↓1 	- 	↑1 	- 	↓1 	- 	↑1 	↓1 6 6
Nyanja 	- 	- 	- 	↓1 	↓1 	- 	- 	- 	- 	- 	↑2 	- 	↑1 	↓1 5 6
- 	↓1 	↑1 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	2 2
Shona 	- 	- 	- 	- 	↓1 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	↑1 4 4
↑2 	- 	↑1 	↑1 	- 	- 	↑1 	- 	- 	↓4 	↓1 	- 	- 	- 	6 10
Sinhala 	- 	↑1

Chunk 78 · 1,998 chars

- 	- 	- 	- 	- 	↓1 	↑1 	- 	- 	- 	- 	2 2
- 	- 	- 	- 	- 	↑1 	↓1 	- 	↑1 	- 	↓1 	- 	↑1 	↓1 6 6
Nyanja 	- 	- 	- 	↓1 	↓1 	- 	- 	- 	- 	- 	↑2 	- 	↑1 	↓1 5 6
- 	↓1 	↑1 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	- 	2 2
Shona 	- 	- 	- 	- 	↓1 	- 	- 	- 	- 	- 	↑1 	- 	↓1 	↑1 4 4
↑2 	- 	↑1 	↑1 	- 	- 	↑1 	- 	- 	↓4 	↓1 	- 	- 	- 	6 10
Sinhala 	- 	↑1 	- 	- 	- 	- 	↓3 	- 	- 	↑2 	- 	- 	- 	- 	3 6
- 	↓1 	↑1 	↑1 	- 	- 	- 	- 	- 	↓1 	- 	- 	- 	- 	4 4
Somali 	- 	↓2 	- 	↑1 	- 	- 	- 	- 	- 	↑1 	- 	- 	- 	- 	3 4
- 	↑1 	↑2 	↓2 	- 	- 	↑2 	- 	- 	↓2 	↓1 	- 	↓1 	↑1 8 12
Swahili 	- 	↓1 	- 	- 	- 	- 	↑1 	- 	- 	- 	- 	- 	- 	- 	2 2
- 	- 	↑1 	- 	- 	- 	↓1 	- 	- 	- 	- 	- 	↓1 	↑1 4 4
Telugu 	- 	↓1 	- 	- 	- 	- 	- 	- 	- 	↑1 	↑1 	↓1 	- 	- 	4 4
- 	↓1 	↑2 	↑1 	↑1 	- 	↑1 	- 	↓1 	↓2 	↓1 	- 	- 	- 	8 10
Yoruba 	- 	↑1 	↓2 	- 	↓1 	- 	- 	- 	- 	↑2 	↑1 	↓1 	- 	- 	6 8
- 	↓1 	↑1 	↑1 	↑1 	- 	- 	- 	- 	- 	↓2 	- 	- 	- 	5 6
Table 7: Model rankings with MA rank as the reference for mid ( ) and low ( ) resource
languages. First row indicates changes in CA ranks, while second row shows the changes in
CS ranks relative to MA. Color-coded boxes highlight increases ( ↑ ) and decreases ( ↓ ).
43

-- 43 of 57 --

F.2 Subject-level Performance
Figure 15 illustrates the performance of the Aya Expanse 32 model across various subjects, with
an average accuracy of 66.4%. Notably, most STEM subjects fall below this average, whereas
the majority of Social Sciences and Humanities subjects exceed it.
Figure 15: Aya Expanse 32B performance on each subjects.
G Relationship between cultural and geographical tags
G.1 Culture–Region Relations
We analyzed the samples in the CS dataset. Figure 16 illustrates the relationship between
Western and Asian cultures and their associated regions. Among the samples labeled with a
Western culture tag, 73.3% are also tagged with North America, followed by 25.5% with Europe.
Similarly, 97.2% of samples labeled with Asian cultures are associated with the Asia region.
G.2 Culture Country Relations
Figure 17 shows relationship between

Chunk 79 · 1,999 chars

es and their associated regions. Among the samples labeled with a
Western culture tag, 73.3% are also tagged with North America, followed by 25.5% with Europe.
Similarly, 97.2% of samples labeled with Asian cultures are associated with the Asia region.
G.2 Culture Country Relations
Figure 17 shows relationship between culture and country. For the Latin American culture, the
distribution is balanced, with Bolivia and Mexico comprising 33.3% each of the tags, followed by
Hondurus and Peru sharing 16.7% of the tags each. For Indigenous culture, the tags are shared
between two countries with USA at top with 66.7% followed by Micronesia at 33.3%. The Other
culture category was added for representing cultures that did not fall under other pre-existing
categories. We find that all samples Other category fall under Russia.
44

-- 44 of 57 --

100
80
60
40
20
0
Percentage (%)

Asia

North
America
Europe Africa South
America
Australia Asia North
America
Europe Africa South
America
Australia

Figure 16: Relationship between Western and Asia cultures and region tags.
Bolivia
100
80
60
40
20
0
Percentage of Samples (%)

Mexico
Honduras
Peru
Others
USA
Micronesia
Others
Russia
Others

Figure 17: Relationship between culture and country tags, focusing on Latin American and
Indigeneous cultures.
G.3 Region Country Relations
Figure 18 and 19 present country-specific information for each region: North America, Europe,
and Africa. The United States accounts for the largest proportion of regional tags, representing
89.6% of the tags for the North America region, followed by Canada and the United Kingdom,
each with only 0.8% of the tags. For the Europe region, the distribution is more balanced, with
the United Kingdom comprising 20.1% of the tags, followed by France at 10.1%. In the Africa
region, the distribution is even more balanced,

Chunk 80 · 1,989 chars

f the tags for the North America region, followed by Canada and the United Kingdom,
each with only 0.8% of the tags. For the Europe region, the distribution is more balanced, with
the United Kingdom comprising 20.1% of the tags, followed by France at 10.1%. In the Africa
region, the distribution is even more balanced, with Egypt and South Africa sharing the top
position at 33.3% of the tags each.
H Annotation Process
Communication. For both annotation tasks, annotators were briefed by one of the authors
in a virtual introduction session and were able to ask questions and raise issues throughout
the annotation task in a Discord channel. For both tasks, they were also encouraged to share
frequent error patterns or artifacts that they observed throughout the tasks with the authors and
45

-- 45 of 57 --

100
80
60
40
20
0
Percentage of Samples (%)

Egypt
South Africa
Ethiopia
Zimbabwe
Others

USA
Others
Canada
UK
Germany

Antarctica
Austria
China
Israel
UK
France
Russia
Others
Spain

Germany
Greece
Italy
Switzerland

Figure 18: Relationship between region and country tags, focusing on North America, Europe
and Africa regions.
100
80
60
40
20
0
Percentage of Samples (%)

Papua New
Guinea
Others
India
Japan
China
Iraq
Others
South Korea
Micronesia
Vietnam
Bolivia
Mexico
Honduras
Peru
Others

Australia

Figure 19: Relationship between region and country tags, focusing on Asia, South America and
Australia.
capture difficult decisions and their rationales in comments for individual ratings. Similarly, they
discussed ambiguous cases and questions. This helped calibrate annotations across annotators
and languages.
Schedule. Each of the annotation tasks was conducted as 2–3 week long sprints in collaboration
with contributors from the community. There was no fixed time schedule for the

Chunk 81 · 1,994 chars

r individual ratings. Similarly, they
discussed ambiguous cases and questions. This helped calibrate annotations across annotators
and languages.
Schedule. Each of the annotation tasks was conducted as 2–3 week long sprints in collaboration
with contributors from the community. There was no fixed time schedule for the annotations,
and annotators contributed varying hours, depending on their availability and speed.
For the cultural sensitivity evaluation task, 100% of the selected samples were labeled whereas for
the translation quality evaluation task, 37% of the provided samples were fully reviewed 12.3%
of the samples were edited in total.
Interface. The annotation interface for both tasks was built using Argilla.27 Argilla is an
open-source tool that can be used for data labeling. Using Argilla’s Python SDK, it was quick
27https://argilla.io/
46

-- 46 of 57 --

and easy to set up an annotation interface that could be deployed on Hugging Face Spaces. We
also set up SSO so annotators could log in and easily access the UI using their Hugging Face
accounts.
For cultural sensitivity evaluation, annotators were shown questions one by one from each of the
57 MMLU subjects and were asked to analyze and label the questions for presence of cultural,
geographic, dialect or regional knowledge as explained in 2.1 and shown in Figure 20.
Figure 20: Cultural Sensitivity evaluation annotation interface.
As shown in Figure 21, for translation quality evaluation, annotators were shown the translated
question and corresponding options in their chosen language on the UI. Annotators were also
shown the original question and answer options in English for reference. If the translation was
good in quality and correctly represented the original English text then the annotators could
mark it as acceptable in quality and proceed to next question otherwise they could edit the
provided translation to improve its quality.
47

-- 47 of 57 --

Figure 21: Translation evaluation annotation

Chunk 82 · 1,994 chars

If the translation was
good in quality and correctly represented the original English text then the annotators could
mark it as acceptable in quality and proceed to next question otherwise they could edit the
provided translation to improve its quality.
47

-- 47 of 57 --

Figure 21: Translation evaluation annotation interface.
H.1 Compensated Annotator Pool for Gold Standard Languages
Annotator Selection. The primary demographic make-up of the participants in the evaluations
was recruited based on their proficiency in the language groups. The proficiency was self-reported,
and the primary requirement was native or professional proficiency in the specific languages
needed for the project.
Socio-Demographics. The annotator pool is comprised of people from diverse backgrounds,
and this spans across socioeconomic backgrounds, careers, levels of education, and self-reported
gender and sexual identities. We do not ask any annotators to share or report any of these
statistical pieces of information in a formal way; any insights into this are gathered organically
and through self-reporting by the annotators.
Quality Considerations. We do not believe that any socio-demographic characteristics have
led to any impact on the data that has been annotated. Through every part of the project,
we have reiterated the importance of this work and the fact that it is helping support a global-
scale research project. We are confident in the trust we have built with the annotators in this
project, and they care greatly about the overall outcome and, therefore, have been diligent in
completing the task with a high degree of accuracy. Where possible, we have done our best to
have annotators work on this project and be representatives of the communities that the project
aims to support.
H.2 Agreement between Annotators
For the first phase of annotations to identify culturally sensitive samples, we ensured that each
sample was annotated by at least 3 annotators. We used the ratings for

Chunk 83 · 1,998 chars

t to
have annotators work on this project and be representatives of the communities that the project
aims to support.
H.2 Agreement between Annotators
For the first phase of annotations to identify culturally sensitive samples, we ensured that each
sample was annotated by at least 3 annotators. We used the ratings for each sample from different
annotators and aggregated it per subject to analyze the agreement among annotators. We report
the corresponding Krippendorff’s Alpha scores depicting annotator agreement in Figure 23 and
48

-- 48 of 57 --

Male
Cultural Sensitivity Evaluation
Translation Quality Evaluation
100
80
60
40
20
0
Percentage (%)
Female Prefer not
to say 18-24 	25-34 	35-44 	45-54 	55-64	Under 18 	Asia 	Europe 	North
America
Africa 	Oceania 	South
America
 	 	
Figure 22: Demographics of annotators who registered using our annotation interface for cultural
sensitivity as well as translation quality evaluation.
24. Krippendorff’s Alpha values range between -1 and 1 where 1 denotes that all annotators
agree unanimously and -1 denotes that the annotators are making opposite ratings. We observe
reasonable disagreement among samples for moral scenarios for both cultural sensitivity as well
as time-sensitivity annotations. 12 subjects have complete unanimous agreement regarding time-
sensitivity annotations between annotators.
1
0.6
0.4
0.2
0
-0.2
0.8
Alpha Krippendorff Score
Algebra
	Anatomy
	Clinical
	CS (Uni.)
Physics (Uni.)
	Computer Sec.
Conc. Physics
	Econometrics
	Formal Logic
Bio (HS)
ML
Genetics
	Astronomy
	Bio (Uni.)
	Chem (Uni.)
	Math(Uni.)
Medicine (Uni.)
	Chemistry (HS)
CS (HS)
	Math (HS)
Micro econ. (HS)
	Medicine (Pro.)
	Electrical Eng.
Math (El.)
	Stats (HS)
	Nutrition
	Virology
	Marketing
	Physics (HS)
	Business Ethics
	Human Aging
Management
	Public Rel.
Macro econ. (HS)
Sexuality
Psychology (HS)
Int. Law
Accounting (Pro)
	Psychology (Pro)
Security
	Misc.
Prehistory
Geography (HS)
Sociology
	Fallacies
World Hist.

Chunk 84 · 1,990 chars

Medicine (Pro.)
	Electrical Eng.
Math (El.)
	Stats (HS)
	Nutrition
	Virology
	Marketing
	Physics (HS)
	Business Ethics
	Human Aging
Management
	Public Rel.
Macro econ. (HS)
Sexuality
Psychology (HS)
Int. Law
Accounting (Pro)
	Psychology (Pro)
Security
	Misc.
Prehistory
Geography (HS)
Sociology
	Fallacies
World Hist. (HS)
	EU Hist. (HS)
US Foreign Policy
Law (Pro)
	Disputes
Jurisprudence
	Philosophy
Facts
US Hist. (HS)
Gov. Politics (HS)
	Moral Scenarios
	World Religions
Figure 23: Krippendorff’s Alpha Scores for checking annotator agreement regarding the presence
of cultural or regional knowledge of samples.
49

-- 49 of 57 --

1
0.6
0.4
0.2
0
-0.2
0.8
Alpha Krippendorff Score
Algebra
	Anatomy
	Clinical
	CS (Uni.)
Physics (Uni.)
	Computer Sec.
Conc. Physics
	Econometrics
	Formal Logic
Bio (HS)
ML
Genetics
	Astronomy
	Bio (Uni.)
	Chem (Uni.)
	Math(Uni.)
Medicine (Uni.)
	Chemistry (HS)
CS (HS)
	Math (HS)
Micro econ. (HS)
	Medicine (Pro.)
	Electrical Eng.
Math (El.)
	Stats (HS)
	Nutrition
	Virology
	Marketing
	Physics (HS)
	Business Ethics
	Human Aging
Management
	Public Rel.
Macro econ. (HS)
Sexuality
Psychology (HS)
Int. Law
Accounting (Pro)
	Psychology (Pro)
Security
	Misc.
Prehistory
Geography (HS)
Sociology
	Fallacies
World Hist. (HS)
	EU Hist. (HS)
US Foreign Policy
Law (Pro)
	Disputes
Jurisprudence
	Philosophy
Facts
US Hist. (HS)
Gov. Politics (HS)
	Moral Scenarios
	World Religions
Figure 24: Krippendorff’s Alpha Scores for checking annotator agreement regarding the presence
of the time-sensitive nature of samples.
I Translation Analysis
I.1 Translation Quality
Figure 7 shows the translation quality comparison for Google Translate which is used to trans-
late Global-MMLU and GPT-3.5-turbo which is used for translating multilingual MMLU
released by (Lai et al., 2023). We see that Google Translate is significantly better across different
MMLU subject categories. For this analysis, we considered samples from MMMLU dataset28 as
the human reference and only

Chunk 85 · 1,997 chars

used to trans-
late Global-MMLU and GPT-3.5-turbo which is used for translating multilingual MMLU
released by (Lai et al., 2023). We see that Google Translate is significantly better across different
MMLU subject categories. For this analysis, we considered samples from MMMLU dataset28 as
the human reference and only considered languages which overlapped between the two machine
translated sets and human translated MMMLU.
I.2 Translation Edits
Figure 25 illustrates the edit distance, averaged over all samples within each subject category,
for edits made by professional and community annotators. The edit distance, calculated using
the “Levenshtein Distance” (Levenshtein, 1966), measures the differences between two strings.
In this analysis, the machine translations were compared to their edited versions to compute the
scores.
The results reveal that the Humanities category exhibits the largest edit distances, with higher
values observed for questions compared to answers.
Given that longer text may inherently require more edits, we hypothesized that the observed
large edit distances could be influenced by the length of the questions and answers. To account
for this, we analyzed the length of each question-answer pair and computed the Normalized Edit
Distance (NED), where the edit distance is divided by the text length, shown in Figure 26.
The analysis reveals that questions in the Humanities category have the greatest average length,
whereas answers in the STEM category exhibit the highest NED. These findings suggest that
while raw edit distances are influenced by text length, normalized measures provide additional
insights into the complexity of edits across categories.
28https://openai.com/index/openai-o1-system-card/
50

-- 50 of 57 --

Chunk 86 · 1,996 chars

across categories.
28https://openai.com/index/openai-o1-system-card/
50

-- 50 of 57 --

Figure 25: Average edit distance across different subject categories in MMLU. Each sample
comprises a question-and-answer pair, with the left column showing edit distances for questions
and the right column for answers.
Figure 26: (Top) Average normalized edit distance and (Bottom) average question and answer
lengths across different subject categories. The left column represents questions, while the right
column represents answers.
51

-- 51 of 57 --

J MMLU Annotated Examples
Dataset Subject Question Choices
US Hist. (HS) This question refers to the following
information:
“Some men look at constitutions with
sanctimonious reverence, and deem
them like the ark of the covenant,
too sacred to be touched. They as-
cribe to the men of the preceding
age a wisdom more than human, and
suppose what they did to be beyond
amendment . . . . But I know also,
that laws and institutions must go
hand in hand with the progress of
the human mind. As that becomes
more developed, more enlightened,
as new discoveries are made, new
truths disclosed, and manners and
opinions change with the change of
circumstances, institutions must ad-
vance also, and keep pace with the
times.”
—Thomas Jefferson, 1816
Which of the following best describes
a contributing factor in the crafting
of the United States Constitution?
(A) Individual state constitutions written at
the time of the Revolution tended to cede
too much power to the federal government,
leading to a call for reform on the part of
Anti-Federalists.
(B) The weaknesses of the Articles of Con-
federation led James Madison to question
their efficacy and prompted a formation of
the Constitutional Congress in 1787.
(C) Difficulties over trade and foreign relations
led to a repeal of overly restrictive tariffs
required by the Articles of Confederation.
(D) Washington’s embarrassing failure at the
Whiskey Rebellion led to Federalist de-
mands for a new

Chunk 87 · 1,996 chars

their efficacy and prompted a formation of
the Constitutional Congress in 1787.
(C) Difficulties over trade and foreign relations
led to a repeal of overly restrictive tariffs
required by the Articles of Confederation.
(D) Washington’s embarrassing failure at the
Whiskey Rebellion led to Federalist de-
mands for a new framework for federal
power.
Accounting (Pro) Under the Sales Article of the UCC,
which of the following circumstances
best describes how the implied war-
ranty of fitness for a particular pur-
pose arises in a sale of goods transac-
tion?
(A) The buyer is purchasing the goods for a
particular purpose and is relying on the
seller’s skill or judgment to select suitable
goods.
(B) The buyer is purchasing the goods for a
particular purpose and the seller is a mer-
chant in such goods.
(C) The seller knows the particular purpose
for which the buyer will use the goods and
knows the buyer is relying on the seller’s
skill or judgment to select suitable goods.
(D) The seller knows the particular purpose
for which the buyer will use the goods and
the seller is a merchant in such goods.
Jurisprudence Which of the following criticisms of
Llewellyn’s distinction between the
grand and formal styles of legal rea-
soning is the most compelling?
(A) There is no distinction between the two
forms of legal reasoning.
(B) Judges are appointed to interpret the law,
not to make it.
(C) It is misleading to pigeon-hole judges in
this way.
(D) Judicial reasoning is always formal.
52

-- 52 of 57 --

Prehistory What is the name of the lithic tech-
nology seen in the Arctic and con-
sisting of wedge-shaped cores, micro-
blades, bifacial knives, and burins?
(A) Clovis Complex
(B) Denali Complex
(C) Folsom Complex
(D) Nenana Complex
US Foreign Policy What was the key difference between
US expansion pre- and post- 1865?
(A) US expansion was based on territory
rather than markets post-1865
(B) US expansion was based on markets rather
than territory post-1865
(C) US expansion was

Chunk 88 · 1,999 chars

lovis Complex
(B) Denali Complex
(C) Folsom Complex
(D) Nenana Complex
US Foreign Policy What was the key difference between
US expansion pre- and post- 1865?
(A) US expansion was based on territory
rather than markets post-1865
(B) US expansion was based on markets rather
than territory post-1865
(C) US expansion was limited to Latin Amer-
ica post-1865
(D) US expansion ended after 1865
Econometrics Which of the following statements
will be true if the number of repli-
cations used in a Monte Carlo study
is small? i) The statistic of inter-
est may be estimated imprecisely ii)
The results may be affected by un-
representative combinations of ran-
dom draws iii) The standard errors
on the estimated quantities may be
unacceptably large iv) Variance re-
duction techniques can be used to re-
duce the standard errors
(A) (ii) and (iv) only
(B) (i) and (iii) only
(C) (i), (ii), and (iv) only
(D) (i), (ii), (iii), and (iv)
Stats (HS) An assembly line machine is sup-
posed to turn out ball bearings with
a diameter of 1.25 centimeters. Each
morning the first 30 bearings pro-
duced are pulled and measured. If
their mean diameter is under 1.23
centimeters or over 1.27 centimeters,
the machinery is stopped and an
engineer is called to make adjust-
ments before production is resumed.
The quality control procedure may be
viewed as a hypothesis test with the
null hypothesis H0 : μ = 1.25 and the
alternative hypothesis Ha : μ̸ = 1.25.
The engineer is asked to make ad-
justments when the null hypothesis
is rejected. In test terminology, what
would a Type II error result in?
(A) A warranted halt in production to adjust
the machinery
(B) An unnecessary stoppage of the produc-
tion process
(C) Continued production of wrong size ball
bearings
(D) Continued production of proper size ball
bearings
Formal Logic Construct a complete truth table for
the following argument. Then, using
the truth table, determine whether
the argument is valid or invalid. If
the argument is invalid, choose an

Chunk 89 · 1,999 chars

roduc-
tion process
(C) Continued production of wrong size ball
bearings
(D) Continued production of proper size ball
bearings
Formal Logic Construct a complete truth table for
the following argument. Then, using
the truth table, determine whether
the argument is valid or invalid. If
the argument is invalid, choose an op-
tion which presents a counterexam-
ple. (There may be other counterex-
amples as well.) M ∨ N ¬M ∧ O
N
(A) Valid
(B) Invalid. Counterexample when M and O
are true and N is false
(C) Invalid. Counterexample when M is true
and O and N are false
(D) Invalid. Counterexample when O is true
and M and N are false
53

-- 53 of 57 --

Geography (HS) Which of the following is MOST
likely to experience population pres-
sure? (A) An industrial society with abundant nat-
ural resources and large imports of food
(B) A society with a highly mechanized agri-
cultural sector
(C) A non-ecumene
(D) A slash-and-burn agricultural society
Nutrition Why might some biochemical (eg
plasma or serum) indices of micronu-
trient status give misleading results
in people with infections or inflam-
matory states?
(A) Because people who are sick often alter
their diets, and may eat less food.
(B) Because the accuracy of some laboratory
assays may be compromised in samples
from people who are sick.
(C) Because some metabolic pathways are al-
tered in sick people, which changes their
micronutrient requirements.
(D) Because an acute phase reaction results in
changes in inter-tissue distributions of cer-
tain micro-nutrients.
K Examples of Cultural, Geographical and Dialect Knowledge
This section lists some examples of cultural, geographical (or regional) and dialect knowledge
that was shared with the annotators to guide them during the annotation process.
Knowledge Applicable Examples Non-Applicable Examples
Cultural
(A) Understanding religious customs: For in-
stance, the significance of colored powder
during Holi in Hindu culture.
(B) Awareness of traditional arts: For in-
stance,

Chunk 90 · 1,993 chars

wledge
that was shared with the annotators to guide them during the annotation process.
Knowledge Applicable Examples Non-Applicable Examples
Cultural
(A) Understanding religious customs: For in-
stance, the significance of colored powder
during Holi in Hindu culture.
(B) Awareness of traditional arts: For in-
stance, the unique styles and techniques
of Indigenous Australian art, often featur-
ing dot painting and storytelling.
(C) References to liberal/conservative atti-
tudes: We can’t assume the notion of lib-
eral is specific to a certain culture or region
but it inevitably involves social values and
culture.
(D) References to philosophy and philosophi-
cal concepts, including philosophy of law:
Some familiar philosophical concepts fall
within critical cultural contexts. Hume’s
conception of practical reason is a familiar
philosophical concept in western culture.
Logical fallacies also fall under this cate-
gory.
(A) Universal scientific principles: Knowledge of gravity or
evolution is not exclusive to any particular culture.
(B) Principles from the social sciences: The principle of
social exchange, that posits that social behavior is the
result of an exchange process, is used worldwide.
(C) Standardized international sports: The rules and prac-
tices of soccer (football) are consistent worldwide.
(D) Math questions which do not rely on local references:
For example, the formula for the radius of a circle.
54

-- 54 of 57 --

Geographical
(A) Natural Landmark Identification: Rec-
ognizing and knowing the significance of
regional natural wonders like the Grand
Canyon in the Southwestern United States
or the Great Barrier Reef in Australia.
(B) Environmental Awareness: Understand-
ing the impact and importance of regional
weather patterns, such as the monsoons
in South Asian regions or the hurricanes
in the Caribbean.
(C) Historical Event Memory: Knowledge of
region-specific historical occurrences, such
as the Gold Rush in California during
the 1850s, which

Chunk 91 · 1,996 chars

Environmental Awareness: Understand-
ing the impact and importance of regional
weather patterns, such as the monsoons
in South Asian regions or the hurricanes
in the Caribbean.
(C) Historical Event Memory: Knowledge of
region-specific historical occurrences, such
as the Gold Rush in California during
the 1850s, which transformed the region’s
economy and demographics.
(D) Awareness of a region-specific natural phe-
nomenon: The Northern Lights, visible in
the night skies of Alaska and northern re-
gions.
(E) Systems of measurement that are specific
to a geographic area: Imperial units are
used to measure distance (eg. miles), vol-
ume (eg. gallons) and weight (eg. pounds)
(F) Laws and regulations: A programmer uses
code published online under a Creative
Commons Attribution (CCBY) license in
a commercial product. This license is spe-
cific to the regional geographic area it was
created in.
(G) Behaviors and preferences of groups in
specified areas: These can be noted as
both “cultural” and “geographic”, as in the
exam “Which of the following statements
does NOT accurately describe voting be-
havior in the United States?” voting prac-
tices are cultural, and the US is specified
as a geographic area.
(A) Global Climate Patterns: Understanding El Niño and
La Niña weather phenomena, which occur worldwide
and are not specific to any single region.
(B) Universal Celestial Bodies: The Sun and the Moon are
visible worldwide and do not possess regional speci-
ficity.
(C) Standardized Geography Terms: Understanding the
definition of a peninsula or archipelago is applicable
to geographic features globally, not tied to regional
knowledge.
55

-- 55 of 57 --

Dialect
(A) Regional slang: Using the word “wicked”
to mean “very good” in parts of New Eng-
land, USA. Using the phrase “boot of the
car” to mean “trunk” in the UK.
(B) Unique idiomatic expressions: The phrase
“Bob’s your uncle” in British English,
meaning “there you have it” or “that’s all
there is to it.”
(C) Knowledge

Chunk 92 · 1,985 chars

ect
(A) Regional slang: Using the word “wicked”
to mean “very good” in parts of New Eng-
land, USA. Using the phrase “boot of the
car” to mean “trunk” in the UK.
(B) Unique idiomatic expressions: The phrase
“Bob’s your uncle” in British English,
meaning “there you have it” or “that’s all
there is to it.”
(C) Knowledge of social greetings: The cus-
tomary handshake and verbal greeting of
“Konnichiwa” when meeting someone in
Japanese culture.
(D) Words or phrases from other languages
that are brought into English: as in the
sentence “he has that je ne sais quoi” in
which je ne sais quoi is borrowed from
French
(A) Standardized technical jargon: Medical or legal termi-
nology used internationally within professional fields.
(B) Formal literary language: The writings of Shakespeare
or Dickens utilize sophisticated language but are not
tied to specific dialects.
(C) Global brand names: Companies like Nike or Adidas
use consistent branding worldwide, avoiding regional
vocabulary.
L MMLU Subject Name Mapping
Original Name Short Name
abstract_algebra Algebra
anatomy Anatomy
astronomy Astronomy
business_ethics Business Ethics
clinical_knowledge Clinical
college_biology Bio (Uni.)
college_chemistry Chem (Uni.)
college_computer_science CS (Uni.)
college_mathematics Math (Uni.)
college_medicine Medicine (Uni.)
college_physics Physics (Uni.)
computer_security Computer Sec
conceptual_physics Conc. Physics
econometrics Econometrics
electrical_engineering Electrical Eng.
elementary_mathematics Math (El.)
formal_logic Formal Logic
global_facts Facts
high_school_biology Bio (HS)
high_school_chemistry Chemistry (HS)
high_school_computer_science CS (HS)
high_school_european_history EU Hist. (HS)
high_school_geography Geography (HS)
high_school_government_and_politics Gov. Politics (HS)
high_school_macroeconomics Macro econ. (HS)
high_school_mathematics Math (HS)
high_school_microeconomics Micro econ. (HS)
high_school_physics Physics (HS)
high_school_psychology Psychology

Chunk 93 · 1,276 chars

_school_european_history EU Hist. (HS)
high_school_geography Geography (HS)
high_school_government_and_politics Gov. Politics (HS)
high_school_macroeconomics Macro econ. (HS)
high_school_mathematics Math (HS)
high_school_microeconomics Micro econ. (HS)
high_school_physics Physics (HS)
high_school_psychology Psychology (HS)
high_school_statistics Stats (HS)
high_school_us_history US Hist. (HS)
high_school_world_history World Hist. (HS)
human_aging Human Aging
human_sexuality Sexuality
international_law Int. Law
jurisprudence Jurisprudence
56

-- 56 of 57 --

logical_fallacies Fallacies
machine_learning ML
management Management
marketing Marketing
medical_genetics Genetics
miscellaneous Misc.
moral_disputes Disputes
moral_scenarios Moral Scenarios
nutrition Nutrition
philosophy Philosophy
prehistory Prehistory
professional_accounting Accounting (Pro)
professional_law Law (Pro)
professional_medicine Medicine (Pro)
professional_psychology Psychology (Pro)
public_relations Public Rel.
security_studies Security
sociology Sociology
us_foreign_policy US Foreign Policy
virology Virology
world_religions World Religions
Table 10: This table shows the short names assigned to MMLU subjects proposed by (Hendrycks
et al., 2020) in Figures 3, 6, 23, 24.
57

-- 57 of 57 --