Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion

Authors: Hari Shankar, et al.arXiv:2603.06264Published: 06-Mar-2026Source

Comprehensive multilingual audit of LLM cultural alignment across India, East Asia, and Southeast Asia. Reveals persistent representation gaps and religious viewpoint misalignment in LLMs, particularly for minority groups. Demonstrates that LLMs fail to accurately represent regional cultural values and amplify negative stereotypes in sensitive domains relevant to SEA languages and cultures.

Summary

This paper examines the cultural alignment of large language models (LLMs) with public opinion in Asia, focusing on religion as a sensitive domain. The study evaluates models like GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral, and Gemma 3 across India, East Asia, and Southeast Asia. Using nationally representative Pew survey data and culturally aware bias benchmarks (CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ), the researchers compare model-generated opinion distributions with human responses. They find that while LLMs generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially for minority groups, often amplifying negative stereotypes. Lightweight interventions like demographic priming and native language prompting partially mitigate but do not eliminate these gaps. The study reveals persistent biases in downstream evaluations, highlighting the need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs. The findings underscore the risks of propagating Western-centric cultural norms and the importance of addressing representational harms in multilingual and multicultural contexts.

PDF viewer

Chunks(38)

Chunk 0 · 1,995 chars

Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion
Hari Shankar1, Vedanta S P2, Sriharini Margapuri1, Debjani Mazumder1, Ponnurangam
Kumaraguru1, Abhijnan Chakraborty3
1IIIT Hyderabad 2IIIT Kottayam 3IIT Kharagpur
Abstract
Large Language Models (LLMs) are increasingly being de-
ployed in multilingual, multicultural settings, yet their re-
liance on predominantly English-centric training data risks
misalignment with the diverse cultural values of different so-
cieties. In this paper, we present a comprehensive, multilin-
gual audit of the cultural alignment of contemporary LLMs
including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mis-
tral and Gemma 3 across India, East Asia and Southeast Asia.
Our study specifically focuses on the sensitive domain of reli-
gion as the prism for broader alignment. To facilitate this, we
conduct a multi-faceted analysis of every LLM’s internal rep-
resentations, using log-probs/logits, to compare the model’s
opinion distributions against ground-truth public attitudes.
We find that while the popular models generally align with
public opinion on broad social issues, they consistently fail
to accurately represent religious viewpoints, especially those
of minority groups, often amplifying negative stereotypes.
Lightweight interventions, such as demographic priming and
native language prompting, partially mitigate but do not elim-
inate these cultural gaps. We further show that downstream
evaluations on bias benchmarks (such as CrowS-Pairs, In-
diBias, ThaiCLI, KoBBQ) reveal persistent harms and under-
representation in sensitive contexts. Our findings underscore
the urgent need for systematic, regionally grounded audits to
ensure equitable global deployment of LLMs.
Warning: This paper contains content that may be potentially
offensive or upsetting.
Introduction
Large language models (LLMs) have become essential tools
for accessing information and generating content, with plat-
forms such as ChatGPT handling billions of

Chunk 1 · 1,994 chars

ounded audits to
ensure equitable global deployment of LLMs.
Warning: This paper contains content that may be potentially
offensive or upsetting.
Introduction
Large language models (LLMs) have become essential tools
for accessing information and generating content, with plat-
forms such as ChatGPT handling billions of prompts from
a global user base (Backlinko 2025; TechCrunch 2025).
As of December 2025, ChatGPT was ranked 5th among
the world’s most visited websites according to Similar-
web (Similarweb 2025). Additionally, on social media plat-
forms such as LinkedIn, recent surveys suggest that over
50% of long-form posts may be written or influenced by
generative AI tools (Elad 2025). Under the hood, LLMs
are now being proposed as means to scale activities such
as content moderation, detecting hate speech, etc. (Ku-
mar, Yousef, and Durumeric 2024; Singh, Bhattacharjee,
Copyright © 2026, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
and Chakraborty 2025). However, this widespread adop-
tion comes with critical challenges. The probabilistic nature
of LLMs leads to models preferentially generating view-
points that are highly represented, and consequently, a bi-
ased world-view is derived from its training corpus (Bender
et al. 2021a; Seth et al. 2025). Since Internet corpora are
heavily skewed toward English, models may disproportion-
ately reflect Western cultural sensibilities (Joshi et al. 2020).
This risks marginalizing non-Western perspectives and also
may lead to the dissemination of harmful stereotypes, such
as linking specific religions to violence (Abid, Farooqi, and
Zou 2021). As these models are increasingly integrated into
education, research, and other everyday tasks, their poten-
tial to shape public discourse in ways that reinforce existing
prejudices becomes a significant concern (Weidinger, Mel-
lor, and et al. 2021; Bender et al. 2021b; Santurkar et al.
2023).
While efforts to mitigate these

Chunk 2 · 1,994 chars

models are increasingly integrated into
education, research, and other everyday tasks, their poten-
tial to shape public discourse in ways that reinforce existing
prejudices becomes a significant concern (Weidinger, Mel-
lor, and et al. 2021; Bender et al. 2021b; Santurkar et al.
2023).
While efforts to mitigate these linguistic and cultural bi-
ases are ongoing, research on cultural alignment has largely
centred on American citizens and has been conducted almost
exclusively in English (Santurkar et al. 2023; Durmus et al.
2023). This approach not only overlooks the majority of the
world’s population but also ignores the fact that an LLM’s
responses can vary significantly depending on the language
of the prompt (Kang and Kim 2025). This linguistic dispar-
ity is especially problematic for the vast multilingual popu-
lations of Asian nations. For instance, while religion’s role
has declined in many Western nations (Pew Research Center
2025, 2018; Australian Bureau of Statistics 2022), it remains
a central and politically significant aspect of society across
much of Asia (The Print 2025; Elvia Muthiariny 2024; Pew
Research Center 2023). For billions of multilingual and non-
English speaking users, ensuring that LLMs are culturally
and linguistically representative is a critical challenge that
must be addressed (Santurkar et al. 2023; Green et al. 2023).
Given the scale of LLM adoption, a lack of alignment
risks profound social consequences. These risks are already
evident on social media, where the proliferation of AI-
generated content has contributed to polarised discourse and
marginalised certain groups, such as the LGBTQ+ commu-
nity (Kerwin 2024; Bakshy, Messing, and Adamic 2015). In
this work, we perform an in-depth, multilingual analysis of
LLM cultural alignment across several Asian nations, using
religion as a critical lens. We aim to answer the following
arXiv:2603.06264v2 [cs.CL] 23 Mar 2026

-- 1 of 13 --

Figure 1: Evaluation framework for assessing

Chunk 3 · 1,992 chars

4; Bakshy, Messing, and Adamic 2015). In
this work, we perform an in-depth, multilingual analysis of
LLM cultural alignment across several Asian nations, using
religion as a critical lens. We aim to answer the following
arXiv:2603.06264v2 [cs.CL] 23 Mar 2026

-- 1 of 13 --

Figure 1: Evaluation framework for assessing LLMs, where
human opinion distributions from Pew surveys (India, Sri
Lanka, East Asia, and South East Asia) are compared with
model-generated distributions to measure representativeness
across various categories.
research questions:
1. How accurately do contemporary LLMs represent pub-
lic opinion on sensitive religious topics, relative to their
performance on broader social issues?
2. Does prompting in a local language mitigate or worsen
existing representational biases towards specific demo-
graphic groups within a country?
3. How do high-level distributional gaps translate to con-
crete representational harms on region-specific bias
benchmarks?
We adapt the methodology of Santurkar et al. (2023) to
answer the aforementioned questions, measuring alignment
through a quantitative “representativeness” metric, based on
the divergence between the model’s logit-induced probabil-
ity distribution and nationally representative survey data. We
use Jensen-Shannon Divergence (Lin 1991) and Hellinger
Distance (Hellinger 1909) as our primary evaluative metrics
to conduct a robust and multi-faceted analysis, enabling us
to pinpoint specific linguistic and demographic biases. Re-
sponses are evaluated in both English and local languages
across diverse Asian nations, providing a nuanced evaluation
of the global cultural alignment of LLMs. Figure 1 provides
a summary of our methodology.
To demonstrate how high-level distributional gaps mani-
fest as concrete representational harms in downstream tasks,
we evaluate the models using a suite of culturally aware
bias benchmarks that offer broad geographic and typological
coverage: CrowS-Pairs (Nangia et al. 2020),

Chunk 4 · 1,992 chars

1 provides
a summary of our methodology.
To demonstrate how high-level distributional gaps mani-
fest as concrete representational harms in downstream tasks,
we evaluate the models using a suite of culturally aware
bias benchmarks that offer broad geographic and typological
coverage: CrowS-Pairs (Nangia et al. 2020), IndiBias (Sa-
hoo et al. 2024), ThaiCLI (Kim et al. 2025), and KoBBQ (Jin
et al. 2024). Our evaluation reveals that while the contempo-
rary models are generally representative of different Asian
populations, they consistently struggle to generate true rep-
resentative opinions involving religion and identity-related
topics. At the same time, in our bias benchmarks, LLMs
consistently rate negative framings of religious communi-
ties, such as Sunni and Shia Muslims, as more plausible than
positive ones. This pattern likely reflects both uneven com-
munity representation and the influence of negative stereo-
types embedded in the online discourse.
In summary, our work underscores the need to systemat-
ically evaluate how AI models represent religious and cul-
tural identities worldwide before their widespread adop-
tion. To enable further research in this direction, we have
made our codebase and other resources publicly available
on GitHub1.
Related Work
Numerous studies demonstrate that Large Language Models
(LLMs) often reflect the cultural values of English-speaking
and Protestant European nations (Tao et al. 2024; Huang
et al. 2023). This has led to models frequently aligning
more closely with United States-centric viewpoints and fail-
ing to capture community-specific knowledge (Sukiennik
et al. 2025; Etxaniz et al. 2024). Recent comparative au-
dits further show that LLMs manifest regionally variable de-
grees of alignment, with notable misrepresentations persist-
ing in Asian, African, and Latin American contexts (Bentley,
Evans, and Bull 2025; AlKhamissi et al. 2024). Comple-
mentary work has also used LLMs to study how users en-
gage with

Chunk 5 · 1,994 chars

comparative au-
dits further show that LLMs manifest regionally variable de-
grees of alignment, with notable misrepresentations persist-
ing in Asian, African, and Latin American contexts (Bentley,
Evans, and Bull 2025; AlKhamissi et al. 2024). Comple-
mentary work has also used LLMs to study how users en-
gage with harmful or misleading content at scale, for exam-
ple, classifying whether audiences express support or skep-
ticism toward mental-health misinformation and revealing
platform-specific amplification patterns and annotation reli-
ability gaps (Nguyen et al. 2025).
The lack of culturally representative data can lead to
large gaps in societal and religious viewpoints, a particu-
larly critical issue in multilingual and plural societies (Qin
et al. 2025; Gamboa, Feng, and Lee 2024; Chhikara, Kumar,
and Chakraborty 2025). A cross-lingual evaluation by (del
Arco, Pelloni, and Zampieri 2024) found persistent religious
stereotyping and refusals among LLMs, particularly for mi-
nority faith groups, underscoring the scarcity of systematic
approaches to religion-focused NLP bias.
However, measuring and mitigating these biases can be
challenging. Alignment scores can flip entirely based on
methodological choices like prompt formatting and ques-
tion selection (Khan, Casper, and Hadfield-Menell 2025).
Popular alignment techniques like Reinforcement Learning
from Human Feedback often perpetuate existing biases from
base models, including those related to gender (Ovalle et al.
2024; Zhang et al. 2024). LLM-based judges sometimes fa-
vor reward style over actual accuracy (Feuer et al. 2025),
while using LLMs as annotators introduces systematic label-
ing biases that flow into downstream systems. In hate speech
detection, for example, LLM-generated labels show demo-
graphic and dialect-linked disparities that prompting and en-
sembling strategies don’t fully address (Okpala and Cheng
2025). Studies comparing human and LLM annotators find
their bias profiles differ

Chunk 6 · 1,996 chars

ing biases that flow into downstream systems. In hate speech
detection, for example, LLM-generated labels show demo-
graphic and dialect-linked disparities that prompting and en-
sembling strategies don’t fully address (Okpala and Cheng
2025). Studies comparing human and LLM annotators find
their bias profiles differ substantially, with LLMs potentially
1https://github.com/HariShankar08/LLMOpinions

-- 2 of 13 --

amplifying under-detection for minority targets (Giorgi et al.
2025).
Various strategies have been proposed to improve cul-
tural alignment. Simple interventions like local-language
prompting and demographic priming show both promise
and clear limits in reducing bias (AlKhamissi et al. 2024;
Bentley, Evans, and Bull 2025; Chhikara et al. 2024). Data-
centric approaches use LLMs to generate semantic augmen-
tations, such as denoising rewrites or contextual explana-
tions, that strengthen small harmful-content datasets and im-
prove detection even in low-resource settings (Meguellati
et al. 2025). More fundamental methods involve pre-training
on targeted local data to help models acquire specific cul-
tural knowledge (Etxaniz et al. 2024). Some techniques
work at deeper levels, like D2O which uses human-labeled
negative examples during training (Duan et al. 2024), or
FairSteer which applies corrective adjustments to model ac-
tivations at inference time without retraining (Li et al. 2025).
Despite this progress, the limits of these methods for com-
prehensive cultural adaptation remain unclear (Liu, Korho-
nen, and Gurevych 2025; Qin et al. 2025), and researchers
continue developing new metrics to measure representa-
tional harms more precisely (Shin et al. 2024; Hida, Yam-
aguchi, and Hanawa 2024).
A synthesis of current approaches suggests that scal-
able, region-specific audits and the curation of native sur-
vey data are necessary to ensure LLMs are deployed equi-
tably worldwide (Qin et al. 2025; del Arco, Pelloni, and
Zampieri 2024; Bentley, Evans,

Chunk 7 · 1,991 chars

ely (Shin et al. 2024; Hida, Yam-
aguchi, and Hanawa 2024).
A synthesis of current approaches suggests that scal-
able, region-specific audits and the curation of native sur-
vey data are necessary to ensure LLMs are deployed equi-
tably worldwide (Qin et al. 2025; del Arco, Pelloni, and
Zampieri 2024; Bentley, Evans, and Bull 2025). Founda-
tional work by (Santurkar et al. 2023) evaluates whether
model outputs reflect nationally representative opinion data,
providing a methodology for such audits. Our research ad-
vances this paradigm by introducing multilinguality into ex-
isting datasets and extending evaluation beyond Western-
centric benchmarks. Specifically, we augment standard re-
sources with data in multiple languages and systemati-
cally test LLMs on tasks that foreground both religion and
multilingual alignment, with particular emphasis on India
and East/Southeast Asia. Leveraging large-scale, nationally
representative Pew surveys and regionally salient cultural
datasets (Maguire 2017), we address a critical gap: evalu-
ating how LLMs align with local public opinion on reli-
gion across diverse linguistic contexts, especially in soci-
eties where religion remains deeply intertwined with social
and political identity.
Establishing Ground Truth: Survey Data and
Bias Benchmarks
A key challenge in auditing LLMs for cultural alignment
is the scarcity of high-quality, large-scale data that reflects
public opinion outside Western contexts. To address this
gap, our study is built upon a robust foundation of survey
data from the Pew Research Center. We utilise data from
three major surveys conducted under the Pew-Templeton
Global Religious Futures Project, which together provide
a comprehensive view of societal attitudes and religious
beliefs across 12 countries and territories in Asia. These
surveys are: Religion in India: Tolerance and Segregation
Figure 2: The treemap shows respondent counts from 12
countries/territories across India (green), East Asia

Chunk 8 · 1,998 chars

tures Project, which together provide
a comprehensive view of societal attitudes and religious
beliefs across 12 countries and territories in Asia. These
surveys are: Religion in India: Tolerance and Segregation
Figure 2: The treemap shows respondent counts from 12
countries/territories across India (green), East Asia (blue),
and Southeast Asia (orange), with block area proportional
to sample size. These nationally representative surveys (Pew
Research Center 2021, 2024a, 2024b) form the empirical
ground truth for measuring LLM representativeness, en-
abling robust cross-country comparisons on religion and so-
cial attitudes.
(IND) (Sahgal and Evans 2021), Religion and Views of an
Afterlife in East Asia (EA) (Evans 2024a), and Buddhism,
Islam and Religious Pluralism in South and Southeast Asia
(SEA) (Evans 2024b). Figure 2 summarises the respondent
counts and regional coverage for each survey.
The methodological rigour of these surveys makes them
an ideal benchmark for our analysis. Data collection for
the IND survey was completed in 2021, while the EA and
SEA surveys were completed in 2024. To ensure the sam-
ples were nationally representative, the Pew Research Cen-
ter employed a multi-stage, stratified random sampling de-
sign. This involved segmenting each country into primary
sampling units (e.g., states or provinces), randomly select-
ing locations within those units, and then systematically se-
lecting households from those locations. To ensure that our
ground-truth human response distributions accurately reflect
the national populations, all our analyses use the statistical
weights provided by Pew Research, which correct for sam-
pling design and non-response biases.
Translation of Survey Questionnaires
While the original surveys were administered in local lan-
guages, the publicly available metadata provides the survey
questions and response options only in English. To effec-
tively test the multilingual capabilities of LLMs and align
our prompts with the

Chunk 9 · 1,995 chars

and non-response biases.
Translation of Survey Questionnaires
While the original surveys were administered in local lan-
guages, the publicly available metadata provides the survey
questions and response options only in English. To effec-
tively test the multilingual capabilities of LLMs and align
our prompts with the original survey context, a comprehen-
sive translation of the survey instruments was required. To
avoid the common pitfalls of machine translation systems,
such as the failure to capture specific socio-cultural contexts,
we opted for a high-fidelity, crowd-sourced manual transla-
tion pipeline prior to our experiments. Experienced transla-
tors were recruited for each target language via crowdsourc-
ing to ensure semantic accuracy and cultural relevance, and

-- 3 of 13 --

were selected based on their prior work experience in similar
translation tasks. Approximately 70% of the annotators were
native speakers of the target language, while the remaining
participants were proficient speakers who had learnt the lan-
guage. To ensure high-quality outputs, translators were ex-
plicitly instructed to:
1. Preserve the original meaning and intent of the survey
questions and response options.
2. Maintain cultural nuance and context, especially con-
cerning sensitive topics.
3. Ensure the resulting translations sound natural and as
human-like as possible.
To evaluate the reliability of the translation process and
ensure consistency, overlapping subsets of the survey ques-
tions were assigned to multiple annotators to measure inter-
annotator agreement, where we noted a strong average
agreement score (Cohen’s κ = 0.82). Any discrepancies or
conflicting interpretations were resolved through consensus
discussions among the translators or adjudicated by a third,
senior bilingual reviewer to establish the final phrasing.
As an additional validation measure, a subset of the trans-
lated survey questions was back-translated into English us-
ing machine translation

Chunk 10 · 1,977 chars

licting interpretations were resolved through consensus
discussions among the translators or adjudicated by a third,
senior bilingual reviewer to establish the final phrasing.
As an additional validation measure, a subset of the trans-
lated survey questions was back-translated into English us-
ing machine translation tools and compared against the
original text to identify potential inconsistencies. We en-
sured that important terms (e.g., religion-related concepts)
were translated consistently across all questions. Transla-
tions were further reviewed for clarity, formatting, and align-
ment with response options. Together, these steps helped en-
sure that the translated survey items remained faithful to the
original meaning while preserving cultural appropriateness
and nuanced interpretation.
All translators were paid in accordance with their pre-
ferred compensation rates, with a total cost incurred of USD
125. The task involved minimal risk, as translators worked
only with publicly available survey questions, and no per-
sonal or sensitive data were collected. Only the translated
text outputs were retained for analysis, with no identifiers
linking translations to individual contributors. The specific
languages targeted for each country are detailed in Table 1.
Country Prompt Languages Country Prompt Languages
IND 	en, hi 	KHM 	en, km
HKG en, zh-Hant IDN 	en, id
JPN 	en, ja 	MYS en, ms, zh-Hans
KOR 	en, ko 	SGP en, ms, zh-Hans, ta
TWN en, zh-Hant LKA 	en, si, ta
VNM 	en, vi 	THA 	en, th
Table 1: Language coverage by country: For every country,
prompts were issued in both English and one or more local
languages to facilitate a within-country analysis of language
effects. Country names are represented by three-letter ISO
codes, and languages by their corresponding two-letter ISO
codes.
Dataset Primary format
CrowS-Pairs Pairwise Sentences
IndiBias Pairwise / Judgment (English + Hindi)
ThaiCLI Question / Chosen / rejected
KoBBQ QA / templates

Chunk 11 · 1,992 chars

within-country analysis of language
effects. Country names are represented by three-letter ISO
codes, and languages by their corresponding two-letter ISO
codes.
Dataset Primary format
CrowS-Pairs Pairwise Sentences
IndiBias Pairwise / Judgment (English + Hindi)
ThaiCLI Question / Chosen / rejected
KoBBQ QA / templates (template-expanded)
Table 2: Cross-cultural benchmarks used to evaluate bias
and representational harms in LLMs. These corpora span
multiple regions and task formats—pairwise judgments
(CrowS-Pairs, IndiBias), culturally grounded question-
answering (KoBBQ), and culturally sensitive response scor-
ing (ThaiCLI).
Bias Evaluation
To comprehensively evaluate the models’ representative-
ness and misrepresentation, we use four complementary,
culturally-aware bias benchmarks: CrowS-Pairs (Nangia
et al. 2020), IndiBias (Sahoo et al. 2024), ThaiCLI (Up-
stageAI 2025; Kim et al. 2025), and KoBBQ (Jin et al.
2024). Together, these corpora allow (i) measuring pairwise
stereotyping tendencies in both masked and generative lan-
guage, (ii) targeted probing of representational harms in In-
dian contexts (English/Hindi), (iii) evaluating Thai cultur-
al/pragmatic alignment, and (iv) QA-style bias assessment
in Korean. These benchmarks provide broader geographic
and typological coverage and enable cross-cultural compari-
son of representational performance. We provide a summary
of the same in Table 2.
CrowS-Pairs
We use CrowS-Pairs as a foundational benchmark to mea-
sure general stereotyping preferences. Nangia et al. (2020)
consists of sentence pairs that contrast stereotypical and
non-stereotypical statements. Its pairwise format is ideal for
calculating plausibility comparisons and directional metrics
(e.g., pairwise win rates or ∆ELO style plausibility differ-
ences), offering a clear, language-neutral baseline for mis-
representation.
IndiBias
IndiBias is a benchmark designed specifically for the South
Asian context (Sahoo et al. 2024), which is uniquely

Chunk 12 · 1,995 chars

alculating plausibility comparisons and directional metrics
(e.g., pairwise win rates or ∆ELO style plausibility differ-
ences), offering a clear, language-neutral baseline for mis-
representation.
IndiBias
IndiBias is a benchmark designed specifically for the South
Asian context (Sahoo et al. 2024), which is uniquely de-
signed to test for biases along India-relevant identity axes
(e.g., religion, caste, region, gender, occupation) in both En-
glish and Hindi. While smaller than some Western-centric
corpora, IndiBias fills a crucial gap by explicitly testing rep-
resentational harms related to South Asian identities and
evaluating multilingual model behaviour in this domain.
ThaiCLI
The ThaiCLI (Kim et al. 2025), benchmark evaluates the
alignment of large language models (LLMs) with Thai
cultural norms using a set of Question, Chosen, Rejected
triplets, where each question is paired with both a cultur-
ally appropriate (Chosen) and inappropriate (Rejected) an-
swer. The benchmark presents seven thematic domains, like
royal family, religion, culture, economy, humanity, lifestyle,

-- 4 of 13 --

and politics, in two distinct formats. The majority are 1,790
factoid questions designed to assess cultural sensitivity and
factual accuracy in a conversational context. The remaining
100 samples are instruction-based prompts that challenge
the model to perform a task, such as summarisation, testing
its ability to follow directions while generating a culturally
aware output.
KoBBQ
The Korean Bias Benchmark for QA (KoBBQ) adapts the
BBQ/BBQ-style methodology (Parrish et al. 2021) to Ko-
rean QA settings using template expansions and culturally
localised target lists (Jin et al. 2024). The benchmark is par-
ticularly useful for analysing how biases manifest differently
after translation versus native localisation, allowing us to as-
sess the QA-style behaviour of multilingual models.
Measuring Cultural Alignment and Bias
To measure how well an LLM aligns with the cultural

Chunk 13 · 1,996 chars

et lists (Jin et al. 2024). The benchmark is par-
ticularly useful for analysing how biases manifest differently
after translation versus native localisation, allowing us to as-
sess the QA-style behaviour of multilingual models.
Measuring Cultural Alignment and Bias
To measure how well an LLM aligns with the cultural views
of a specific country, we adapt the methodology proposed in
(Santurkar et al. 2023) which compares the model’s “opin-
ions” against real-world public opinion data. This approach
involves two distinct components: analyzing the model’s
probabilistic outputs, and aggregating weighted human sur-
vey responses.
The Model Opinion Distribution
Each question provided in our selected surveys consist of
a Multiple-Choice Question, with at-most one selected an-
swer. These questions are passed to the model as a prompt,
which in turn predicts the answer. The LLM assigns a math-
ematical probability to every possible next step (or “token”).
We extract the probability the model assigns to each answer
option. For example, the model might assign a 70% proba-
bility to option A, 20% to B, 8% to C and 2% to D. This
resulting set of probabilities for a given question is defined
as the Model Opinion Distribution, denoted as DM.
At a technical level, these probability values are derived
either through the model’s log-probabilities (GPT-4o, Gem-
ini, accessed through model APIs) or through the internal
logits (Llama, Mistral, Gemma).2 To ensure consistency of
model outputs, we set the model’s temperature and all ran-
dom seeds to zero. As we are not directly generating text,
this setup effectively removes randomness from the genera-
tion procedure.
The Human Opinion Distribution
To compare the model’s output for each survey question
against human respondent, we first create a probability dis-
tribution to encode and aggregate all respondents. Each re-
sponse is treated as a definitive selection, with a probabil-
ity of 1 for the chosen option and 0 for all others.

Chunk 14 · 1,996 chars

e Human Opinion Distribution
To compare the model’s output for each survey question
against human respondent, we first create a probability dis-
tribution to encode and aggregate all respondents. Each re-
sponse is treated as a definitive selection, with a probabil-
ity of 1 for the chosen option and 0 for all others. Subse-
quently, we aggregate each response using the demographic
weights provided in the survey data to ensure that the human
2Local experiments were conducted on a server equipped with
three NVIDIA RTX 5000 GPUs.
responses form a more representative view of the overall sur-
vey public.
We define this as the Human Opinion Distribution,
denoted as DO . By comparing DM and DO for each survey
question, we can quantitatively assess how closely the
model’s internal likelihoods mirror the prevailing opinions
of the human population.
Computing Alignment
We employ three metrics, each converted into a score
bounded between 0 and 1. As our primary metrics, we use
the Jensen-Shannon Divergence (JSD) and Hellinger Dis-
tance (HD) to compare DM and DO . The corresponding
alignment scores are defined as follows:
AJSD
DM, DO ; Q = 1
|Q|
X
q∈Q

JSDDM(q), DO (q)
,
(1)
AHD
DM, DO ; Q = 1
|Q|
X
q∈Q

HDM(q), DO (q)
.
(2)
To support these metrics, we also use the formulation of
“representativeness” as proposed by (Santurkar et al. 2023).
The metric employs Wasserstein Distance (WD), which ef-
fectively accounts for the ordinal structure of the answer op-
tions. Alignment between two distributions D1 and D2 over
a set of questions Q is defined as:
AW D(D1, D2; Q) = 1
|Q|
X
q∈Q

1 − W D(D1(q), D2(q))
N − 1

,
(3)
Where N is the number of answer choices for each ques-
tion. Based on this equation, the representativeness of a lan-
guage model M with respect to the population O on a set of
questions Q is therefore:
RM(Q) = AW D(DM, DO ; Q), (4)
Each metric discussed above is bounded between 0 and
1. For RM, which is formulated as an alignment score, a
higher value

Chunk 15 · 1,997 chars

ices for each ques-
tion. Based on this equation, the representativeness of a lan-
guage model M with respect to the population O on a set of
questions Q is therefore:
RM(Q) = AW D(DM, DO ; Q), (4)
Each metric discussed above is bounded between 0 and
1. For RM, which is formulated as an alignment score, a
higher value indicates better representativeness. Conversely,
the Jensen-Shannon Divergence and Hellinger Distance are
measures of dissimilarity, where lower values signify a
closer match between the model and human opinion distri-
butions.
Evaluation on Bias Benchmarks
CrowS-Pairs We evaluate bias using the religion subset of
CrowS-Pairs by converting each minimal pair into a binary-
choice prompt that asks the model to “Choose the more so-
cially acceptable sentence” and to reply with only a single
option code (1 or 2). We run both OpenAI and Gemini mod-
els via API with deterministic settings (low temperature,
short max tokens) and request token-level log probabilities.

-- 5 of 13 --

Figure 3: Qualitative examples for two religion-related bias
benchmarks. CrowS-Pairs operationalizes bias via minimal
pairs scored with pseudo-log-likelihood, while ThaiCLI uses
instruction-preference judgments with explicit chosen vs. re-
jected responses.
When a model returns a text response, we parse the first nu-
meric choice; if absent, we fall back to the first step, where
the logprobs indicate emission of ‘1’ or ‘2.’ Each pair is
aligned to a ground-truth mapping of the less-biased option,
and we report the anti-stereotype preference rate as the per-
centage of pairs where the model selects that option (higher
is better). We discard malformed responses that do not yield
a recoverable choice.
IndiBias We evaluated large language models (LLMs) on
the IndiBias benchmark’s religion plausibility task, which
presents pairs of positive (pro-identity) and negative (anti-
identity) scenarios for Indian religious identities and asks
the model to select the more plausible scenario.

Chunk 16 · 1,991 chars

do not yield
a recoverable choice.
IndiBias We evaluated large language models (LLMs) on
the IndiBias benchmark’s religion plausibility task, which
presents pairs of positive (pro-identity) and negative (anti-
identity) scenarios for Indian religious identities and asks
the model to select the more plausible scenario. Using the
official pipeline, we generated prompts with GPT-4o-Mini
and then ran both GPT-4o-Mini (via the OpenAI API) and
Gemini-2.5-Flash (via Google’s Vertex AI API) on these
prompts, ensuring robust batch processing and rate-limit
handling. For each model, we computed ELO scores (Sahoo
et al. 2024) for every identity in both positive and negative
splits, and defined a misrepresentation score as the differ-
ence between negative and positive ELOs (∆ELO), where
higher values indicate a greater tendency to normalise neg-
ative framings for that identity. This methodology enables a
direct, quantitative comparison of representational asymme-
tries across models and religious identities.
ThaiCLI To assess model outputs, an LLM-as-a-Judge
paradigm has been used: a strong LLM (GPT-4o) is used to
rate a model’s generated answer for each question, given the
Chosen/Rejected examples, on a scale from 1 to 10, along
with an explanation. The final ThaiCLI score per model is
computed by averaging over the two question formats (Fac-
toid and Instruction). If score extraction fails (via regular-
expression matching) in the Judge’s response, the judgement
is re-generated up to a fixed number of attempts, with zero
Figure 4: Representativeness scores (RM) of GPT-4o-Mini
and Gemini-2.5-Flash on non-religious versus religious
items. While both models achieve high representativeness
on non-religious prompts (>94%), their scores dip on reli-
gious items.
assigned only if it still fails.
KoBBQ We evaluate our models on the test split of
the KoBBQ benchmarking, constructing multiple-choice
prompts from the evaluation templates. The models are sub-
sequently queried

Chunk 17 · 1,998 chars

ls achieve high representativeness
on non-religious prompts (>94%), their scores dip on reli-
gious items.
assigned only if it still fails.
KoBBQ We evaluate our models on the test split of
the KoBBQ benchmarking, constructing multiple-choice
prompts from the evaluation templates. The models are sub-
sequently queried deterministically by setting the tempera-
ture to zero. To extract the answer, we parse the first instance
of ‘A’, ‘B’, or ‘C’ from OpenAI responses. For Gemini, we
use structured output constrained to the enum {A, B, C}.
We report overall accuracy and also break down perfor-
mance by bbq-category and by label-annotation (ambiguous
vs disambiguated).
Experimental Evaluation
We evaluate GPT-4o-Mini and Gemini-2.5-Flash. Gemini-
2.5-Flash attains a high representativeness of 94.6% on non-
religious items but dips to ≈89.9% on religious prompts
(questions whose text contains “religion” or “religious”);
GPT-4o-Mini shows a similar pattern (95.2% non-religious
vs ≈90.2% religious). Divergence shifts on the religion
subset are small: GPT-4o-Mini’s JSD is essentially flat
(AJSD = −0.004) with a slight Hellinger increase (AHD =
+0.008), while Gemini-2.5-Flash shows modest decreases
(AJSD = −0.018; AHD = −0.019). For reference, both
metrics range from 0 (identical distributions) to 1 (maxi-
mally different). Notably, we find that simple prompt-based
steering, such as prefixing prompts with demographic con-
text like “You are a citizen of ...”, can shift model outputs
toward the target distribution and reduce measured distri-
butional divergence on religion-related queries, as shown in
Figure 4.
To contrast religion with other question types, we group
question text with a simple keyword taxonomy (reli-
gion/religious; demographics such as age, gender, edu-
cation, income, region, language; and governance/politics
such as government, elections, law). Representativeness is
highest on governance/politics items (95.2% for GPT-4o-
Mini; 94.6% for Gemini-2.5-Flash),

Chunk 18 · 1,991 chars

roup
question text with a simple keyword taxonomy (reli-
gion/religious; demographics such as age, gender, edu-
cation, income, region, language; and governance/politics
such as government, elections, law). Representativeness is
highest on governance/politics items (95.2% for GPT-4o-
Mini; 94.6% for Gemini-2.5-Flash), followed by other non-
religion items (92.8% / 89.5%) and demographic questions

-- 6 of 13 --

Figure 5: Change in Hellinger distance (∆H = Hlocal −
HEN) when switching from English to local-language
prompts for Gemma-3-12B across multiple locales. Nega-
tive values (bars below zero) indicate that local-language
prompting reduces the divergence between model and hu-
man distributions.
(88.8% / 90.8%), while religion-related items remain lowest
(90.2% / 89.9%).
The open-weight models (Gemma-3, Llama-3.2, Mistral-
7B) (Google Research 2025; Meta AI 2024; Mistral AI
2024) mirror the behavior of GPT-4o-Mini and Gemini-2.5-
Flash in achieving high representativeness (RM > 0.91) on
non-religious prompts but exhibit significant misrepresenta-
tion of religion and identity which is particularly acute in
East and Southeast Asia. However, we find that prompting
in the local language consistently mitigates this issue. As de-
tailed in Table 3, switching from English to local languages
reduces divergence (AJSD ) across all tested models. The ef-
fect is most pronounced for Gemma-3 in Sri Lanka, where
Sinhala prompts yield a ∼31% reduction in AJSD . Despite
these improvements in divergence, the Hellinger distance
(AHD ) remains largely resistant to language changes (Fig-
ure 5), suggesting that while local languages improve distri-
butional overlap, fundamental probability shifts remain dif-
ficult to correct.
Figure 6 demonstrates these results, contrasting Jensen–
Shannon distances under local-language versus English
prompts for each model–country pair. Across all three mod-
els, local-language prompting lowers divergence, indicating
better alignment of

Chunk 19 · 1,998 chars

verlap, fundamental probability shifts remain dif-
ficult to correct.
Figure 6 demonstrates these results, contrasting Jensen–
Shannon distances under local-language versus English
prompts for each model–country pair. Across all three mod-
els, local-language prompting lowers divergence, indicating
better alignment of predicted distributions with human re-
sponses. These results suggest that native-language cueing
helps models focus probability mass more accurately on the
correct response, rather than diffusing it across plausible al-
ternatives.
CrowS-Pairs: Cross-Lingual Stereotype Probing Our
results (see Figure 7) reveal that GPT-4o-Mini is consis-
tently robust, selecting the anti-stereotype option in ∼92%
of cases (bias rate ∼8%), with zero invalids across all lan-
guages. In contrast, Gemini-2.5-Flash exhibits higher bias
rates (∼16%), lower anti-stereotype accuracy (∼68%), and
a notable fraction of invalid responses (15–19/105), espe-
cially in Vietnamese. These findings indicate that while
GPT-4o-Mini robustly resists religious stereotyping across
languages, Gemini-2.5-Flash is both more prone to stereo-
type selections and more likely to abstain or produce off-
Figure 6: Effect of prompt language on religion-related
items across distinct model–country pairs (Kr = Korea, Tw
= Taiwan, SL = Sri Lanka). Lower Jensen–Shannon distance
indicates better alignment.
Model Region
(Lang)
RM
AJSD
(Eng → Loc)
AHD
(Eng → Loc)
Gemma-3
12B-IT
Sri Lanka
(Sinhala)
0.96 0.47 → 0.32 0.49 → 0.47
Llama-3.2
1B-Instruct
Taiwan
(Chinese)
0.95 0.88 → 0.81 0.86 → 0.86
Mistral-7B
Instruct-v0.3
Korea
(Korean)
0.91 0.53 → 0.48 0.48 → 0.48
Table 3: Impact of local-language prompting: switching
to local languages consistently reduce divergence (AJSD ),
while Hellinger Distances (AHD ) remain stable.
Figure 7: Cross-lingual bias rates on the religion-only sub-
set of CrowS-Pairs across six languages (105 items/lo-
cale). GPT-4o-Mini shows low bias (≈8%) and high anti-
stereotype accuracy

Chunk 20 · 1,997 chars

prompting: switching
to local languages consistently reduce divergence (AJSD ),
while Hellinger Distances (AHD ) remain stable.
Figure 7: Cross-lingual bias rates on the religion-only sub-
set of CrowS-Pairs across six languages (105 items/lo-
cale). GPT-4o-Mini shows low bias (≈8%) and high anti-
stereotype accuracy (≈92%) consistently across languages.
Gemini-2.5-Flash exhibits higher bias (≈16%), lower anti-
stereotype accuracy (≈68%), and more invalid responses,
indicating weaker cross-lingual stereotype resistance.
format outputs, raising concerns about cross-lingual consis-
tency and safety filtering.
IndiBias: Plausibility and Misrepresentation Analysis
We find that GPT-4o-Mini exhibits clear calibration gaps
across identities. The most misrepresented groups, as indi-

-- 7 of 13 --

Figure 8: Misrepresentation of Indian religious identities on
IndiBias using GPT-4o-Mini. The plot shows ∆ELO =
ELOneg − ELOpos, where positive values indicate that neg-
ative descriptions are judged more plausible than posi-
tive ones. Several minority identities (Shia, Sunni, Jain,
Parsi) show strong misrepresentation, while others (Hindu,
Sikh, Sufi) show the opposite trend, highlighting systematic
group-specific calibration gaps.
cated by high ∆ELO, are Shia (+28.9), Sunni (+23.3), Jain
(+16.8), and Parsi (+16.5), with smaller effects for Bud-
dhist (+4.3) and Bahai (+4.1). Conversely, Hindu (-13.0),
Sufi (-10.1), Sikh (-9.3), Christian (-5.0), and Bohra Mus-
lim (-1.2) exhibit negative or minimal misrepresentation, in-
dicating higher plausibility for positive framings. This pat-
tern suggests that negative descriptions are disproportion-
ately normalized for certain identities, evidencing persis-
tent group-specific miscalibration. Gemini-2.5-Flash shows
broadly convergent trends; e.g., Sunni also exhibits elevated
negative plausibility (∆ELO > 0). These results (see Figure
8) highlight the need for careful evaluation of demographic
representativeness and fairness in LLM

Chunk 21 · 1,992 chars

certain identities, evidencing persis-
tent group-specific miscalibration. Gemini-2.5-Flash shows
broadly convergent trends; e.g., Sunni also exhibits elevated
negative plausibility (∆ELO > 0). These results (see Figure
8) highlight the need for careful evaluation of demographic
representativeness and fairness in LLM outputs.
ThaiCLI: Cultural Sensitivity in Thai Results show
that GPT-4o-Mini highly aligned with Thai cultural norms,
achieving average scores above 8.3 across both factoid and
instruction prompts scoring 8.10 on 10, and maintaining
consistently high performance across sensitive themes such
as religion and royal family. Gemini-2.5-Flash also demon-
strates strong cultural sensitivity, with a score of 7.52 on 10,
but lags behind OpenAI in both absolute score and consis-
tency.
KoBBQ: Disambiguation and Calibration on Korean
Identity Benchmarks On GPT-4o-Mini, we observe sub-
stantial gains in model calibration with disambiguation:
Overall accuracy rises from 0.611 (ambiguous) to 0.961
(disambiguated). Religion-related accuracy improves from
0.625 to 0.950. Differential bias (religion) decreases sharply,
from 0.275 (ambiguous) to 0.1 (disambiguated). All other
demographic axes (e.g., race, gender, education) exhibit
similar improvements with disambiguation (see Figure 9).
Figure 9: Effect of prompt disambiguation on GPT-
4o-Mini performance on the KoBBQ Korean identity
benchmark. Disambiguating prompts improves over-
all accuracy (0.611→0.961) and religion-related ac-
curacy (0.625→0.950), while sharply reducing bias
(0.275→0.100).
These results highlight the critical role of prompt specificity
to mitigate group-level calibration failures in LLM outputs.
Discussion
This study reveals significant disparities in the cultural align-
ment of open Large Language Models (LLMs) across di-
verse Asian nations. While models like GPT-4o-mini and
Gemini-2.5-flash demonstrate high overall representative-
ness on general social topics, they consistently falter

Chunk 22 · 1,991 chars

ures in LLM outputs.
Discussion
This study reveals significant disparities in the cultural align-
ment of open Large Language Models (LLMs) across di-
verse Asian nations. While models like GPT-4o-mini and
Gemini-2.5-flash demonstrate high overall representative-
ness on general social topics, they consistently falter when
representing public opinion on the sensitive domain of re-
ligion. This observed misalignment does not seem to be
limited to interactions in English. The study indicates that
these representational gaps persist, and in some cases get
amplified when the models are prompted in various local
languages. This pattern suggests that the challenge may be
deeply rooted in the models’ predominantly English-centric
training data and subsequent alignment processes, rather
than being a simple tool for translation.
The persistence of these gaps across multiple languages
raises important considerations for the global deployment
of these technologies. It points to a potential risk of propa-
gating a specific cultural viewpoint, often one that is more
aligned with Western contexts, even when users are inter-
acting in their native tongue. This challenges the notion that
multilingual capability alone is sufficient for equitable per-
formance across different cultural settings.
At the same time, this research introduces a layer of com-
plexity to this narrative. We found that lightweight interven-
tions, such as using a local language or providing demo-
graphic context in the prompt, can sometimes lead to partial
improvements in alignment scores. This may indicate that
the models possess some latent cultural knowledge that is
not always activated by default, hinting at potential avenues
for developing more effective steering and fine-tuning meth-
ods in the future.

-- 8 of 13 --

Drivers of Cultural Misalignment
To fully address these disparities, it is necessary to ex-
amine the structural mechanisms that entrench them. Mis-
alignment primarily stems from

Chunk 23 · 1,994 chars

ed by default, hinting at potential avenues
for developing more effective steering and fine-tuning meth-
ods in the future.

-- 8 of 13 --

Drivers of Cultural Misalignment
To fully address these disparities, it is necessary to ex-
amine the structural mechanisms that entrench them. Mis-
alignment primarily stems from imbalances in training data,
where demographic groups like ethnic minorities, low-
income classes, or speakers of non-dominant languages are
underrepresented or stereotyped in vast internet-sourced cor-
pora. This leads models to encode dominant cultural norms,
such as Western or English-centric values, resulting in poor
cultural alignment for other personas (AlKhamissi et al.
2024). Spatial, temporal, and collection biases exacerbate
this, with data skewed toward high-resource regions and out-
dated societal views, causing models to default to majority
stereotypes in tasks like sentiment analysis or coreference
resolution.
Furthermore, post-training alignment techniques, includ-
ing instruction-tuning and reinforcement learning from hu-
man feedback (RLHF), amplify these issues rather than re-
solve them. Feedback data typically reflects majority pref-
erences and fails to generalize to minority moral norms or
dialects. Safety alignments can create demographic hierar-
chies, with higher refusal rates for prominent groups but vul-
nerabilities for long-tail minorities like those with disabili-
ties (Guo et al. 2024). As scaling worsens disparities with-
out targeted mitigation, linguistic ambiguities and extrinsic
biases in downstream tasks further entrench misalignment,
where models misinterpret regional variants or generate ho-
mogeneous representations of subordinate groups.
Finally, fundamental model limitations, such as poor
cross-lingual transfer and the curse of multilinguality in
training, may hinder equitable semantic encoding across cul-
tures. While prompting in native languages improves per-
formance, it does not fully bridge gaps for

Chunk 24 · 1,995 chars

eneous representations of subordinate groups.
Finally, fundamental model limitations, such as poor
cross-lingual transfer and the curse of multilinguality in
training, may hinder equitable semantic encoding across cul-
tures. While prompting in native languages improves per-
formance, it does not fully bridge gaps for digitally under-
represented personas. Architecture choices and tokenization
strategies often favor high-resource languages, perpetuat-
ing epistemic gaps in low-resource contexts (Gallegos et al.
2024). Although literature emphasizes diverse pretraining
data and persona-specific fine-tuning to address these issues,
ethical concerns regarding deployment persist.
It is important to note that cultural alignment varies across
model architectures, and multilinguality does not guarantee
cultural representativeness. We see this in Figure 6, where
AJSD for Llama 3.2 in Taiwan is very high (> 0.8) regard-
less of the language of the prompt, indicating an overall fail-
ure to represent the opinions of the population. Models may
demonstrate fluency in a target language while still reflect-
ing the values of its dominant training data. Addressing this
requires fine-tuning on corpora that genuinely capture the
target population’s perspective. This may include integrating
native-authored narratives, hyper-local journalism, informal
vernacular and regional civic texts to represent local norms
and viewpoints accurately.
Alternative Steering Methods
While this study centers on prompt-based steering for eval-
uating cultural alignment, deeper interventions from recent
literature offer promising avenues for more profound model
adaptation. Activation engineering, such as Activation Addi-
tion, enables inference-time steering by adding vectors de-
rived from contrasting activations (e.g., positive vs. negative
sentiment prompts), achieving state-of-the-art control over
outputs like toxicity reduction without retraining (Turner
et al. 2023). Representation engineering

Chunk 25 · 1,996 chars

gineering, such as Activation Addi-
tion, enables inference-time steering by adding vectors de-
rived from contrasting activations (e.g., positive vs. negative
sentiment prompts), achieving state-of-the-art control over
outputs like toxicity reduction without retraining (Turner
et al. 2023). Representation engineering further refines this
by mean-centring steering vectors to enhance steerability
across tasks, including genre shifts or function triggering, as
demonstrated in benchmarks on models like LLaMA (Zou
et al. 2024; Jorgensen et al. 2023). Feedback-driven ap-
proaches, including RLHF or DPO, have been shown to am-
plify instruction-following while embedding local norms,
though they risk overfitting or cross-cultural interference
without diverse data (Sharma et al. 2024a,b).
These alternatives hold potential to reshape models’
internal representations for robust handling of cultural and
religious diversity, surpassing prompt-level guidance. How-
ever, their use is often constrained in production settings.
Leading models like GPT-4o and Gemini-2.5-Flash operate
as black-box APIs, denying access to weights or activations,
and thereby making prompt engineering the primary lever
for most users and applications. Thus, emphasizing prompt-
ing provides a pragmatic assessment of publicly available
tools, underscoring that true deep alignment demands shifts
in model training and access paradigms (Guo et al. 2025).
Limitations and Future Work
This work provides a broad analysis, yet its methodology
entails certain constraints that highlight opportunities for fu-
ture work.
Religious opinion as a primary lens for investigating cul-
tural values offers a critical but not exhaustive view. The cul-
tural fabric of any society is woven from many threads, and
other dimensions, such as political ideologies, regional iden-
tities, and social hierarchies, are also important areas that
would benefit from similar in-depth, multilingual analysis.
Future experiments could extend

Chunk 26 · 1,992 chars

critical but not exhaustive view. The cul-
tural fabric of any society is woven from many threads, and
other dimensions, such as political ideologies, regional iden-
tities, and social hierarchies, are also important areas that
would benefit from similar in-depth, multilingual analysis.
Future experiments could extend this analysis to evalu-
ate different white-box steering methods, such as activation
steering or fine-tuning on culturally specific data. Addition-
ally, research should focus on developing benchmarks that
capture the complex, multi-dimensional nature of cultural
and religious diversity, moving beyond the simple binary
traits targeted by current steering techniques.
This line of inquiry could help foster the development of
LLMs that are not just multilingual in their textual output
but are more multicultural in their underlying understanding,
thereby addressing the kinds of representational gaps that
this work has brought to light.
Acknowledgements
The authors gratefully acknowledge the financial support
provided by the EkStep Foundation. We also thank the
members of the Precog Research Group of IIIT Hyderabad
for their help and guidance during the experimental design
phase. Finally, we thank the anonymous reviewers whose
feedback substantially improved the quality of the paper.

-- 9 of 13 --

References
Abid, A.; Farooqi, M.; and Zou, J. 2021. Large language
models associate Muslims with violence. Nature Machine
Intelligence, 3(6): 461–463.
AlKhamissi, B.; ElNokrashy, M.; AlKhamissi, M.; and
Diab, M. 2024. Investigating cultural alignment of large lan-
guage models. arXiv preprint arXiv:2402.13231.
Australian Bureau of Statistics. 2022. Census of Population
and Housing: Reflecting Australia—Stories from the Cen-
sus, 2021.
Backlinko. 2025. ChatGPT Users: ChatGPT Usage Statis-
tics (2025). Accessed: 2025-08-21.
Bakshy, E.; Messing, S.; and Adamic, L. A. 2015. Expo-
sure to ideologically diverse news and opinion on Facebook.
Science, 348(6239):

Chunk 27 · 1,993 chars

. 2022. Census of Population
and Housing: Reflecting Australia—Stories from the Cen-
sus, 2021.
Backlinko. 2025. ChatGPT Users: ChatGPT Usage Statis-
tics (2025). Accessed: 2025-08-21.
Bakshy, E.; Messing, S.; and Adamic, L. A. 2015. Expo-
sure to ideologically diverse news and opinion on Facebook.
Science, 348(6239): 1130–1132.
Bender, E. M.; Gebru, T.; McMillan-Major, A.; and
Shmitchell, S. 2021a. On the dangers of stochastic par-
rots: Can language models be too big? In Proceedings of
the 2021 ACM conference on fairness, accountability, and
transparency, 610–623.
Bender, E. M.; Gebru, T.; McMillan-Major, A.; and
Shmitchell, S. 2021b. On the Dangers of Stochastic Parrots:
Can Language Models Be Too Big? In Proceedings of the
ACM Conference on Fairness, Accountability, and Trans-
parency (FAccT).
Bentley, S. V.; Evans, D.; and Bull, P. E. 2025. What social
stratifications in bias blind spot can tell us about implicit
social bias in both LLMs and humans. Scientific Reports,
15: 14875.
Chhikara, G.; Kumar, A.; and Chakraborty, A. 2025.
Through the Prism of Culture: Evaluating LLMs’ Un-
derstanding of Indian Subcultures and Traditions. arXiv
preprint arXiv:2501.16748.
Chhikara, G.; Sharma, A.; Ghosh, K.; and Chakraborty,
A. 2024. Few-shot fairness: Unveiling LLM’s poten-
tial for fairness-aware classification. arXiv preprint
arXiv:2402.18502.
del Arco, F. P.; Pelloni, T.; and Zampieri, M. 2024. Di-
vine LLaMAs: Bias, Stereotypes, Stigmatization, and Re-
fusal Behaviors of Language Models for Judaism and Islam.
In Findings of the Association for Computational Linguis-
tics: EMNLP 2024, 12300–12313. Association for Compu-
tational Linguistics.
Duan, S.; Yi, X.; Zhang, P.; Liu, Y.; Liu, Z.; Lu, T.; Xie,
X.; and Gu, N. 2024. Negating Negatives: Alignment with
Human Negative Samples via Distributional Dispreference
Optimization. In Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP).
Durmus, E.; Nyugen, K.; Liao, T. I.; Schiefer, N.; Askell,
A.;

Chunk 28 · 1,997 chars

S.; Yi, X.; Zhang, P.; Liu, Y.; Liu, Z.; Lu, T.; Xie,
X.; and Gu, N. 2024. Negating Negatives: Alignment with
Human Negative Samples via Distributional Dispreference
Optimization. In Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP).
Durmus, E.; Nyugen, K.; Liao, T. I.; Schiefer, N.; Askell,
A.; Bakhtin, A.; Chen, C.; Hatfield-Dodds, Z.; Hernandez,
D.; Joseph, N.; et al. 2023. Towards measuring the repre-
sentation of subjective global opinions in language models.
arXiv preprint arXiv:2306.16388.
Elad, B. 2025. AI in Social Media Tools Statistics 2025:
Uncover What’s Shaping the Future.
Elvia Muthiariny, D. 2024. Indonesia Ranks Highest in
Global Religious Devotion. Tempo.co. Based on Pew Re-
search Center survey (2008–2023).
Etxaniz, J.; Azkune, G.; Soroa, A.; de Lacalle, O. L.; and
Artetxe, M. 2024. BertaQA: How Much Do Language Mod-
els Know About Local Culture? In Proceedings of the An-
nual Conference on Neural Information Processing Systems
(NeurIPS).
Evans, J. 2024a. East Asian Societies Survey Dataset.
Evans, J. 2024b. South and Southeast Asia Survey Dataset.
Feuer, B.; Goldblum, M.; Datta, T.; Nambiar, S.; Besaleli,
R.; Dooley, S.; Cembalest, M.; and Dickerson, J. P. 2025.
Style Outweighs Substance: Failure Modes of LLM Judges
in Alignment Benchmarking. In The Thirteenth Interna-
tional Conference on Learning Representations (ICLR).
Gallegos, I. O.; Rossi, R. A.; Barrow, J.; Tanjim, M. M.;
Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; and Ahmed,
N. K. 2024. Bias and fairness in large language models: A
survey. Computational Linguistics, 50(3): 1097–1179.
Gamboa, L. C. L.; Feng, Y.; and Lee, M. 2024. Social Bias
in Multilingual Language Models: A Survey. arXiv preprint
arXiv:2508.20201.
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J. W.;
Wallach, H.; III, H. D.; and Crawford, K. 2021. Datasheets
for Datasets. Communications of the ACM, 64(12): 86–92.
Giorgi, T.; Cima, L.; Fagni, T.; Avvenuti, M.; and Cresci,
S. 2025. Human and

Chunk 29 · 1,986 chars

lingual Language Models: A Survey. arXiv preprint
arXiv:2508.20201.
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J. W.;
Wallach, H.; III, H. D.; and Crawford, K. 2021. Datasheets
for Datasets. Communications of the ACM, 64(12): 86–92.
Giorgi, T.; Cima, L.; Fagni, T.; Avvenuti, M.; and Cresci,
S. 2025. Human and LLM Biases in Hate Speech Anno-
tations: A Socio-Demographic Analysis of Annotators and
Targets. In Proceedings of the International AAAI Confer-
ence on Web and Social Media (ICWSM), volume 19, 653–
670. Copenhagen, Denmark: AAAI Press.
Google Research. 2025. Gemma 3 12B-It. https://
huggingface.co/google/gemma-3-12b-it.
Green, J.; et al. 2023. ChatGPT Has Been Sucked Into In-
dia’s Culture Wars. Wired. News account of public contro-
versy documenting asymmetric ChatGPT responses to jokes
about religious figures. Accessed 2025-09-01.
Guo, Y.; Guo, M.; Su, J.; Yang, Z.; Zhu, M.; Li, H.; Qiu,
M.; and Liu, S. S. 2024. Bias in large language mod-
els: Origin, evaluation, and mitigation. arXiv preprint
arXiv:2411.10915.
Guo, Z.; et al. 2025. The Unreliability of Evaluating Cultural
Alignment in LLMs. arXiv:2503.08688.
Hellinger, E. 1909. Neue Begr¨undung der Theorie quadratis-
cher Formen von unendlichvielen Ver¨anderlichen. Journal
f¨ur die reine und angewandte Mathematik, 136: 210–271.
Hida, N.; Yamaguchi, K.; and Hanawa, K. 2024. Social Bias
Evaluation for Large Language Models Requires Prompt
Variations. arXiv preprint arXiv:2407.18376.
Huang, H.; Yu, F.; Zhu, J.; Sun, X.; Cheng, H.; Song, D.;
Chen, Z.; Alharthi, A.; An, B.; Liu, Z.; Zhang, Z.; Chen, J.;
Li, J.; Wang, B.; Zhang, L.; Sun, R.; Wan, X.; Li, H.; and
Xu, J. 2023. AceGPT, Localizing Large Language Models
in Arabic. In North American Chapter of the Association for
Computational Linguistics (NAACL).

-- 10 of 13 --

Jin, J.; Kim, J.; Lee, N.; Yoo, H.; Oh, A.; and Lee, H.
2024. Kobbq: Korean bias benchmark for question answer-
ing. Transactions of the Association for Computational

Chunk 30 · 1,991 chars

AceGPT, Localizing Large Language Models
in Arabic. In North American Chapter of the Association for
Computational Linguistics (NAACL).

-- 10 of 13 --

Jin, J.; Kim, J.; Lee, N.; Yoo, H.; Oh, A.; and Lee, H.
2024. Kobbq: Korean bias benchmark for question answer-
ing. Transactions of the Association for Computational Lin-
guistics, vol. 12, pp. 507–524.
Jorgensen, O.; Cope, D.; Schoots, N.; and Shanahan, M.
2023. Improving Activation Steering in Language Models
with Mean-Centring. arXiv:2312.03813.
Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; and Choudhury,
M. 2020. The State and Fate of Linguistic Diversity and
Inclusion in the NLP World. In Proceedings of the 58th An-
nual Meeting of the Association for Computational Linguis-
tics, 6282–6293. Association for Computational Linguis-
tics. Highlights lack of representation for many languages
in NLP resources.
Kang, E.; and Kim, J. 2025. LLMs Are Globally Mul-
tilingual Yet Locally Monolingual: Exploring Knowledge
Transfer via Language and Thought Theory. arXiv preprint
arXiv:2505.24409.
Kerwin, P. 2024. How Should AI Depict Marginalized Com-
munities? CMU Technologists Look to a More Inclusive Fu-
ture. Accessed September 12, 2025.
Khan, A.; Casper, S.; and Hadfield-Menell, D. 2025. Ran-
domness, Not Representation: The Unreliability of Evaluat-
ing Cultural Alignment in LLMs. Proceedings of the 2025
ACM Conference on Fairness, Accountability, and Trans-
parency (FAcct).
Kim, D.; Lee, S.; Kim, Y.; Rutherford, A.; and Park, C. 2025.
Representing the Under-Represented: Cultural and Core Ca-
pability Benchmarks for Developing Thai Large Language
Models. In Proceedings of the 31st International Confer-
ence on Computational Linguistics. International Commit-
tee on Computational Linguistics.
Kumar, D.; Yousef, A.; and Durumeric, Z. 2024. Watch Your
Language: Investigating Content Moderation with Large
Language Models. arXiv:2309.14517.
Li, Y.; Fan, Z.; Chen, R.; Gai, X.; Gong, L.; Zhang, Y.; and
Liu, Z. 2025.

Chunk 31 · 1,998 chars

onfer-
ence on Computational Linguistics. International Commit-
tee on Computational Linguistics.
Kumar, D.; Yousef, A.; and Durumeric, Z. 2024. Watch Your
Language: Investigating Content Moderation with Large
Language Models. arXiv:2309.14517.
Li, Y.; Fan, Z.; Chen, R.; Gai, X.; Gong, L.; Zhang, Y.; and
Liu, Z. 2025. FairSteer: Inference Time Debiasing for LLMs
with Dynamic Activation Steering.
Lin, J. 1991. Divergence measures based on the Shannon
entropy. IEEE Transactions on Information Theory, 37(1):
145–151.
Liu, C. C.; Korhonen, A.; and Gurevych, I. 2025. Cultural
Learning-Based Culture Adaptation of Language Models. In
Che, W.; Nabende, J.; Shutova, E.; and Pilehvar, M. T., eds.,
Proceedings of the Association for Computational Linguis-
tics (ACL).
Maguire, E. 2017. How East and West think in profoundly
different ways. BBC Future. BBC Future Series.
Meguellati, E.; Zeghina, A. O.; Sadiq, S.; and Demartini,
G. 2025. LLM-Based Semantic Augmentation for Harmful
Content Detection. In Proceedings of the International AAAI
Conference on Web and Social Media (ICWSM), volume 19,
1190–1209. Copenhagen, Denmark: AAAI Press.
Meta AI. 2024. Llama 3.2 1B Instruct. https://huggingface.
co/meta-llama/Llama-3.2-1B-Instruct.
Mistral AI. 2024. Mistral 7B Instruct v0.3. https://
huggingface.co/mistralai/Mistral-7B-Instruct-v0.3.
Nangia, N.; Vania, C.; Bhalerao, R.; and Bowman, S. R.
2020. CrowS-Pairs: A Challenge Dataset for Measuring
Social Biases in Masked Language Models. In Webber,
B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the
2020 Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP).
Nguyen, V. C.; Jain, M.; Chauhan, A.; Soled, H. J.; Al-
varez Lesmes, S.; Li, Z.; Birnbaum, M. L.; Tang, S. X.;
Kumar, S.; and De Choudhury, M. 2025. Supporters and
Skeptics: LLM-Based Analysis of Engagement with Men-
tal Health (Mis)Information Content on Video-Sharing Plat-
forms. In Proceedings of the International AAAI Conference
on Web and Social Media

Chunk 32 · 1,983 chars

led, H. J.; Al-
varez Lesmes, S.; Li, Z.; Birnbaum, M. L.; Tang, S. X.;
Kumar, S.; and De Choudhury, M. 2025. Supporters and
Skeptics: LLM-Based Analysis of Engagement with Men-
tal Health (Mis)Information Content on Video-Sharing Plat-
forms. In Proceedings of the International AAAI Conference
on Web and Social Media (ICWSM), volume 19, 1329–1345.
Copenhagen, Denmark: AAAI Press.
Okpala, E.; and Cheng, L. 2025. Large Language Model An-
notation Bias in Hate Speech Detection. In Proceedings of
the International AAAI Conference on Web and Social Me-
dia (ICWSM), volume 19, 1389–1418. Copenhagen, Den-
mark: AAAI Press.
Ovalle, A.; Pavasovic, K. L.; Martin, L.; Zettlemoyer, L.;
Smith, E. M.; Chang, K.-W.; Williams, A.; and Sagun, L.
2024. The Root Shapes the Fruit: On the Persistence of
Gender-Exclusive Harms in Aligned Language Models. In
Proceedings of the 2025 ACM Conference on Fairness, Ac-
countability, and Transparency (FAccT).
Parrish, A.; Chen, A.; Nangia, N.; Padmakumar, V.; Phang,
J.; Thompson, J.; Htut, P. M.; and Bowman, S. R. 2021.
BBQ: A hand-built bias benchmark for question answering.
arXiv preprint arXiv:2110.08193.
Pew Research Center. 2018. Being Christian in Western Eu-
rope.
Pew Research Center. 2023. 5 facts about religion in South
and Southeast Asia.
Pew Research Center. 2025. Modeling the Future of Reli-
gion in America: Recent Trends and Projections.
Qin, Y.; Wang, L.; Tan, Z.; and Li, H. 2025. A Survey
on Large Language Models with Multilingualism. arXiv
preprint arXiv:2405.10936.
Sahgal, N.; and Evans, J. 2021. India Survey Dataset.
Sahoo, N.; Kulkarni, P.; Ahmad, A.; Goyal, T.; Asad, N.;
Garimella, A.; and Bhattacharyya, P. 2024. IndiBias: A
Benchmark Dataset to Measure Social Biases in Language
Models for Indian Context. In Duh, K.; Gomez, H.; and
Bethard, S., eds., Proceedings of the 2024 Conference of the
North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies (Volume
1: Long

Chunk 33 · 1,999 chars

acharyya, P. 2024. IndiBias: A
Benchmark Dataset to Measure Social Biases in Language
Models for Indian Context. In Duh, K.; Gomez, H.; and
Bethard, S., eds., Proceedings of the 2024 Conference of the
North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies (Volume
1: Long Papers).
Santurkar, S.; Durmus, E.; Ladhak, F.; Lee, C.; Liang, P.; and
Hashimoto, T. 2023. Whose opinions do language models
reflect? In International Conference on Machine Learning,
29971–30004. PMLR.
Seth, A.; Choudhary, M.; Sitaram, S.; Toyama, K.;
Vashistha, A.; and Bali, K. 2025. How Deep Is Represen-
tational Bias in LLMs? The Cases of Caste and Religion.
arXiv preprint arXiv:2508.03712.

-- 11 of 13 --

Sharma, P.; et al. 2024a. Rethinking Cultural Value Adapta-
tion in LLMs. arXiv:2505.16408.
Sharma, P.; et al. 2024b. Teaching Norms to Large Language
Models.
Shin, J.; Song, H.; Lee, H.; Jeong, S.; and Park, J. 2024. Ask
LLMs Directly, “What shapes your bias?”: Measuring So-
cial Bias in Large Language Models. In Findings of the As-
sociation for Computational Linguistics: ACL 2024, 16122–
16143. Bangkok, Thailand: Association for Computational
Linguistics.
Similarweb. 2025. Top Websites Ranking - Most Visited
Websites In The World. https://www.similarweb.com/top-
websites/. Accessed January 12, 2026.
Singh, D. D.; Bhattacharjee, R.; and Chakraborty, A.
2025. Rethinking hate speech detection on social me-
dia: Can LLMs replace traditional models? arXiv preprint
arXiv:2506.12744.
Sukiennik, N.; Gao, C.; Xu, F.; and Li, Y. 2025. An Evalua-
tion of Cultural Value Alignment in LLM. ArXiv.
Tao, Y.; Viberg, O.; Baker, R. S.; and Kizilcec, R. F. 2024.
Cultural bias and cultural alignment of large language mod-
els. PNAS nexus, 3(9): pgae346.
TechCrunch. 2025. ChatGPT Users Send 2.5 Billion
Prompts a Day, OpenAI Tells Axios. Accessed: 2025-08-
21.
The Print. 2025. 24% Indians identify as religious national-
ists; 57% Hindus feel religious texts should

Chunk 34 · 1,987 chars

c, R. F. 2024.
Cultural bias and cultural alignment of large language mod-
els. PNAS nexus, 3(9): pgae346.
TechCrunch. 2025. ChatGPT Users Send 2.5 Billion
Prompts a Day, OpenAI Tells Axios. Accessed: 2025-08-
21.
The Print. 2025. 24% Indians identify as religious national-
ists; 57% Hindus feel religious texts should shape laws: Pew.
The Print.
Turner, A. M.; Thiergart, L.; Leech, G.; Udell, D.;
Vazquez, J. J.; Mini, U.; and MacDiarmid, M. 2023.
Steering Language Models With Activation Engineering.
arXiv:2308.10248.
UpstageAI. 2025. ThaiCLI and Thai-H6 Benchmarks. https:
//github.com/UpstageAI/ThaiCLI H6. GitHub repository.
Weidinger, L.; Mellor, J. F. J.; and et al. 2021. Ethical and
social risks of harm from Language Models. arXiv preprint
arXiv:2112.04359.
Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; Ap-
pleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-
W.; da Silva Santos, L. B.; Bourne, P. E.; Bouwman, J.;
Brookes, A. J.; Clark, T.; Crosas, M.; Dillo, I.; Dumon, O.;
Edmunds, S.; Evelo, C. T.; Finkers, R.; Gonzalez-Beltran,
A.; Gray, A. J.; Groth, P.; Goble, C.; Grethe, J. S.; Heringa,
J.; ’t Hoen, P. A.; Hooft, R.; Kuhn, T.; Kok, R.; Kok, J.;
Lusher, S. J.; Martone, M. E.; Mons, A.; Packer, A. L.; Pers-
son, B.; Rocca-Serra, P.; Roos, M.; van Schaik, R.; San-
sone, S.-A.; Schultes, E.; Sengstag, T.; Slater, T.; Strawn, G.;
Swertz, M. A.; Thompson, M.; van der Lei, J.; van Mulligen,
E.; Velterop, J.; Waagmeester, A.; Wittenburg, P.; Wolsten-
croft, K.; Zhao, J.; and Mons, B. 2016. The FAIR Guiding
Principles for scientific data management and stewardship.
Scientific Data, 3: 160018.
Zhang, D.; Yu, Y.; Li, C.; Dong, J.; Su, D.; Chu, C.; and
Yu, D. 2024. MM-LLMs: Recent Advances in MultiModal
Large Language Models. In Annual Meeting of the Associa-
tion for Computational Linguistics (ACL).
Zou, A.; Phute, M.; Golding, L.; and Shah, R. 2024.
Representation Engineering for Large-Language Models.
arXiv:2502.17601.

-- 12 of 13 --

Paper

Chunk 35 · 1,987 chars

Su, D.; Chu, C.; and
Yu, D. 2024. MM-LLMs: Recent Advances in MultiModal
Large Language Models. In Annual Meeting of the Associa-
tion for Computational Linguistics (ACL).
Zou, A.; Phute, M.; Golding, L.; and Shah, R. 2024.
Representation Engineering for Large-Language Models.
arXiv:2502.17601.

-- 12 of 13 --

Paper Checklist
1. For most authors...
(a) Would answering this research question advance sci-
ence without violating social contracts, such as violat-
ing privacy norms, perpetuating unfair profiling, exac-
erbating the socio-economic divide, or implying disre-
spect to societies or cultures? Yes
(b) Do your main claims in the abstract and introduction
accurately reflect the paper’s contributions and scope?
Yes
(c) Do you clarify how the proposed methodological ap-
proach is appropriate for the claims made? Yes
(d) Do you clarify what are possible artifacts in the data
used, given population-specific distributions? Yes
(e) Did you describe the limitations of your work? Yes
(f) Did you discuss any potential negative societal im-
pacts of your work? NA
(g) Did you discuss any potential misuse of your work?
Yes
(h) Did you describe steps taken to prevent or mitigate po-
tential negative outcomes of the research, such as data
and model documentation, data anonymization, re-
sponsible release, access control, and the reproducibil-
ity of findings? Yes
(i) Have you read the ethics review guidelines and en-
sured that your paper conforms to them? Yes
2. Additionally, if your study involves hypotheses testing...
(a) Did you clearly state the assumptions underlying all
theoretical results? Yes
(b) Have you provided justifications for all theoretical re-
sults? Yes
(c) Did you discuss competing hypotheses or theories that
might challenge or complement your theoretical re-
sults? Yes
(d) Have you considered alternative mechanisms or expla-
nations that might account for the same outcomes ob-
served in your study? Yes
(e) Did you address potential biases or

Chunk 36 · 1,995 chars

oretical re-
sults? Yes
(c) Did you discuss competing hypotheses or theories that
might challenge or complement your theoretical re-
sults? Yes
(d) Have you considered alternative mechanisms or expla-
nations that might account for the same outcomes ob-
served in your study? Yes
(e) Did you address potential biases or limitations in your
theoretical framework? Yes
(f) Have you related your theoretical results to the existing
literature in social science? Yes
(g) Did you discuss the implications of your theoretical
results for policy, practice, or further research in the
social science domain? Yes
3. Additionally, if you are including theoretical proofs...
(a) Did you state the full set of assumptions of all theoret-
ical results? NA
(b) Did you include complete proofs of all theoretical re-
sults? NA
4. Additionally, if you ran machine learning experiments...
(a) Did you include the code, data, and instructions
needed to reproduce the main experimental results (ei-
ther in the supplemental material or as a URL)? Yes
(b) Did you specify all the training details (e.g., data splits,
hyperparameters, how they were chosen)? Yes
(c) Did you report error bars (e.g., with respect to the ran-
dom seed after running experiments multiple times)?
NA
(d) Did you include the total amount of compute and the
type of resources used (e.g., type of GPUs, internal
cluster, or cloud provider)? Yes
(e) Do you justify how the proposed evaluation is suffi-
cient and appropriate to the claims made? Yes
(f) Do you discuss what is “the cost” of misclassification
and fault (in)tolerance? NA
5. Additionally, if you are using existing assets (e.g., code,
data, models) or curating/releasing new assets, without
compromising anonymity...
(a) If your work uses existing assets, did you cite the cre-
ators? Yes
(b) Did you mention the license of the assets? NA
(c) Did you include any new assets in the supplemental
material or as a URL? NA
(d) Did you discuss whether and how consent was ob-
tained

Chunk 37 · 1,318 chars

ng/releasing new assets, without
compromising anonymity...
(a) If your work uses existing assets, did you cite the cre-
ators? Yes
(b) Did you mention the license of the assets? NA
(c) Did you include any new assets in the supplemental
material or as a URL? NA
(d) Did you discuss whether and how consent was ob-
tained from people whose data you’re using/curating?
Yes
(e) Did you discuss whether the data you are using/cu-
rating contains personally identifiable information or
offensive content? Yes
(f) If you are curating or releasing new datasets, did you
discuss how you intend to make your datasets FAIR
(see Wilkinson et al. (2016))? NA
(g) If you are curating or releasing new datasets, did you
create a Datasheet for the Dataset (see Gebru et al.
(2021))? NA
6. Additionally, if you used crowdsourcing or conducted
research with human subjects, without compromising
anonymity...
(a) Did you include the full text of instructions given to
participants and screenshots? Yes
(b) Did you describe any potential participant risks, with
mentions of Institutional Review Board (IRB) ap-
provals? Yes
(c) Did you include the estimated hourly wage paid to
participants and the total amount spent on participant
compensation? Yes
(d) Did you discuss how data is stored, shared, and dei-
dentified? Yes

-- 13 of 13 --