Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs
Summary
This study investigates whether improved multilingual capabilities in large language models (LLMs) lead to better cultural alignment. Using a novel methodology that compares LLM-generated responses to population-level data from the World Values Survey, the researchers evaluated three model familiesâGoogleâs Gemma, AI2âs OLMo, and OpenAIâs turbo-seriesâacross four languages: Danish, Dutch, English, and Portuguese. The findings reveal no consistent relationship between multilingual capability and cultural alignment. While Gemma models showed positive correlations across all languages, OpenAI and OLMo models exhibited inconsistent or negative relationships, particularly in non-English contexts. The study also addresses US-centric bias, finding that LLMs align more with US values in multilingual languages like English and Portuguese but less so in monocultural languages like Danish and Dutch. These results highlight that enhancing cultural alignment requires more than just improving language capabilities; it demands dedicated efforts to address cultural representation. The researchers emphasize the need for diverse participation in LLM development and further investigation into alignment techniques that account for cultural value distributions.
PDF viewer
Chunks(31)
Chunk 0 · 1,997 chars
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs Jonathan RystrĂžm Oxford Internet Institute University of Oxford, UK Hannah Rose Kirk Oxford Internet Institute University of Oxford, UK Correspondence: jonathan.rystrom@oii.ox.ac.uk Scott A. Hale Oxford Internet Institute University of Oxford, UK Abstract Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. A key concern is US-centric bias, where LLMs re- flect US rather than local cultural values. We propose a novel methodology that compares LLM-generated response distributions against population-level opinion data from the World Value Survey across four languages (Danish, Dutch, English, and Portuguese). Using a rig- orous linear mixed-effects regression frame- work, we compare three families of models: Googleâs Gemma models (2Bâ27B parame- ters), AI2âs OLMo models (7B-32B parame- ters), and successive iterations of OpenAIâs turbo-series. Across the families of models, we find no consistent relationships between language capabilities and cultural alignment. While the Gemma models have a positive cor- relation between language capability and cul- tural alignment across all languages, the Ope- nAI and OLMo models are inconsistent. Our results demonstrate that achieving meaningful cultural alignment requires dedicated effort be- yond improving general language capabilities. 1 Introduction Spearheaded by accessible chat interfaces to pow- erful models like ChatGPT (OpenAI, 2022), LLMs are reaching hundreds of millions of users (Milmo, 2023). These models are deployed across di- verse contexts: from tutoring mathematics (Khan, 2023) to building software applications (Peng et al., 2023) to assisting in legal cases (Tan et al., 2023). While most LLMs demonstrate multilingual abili- ties (ĂstĂŒn et al.,
Chunk 1 · 1,991 chars
are reaching hundreds of millions of users (Milmo, 2023). These models are deployed across di- verse contexts: from tutoring mathematics (Khan, 2023) to building software applications (Peng et al., 2023) to assisting in legal cases (Tan et al., 2023). While most LLMs demonstrate multilingual abili- ties (ĂstĂŒn et al., 2024), the ability to communicate across languages does not necessarily translate into appropriate cultural representations. Disentangling language capabilities and cultural alignment is cru- Figure 1: The relationship between multilingual capa- bility and cultural alignment is inconsistent across LLM families, as shown by coefficients from our linear mixed- effects model (ÎČmultilingual = ÎČf lm; Eq. 3; §3.2). OpenAI and OLMo models show negative or insignifi- cant relationships outside of Danish and Dutch, while Gemma models show positive relationships throughout (p < .05). cial for understanding how LLMs should be exam- ined and audited (Mökander et al., 2024) and for ensuring these technologies work for diverse peo- ple (Dâignazio and Klein, 2023; Weidinger et al., 2022). Given the Silicon Valley origins of many frontier AI labs and the prevalence of American English training data, we might expect LLMs to exhibit US-centric cultural biases despite their multilin- gual capabilities. These companies comprise a nar- row slice of human experience, limiting the voices that contribute to critical design decisions in LLMs (Dâignazio and Klein, 2023). They typically train LLMs on massive amounts of predominantly En- glish text and employ American crowd workers to rate and evaluate the LLMsâ responses (John- son et al., 2022; Kirk et al., 2023). Far too often, the benefits and harms of data technologies are un- equally distributed, reinforcing biases and harming arXiv:2502.16534v2 [cs.CL] 30 Aug 2025 -- 1 of 12 -- already minoritized groups (Birhane, 2020; Milan and TrerĂ©, 2019; Khandelwal et al., 2024). Under- standing how LLMs represent different
Chunk 2 · 1,997 chars
, 2023). Far too often, the benefits and harms of data technologies are un- equally distributed, reinforcing biases and harming arXiv:2502.16534v2 [cs.CL] 30 Aug 2025 -- 1 of 12 -- already minoritized groups (Birhane, 2020; Milan and TrerĂ©, 2019; Khandelwal et al., 2024). Under- standing how LLMs represent different cultures is thus paramount to establishing risks of representa- tional harm (Rauh et al., 2022) and ensuring the technologyâs utility is shared across diverse com- munities. Increasing diversity and cross-cultural under- standing is stymied by unchecked assumptions in both alignment techniques and evaluation method- ologies. First, there is an assumption that bigger and more capable LLMs trained on more data will be inherently easier to align (Zhou et al., 2023; Kundu et al., 2023), but this sidesteps the thorny question of pluralistic variation and cultural repre- sentations (Kirk et al., 2024b). Thus, it is unclear whether improvements in architecture (Fedus et al., 2022) and post-training methods (Kirk et al., 2023; Rafailov et al., 2023) translate into improvements in cultural alignment. Although studies like the World Values Sur- vey (WVS) have documented how values vary across cultures (EVS/WVS, 2022), it remains un- clear whether more capable LLMsâthrough scal- ing or improved trainingâbetter align with these cultural differences (Bai et al., 2022; Kirk et al., 2023). While the WVS has been used in prior research on values in LLMs, these studies have fo- cused predominantly on individual modelsâ perfor- mance within an English-language context. (Cao et al., 2023; Arora et al., 2023; AlKhamissi et al., 2024). This paper addresses this gap by develop- ing a methodology for assessing how well fami- lies of LLMs represent different cultural contexts across multiple languages. We compare two dis- tinct paths to model improvement: systematic scal- ing of instruction-tuned models and commercial product development comprising scaling and inno- vation
Chunk 3 · 1,996 chars
this gap by develop- ing a methodology for assessing how well fami- lies of LLMs represent different cultural contexts across multiple languages. We compare two dis- tinct paths to model improvement: systematic scal- ing of instruction-tuned models and commercial product development comprising scaling and inno- vation in post-training to accommodate pressures from capabilities, cost, and preferences (OpenAI et al., 2024b). Given these considerations, we investigate the following research questions: RQ1 Multilingual Cultural Alignment: Does improved multilingual capability increase LLM alignment with population-specific value distributions? RQ2 US-centric Bias: When using different lan- guages, do LLMs align more with US values or with values from the countries where these languages are native? We operationalise multilingual capability as an LLMâs performance on a range of multilingual benchmarks across languages (see, e.g., Nielsen, 2023). We describe the specific benchmarks and performances in the supplementary materials. This work makes several key contributions. First, we introduce a novel distribution-based method- ology for probing cultural alignment across lan- guages, moving beyond direct survey approaches to better capture latent cultural values (Sorensen et al., 2024). Second, we provide the first sys- tematic comparison of how improvements in scale and post-training affect cultural alignment and US- centric bias across English, Danish, Dutch, and Por- tuguese through a series of robust statistical mod- els. Third, we release a dataset of model-generated responses across multiple languages and cultural contexts as well as our code, enabling future re- search into cultural alignment and bias.1 Together, these contributions advance our understanding of how LLM development choices influence cultural representation while providing tools for ongoing investigation of these critical issues. 2 Measuring Cultural Alignment Figure 2: Pearson correlations in value
Chunk 4 · 1,988 chars
future re- search into cultural alignment and bias.1 Together, these contributions advance our understanding of how LLM development choices influence cultural representation while providing tools for ongoing investigation of these critical issues. 2 Measuring Cultural Alignment Figure 2: Pearson correlations in value polarity scores across studied countries from the World Values Survey. Value polarity scores are the fraction of the population in favour of a given topic. All correlations are positive, with most being between 0.7â0.95. This section defines âcultural alignmentâ and how to measure it in LLMs. We conceptualise cultural alignment as reproducing distributions of 1See github.com/jhrystrom/multicultural-alignment for code, data, and supplementary materials. -- 2 of 12 -- values in a particular population. Then we show how to a) get a ground-truth distribution of values using the World Values Survey (§2.1) and b) elicit value distributions from LLMs (§2.2). Cultural alignment as value reproduction: Within a culture there will be a variety of stances to any particular topic. However, the distribution of stances will be characteristic among cultures. For instance, while around 8% of Danes are opposed to abortion, it is a much less contentious topic than in the US, where itâs close to 40% (EVS/WVS, 2022). We posit that cultural alignment for a specific group of people can be operationalised as how well an LLM reproduces the distribution of values over a wide range of topics (Sorensen et al., 2024). In- vestigating distributions of responses differs from previous work that directly surveys the LLMs as regular participants (e.g., Cao et al., 2023). This approach also addresses concerns raised by Khan et al. (2025) about the instability of survey-based evaluations by focusing on aggregate distributions rather than individual responses and incorporating explicit controls for response consistency. Our goal is to get more naturalistic elicitations of the
Chunk 5 · 1,995 chars
al., 2023). This approach also addresses concerns raised by Khan et al. (2025) about the instability of survey-based evaluations by focusing on aggregate distributions rather than individual responses and incorporating explicit controls for response consistency. Our goal is to get more naturalistic elicitations of the un- derlying values whilst avoiding sycophancy and response bias (Sharma et al., 2023). We operationalise reproduction as high correla- tions between value polarity scores: the fraction of people (or LLM responses) in favour of a topic in the population. Note, that we binarise issues to allow for simpler operationalisation. Below, we describe how we empirically estimate the value po- larity score for the ground truth (§2.1) and LLMs (§2.2). 2.1 Ground Truth: World Values Survey To get a âground truthâ distribution of cultural val- ues, we use the joint World Values Survey and European Values Survey (EVS; EVS/WVS, 2022). These surveys cover adults across 92 countries with samples that are nationally representative for gen- der, age, education, and religion. The surveysâ broad coverage enables cross-cultural comparabil- ity for the many countries covered by the surveys, though some scholars note challenges in ensuring response comparability across countries (AlemĂĄn and Woods, 2016). The WVS provides both coun- try and language identifiers for each respondent, allowing us to define populations either as citizens of a country or speakers of a language using the same underlying respondent-level data. We select questions with binary agree/disagree or rating scale formats that allow clear classifica- tion of positive vs. negative stances, excluding questions with multiple categorical response op- tions (see the supplementary materials for the full list of questions). These questions span environ- ment, work, family, politics, religion, and security. We convert responses to binary indicators by deter- mining whether each response indicates support for the
Chunk 6 · 1,988 chars
cluding questions with multiple categorical response op- tions (see the supplementary materials for the full list of questions). These questions span environ- ment, work, family, politics, religion, and security. We convert responses to binary indicators by deter- mining whether each response indicates support for the measured construct, with custom coding to han- dle the various question formats and reverse-scored items. Finally, we calculate the value polarity score as the demographically weighted proportion of re- spondents with affirmative stances. Formally, we can define the value polarity score for a given pop- ulation, P (e.g., citizens in a country or speakers of a language) and topic, q, (i.e., question within the EVS/WVS) as shown in Eq. 1: VPSP,q = X iâP wi P jâPq wj Ai,q (1) Here, Ai,q is a binary indicator of whether par- ticipant i has a positive stance on topic q, wi rep- resents the survey-provided demographic weights, and Pq denotes respondents in population P who answered question q. The first term normalises the weights to account for missing responses and enables aggregation across any definition of a pop- ulation (e.g., residents in a country, speakers of a language, etc.). For example, if 80% of Danish respondents who answered the same-sex marriage question ex- pressed support (after demographic reweighting), Denmarkâs value polarity score for this topic would be 0.8. Thus, a cultureâs values can be represented as a vector, where each element corresponds to a value polarity score for a specific topic. 2.2 Ecologically valid LLM responses Testing cultural alignment effectively requires em- bedding contextual and cultural elements in ways that maintain ecological validity. At a high level, eliciting values from an LLM consist of two steps: 1) Iteratively prompting the model with the se- lected topics and 2) extracting the stances from each model response. Setting prompt context: Developing ecologi- cally valid prompts requires careful
Chunk 7 · 1,998 chars
elements in ways that maintain ecological validity. At a high level, eliciting values from an LLM consist of two steps: 1) Iteratively prompting the model with the se- lected topics and 2) extracting the stances from each model response. Setting prompt context: Developing ecologi- cally valid prompts requires careful consideration. When evaluating LLM responses to value-laden topics, simply asking questions like âWhat propor- tion of people support topic X?â or âDo you support -- 3 of 12 -- topic X?â proves inadequate (e.g., Rozado, 2024). Such direct approaches suffer from three key lim- itations: they generate false positives through ex- cessive agreement, fail to reflect realistic usage patterns, and provide insufficient variation to as- sess cultural alignment (Röttger et al., 2024). They also struggle to capture instance-specific harms that emerge when systems misalign with usersâ cultural contexts (Rauh et al., 2022). Instead, we adopt an implicit approach by asking the model to generate responses from hypothetical respondents. For example, prompting âimagine surveying 10 random people on topic X. What are their responses?â This method reveals the modelâs latent opinion distribution while avoiding the lim- itations of direct questioning. Details for prompt construction are provided in the supplementary ma- terials. Seeding cultural responses: Having a method for eliciting distributions of values, the next step is to seed culture. One typical way of seeding a spe- cific culture is to explicitly instruct the LLM either by mentioning a specific country (âimagine survey- ing 10 random Americansâ) or through describing specific personas (âImagine surveying a 85-year- old Danish woman...â; AlKhamissi et al., 2024). The problem with these demographic prompting approaches is that they stray from actual uses of LLMs. Users are unlikely to explicitly men- tion their demographic information or nationality (Zheng et al., 2023a). Instead, we use language as a proxy for
Chunk 8 · 1,999 chars
urveying a 85-year- old Danish woman...â; AlKhamissi et al., 2024). The problem with these demographic prompting approaches is that they stray from actual uses of LLMs. Users are unlikely to explicitly men- tion their demographic information or nationality (Zheng et al., 2023a). Instead, we use language as a proxy for cultural origin. For instance, a prompt in Danish is as- sumed to come from a Dane. This approach creates an intentional distinction in our analysis: we can compare âlanguage-levelâ alignment (all speakers of a language globally) with âcountry-levelâ align- ment (all people from specific nations where that language is native). As argued by Havaldar et al. (2023), users speaking a particular language would expect culturally appropriate responses in that lan- guage. For languages spoken in multiple countries, this approach is intentionally ambiguous. The am- biguity allows us to elicit the underlying âdefaultâ alignment rather than the general ability to emu- late cultures (Tao et al., 2024). We validate this approach by showing that LLM responses exhibit significantly lower self-consistency between lan- guages compared to within languages, demonstrat- ing that language impacts output (see the supple- mentary materials). To create prompts across lan- guages, we use gpt-3.5-turbo to translate our original English prompts. Although previous liter- ature has shown strong translation capabilities in LLMs (Yan et al., 2024), we nonetheless manually verify the translations. Annotating and aggregating responses: Fi- nally, to transform the LLMsâ hypothetical sur- vey responses into vectors of stances, we use an LLM-as-a-judge approach (Zheng et al., 2023b; Guerdan et al., 2025). Specifically, we use gpt-4.1-mini (OpenAI et al., 2025) to label each substatement as either âproâ, âconâ, or ânullâ given the context of the topic and a representative pro and con statement (generated with an LLM and validated by the authors). We then calculate the proportion of âproâ
Chunk 9 · 1,995 chars
., 2023b;
Guerdan et al., 2025). Specifically, we use
gpt-4.1-mini (OpenAI et al., 2025) to label
each substatement as either âproâ, âconâ, or ânullâ
given the context of the topic and a representative
pro and con statement (generated with an LLM and
validated by the authors). We then calculate the
proportion of âproâ versus âconâ responses as the
LLMâs value polarity score for the given statement.
For instance, a response with seven âproâ, one âconâ,
and two ânullâ statement would yield a value po-
larity score of 0.875 ( 7
8 ). A complete, unabridged
example can be found in the supplementary materi-
als. Formally, we label each substatement from the
full set of hypothetical statements, Gq,g, for topic
q and generation g as r. Furthermore, we label
the classifier as â(r). We then formalise the value
polarity score for a given instance of a generation
for a topic (VPSLLM
q,g ) as shown in Eq. 2:
VPSLLM
q,g =
P
râGq,g [ â(r) = pro ]
P
râGq,g [ â(r) â {pro, con} ] , (2)
These scores are then compared against the value
polarity scores from the WVS. Specifically, we
calculate the Spearman rank correlation to obtain
a measure of similarity between the LLMsâ re-
sponses and the value distributions of a given pop-
ulation.
To validate the LLM-as-judge, we manually an-
notate 200 statements. We iteratively refine the
prompts and the LLM used until we reach satisfac-
tory performance. We find a 91% agreement and a
mean absolute error for value polarity of 4.5% over
the dataset, ensuring consistent statistics between
LLM and human annotation (Guerdan et al., 2025).
3 Experimental Setup
To investigate whether improving the multilingual
capabilities of LLMs improves cultural alignment,
we set up an experiment using a carefully chosen
set of models and languages. We examine two
-- 4 of 12 --
Figure 3: Self-consistency in responses for LLMs and
WVS countries. LLMs have lower self-consistency
than resampled WVS responsesâshown by the dashed
linesâparticularly in non-EnglishChunk 10 · 1,988 chars
proves cultural alignment, we set up an experiment using a carefully chosen set of models and languages. We examine two -- 4 of 12 -- Figure 3: Self-consistency in responses for LLMs and WVS countries. LLMs have lower self-consistency than resampled WVS responsesâshown by the dashed linesâparticularly in non-English languages. different kinds of model improvements: scaling and commercial product development. These cases provide complementary perspectives on the effects of multilingual capabilities on cultural alignment. Scaling is the most well-studied path to improv- ing LLMs (Kaplan et al., 2020; Ganguli et al., 2022). Commercial product development, on the other hand, comprises both scale and innovation in post-training to accommodate different pressures from capabilities, cost, and preferences (Kirk et al., 2024a). For scaling, we use the instruction-tuned Gemma models (Gemma et al., 2024) and OLMo- 2 models (OLMo et al., 2025), while for product development, we use OpenAIâs turbo-series mod- els (OpenAI, 2022; OpenAI et al., 2024a,b). We provide details of these model families in §3.1. A breakdown of the computational cost is in the sup- plementary materials. Languages: For the languages, we compare En- glish with Danish, Dutch, and Portuguese. This set allows us to test multiple assumptions about cultural alignment. English represents a widely used case: it is a global language with speakers across many countries represented in the WVS (see Fig. 2). This diversity allows us to assess whether LLMs align more strongly with US values or those of other English-speaking nations. Danish and Dutch serve as controlled test cases since they are primarily used in a single country. If cultural alignment stems from pre-training data, models should show strong Danish/Dutch cultural alignment when using these languages, despite their small share of training data (Kreutzer et al., 2022). Alternatively, if alignment emerges from post-training processesâwhich are
Chunk 11 · 1,999 chars
re primarily used in a single country. If cultural alignment stems from pre-training data, models should show strong Danish/Dutch cultural alignment when using these languages, despite their small share of training data (Kreutzer et al., 2022). Alternatively, if alignment emerges from post-training processesâwhich are predominantly English-based (Blevins and Zettlemoyer, 2022)â responses in these languages should align more with US values. Portuguese presents an interesting case since it is an official language in several countries. We inves- tigate whether the LLM responses are more aligned to Portugal or Brazilâtwo countries that show dis- tinct value patterns in relation to each other and the US (see Fig. 2). This allows us to test whether an LLM aligns more strongly with one countryâs values, the aggregate values of all language users, or US values. For each language-model pair, we collect 300 prompt-response pairs to power our statistical anal- ysis sufficiently (see §3.2). After filtering out re- sponses that either lacked the required hypothetical survey format or were in a language other than the prompt, we obtained between 111â299 valid responses per combination. We calculate the corre- lation in value polarity scores at three levels: coun- try (e.g., US or Denmark), language (pooling all speakers of a given language), and global (weighted values from all WVS/EVS participants). 3.1 Models We examine three model families representing dif- ferent development approaches: Gemma (Gemma et al., 2024) and OLMo (OLMo et al., 2025) for improvements through scaling and OpenAIâs turbo series for commercial product development, com- bining scaling with post-training improvements (OpenAI, 2022; OpenAI et al., 2024a,b). Other pre- liminary experiments included different versions of LLaMA models (Touvron et al., 2023) and Mistral models (Jiang et al., 2023). However, these mod- els either failed to consistently follow instructions or always answered in English regardless
Chunk 12 · 1,982 chars
st-training improvements (OpenAI, 2022; OpenAI et al., 2024a,b). Other pre- liminary experiments included different versions of LLaMA models (Touvron et al., 2023) and Mistral models (Jiang et al., 2023). However, these mod- els either failed to consistently follow instructions or always answered in English regardless of the prompt language. See the supplementary materials for a more thorough description of the LLMs. 3.2 RQ1: Multilingual Cultural Alignment To statistically assess whether improving the mul- tilingual capabilities of LLMs improves cultural alignment, we construct a linear mixed-effects re- gression (LMER; Luke, 2017) based on the experi- mental setup described above. Our LMER follows standard practices and has three core components: âą Core coefficient: The coefficient of interest -- 5 of 12 -- Figure 4: Language capability (x-axis) vs cultural align- ment scores (y-axis) across languages. Stars indicate significance (p < .05) in our linear mixed-effects regres- sion of multiple runs (See §3.2). OpenAI models (blue) and OLMo models (red) show negative/insignificant relationships outside of English, while the Gemma models (green) show positive relationships throughout (p < .05). is the three-way interaction between model family, language, and multilingual capability. This tests whether the multilingual capabil- ityâalignment relationship differs by model family and response language, directly ad- dressing RQ1. âą Random effects: We include a model-specific random intercept αj to account for repeated measures of cultural alignment for the same LLM. This models variation between LLMs and can improve efficiency over standard lin- ear regressions (Luke, 2017). âą Control for self-consistency: We include a consistency-by-language term to help en- sure that higher alignment scores reflect gen- uine cultural adaptation rather than reduced response noise, which can inflate scores (Kah- neman et al., 2021). We calculate self-consistency as the
Chunk 13 · 1,993 chars
d lin- ear regressions (Luke, 2017). âą Control for self-consistency: We include a consistency-by-language term to help en- sure that higher alignment scores reflect gen- uine cultural adaptation rather than reduced response noise, which can inflate scores (Kah- neman et al., 2021). We calculate self-consistency as the Spearman correlation between value polarity scores (defined in §2) of repeated responses to identical topics, ad- justed by the reliability of the LLM annotation (see §2.2; Charles, 2005). A score of 1.0 indicates per- fect consistency; 0.0 indicates random responses. Population-level resampling of the human WVS responses yields values between 0.66 and 0.84 (see Fig.3 and the supplementary materials). Formally, the model is specified in Eq. 3: CAi ⌠N (ÎŒi, Ï2), ÎŒi = αj[i] + ÎČ1l Xcons,iXl,i + ÎČf lm Xm,iXf,iXl,i, αj ⌠N (Όα, Ï2 α), j = 1, . . . , J. (3) where i indexes responses and j[i] denotes the LLM producing response i. Here Xcons,i is the self- consistency score for response i, Xl,i is the set of language indicators, Xf,i is the set of model-family indicators, and Xm,i is the multilingual capability score. The residual variance Ï2 represents within- LLM variation in alignment scores not explained by the fixed effects or model-specific intercept, while Ï2 α represents between-LLM variation in average alignment. The above statistical model allows us to analyse the relationship between multilingual capabilities and cultural alignment in model families at the level of individual languages. For example, we might find that multilingual capabilities improve cultural alignment for Gemma models for Danish but not for Dutch or vice versa. 3.3 RQ2: US-Centric Bias We analyse model bias by comparing cultural align- ment between US and local values, where âlocalâ refers to values in the country or countries where a given language is natively spoken. We define US-centric bias as an LLM showing higher cultural alignment with US value distributions compared
Chunk 14 · 1,990 chars
3.3 RQ2: US-Centric Bias We analyse model bias by comparing cultural align- ment between US and local values, where âlocalâ refers to values in the country or countries where a given language is natively spoken. We define US-centric bias as an LLM showing higher cultural alignment with US value distributions compared to local ones. To quantify this bias, we use a linear re- gression model that measures the differential effect of US versus local value alignment: CA = ÎČ0 + ÎČ1(US) + X mâM X lâL ÎČml(m Ă l) + X mâM X lâL ÎČUS ml (US Ăm Ă l) + Ï” (4) The regressionâs intercept (ÎČ0,i.e., the base case) is a baseline that produces uniformly random value polarity scores. M is the set of models and L is the set of languages. US is a boolean feature denoting -- 6 of 12 -- whether the cultural alignment is to the US (if 1) or the local values (if 0). We primarily analyse the coefficients with US (ÎČU S ml ) since these provide the partial effect of US-centric bias, i.e., how much more/less a given LLM is aligned to US rather than local values. Assumption checks for the regression can be seen in the supplementary materials. 4 Results 4.1 Multilingual Cultural Alignment (RQ1) We first examine the stability of LLMsâ cultural values. For LLMs lacking stable internal values, apparent improvements in cultural alignment may reflect reduced response variance rather than gen- uine advances (Röttger et al., 2024; Kahneman et al., 2021). We therefore analyse both the self- consistency of LLM responses and how alignment changes with model improvements. LLMs have low self-consistency: We find low self-consistency scores across all models and lan- guages compared to human responses in the WVS data (Fig. 3). In contrast, LLMs show gener- ally lower self-consistency compared to the hu- man responses, even in English, where instruction- following capabilities are strongest due to English- dominated training data. (OpenAI et al., 2024a; Gemma et al., 2024; OLMo et al., 2025). This lower
Chunk 15 · 1,987 chars
an responses in the WVS data (Fig. 3). In contrast, LLMs show gener- ally lower self-consistency compared to the hu- man responses, even in English, where instruction- following capabilities are strongest due to English- dominated training data. (OpenAI et al., 2024a; Gemma et al., 2024; OLMo et al., 2025). This lower self-consistency complicates our cul- tural alignment analysis (Wright et al., 2024). Drawing on Kahneman et al. (2021)âs noise frame- work, we recognise that inconsistent responses can be as detrimental as bias with respect to the ac- curacy of the analysis. To address the noise, we employ larger sample sizes and incorporate consis- tency controls in our regression analyses. Multilinguality does not imply cultural align- ment: The relationship between model improve- ments and cultural alignment varies substantially across languages and model families (Fig. 1). For Gemma, there is a strong and significant positive relationship between multilingual capabilities and cultural alignment for all languages. In contrast, the relationships for the GPT-Turbo models are either insignificant or negative. For Dutch and Danish the relationships are insignificant (ÎČgpt,nl = 0.049, p = 0.589,ÎČgpt,da = 0.053, p = 0.522), and for Portuguese and English the effect is signifi- cant and negative (ÎČgpt,en = â0.24, p = 0.009, ÎČgpt,pt = â0.30, p < 0.001). Similarly for OLMo, the relationship is positive for Danish and Dutch (ÎČOLMo,da = 0.44, p < 0.001, ÎČOLMo,nl = 0.29, p < 0.001) and insignificant for English and Portuguese (ÎČOLMo,en = 0.068, p = 0.115, ÎČOLMo,pt = 0.008, p = 0.825). The mismatch between multilingual perfor- mance and cultural alignment could suggest a capability threshold: multilingual improvements might provide rudimentary instruction following skills (Nie et al., 2024), but beyond a point, other factorsâsuch as the preferences of developers and annotatorsâdominate (Kirk et al., 2024b). This could explain the smaller open weights modelsâ higher
Chunk 16 · 1,995 chars
ent could suggest a capability threshold: multilingual improvements might provide rudimentary instruction following skills (Nie et al., 2024), but beyond a point, other factorsâsuch as the preferences of developers and annotatorsâdominate (Kirk et al., 2024b). This could explain the smaller open weights modelsâ higher coefficients than the gpt-turbo models (see Fig. 4 or Fig. 1). Further work is needed to under- stand alignment at the sub-national level. Furthermore, the strong effect of self- consistency (0.405 < ÎČconsistency < 0.723, p âȘ 0.001) compared to multilingual capability suggests that noise remains a major limiting factor in analysing cultural alignment. This aligns with broader findings about the instability of LLM value elicitation (Röttger et al., 2024; Khan et al., 2025). Moreover, even the highest observed alignment scores (around 0.7; see Fig 4) indicate substantial room for improvement in how well LLMs match human cultural values and behaviours. In conclusion, our analysis reveals a complex relationship between model improvements and cul- tural alignment. Although some languages show progressive improvements in cultural alignment from model scaling or iterative commercial devel- opment, others show minimal or inconsistent im- provements. These findings, combined with the relatively low self-consistency of LLM responses, demonstrate that improved multilingual capability does not guarantee better cultural alignment. 4.2 US-centric Bias (RQ2) Here, we answer RQ2 by examining US bias across languages. Specifically, we investigate relative alignment between local and US values (Fig. 5). Our analysis reveals distinct patterns of US- centric bias across both languages and model fam- ilies (Fig. 5). Languages show different suscepti- bilities to US bias: only one of nine LLMs exhibits US-centric bias in Danish, all in English, all in Por- tuguese, and none in Dutch. Note that for English, these results mean that the LLM, on average, is relatively more
Chunk 17 · 1,989 chars
tric bias across both languages and model fam- ilies (Fig. 5). Languages show different suscepti- bilities to US bias: only one of nine LLMs exhibits US-centric bias in Danish, all in English, all in Por- tuguese, and none in Dutch. Note that for English, these results mean that the LLM, on average, is relatively more aligned to US values compared to other English-speaking countries like Kenya or the United Kingdom. See the supplementary materials -- 7 of 12 -- Figure 5: US-centric bias coefficients across LLMs and languages (ÎČBiasU S ); see Eq. 4). Error bars are stan- dard errors from the regression. Positive values indicate the presence of US-centric bias. for detailed results. The overarching pattern is that languages spoken across countries (English and Portuguese) show US-centric bias, whereas languages spoken in only one country (Danish and Dutch) show less US- centric bias. This supports the hypothesis that homogeneity in the training data can counteract US-centric biasâat least for medium-resourced, Western-European languages. For LLMs, some specific LLMs seem more prone to bias across languages. Specifically, the small gemma-2-2b-it exhibits higher US- centric bias across every language except Dutch. Beyond that, we see no clear progressions in US- centric bias within any family. In conclusion, language seems a stronger indica- tor of US-centric bias in LLMs compared to LLM development. Monocultural languages show in- significant to negative bias, while English and Por- tuguese show significant US-centric bias. Within each LLM family, we find no consistent nor sig- nificant change in US-centric bias across LLM versions. These findings underscore the complex relationship between multilingual capability and alignment. 5 Related Work Recent work emphasizes the need for systematic auditing of LLMsâ cultural alignment, particularly as these models are deployed globally (Kirk et al., 2024a; Mökander et al., 2024; Kirk et al., 2024b). Prior empirical
Chunk 18 · 1,999 chars
ings underscore the complex relationship between multilingual capability and alignment. 5 Related Work Recent work emphasizes the need for systematic auditing of LLMsâ cultural alignment, particularly as these models are deployed globally (Kirk et al., 2024a; Mökander et al., 2024; Kirk et al., 2024b). Prior empirical approaches have primarily taken two paths: using transformations based on Hofst- edeâs cultural dimensions framework or directly comparing against survey responses. Studies us- ing Hofstedeâs dimensions (Masoud et al., 2025; Cao et al., 2023) provide structured cross-cultural comparisons through latent variable analysis. How- ever, these studies assume that LLMsâ latent dimen- sions map directly onto human dimensions, since they use formulas calibrated for humansâan as- sumption that warrants scrutiny (Shanahan, 2024; Schröder et al., 2025). Recent work has explored using LLMs to sim- ulate responses for assessing cultural alignment (Tao et al., 2024; AlKhamissi et al., 2024; Havaldar et al., 2023). Similarly to our work, these works show that LLMs struggle to represent underrep- resented personas (AlKhamissi et al., 2024) and emotions (Havaldar et al., 2023) for non-English languages. Prior approaches focused on individual- level responses. In contrast, our method generates distributions of opinions across hypothetical sur- vey participants, enabling direct comparison with population-level statistics. This distribution-based approach offers three key advantages. First, it bet- ter captures the inherent variation in cultural values within populations, paving the way for investigat- ing distributional alignment (Sorensen et al., 2024). Second, it enables principled statistical comparison against large-scale survey data like the World Val- ues Survey (EVS/WVS, 2022). Finally, the frame- work is easy to extend to new languages by auto- matically translating the prompts. We detail our quantitative framework for measuring alignment with observed population
Chunk 19 · 1,995 chars
ond, it enables principled statistical comparison against large-scale survey data like the World Val- ues Survey (EVS/WVS, 2022). Finally, the frame- work is easy to extend to new languages by auto- matically translating the prompts. We detail our quantitative framework for measuring alignment with observed population distributions in §2. There is also an increasing body of work inves- tigating political biases in LLMs (Röttger et al., 2024, 2025; Rozado, 2024). Much of this work also relies on human political surveys like the Political Compass Test. However, recent work has called for increased attention to how the randomness inher- ent in LLM decoding at non-zero temperatures can create instability in attributes (Röttger et al., 2024; Wright et al., 2024; Khan et al., 2025). We expand on this work by including multilingual perspectives and constructing prompts with a wide range of vari- ations (see §2). These prompt variations, combined with statistically accounting for self-consistency in our statistical analysis (see §3.2), allow us to get a more robust measure of cultural alignment. The relationship between model capabilities and cultural alignment remains understudied. Un- like general performance metrics that follow pre- dictable scaling laws (Kaplan et al., 2020), cultural alignment may not improve systematically with model capabilities. This aligns with research show- -- 8 of 12 -- ing micro-level capabilities can be discontinuous with scale (Ganguli et al., 2022). The challenge is compounded in multilingual settings (Hoffmann et al., 2022), where static benchmarks with single correct answers fail to capture how cultural values are distributed across different topics and contexts. Previous work has focused primarily on English- language performance (Tao et al., 2024) or indi- vidual LLMs (Arora et al., 2023; Cao et al., 2023). Our work extends this by examining how cultural alignment systematically varies within model fam- ilies and across languages,
Chunk 20 · 1,999 chars
ibuted across different topics and contexts. Previous work has focused primarily on English- language performance (Tao et al., 2024) or indi- vidual LLMs (Arora et al., 2023; Cao et al., 2023). Our work extends this by examining how cultural alignment systematically varies within model fam- ilies and across languages, providing insight into how different development approachesâscaling and commercial product developmentâinfluence cultural representation capabilities. There is already progress on improving the cross- cultural participation in alignment data. Two no- table projects are PRISM and AYA (Kirk et al., 2024b; ĂstĂŒn et al., 2024). PRISM is a large dataset of conversational preferences from a diverse par- ticipant pool. While the data is predominantly in English, it could be an important resource for bet- ter understanding and modelling diverse cultural preferences. The AYA dataset is a massively multi- lingual instruction fine-tuning dataset. AYA could provide further means of realising the demonstrated benefits of multilingual training (Nie et al., 2024). 6 Conclusion Increased multilingual capabilities do not guaran- tee improved cultural alignment in Large Language Models. Through systematic comparison of three model familiesâGemma, OLMo, and OpenAIâs GPTsâwe find that the relationship between im- provements in multilingual capability and cultural alignment is complex. While some languages show clear improvements in alignment with increased model capabilities (e.g., Danish), others exhibit inconsistent patterns, suggesting that cultural align- ment does not automatically follow gains in mul- tilingual capabilities. Our distribution-matching methodology using World Values Survey data en- abled the detection of these nuanced patterns across languages and cultural contexts. We also find that, contrary to popular discourse, LLMs do not exhibit US-centric bias across all languages; in Danish and Dutch, they align more closely with the values of Denmark and the
Chunk 21 · 1,998 chars
ethodology using World Values Survey data en- abled the detection of these nuanced patterns across languages and cultural contexts. We also find that, contrary to popular discourse, LLMs do not exhibit US-centric bias across all languages; in Danish and Dutch, they align more closely with the values of Denmark and the Nether- lands, respectively, than with the US. This fits with the hypothesis that more culturally uniform data leads to less US-centric bias. Both English and Por- tuguese are spoken in multiple countries, whereas Dutch and Danish are predominantly spoken in one. To further validate this claim, future work could include other multi-cultural languages (like Span- ish or Swahili) and monocultural languages (like Japanese)âespecially with a wider geographical reach to preclude European bias. Our findings highlight that improving cultural alignment requires dedicated effort beyond general capability scaling. Future work should focus on developing techniques that can better handle align- ment with distributions of cultural values rather than single points, while ensuring meaningful par- ticipation from diverse communities in LLM devel- opment. As these models continue to reach wider audiences spanning many geographic and cultural regions, achieving robust cultural alignment be- comes increasingly crucial for equitable deploy- ment. Acknowledgements We are thankful for the helpful feedback from the anonymous reviewers. We also thank Shiri Dori- Hacohen, Daniel Hershcovich, and others for help- ful discussions throughout the project. For compute support, the project used the Microsoft Azure Ac- celerating Foundation Model Research Grant. This work was supported in part by the Engineering and Physical Sciences Research Council [grant number EP/X028909/1]. References JosĂ© AlemĂĄn and Dwayne Woods. 2016. Value Orien- tations From the World Values Survey: How Com- parable Are They Cross-Nationally? Comparative Political Studies, 49(8):1039â1067. Badr AlKhamissi,
Chunk 22 · 1,990 chars
ork was supported in part by the Engineering and Physical Sciences Research Council [grant number EP/X028909/1]. References JosĂ© AlemĂĄn and Dwayne Woods. 2016. Value Orien- tations From the World Values Survey: How Com- parable Are They Cross-Nationally? Comparative Political Studies, 49(8):1039â1067. Badr AlKhamissi, Muhammad ElNokrashy, Mai Alkhamissi, and Mona Diab. 2024. Investigating cultural alignment of large language models. In Pro- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 12404â12422, Bangkok, Thailand. Association for Computational Linguistics. Arnav Arora, Lucie-aimĂ©e Kaffee, and Isabelle Augen- stein. 2023. Probing pre-trained language models for cross-cultural differences in values. In Proceedings of the First Workshop on Cross-Cultural Considera- tions in NLP (C3NLP), pages 114â130, Dubrovnik, Croatia. Association for Computational Linguistics. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, and 22 others. 2022. -- 9 of 12 -- Training a Helpful and Harmless Assistant with Re- inforcement Learning from Human Feedback. Abeba Birhane. 2020. Algorithmic colonization of Africa. SCRIPTed, 17:389. Terra Blevins and Luke Zettlemoyer. 2022. Language contamination helps explains the cross-lingual capa- bilities of english pretrained models. In Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, pages 3563â3574, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich. 2023. Assessing cross-cultural alignment between ChatGPT and hu- man societies: An empirical study. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 53â67, Dubrovnik, Croatia. Association for Computational Linguistics. Eric P. Charles. 2005. The correction for
Chunk 23 · 1,998 chars
el Hershcovich. 2023. Assessing cross-cultural alignment between ChatGPT and hu- man societies: An empirical study. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 53â67, Dubrovnik, Croatia. Association for Computational Linguistics. Eric P. Charles. 2005. The correction for attenuation due to measurement error: Clarifying concepts and creating confidence sets. Psychological Methods, 10(2):206â226. Catherine Dâignazio and Lauren F. Klein. 2023. Data Feminism. MIT press. EVS/WVS. 2022. Joint EVS/WVS 2017-2022 Dataset. William Fedus, Jeff Dean, and Barret Zoph. 2022. A review of sparse expert models in deep learning. Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Con- erly, Nova Dassarma, Dawn Drain, and Nelson El- hage. 2022. Predictability and surprise in large gener- ative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747â1764. Team Gemma, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, Bobak Shahriari, Alexandre RamĂ©, Johan Ferret, and 187 others. 2024. Gemma 2: Improving open language models at a practical size. Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, and Alexandra Choulde- chova. 2025. Validating LLM-as-a-judge systems in the absence of gold labels. Shreya Havaldar, Bhumika Singhal, Sunny Rai, Langchen Liu, Sharath Chandra Guntuku, and Lyle Ungar. 2023. Multilingual language models are not multicultural: A case study in emotion. In Proceed- ings of the 13th Workshop on Computational Ap- proaches to Subjectivity, Sentiment, & Social Media Analysis, pages 202â214, Toronto, Canada. Associa- tion for Computational Linguistics. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, and 13 others. 2022. An empirical analysis of compute-optimal large language model
Chunk 24 · 1,988 chars
, pages 202â214, Toronto, Canada. Associa- tion for Computational Linguistics. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, and 13 others. 2022. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016â30030. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, and 9 others. 2023. Mistral 7B. Rebecca L. Johnson, Giada Pistilli, Natalia MenĂ©dez- GonzĂĄlez, Leslye Denisse Dias Duran, Enrico Panai, Julija Kalpokiene, and Donald Jay Bertulfo. 2022. The Ghost in the Machine has an American accent: Value conflict in GPT-3. Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein. 2021. Noise: A Flaw in Human Judgment. Little, Brown. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs, stat]. Ariba Khan, Stephen Casper, and Dylan Hadfield- Menell. 2025. Randomness, not representation: The unreliability of evaluating cultural alignment in LLMs. In Proceedings of the 2025 ACM Confer- ence on Fairness, Accountability, and Transparency, FAccT â25, pages 2151â2165, New York, NY, USA. Association for Computing Machinery. Sal Khan. 2023. Harnessing GPT-4 so that all students benefit. A nonprofit approach for equal access! Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, and Scott A. Hale. 2024. Indian- BhED: A dataset for measuring india-centric biases in large language models. In Proceedings of the 2024 International Conference on Information Technology for Social Good, pages 231â239, Bremen Germany. ACM. Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The
Chunk 25 · 1,998 chars
Hale. 2024. Indian- BhED: A dataset for measuring india-centric biases in large language models. In Proceedings of the 2024 International Conference on Information Technology for Social Good, pages 231â239, Bremen Germany. ACM. Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The past, present and better future of feedback learning in large language models for subjective human preferences and values. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 2409â2430, Singapore. Association for Computational Linguistics. Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2024a. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, 6(4):383â392. Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Michael Bean, Katerina Margatina, Rafael Mosquera, Juan Manuel Ciro, Max Bartolo, Adina Williams, and 3 others. 2024b. The PRISM align- ment dataset: What participatory, representative and individualised human feedback reveals about the sub- jective and multicultural alignment of large language -- 10 of 12 -- models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Bench- marks Track. Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, and 43 others. 2022. Quality at a glance: An audit of web- crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50â72. Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, and 27 oth- ers. 2023. Specific versus General Principles for Constitutional AI. Steven G. Luke. 2017. Evaluating significance in lin- ear mixed-effects models in R. Behavior Research Methods, 49(4):1494â1502. Reem Masoud, Ziquan Liu, Martin
Chunk 26 · 1,991 chars
l, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, and 27 oth- ers. 2023. Specific versus General Principles for Constitutional AI. Steven G. Luke. 2017. Evaluating significance in lin- ear mixed-effects models in R. Behavior Research Methods, 49(4):1494â1502. Reem Masoud, Ziquan Liu, Martin Ferianc, Philip C. Treleaven, and Miguel Rodrigues Rodrigues. 2025. Cultural alignment in large language models: An ex- planatory analysis based on hofstedeâs cultural dimen- sions. In Proceedings of the 31st International Con- ference on Computational Linguistics, pages 8474â 8503, Abu Dhabi, UAE. Association for Computa- tional Linguistics. Stefania Milan and Emiliano TrerĂ©. 2019. Big data from the south(s): Beyond data universalism. Television & New Media, 20(4):319â335. Dan Milmo. 2023. ChatGPT reaches 100 million users two months after launch. The Guardian. Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, and Luciano Floridi. 2024. Auditing large language models: A three-layered approach. AI and Ethics, 4(4):1085â1115. Shangrui Nie, Michael Fromm, Charles Welch, Rebekka Görge, Akbar Karimi, Joan Plepi, Nazia Mowmita, Nicolas Flores-Herr, Mehdi Ali, and Lucie Flek. 2024. Do multilingual large language models mit- igate stereotype bias? In Proceedings of the 2nd Workshop on Cross-cultural Considerations in NLP, pages 65â83, Bangkok, Thailand. Association for Computational Linguistics. Dan Nielsen. 2023. ScandEval: A benchmark for scan- dinavian natural language processing. In Proceed- ings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 185â201, TĂłrshavn, Faroe Islands. University of Tartu Library. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groen- eveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, and 31 others. 2025. 2 OLMo 2 furious. OpenAI. 2022. ChatGPT: Optimizing Language Mod- els for Dialogue. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Chunk 27 · 1,995 chars
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groen- eveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, and 31 others. 2025. 2 OLMo 2 furious. OpenAI. 2022. ChatGPT: Optimizing Language Mod- els for Dialogue. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, and 272 others. 2024a. GPT-4 technical report. OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, A. J. Ostrow, Akila Welihinda, and 410 others. 2024b. GPT-4o system card. OpenAI, Ananya Kumar, Jiahui Yu, John Hallman, and Michelle Pokrass. 2025. Introducing GPT-4.1. Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The impact of AI on developer pro- ductivity: Evidence from GitHub copilot. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your lan- guage model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728â 53741. Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, and 3 others. 2022. Characteristics of harmful text: To- wards rigorous benchmarking of language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS â22, pages 24720â24739, Red Hook, NY, USA. Curran Associates Inc. Paul Röttger, Musashi Hinck, Valentin Hofmann, Kobi Hackenburg, Valentina Pyatkin, Faeze Brahman, and Dirk Hovy. 2025. IssueBench: Millions of realistic prompts for measuring issue bias in LLM writing assistance. Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schuetze, and Dirk Hovy. 2024. Political compass or spinning arrow? Towards more meaningful evaluations for values and opinions in large language models. In Proceedings of the 62nd Annual Meeting of the
Chunk 28 · 1,998 chars
as in LLM writing assistance. Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schuetze, and Dirk Hovy. 2024. Political compass or spinning arrow? Towards more meaningful evaluations for values and opinions in large language models. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 15295â15311, Bangkok, Thai- land. Association for Computational Linguistics. David Rozado. 2024. The political preferences of LLMs. PLOS One, 19(7):e0306621. Sarah Schröder, Thekla Morgenroth, Ulrike Kuhl, Va- lerie Vaquet, and Benjamin PaaĂen. 2025. Large language models do not simulate human psychology. Murray Shanahan. 2024. Talking about large language models. Commun. ACM, 67(2):68â79. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Du- venaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, and 9 others. 2023. Towards understanding sycophancy in language models. In The Twelfth International Conference on Learning Representations. -- 11 of 12 -- Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell L. Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, and 3 others. 2024. Position: A roadmap to pluralistic alignment. In Proceedings of the 41st International Conference on Machine Learning, pages 46280â46302. PMLR. Jinzhe Tan, Hannes Westermann, and Karim Benyekhlef. 2023. Chatgpt as an artificial lawyer? In Ai4aj@ Icail. Yan Tao, Olga Viberg, Ryan S Baker, and RenĂ© F Kizil- cec. 2024. Cultural bias and cultural alignment of large language models. PNAS Nexus, 3(9):pgae346. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, TimothĂ©e Lacroix, Baptiste RoziĂšre, Naman Goyal, Eric Hambro, and 5 others. 2023. LLaMA: Open and Efficient Founda- tion Language Models. Ahmet ĂstĂŒn, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel Dâsouza, Gbemileke Onilude, Neel Bhan- dari,
Chunk 29 · 1,988 chars
vron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, TimothĂ©e Lacroix, Baptiste RoziĂšre, Naman Goyal, Eric Hambro, and 5 others. 2023. LLaMA: Open and Efficient Founda- tion Language Models. Ahmet ĂstĂŒn, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel Dâsouza, Gbemileke Onilude, Neel Bhan- dari, Shivalika Singh, Hui-Lee Ooi, and 8 others. 2024. Aya model: An instruction finetuned open- access multilingual language model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15894â15939, Bangkok, Thailand. Association for Computational Linguistics. Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, and 14 others. 2022. Taxonomy of Risks posed by Language Mod- els. In 2022 ACM Conference on Fairness, Account- ability, and Transparency, pages 214â229, Seoul Re- public of Korea. ACM. Dustin Wright, Arnav Arora, Nadav Borenstein, Sr- ishti Yadav, Serge Belongie, and Isabelle Augenstein. 2024. LLM tropes: Revealing fine-grained values and opinions in large language models. In Find- ings of the Association for Computational Linguistics: EMNLP 2024, pages 17085â17112, Miami, Florida, USA. Association for Computational Linguistics. Jianhao Yan, Pingchuan Yan, Yulong Chen, Judy Li, Xianchao Zhu, and Yue Zhang. 2024. GPT-4 vs. Human translators: A comprehensive evaluation of translation quality across languages, domains, and expertise levels. Corr. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, and 4 others. 2023a. LMSYS- chat-1M: A large-scale real-world LLM conversation dataset. In The Twelfth International Conference on Learning Representations. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, and 4 others. 2023b. Judg- ing LLM-as-a-judge with MT-bench and
Chunk 30 · 649 chars
LMSYS- chat-1M: A large-scale real-world LLM conversation dataset. In The Twelfth International Conference on Learning Representations. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, and 4 others. 2023b. Judg- ing LLM-as-a-judge with MT-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595â46623. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, and 6 others. 2023. LIMA: Less is more for align- ment. In Thirty-Seventh Conference on Neural Infor- mation Processing Systems. -- 12 of 12 --