The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It
Summary
This paper analyzes the linguistic diversity of Large Language Model (LLM) safety research, revealing a significant and widening English-centric bias. Through a systematic review of nearly 300 publications from 2020 to 2024 across major *ACL venues, the authors find that English-only research dominates, with even high-resource non-English languages like Mandarin receiving minimal attention. Non-English languages are rarely studied in isolation, often appearing only in broad multilingual evaluations that lack cultural depth. Furthermore, half of English safety studies fail to explicitly document their language coverage, implying a false universality. The authors argue that safety mechanisms do not generalize across languages due to cultural nuances, such as varying definitions of toxicity or taboo. Current evaluation metrics relying on averages obscure critical safety failures in specific languages, potentially deploying unsafe models globally. To address these gaps, the paper proposes three future directions: developing culturally grounded evaluation benchmarks that account for code-switching and local linguistic patterns; generating diverse, culturally contextualized synthetic training data using frameworks like Constitutional AI; and investigating crosslingual safety generalization through mechanistic interpretability and influence analysis. The study concludes that bridging this linguistic divide is essential for creating robust, inclusive AI safety practices for diverse global populations.
PDF viewer
Chunks(50)
Chunk 0 · 1,998 chars
arXiv:2505.24119v1 [cs.CL] 30 May 2025 The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It Zheng-Xin Yong1, Beyza Ermis2, Marzieh Fadaee2, Stephen H. Bach1, and Julia Kreutzer2 1Brown University, 2Cohere Labs Corresponding authors: Zheng-Xin Yong (contact.yong@brown.edu), Julia Kreutzer (juliakreutzer@cohere.com) Abstract This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020â2024 across major NLP conferences and workshops at âACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documen- tation practice. To motivate future research into multilingual safety, we make several recommenda- tions based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global popu- lations. Content Warning: This paper contains examples of harmful language. 1 Introduction The rapid advancement of large language models (LLMs) has transformed the artificial intelli- gence landscape, enabling increasingly sophisticated capabilities across domains including health- care [Singhal et al., 2023; Nazi & Peng, 2024; Singhal et al., 2025], education [Neumann et al., 2024; Zhang et al., 2024b; Wen et al., 2024], and media content generation [Wang et al., 2023; Zhang et al., 2024a; Barman et al., 2024]. As these powerful systems are deployed globally and used across different linguistic communities [Tamkin et al., 2024], ensuring their
Chunk 1 · 1,997 chars
ghal et al., 2025], education [Neumann et al., 2024; Zhang et al., 2024b; Wen et al., 2024], and media content generation [Wang et al., 2023; Zhang et al., 2024a; Barman et al., 2024]. As these powerful systems are deployed globally and used across different linguistic communities [Tamkin et al., 2024], ensuring their safe and secure operation across diverse linguistic and cultural contexts has emerged as a critical research imperative. While sig- nificant progress has been made in developing safety mechanisms for high-resource languages [Shi et al., 2024a; Dong et al., 2024], particularly English, the multilingual dimensions of LLM safety remain considerably underexplored. For example, all the public safety evaluation datasets reviewed by Dong et al. [2024] include English content, with only two datasets being bilingual (English and Chinese). This gap creates potentially dangerous blind spots in our safety frameworks and raises fundamental questions about the equitable distribution of AI benefits and risks [Yong et al., 2023a; Ermis et al., 2024; Aakanksha et al., 2024; Kanepajs et al., 2024; Bengio et al., 2025; Peppin et al., 2025]. Multilingual LLM safety encompasses challenges that extend well beyond the simple translation of existing safety techniques. Languages differ not only in their vocabulary and grammatical struc- tures but also in their cultural connotations [Hoijer, 1954; Jiang, 2000; Everett, 2012; Kramsch, 2014; Released as a preprint on June 10, 2025 1 -- 1 of 26 -- Categories Definitions Examples Jailbreaking attacks Work on designing adversarial prompts to bypass refusal safety guardrails or detecting jailbreaking attacks Zeng et al. [2024], Wang et al. [2024c] Toxicity and bias Work on toxic content and stereotypical bias in training data and output generations Zhu et al. [2024], Kim et al. [2024] Factuality and hallucination Work on nonsensical, unfaithful, and factually in- correct content generated by LLMs Pal & Sankara- subbu [2024] AI
Chunk 2 · 1,993 chars
et al. [2024], Wang et al. [2024c] Toxicity and bias Work on toxic content and stereotypical bias in training data and output generations Zhu et al. [2024], Kim et al. [2024] Factuality and hallucination Work on nonsensical, unfaithful, and factually in- correct content generated by LLMs Pal & Sankara- subbu [2024] AI privacy Work on memorization, private data leakage, and unlearning Dou et al. [2024], Shi et al. [2024b] Policy Work on governance frameworks, regulatory ap- proaches, and ethical guidelines for responsible AI deployment Goanta et al. [2023] LLM alignment Work that spans multiple subtopics above or is re- lated to other LLM safety subtopics such as RLHF alignment algorithms Wang et al. [2024d], Yang et al. [2024b] Not related to safety Work that does not belong to any of the topics above Manino et al. [2022] Table 1: Taxonomy for our LLM safety survey study. Mazari & Derraz, 2015], metaphorical expressions [Saygin, 2001; Khoshtab et al., 2025], taboos [De- waele, 2004], and social norms [Sridhar, 1996; Baquedano-LĂłpez & Kattan, 2007; Fasya & Sari, 2021]. Therefore, content that is harmless in one cultural context may be deeply offensive or harm- ful in another [Keipi et al., 2016; Ermis et al., 2024; Aakanksha et al., 2024; Korre et al., 2025], or vice versa. For instance, in South-East Asia, the term âbananaââwhich connotes âyellow on the outside, white on the insideââis used to disparage people of Asian descent who are perceived as forgoing their cultural identity and having adopted Western cultural values and behaviors [Khoo, 2003; Trieu, 2019]. On the other hand, the Chinese word ć±, which literally translates as âdickâ, can be used in both offensive (i.e., swear words) and non-offensive (i.e., an adjective to praise someone who possesses a remarkable talent) settings [Carson & Jiang, 2021]. The wide disparity in language resources [Joshi et al., 2020; Nigatu et al., 2024]ââfrom high-resource languages like English, Mandarin, and Spanish to
Chunk 3 · 1,991 chars
used in both offensive (i.e., swear words) and non-offensive (i.e., an adjective to praise someone who possesses a remarkable talent) settings [Carson & Jiang, 2021]. The wide disparity in language resources [Joshi et al., 2020; Nigatu et al., 2024]ââfrom high-resource languages like English, Mandarin, and Spanish to thousands of low-resource languagesââcreates uneven safety landscapes with potentially severe consequences for marginalized linguistic commu- nities. Several commercial LLMs have demonstrated significantly weaker safety performance when prompted in non-English languages, producing harmful content and undesirable outputs that would be filtered in English contexts [Yong et al., 2023a; Deng et al., 2024; Wang et al., 2024a; Al Ghanim et al., 2024; Yoo et al., 2024; He et al., 2024; Shen et al., 2024; Nigatu & Raji, 2024; Poppi et al., 2025; Aakanksha et al., 2024; Jain et al., 2024; Chan et al., 2025]. These disparities in safety protections, combined with increasingly capable LLMs, risk magnifying societal harms within mul- tilingual communities. While companies behind frontier LLMs have taken concerted efforts to perform multilingual safety alignment training and red-teaming [Grattafiori et al., 2024; Cohere et al., 2025; OpenAI, 2025], these initiatives remain limited in scope. For instance, among the top- ranking LLMs on Chatbot Arenaââa widely used leaderboard platform for evaluating LLMs through user-submitted preferenceââ20 of 24 of those that provide a system report have wide multilingual support, but only 5 reported multilingual safety alignment training and red-teaming efforts. This gap between multilingual deployment capabilities and safety alignment calls for further participation 2 -- 2 of 26 -- 2020 2021 2022 2023 2024 Year 0 20 40 60 80 100 120 Number of Publications 6 8 11 26 118 1 3 2 8 35 English Only Monolingual Non-English + Multilingual Figure 1: Trends of English-only and multilingual LLM safety publications in âACL
Chunk 4 · 1,995 chars
lities and safety alignment calls for further participation 2 -- 2 of 26 -- 2020 2021 2022 2023 2024 Year 0 20 40 60 80 100 120 Number of Publications 6 8 11 26 118 1 3 2 8 35 English Only Monolingual Non-English + Multilingual Figure 1: Trends of English-only and multilingual LLM safety publications in âACL conferences and workshops over the past five years: the language gap in LLM safety research widens. from both private enterprises and academia on multilingual safety alignment. We perform a systematic review of nearly 300 LLM safety publications over the past five years in ACL proceedings (Section 2), and we uncover a concerning trend: the vast majority of safety research is centered on English-language models, while comparatively little work addresses safety in non-English or multilingual contexts. This imbalance has become more pronounced over time. Even Mandarin Chineseââthe second most studied languageââstill has about ten times less research than English. This disparity persists across multiple subdomains of safety research. Furthermore, non-English languages are rarely studied as a standalone language but rather as part of broader multilingual evaluations, which often lack the nuance and depth necessary to address language- specific safety challenges and cultural contexts. Lastly, we discover that only half of English safety research publications document the limitations of their language coverage. These findings highlight critical gaps in the current landscape of LLM safety research and motivate the need for more targeted efforts to address multilingual safety concerns. To help close this gap, we outline three tractable directions for future multilingual safety work: (1) developing culturally grounded evaluation benchmarks, (2) curating diverse multilingual safety training data, and (3) deepening our understanding of alignment challenges across languages. 2 The Language Gap in LLM Safety Research To understand the language gap in LLM safety
Chunk 5 · 1,995 chars
ons for future multilingual safety work: (1) developing culturally grounded evaluation benchmarks, (2) curating diverse multilingual safety training data, and (3) deepening our understanding of alignment challenges across languages. 2 The Language Gap in LLM Safety Research To understand the language gap in LLM safety research, we systematically survey relevant papers and analyze how safety research is distributed across languages and subtopics, as well as how non- English language research is conducted and reported. 2.1 Methodology We collect work related to LLM safety and manually annotate the languages studied in each paper, along with their safety subtopic. To reduce human annotation efforts while ensuring that our 3 -- 3 of 26 -- findings reflect the overall trends in the field, we perform the following strategies: 1. Venue selection: We focus on all âACL venues such as ACL and EMNLP, including both conferences and workshops, as we believe they are the venues with the most linguistically diverse NLP works compared to other venues such as ICLR, NeurIPS, and ICML. 2. Keyword filter: We filter the safety-related publications through keyword matching with words âsafeâ and âsafetyâ in paper abstracts. Using these two terms we get a good proxy for the distribution of diverse LLM safety literature. 3. Manual categorization: We adopt a simplified taxonomy following Cui et al. [2024], which is representative of the type of safety work published at âACL, and we manually categorize publications into seven different subtopics as shown in Table 1. 4. Language Documentation: We annotate the languages that each work addresses,1 and we indicate if the language(s) studied are mentioned in the work. We group them into three categories: monolingual English, monolingual non-English, and multilingual (covering two or more languages). Annotation Task Type Avg Std Safety topic Categorical 0.83 0.19 Has non-English? Binary 0.81 0.15 Specifies languages? Binary 0.80 0.04 Covered
Chunk 6 · 1,939 chars
age(s) studied are mentioned in the work. We group them into three categories: monolingual English, monolingual non-English, and multilingual (covering two or more languages). Annotation Task Type Avg Std Safety topic Categorical 0.83 0.19 Has non-English? Binary 0.81 0.15 Specifies languages? Binary 0.80 0.04 Covered languages List 0.96 0.05 Table 2: Average and standard deviation of agree- ment between four pairs of annotators. Agree- ment on âlanguage coverageâ is measured with Jac- card similarity, and all other categories are mea- sured with Cohenâs Îș. Annotations were manually performed by the authors. In total, we annotated nearly 300 publications from year 2020 till year 2024. Of these, 28% were false positives from our key- word matching process (i.e., unrelated to LLM safety), and were filtered out before we perform further analysis.2 Table 2 reports the mean and standard de- viation of pairwise inter-annotator agreement scores on subsets of 20 repeated annotations. We perform a 4 Ă 20 pairwise agreement study across distinct subsets to maximize the repre- sentativeness of our survey corpus and ensure robust assessment of annotation consistency. We find that inter-annotator agreement is consistently high, between 0.80 and 0.96 on average per category, but we note that the annotations may still contain imperfections. 2.2 Findings English-centricity of LLM safety research. Figure 1 highlights a stark language imbalance in LLM safety research published at âACL conferences and workshops over the past five years. The data reveals a clear English-centric pattern that has persisted throughout this period. English-only research dominates across all years, with a particularly dramatic increase in recent publications. 1If the languages studied were not explicitly mentioned, we followed up on their training and evaluation datasets to identify the language coverage of the work. 2We release our annotations on
Chunk 7 · 1,995 chars
ut this period. English-only research dominates across all years, with a particularly dramatic increase in recent publications. 1If the languages studied were not explicitly mentioned, we followed up on their training and evaluation datasets to identify the language coverage of the work. 2We release our annotations on https://huggingface.co/CohereLabsCommunity/multilingual_safety_surve y2025. 4 -- 4 of 26 -- 0 5 10 15 20 25 30 Paper's Average Multilinguality 0 5 10 15 20 185 190 Frequency eng zho tha ara guj,kan vie swe ita fra por spa rus hin kor tur ukr,zul,gla heb ben deu jpn,ind mar nld swa tel afr,ast,cym,ell, fas,hrv,lit,oci,tgl fin pol cat bul,nep egy Figure 2: Measure of how often a language is studied (âFrequencyâ) and the average number of languages covered by all papers in which the language appear in (âPaperâs Average Multilingualityâ). The trend shows consistent underrepresentation of multilingual non-English research, with the gap widening significantly over time. While both categories have grown as LLM safety has gained prominence, the proportional imbalance remains. English-only publications have consistently out- numbered multilingual and non-English work, and this absolute gap has widened over time, from 5 in 2020 to 83 in 2024. While both categories have grown, the increase is disproportionately concentrated in English-only research. Non-English languages are studied in herds. Another aspect of the marginalization of non- English languages is that they are often addressed as part of large multilingual evaluations, rather than studied in depth on their own. In many cases, breadth is prioritized over depth, and multi- lingual studies are preferred over focused analyses of monolingual ones.3 This is shown in Figure 2 which provides a detailed breakdown of how frequently a language is studied (y-axis) and how often it is studied alongside other languages (x-axis). English (eng) exhibits overwhelming dominance with a frequency nearly ten
Chunk 8 · 1,998 chars
gual studies are preferred over focused analyses of monolingual ones.3 This is shown in Figure 2 which provides a detailed breakdown of how frequently a language is studied (y-axis) and how often it is studied alongside other languages (x-axis). English (eng) exhibits overwhelming dominance with a frequency nearly ten times higher than Chinese (zho)ââthe second most studied language. However, English is primarily studied in isolation, resulting in a low average multilinguality score. In contrast, languages with moderate representation like Chinese (zho), Arabic (ara) and Span- ish (spa) appear primarily in multilingual studies, suggesting that deeper, language-specific safety analyses remain limited even for widely spoken languages. This trend is even more noticeable for under-resourced languages such as Swahili (swa) and Telugu (tel), and especially for languages at the extreme end of the multilingualism spectrum such as Afrikaans (afr), which appears only in a single paper that covers approximately 30 languages [Guerreiro et al., 2023]. Such inclusion severely limits the possibility for language-specific safety analysis and gaining meaningful insights. We commend focused analysis on individual lower-resource languages such as Nakov et al. [2021] and Niraula et al. [2021], who specifically study disinformation and offensive language detection in 3Since our study only captures published papers, we might be missing out on rejected works. There may be a reviewer preference for multilingual over monolingual non-English papers. 5 -- 5 of 26 -- LLM alignment Jailbreaking attacks Toxicity and bias Hallucination and factuality Privacy Policy 0 20 40 60 Count Monolingual (English) Monolingual (Non-English) Multilingual (a) Topic distribution 0.0 20.0 40.0 60.0 80.0 100.0 Proportion (%) Multilingual Monolingual (Non-English) Monolingual (English) 80.8% 75.0% 83.0% 19.2% 25.0% 17.0% Conferences Workshops (b) Venue distribution Figure 3: Distribution of LLM safety
Chunk 9 · 1,997 chars
ount Monolingual (English) Monolingual (Non-English) Multilingual (a) Topic distribution 0.0 20.0 40.0 60.0 80.0 100.0 Proportion (%) Multilingual Monolingual (Non-English) Monolingual (English) 80.8% 75.0% 83.0% 19.2% 25.0% 17.0% Conferences Workshops (b) Venue distribution Figure 3: Distribution of LLM safety publications by (a) safety subtopics and (b) publication venues. Bulgarian and Nepali social media, respectively. Disparities in subtopics of safety. Breaking down LLM safety publications by specific safety subtopics in Figure 3(a), we find that English-centricity persists across all domains, with English- only publications substantially outnumbering multilingual work in every category. LLM alignment and jailbreaking attacks demonstrate the most pronounced disparities, suggesting that these critical safety areas receive particularly limited cross-linguistic attention. In particular, LLM alignment work involving evaluation [Yuan et al., 2024; Hua et al., 2024; Hammoud et al., 2024; Gabriel et al., 2024] and algorithmic improvement [Zhou et al., 2024; Hassan et al., 2024] would benefit from further research with expanded language coverage. Toxicity and bias research shows a similar pattern despite being a domain where cultural and linguistic variations are especially relevant [Costa- jussĂ et al., 2023b; Tao et al., 2024; Devinney et al., 2024; Bhutani et al., 2024]. The near absence of multilingual work in privacy and policy domains indicates these emerging safety concerns are being conceptualized almost exclusively through an English-language framework, potentially overlooking important cultural and legal variations that exist across different linguistic contexts [Larsen & Dignum, 2024]. Valuable role of workshops. Figure 3 (b) reveals an interesting pattern in the distribution of LLM safety publications across venue types. While conferences dominate across all language categories, monolingual non-English safety papers are 46% relatively more likely
Chunk 10 · 1,994 chars
fferent linguistic contexts [Larsen & Dignum, 2024]. Valuable role of workshops. Figure 3 (b) reveals an interesting pattern in the distribution of LLM safety publications across venue types. While conferences dominate across all language categories, monolingual non-English safety papers are 46% relatively more likely to appear in workshops than English-only papers, highlighting the valuable accessibility that workshops offer for this line of work. This suggests that non-English safety research faces a higher barrier to entry at prestigious conferences, whereas workshops, such as Workshop on Gender Bias in Natural Language Processing (GeBNLP) and Workshop on Safety for Conversational AI (Safety4ConvAI), serve as more accessible venues for disseminating non-English safety research. The pattern indicates that, beyond the overall English-centricity of safety research documented in previous figures, additional structural factors may be affecting how non-English safety work is evaluated and disseminated within the community. Language documentation practice differs for English-only research. We argue that it is important for LLM safety research to explicitly document the languages studied (also known as Benderâs rule [Bender, 2011; 2019]) for two key reasons. (1) Safety alignment does not necessarily generalize across languages [Yong et al., 2023a; Wang et al., 2024b; Yoo et al., 2024; Al Ghanim et al., 2024]. Clearly stating which languages were included enables future researchers to under- stand the specific linguistic contexts in which safety findings have been validated. (2) By explicitly acknowledging language limitations, the field can more accurately measure progress in expanding 6 -- 6 of 26 -- safety coverage across languages, thus encouraging a more equitable distribution of safety research to serve a broader range of global populations. Category Does the paper mention languages studied? No (â) Yes (â) Mono. English 50.6% 49.4% Mono. Non-English 0.0%
Chunk 11 · 1,996 chars
tely measure progress in expanding 6 -- 6 of 26 -- safety coverage across languages, thus encouraging a more equitable distribution of safety research to serve a broader range of global populations. Category Does the paper mention languages studied? No (â) Yes (â) Mono. English 50.6% 49.4% Mono. Non-English 0.0% 100.0% Multilingual 0.0% 100.0% Table 3: Proportion of language documentation practice among LLM safety publications. Based on the data presented in Table 3, we observe substantially different patterns in lan- guage documentation practices across LLM safety publications. English-only publications show a concerning trend with 50.6% failing to explicitly name the language studied â in other words, âEnglishâ is not mentioned throughout the paper. In contrast, both non-English mono- lingual and multilingual publications demon- strate full compliance, with 100% explicitly documenting the languages studied. This disparity highlights a systematic bias in reporting practices, where English-centered research often proceeds under an implicit assumption of universality, whereas non-English research demonstrates greater methodological transparency. 2.3 Moving Forward for âACL Venues Our survey reveals that English safety research remains overwhelmingly dominant in nearly every dimensionâpublication volume, topical coverage, methodological reporting, and conference visibil- ity. Nonetheless, Figure 1 shows an encouraging trend of growing multilingual safety research over time. One concrete and low-effort step toward improving documentation is integrating language coverage reporting into âACL proceedings. OpenReview submissions already include a metadata field where authors can indicate the languages studied, but this information is currently private. Making this metadata public would allow for more transparent tracking of linguistic representation and support future meta-analyses of multilingual research, particularly in the context of LLM safety. Addressing
Chunk 12 · 1,998 chars
metadata field where authors can indicate the languages studied, but this information is currently private. Making this metadata public would allow for more transparent tracking of linguistic representation and support future meta-analyses of multilingual research, particularly in the context of LLM safety. Addressing the deeper structural imbalance in language and topic representation will require long- term efforts. We believe that conference and workshop organizers can provide incentive structures to address this systemic imbalance, such as special conference theme tracks dedicated to multilingual safety subtopics and/or creating shared workshop tasks on multilingual safety benchmarks. These initiatives could meaningfully expand the scope and visibility of research beyond English, helping the community better serve diverse user populations. 3 Future Research Directions for Multilingual LLM Safety In addition to providing recommendations to âACL organizers, we propose several key research priorities for researchers and model developers to advance multilingual LLM safety alignment. 3.1 Safety Evaluation for Multilingual Models Moving beyond average safety criterion. Traditional evaluation metrics focus on average performance across languages, for which the model that maximizes the uniformly weighted average across tasks and languages is considered best. However, this criterion is susceptible to outliers (e.g., due to unsupported languages) and not suitable for comparing models with different language and 7 -- 7 of 26 -- Models en zh fr ru de ar hi es ja bn Average â Worst Caseâ â ChatGPT [OpenAI, 2022] 99.0 91.9 86.3 87.5 85.3 90.8 81.7 91.5 79.0 62.6 85.56 62.6 PaLM-2 [Anil et al., 2023] 89.7 78.4 84.6 85.9 83.6 82.6 83.0 85.7 70.1 78.1 82.17 70.1 Llama-2 [Touvron et al., 2023] 85.4 73.5 83.2 82.3 82.0 - 63.5 79.3 71.0 - 77.53 63.5 Vicuna [Chiang et al., 2023] 94.0 89.4 90.6 83.3 88.3 43.4 36.8 88.8 60.2 18.4 69.32 18.4 (!) Table 4: Harmlessness scores of
Chunk 13 · 1,996 chars
62.6 85.56 62.6 PaLM-2 [Anil et al., 2023] 89.7 78.4 84.6 85.9 83.6 82.6 83.0 85.7 70.1 78.1 82.17 70.1 Llama-2 [Touvron et al., 2023] 85.4 73.5 83.2 82.3 82.0 - 63.5 79.3 71.0 - 77.53 63.5 Vicuna [Chiang et al., 2023] 94.0 89.4 90.6 83.3 88.3 43.4 36.8 88.8 60.2 18.4 69.32 18.4 (!) Table 4: Harmlessness scores of different models across 10 languages, based on the results from [Wang et al., 2024a]. We augment the original table with a new "Worst Case" column for the lowest harmlessness score. We use bold text to indicate the cases where average score is not necessarily aligned with worst-case score, and we use red text and exclamation mark to indicate how not reporting Worst-Case score can create a false sense of safety. task support [Kreutzer et al., 2025]. In the context of multilingual safety, where reporting average scores is the norm [Guan et al., 2024], this matters even more since averaging might obscure critical safety failures. To illustrate this blind spot, we add the additional worst-case harmlessness score metric to an ACL 2024 paper Wang et al. [2024a] and report the results in Table 4. The table reveals two findings. First, if the winner were chosen based solely on the highest average harmlessness score, it would be ChatGPT [OpenAI, 2022], with a score of 85.56. However, its worst-case score (i.e., the lowest harmlessness score across languages) is only 62.6. This is notably lower than the worst-case score of PaLM-2 (70.1), despite PaLM-2 [Anil et al., 2023] having a lower average score (82.17). This discrepancy highlights that strong average performance does not necessarily reflect robustness in the worst-case scenarios. Second, and more importantly, despite a high average harmlessness score, Vicunaâs [Chiang et al., 2023] worst-case harmlessness score is just 18.4 due to unsafe behaviour in Bengali (bn). This suggests that relying on average metrics alone may create a false sense of safety, potentially leading to the deployment of models like
Chunk 14 · 1,991 chars
nd more importantly, despite a high average harmlessness score, Vicunaâs [Chiang et al., 2023] worst-case harmlessness score is just 18.4 due to unsafe behaviour in Bengali (bn). This suggests that relying on average metrics alone may create a false sense of safety, potentially leading to the deployment of models like Vicuna in languages where they produce harmful content. In future work, we believe that, in addition to reporting worst-case performance to ensure that models meet fundamental safety thresholds across all languages, researchers should explore designing adaptive thresholding mechanisms that establish language-specific safety baselines according to their unique cultural contexts and user groups. Wider language coverage in evaluation. We observe that current multilingual red-teaming practice mostly focuses on languages that models are finetuned on during post-pretraining processes, such as instruction-following and alignment finetuning [ĂstĂŒn et al., 2024; Grattafiori et al., 2024]. Given that language contamination in pretraining can facilitate crosslingual transfer [Blevins & Zettlemoyer, 2022], it raises valid concerns about whether exempting certain languages from the safety evaluation of multilingual LLMs is justified. Language exemptions risk creating blind spots in safety assessments precisely where they might be most needed, as models can bypass safety guardrails when prompted in languages underrepresented in pretraining [Shen et al., 2024]. For instance, the Llama-3 model report presents red-teaming results for only eight languages (six of which are high-resource) [Grattafiori et al., 2024]. Yet the strong multilingual model has been adapted for languages not covered in its safety evaluation, such as Indonesian [Huang et al., 2024b]. We urge researchers to develop more sophisticated evaluation protocols that can detect and account for potential contamination and to issue disclaimers when safety alignment has not been conducted in certain
Chunk 15 · 1,993 chars
has been adapted for languages not covered in its safety evaluation, such as Indonesian [Huang et al., 2024b]. We urge researchers to develop more sophisticated evaluation protocols that can detect and account for potential contamination and to issue disclaimers when safety alignment has not been conducted in certain languages. This would help ensure that speakers of those languages are aware of poten- tial risks. Such transparency would allow communities to make informed decisions about model 8 -- 8 of 26 -- deployment while encouraging greater accountability from developers to expand alignment efforts to underserved languages. Incorporate diverse and natural linguistic patterns. We believe evaluating multilingual safety requires a fundamental shift away from treating evaluation as merely adding more languages to ex- isting benchmarks, as they should incorporate linguistic patterns used by real-life speakers. One case study is code-switchingââthe communication pattern of alternating between languages within a single utterance [Nilep, 2006; Gardner-Chloros, 2009; Winata et al., 2023]ââwhich is shown to be able to jailbreak multilingual safety guardrails [Yoo et al., 2024; Yang et al., 2024a; Song et al., 2025]. Another example is Al Ghanim et al.âs [2024] discovery that while LLMs remain safe in standardized Arabic scripts, they are jailbroken when Arabic inputs are written in Arabizi formââa system of writing Arabic using English characters and commonly used among native speakers communicat- ing digitally [Yaghan, 2008]. These examples show that current safety evaluation frameworks that predominantly evaluate languages in a monolingual setting fail to capture the complex reality of multilingual communication. Future work on multilingual red-teaming should develop a method- ology that systematically accounts for diverse multilingual multi-turn interactions among users [Li et al., 2025] to ensure that models remain safe across the full spectrum of real-world
Chunk 16 · 1,994 chars
fail to capture the complex reality of multilingual communication. Future work on multilingual red-teaming should develop a method- ology that systematically accounts for diverse multilingual multi-turn interactions among users [Li et al., 2025] to ensure that models remain safe across the full spectrum of real-world usage patterns rather than just in artificial monolingual test scenarios. 3.2 Culturally-Contextualized Synthetic Training Data Collecting labeled training data for LLM safety alignment can be resource-intensive, and many English-centric research has turned to using synthetic data generation [Bai et al., 2022; Kruschwitz & Schmidhuber, 2024; Samvelyan et al., 2024]. However, exploration of multilingual synthetic safety data has been relatively underexplored. Here, we propose two viable future research di- rections based on constitutional AI framework [Bai et al., 2022; Kundu et al., 2023] for cultural contextualization [Qiu et al., 2025; Guo et al., 2025; Qiu et al., 2024; Yin et al., 2024] LLM Generation. Under constitutional AI framework, LLMs are first prompted to generate harmful (or harmless texts). They are then presented with a set of human-written principles that capture culture-specific harms so that they can engage in a multi-turn process of critiquing and revising originally harmful generations to harmless generations (or vice versa), to create culture- specific preference pairs for alignment training. Enabling constitutional AI for multilingual and multicultural alignment data generation requires close collaboration among linguists, cultural an- thropologists and AI researchers to co-create three key components: (1) culturally-informed consti- tutional principles that reflect diverse value systems and ethical frameworks across different societies [Kirk et al., 2024; Pistilli et al., 2025]; (2) sufficiently capable multilingual LLMs that can both un- derstand these principles and generate high-quality content in target languages [Qin et
Chunk 17 · 1,997 chars
ally-informed consti- tutional principles that reflect diverse value systems and ethical frameworks across different societies [Kirk et al., 2024; Pistilli et al., 2025]; (2) sufficiently capable multilingual LLMs that can both un- derstand these principles and generate high-quality content in target languages [Qin et al., 2024; Huang et al., 2024a]; and (3) evaluation protocols involving native speakers and cultural experts to validate both the constitutional principles and the resulting synthetic data [Kyrychenko et al., 2025]. This direction offers a pathway toward scalable, culturally grounded alignment practices that make LLM safety more inclusive and globally relevant. Machine Translation. Machine translation (MT) often fails to capture or preserve culture-specific harms and may introduce undesirable societal biases such as gender stereotyping [Savoldi et al., 2021; Ahn et al., 2022; Wang et al., 2022; Costa-jussà et al., 2023a;c]. The iterative refinement process from the constitutional AI framework can detect and mitigate translation artifacts that 9 -- 9 of 26 -- might inadvertently encode harmful content or lose important cultural nuances. Unlike direct LLM generation, this approach can take advantage of the decades-long research in MT, especially on cross-cultural adaptation studies [Maxwell et al., 1996; de Lima Barroso et al., 2018; Gorecki et al., 2014; Mbada et al., 2015; Pilz et al., 2014]. Future work should focus on developing automated methods to identify culture-specific safety issues that might be lost in translation, especially for languages with limited digital presence and linguistic resources. 3.3 Towards Understanding Crosslingual Safety Generalization Most existing safety alignment data are centered on English or Chinese [Röttger et al., 2025; Costa- jussà et al., 2024; Plaza-del Arco et al., 2024]. It is important to understand how safety alignment generalizes across languages, so the model developers can anticipate potential failure
Chunk 18 · 1,992 chars
slingual Safety Generalization Most existing safety alignment data are centered on English or Chinese [Röttger et al., 2025; Costa- jussĂ et al., 2024; Plaza-del Arco et al., 2024]. It is important to understand how safety alignment generalizes across languages, so the model developers can anticipate potential failure modes when alignment training data lack language coverage. Mechanistic interpretability. This scientific approach of reverse-engineering neural networks to understand precisely how they process information at the circuit and component levelsâââallows researchers to characterize mechanisms that enable or prevent safety alignment knowledge transfer. We believe this research direction is particularly helpful in explaining several phenomena, such as why detoxification and debiasing can transfer effectively across languages [Li et al., 2024; Reusens et al., 2023] but not refusal training Shen et al. [2024]; Aakanksha et al. [2024]; Wang et al. [2025], or to what extent safety alignment is preserved after language adaptation to underrepresented languages [Yong et al., 2023b; Lin et al., 2024; Ji et al., 2024]. Insights from this research direction can inspire novel training techniques that facilitate zero-shot crosslingual generalization of alignment training and maintain safety consistency as language coverage expands. Training Data Influence Analysis. We also recommend exploring the use of influence functions [Grosse et al., 2023; Ruis et al., 2025] to study crosslingual alignment. This technique enables re- searchers to trace how specific training examples causally affect model behavior during generation. Training data influence analysis offers a valuable complement to mechanistic approaches for inves- tigating two key open questions. For crosslingual generalization, it can help quantify how safety- relevant examplesâespecially those from high-resource versus low-resource languagesâcontribute to harmful or aligned outputs. For language adaptation,
Chunk 19 · 1,995 chars
sis offers a valuable complement to mechanistic approaches for inves- tigating two key open questions. For crosslingual generalization, it can help quantify how safety- relevant examplesâespecially those from high-resource versus low-resource languagesâcontribute to harmful or aligned outputs. For language adaptation, influence functions can identify prob- lematic documents within the continued pretraining corpus, enabling more targeted curation of safer language-specific data. To our knowledge, there is currently very limited work on analyzing training-example-to-output relationships for multilingual safety-relevant behaviors. This presents a promising and underexplored direction for improving alignment practices across languages. 4 Related Work and Discussion Our work contrasts prior survey literature on multilingual NLP [Joshi et al., 2020; Pamungkas et al., 2023; Yadav & Sitaram, 2022; Winata et al., 2023; Huang et al., 2024a; Qin et al., 2024; Wu et al., 2025] by focusing on LLM safety. The limitations we identify align with concerns by Blasi et al. [2022] regarding systematic inequalities in language technology, which privileges certain sociolinguistic groups through choices in data collection, annotation protocols, and evaluation. Our findings suggest these inequalities may be even more pronounced in safety research, where cultural and linguistic nuances significantly impact harm and mitigation strategies. 10 -- 10 of 26 -- Recent efforts to catalog LLM safety research challenges [Barez et al., 2025; Debar et al., 2024; Anwar et al., 2024] have primarily centered on threats identified through English-language models, often overlooking multilingual aspects. This gap, along with our survey findings, echoes the âsquare- one biasâ phenomenon [Ruder et al., 2022]: When NLP researchers moves beyond optimizing for usefulness (e.g., accuracy), their study is often only conducted in a single direction of either safety, interpretability, or multilinguality. This
Chunk 20 · 1,998 chars
ilingual aspects. This gap, along with our survey findings, echoes the âsquare- one biasâ phenomenon [Ruder et al., 2022]: When NLP researchers moves beyond optimizing for usefulness (e.g., accuracy), their study is often only conducted in a single direction of either safety, interpretability, or multilinguality. This siloed approach means that progress in one dimension rarely informs the others, resulting in a fragmented research landscape where multilingual LLM safety research remains underdeveloped. 5 Conclusion Our analysis of nearly 300 publications (2020-2024) reveals a significant language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention and typically appearing only in multi-language studies that lack the depth of English-focused work. This linguistic imbalance potentially leaves language-specific risks undetected as LLMs deploy globally. To address these disparities, we make recommendations to future conferences and highlight several critical future research directions. Limitations Coverage of venues Due to the focus on âACL venues, we might have missed out on relevant multilingual safety works that are either not peer reviewed (yet) or published in other venues, such as ML conferences and workshops. Since it is a very fast moving field, the state of the field described in this paper represents a snapshot in time. We hope that if we ran an analysis like this in a yearâs time, the data would hopefully paint a more optimistic picture. Annotation accuracy Inaccuracies in our annotations might have introduced imprecision in our measurements of the language gap. From our analysis of the inter-annotator agreement, we suspect that this would foremost affect the categorization of safety research topics, as the labels for these categories carry the most ambiguity. When papers do not state language coverage very prominently, such in the abstract or introduction or the experimental setup, it might lead to oversight
Chunk 21 · 1,996 chars
otator agreement, we suspect that this would foremost affect the categorization of safety research topics, as the labels for these categories carry the most ambiguity. When papers do not state language coverage very prominently, such in the abstract or introduction or the experimental setup, it might lead to oversight in the annotation (reducing recall in annotations), depending how deeply an annotator reads the paper. However, we observe that especially those works that are investigating multilinguality in LLM safety as a primary angle, do state it explicitly, so we are confident we did not miss these. Research directions We highlight three prominent future directions for multilingual safety re- search in our work, but we believe there are many other directions that are equally important for advancing safety and security of LLMs in global deployment. These include work on AI governance [Reuel et al., 2024], AI auditing [Birhane et al., 2024; Ojewale et al., 2025], hate speech detection [Nozza, 2021], multimodal AI safety [Dash et al., 2025; Ji et al., 2025], algorithmic designs [Zhao et al., 2025], reasoning models [Yong et al., 2025; Guan et al., 2024], etc. Fundamentally, our work illuminates the substantial language disparity within current LLM safety research. Therefore, as re- searchers pursue diverse research directions on LLM safety, efforts on bridging this linguistic divide must remain central to ensuring equitable safeguards across the worldâs languages. 11 -- 11 of 26 -- Acknowledgements and Disclosure We are grateful for feedback from Hellina Hailu Nigatu, M Saiful Bari, Cristina Menghini, Alham Fikri Aji, Pedro Ortiz Suarez, Victor Ojewale, Simran Khanuja, Arianna Muti, Debora Nozza, Srishti Yadav, Jonas Kgomo, and Catherine Arnett on the early draft of our work. Disclosure: Stephen Bach is an advisor to Snorkel AI, a company that provides software and services for data- centric artificial intelligence. References Aakanksha, Arash Ahmadian, Beyza
Chunk 22 · 1,993 chars
jewale, Simran Khanuja, Arianna Muti, Debora Nozza, Srishti Yadav, Jonas Kgomo, and Catherine Arnett on the early draft of our work. Disclosure: Stephen Bach is an advisor to Snorkel AI, a company that provides software and services for data- centric artificial intelligence. References Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, and Sara Hooker. The multilingual alignment prism: Aligning global and local preferences to reduce harm. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 12027â12049, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653 /v1/2024.emnlp-main.671. URL https://aclanthology.org/2024.emnlp-main.671/. Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. Why knowledge distillation amplifies gender bias and how to mitigate from the perspective of DistilBERT. In Christian Hardmeier, Christine Basta, Marta R. Costa-jussĂ , Gabriel Stanovsky, and Hila Gonen (eds.), Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp. 266â272, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.gebnl p-1.27. URL https://aclanthology.org/2022.gebnlp-1.27/. Mansour Al Ghanim, Saleh Almohaimeed, Mengxin Zheng, Yan Solihin, and Qian Lou. Jail- breaking LLMs with Arabic transliteration and Arabizi. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Nat- ural Language Processing, pp. 18584â18600, Miami, Florida, USA, November 2024. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.1034. URL https://aclanthology.org/2024.emnlp-main.1034/. Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2
Chunk 23 · 1,994 chars
SA, November 2024. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.1034. URL https://aclanthology.org/2024.emnlp-main.1034/. Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational chal- lenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932, 2024. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. Patricia Baquedano-LĂłpez and Shlomy Kattan. Growing up in a multilingual community: Insights from language socialization. Handbook of multilingualism and multilingual communication, 5: 69â99, 2007. Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan 12 -- 12 of 26 -- OâGara, Robert Kirk, Ben Bucknall, Tim Fist, et al. Open problems in machine unlearning for ai safety. arXiv preprint arXiv:2501.04952, 2025. Dipto Barman, Ziyi Guo, and Owen Conlan. The dark side of language models: Exploring the potential of llms in multimedia disinformation generation and dissemination. Machine Learning with Applications, pp. 100545, 2024. Emily Bender. The #benderrule: On naming the languages we study and why it matters. The Gradient, 2019. Emily M Bender. On achieving and evaluating language-independence in nlp. Linguistic Issues in Language Technology, 6, 2011. Yoshua Bengio, Sören Mindermann, and Daniel Privitera. International ai safety report 2025. 2025. Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, and Sunipa Dev. SeeG- ULL
Chunk 24 · 1,997 chars
9. Emily M Bender. On achieving and evaluating language-independence in nlp. Linguistic Issues in Language Technology, 6, 2011. Yoshua Bengio, Sören Mindermann, and Daniel Privitera. International ai safety report 2025. 2025. Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, and Sunipa Dev. SeeG- ULL multilingual: a dataset of geo-culturally situated stereotypes. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 842â854, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.75. URL https://aclanthology.org/2024.acl-short.75/. Abeba Birhane, Ryan Steed, Victor Ojewale, Briana Vecchione, and Inioluwa Deborah Raji. Ai auditing: The broken bus on the road to ai accountability. In 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 612â643. IEEE, 2024. Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. Systematic inequalities in language technology performance across the worldâs languages. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5486â5505, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.376. URL https: //aclanthology.org/2022.acl-long.376/. Terra Blevins and Luke Zettlemoyer. Language contamination helps explains the cross-lingual capabilities of English pretrained models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3563â3574, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp- main.233. URL https: //aclanthology.org/2022.emnlp-main.233/. Lorna Carson and Ning Jiang. Offensive
Chunk 25 · 1,997 chars
eedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3563â3574, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp- main.233. URL https: //aclanthology.org/2022.emnlp-main.233/. Lorna Carson and Ning Jiang. Offensive words in chinese dialects. In An Anatomy of Chinese Offensive Words: A Lexical and Semantic Analysis, pp. 99â143. Springer, 2021. Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, and Marzyeh Ghassemi. Speak easy: Eliciting harmful jailbreaks from llms with simple interactions. arXiv preprint arXiv:2502.04322, 2025. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https: //lmsys.org/blog/2023-03-30-vicuna/. 13 -- 13 of 26 -- Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Yazeed Alnumay, Sophia Al- thammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, RaphaĂ«l Avalos, et al. Command a: An enterprise-ready large language model. arXiv preprint arXiv:2504.00698, 2025. Marta Costa-jussĂ , Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, and Carleigh Wood. Multilingual holistic bias: Extending descriptors and patterns to unveil demographic biases in languages at scale. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 14141â14156, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.874. URL https://aclantholo gy.org/2023.emnlp-main.874/. Marta Costa-jussĂ , Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, and Carleigh Wood. Multilingual holistic bias: Extending descriptors and
Chunk 26 · 1,994 chars
Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.874. URL https://aclantholo gy.org/2023.emnlp-main.874/. Marta Costa-jussĂ , Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, and Carleigh Wood. Multilingual holistic bias: Extending descriptors and patterns to unveil demographic biases in languages at scale. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 14141â14156, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.874. URL https://aclantholo gy.org/2023.emnlp-main.874/. Marta Costa-jussĂ , Eric Smith, Christophe Ropers, Daniel Licht, Jean Maillard, Javier Ferrando, and Carlos Escolano. Toxicity in multilingual machine translation at scale. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 9570â9586, Singapore, December 2023c. Association for Computational Lin- guistics. doi: 10.18653/v1/2023.findings-emnlp.642. URL https://aclanthology.org/2023.fi ndings-emnlp.642/. Marta Costa-jussĂ , Pierre Andrews, Christine Basta, Juan Ciro, Agnieszka Falenska, Seraphina Goldfarb-Tarrant, Rafael Mosquera, Debora Nozza, and Eduardo SĂĄnchez. Overview of the shared task on machine translation gender bias evaluation with multilingual holistic bias. In Agnieszka FaleĆska, Christine Basta, Marta Costa-jussĂ , Seraphina Goldfarb-Tarrant, and Deb- ora Nozza (eds.), Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp. 399â404, Bangkok, Thailand, August 2024. Association for Computational Lin- guistics. doi: 10.18653/v1/2024.gebnlp-1.26. URL https://aclanthology.org/2024.gebnlp-1 .26. Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, et al. Risk taxonomy, mitigation, and assessment
Chunk 27 · 1,996 chars
ailand, August 2024. Association for Computational Lin- guistics. doi: 10.18653/v1/2024.gebnlp-1.26. URL https://aclanthology.org/2024.gebnlp-1 .26. Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, et al. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. arXiv preprint arXiv:2401.05778, 2024. Saurabh Dash, Yiyang Nan, John Dang, Arash Ahmadian, Shivalika Singh, Madeline Smith, Bharat Venkitesh, Vlad Shmyhlo, Viraat Aryabumi, Walter Beller-Morales, et al. Aya vision: Advancing the frontier of multilingual multimodality. arXiv preprint arXiv:2505.08751, 2025. BĂĄrbara IansĂŁ de Lima Barroso, ClĂĄudia Regina Cabral GalvĂŁo, Luiz Bueno da Silva, and Selma Lancman. A systematic review of translation and cross-cultural adaptation of instruments for the selection of assistive technologies. Occupational Therapy International, 2018(1):4984170, 2018. Herve Debar, Sven Dietrich, Pavel Laskov, Emil C Lupu, and Eirini Ntoutsi. Emerging security challenges of large language models. arXiv preprint arXiv:2412.17614, 2024. 14 -- 14 of 26 -- Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vESNKdEMGp. Hannah Devinney, Jenny Björklund, and Henrik Björklund. We donât talk about that: Case studies on intersectional analysis of social bias in large language models. In Agnieszka FaleĆska, Christine Basta, Marta Costa-jussĂ , Seraphina Goldfarb-Tarrant, and Debora Nozza (eds.), Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp. 33â44, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.gebnlp-1.3. URL https://aclanthology.org/2024.gebnlp-1.3/. Jean-Marc Dewaele. The emotional force of swearwords and taboo words in the
Chunk 28 · 1,996 chars
of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp. 33â44, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.gebnlp-1.3. URL https://aclanthology.org/2024.gebnlp-1.3/. Jean-Marc Dewaele. The emotional force of swearwords and taboo words in the speech of multilin- guals. Journal of multilingual and multicultural development, 25(2-3):204â222, 2004. Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evalua- tions for LLM conversation safety: A survey. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6734â6747, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.375. URL https://aclanthology.org/2024.naacl-long.375/. Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Das, Alan Ritter, and Wei Xu. Reducing privacy risks in online self-disclosures with language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13732â13754, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.741. URL https://aclanthology.org/2024.acl-long.741/. Beyza Ermis, Luiza Pozzobon, Sara Hooker, and Patrick Lewis. From one to many: Expand- ing the scope of toxicity mitigation in language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 15041â15058, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.893. URL https://aclanthology.org/2024.findings-acl .893/. Daniel L Everett. Language: The cultural tool. Vintage, 2012. Mahmud
Chunk 29 · 1,994 chars
ings of the Association for Computational Linguistics: ACL 2024, pp. 15041â15058, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.893. URL https://aclanthology.org/2024.findings-acl .893/. Daniel L Everett. Language: The cultural tool. Vintage, 2012. Mahmud Fasya and Dini Gilang Sari. Sociocultural factors that determine language choice in a multilingual society. In Fifth International Conference on Language, Literature, Culture, and Education (ICOLLITE 2021), pp. 412â418. Atlantis Press, 2021. Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can AI re- late: Testing large language model response for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 2206â2221, Miami, Florida, USA, November 2024. Associa- tion for Computational Linguistics. doi: 10.18653/v1/2024.f indings- emnlp.120. URL https://aclanthology.org/2024.findings-emnlp.120/. Penelope Gardner-Chloros. Code-switching. Cambridge university press, 2009. 15 -- 15 of 26 -- Catalina Goanta, Nikolaos Aletras, Ilias Chalkidis, Sofia RanchordĂĄs, and Gerasimos Spanakis. Regulation and NLP (RegNLP): Taming large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 8712â8724, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.539. URL https://aclanthology.org/2023. emnlp-main.539/. Claudia Gorecki, Julia M Brown, Michelle Briggs, Suzanne Coleman, Carol Dealey, Elizabeth McGinnis, E Andrea Nelson, Nikki Stubbs, Lyn Wilson, and Jane Nixon. Language translation & cross-cultural adaptation guideline. Recommendations for language translation and cross-cultural adaption of the PU-QOL questionnaire, 2014. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav
Chunk 30 · 1,993 chars
n, Carol Dealey, Elizabeth McGinnis, E Andrea Nelson, Nikki Stubbs, Lyn Wilson, and Jane Nixon. Language translation & cross-cultural adaptation guideline. Recommendations for language translation and cross-cultural adaption of the PU-QOL questionnaire, 2014. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023. Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339, 2024. Nuno M. Guerreiro, Duarte M. Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo, and AndrĂ© F. T. Martins. Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500â1517, 2023. doi: 10.116 2/tacl_a_00615. URL https://aclanthology.org/2023.tacl-1.85/. Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, and Wei Xu. Care: Aligning language models for regional cultural awareness. arXiv preprint arXiv:2504.05154, 2025. Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, and Mete Ozay. Model merging and safety alignment: One bad model spoils the bunch. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 13033â13046, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.762. URL
Chunk 31 · 1,996 chars
ty alignment: One bad model spoils the bunch. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 13033â13046, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.762. URL https://aclanthology.org/2024.findings-emnlp.762/. Sabit Hassan, Anthony Sicilia, and Malihe Alikhani. Active learning for robust and represen- tative LLM generation in safety-critical scenarios. In Sachin Kumar, Vidhisha Balachandran, Chan Young Park, Weijia Shi, Shirley Anugrah Hayati, Yulia Tsvetkov, Noah Smith, Han- naneh Hajishirzi, Dongyeop Kang, and David Jurgens (eds.), Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Applica- tion, Group, or Individual (CustomNLP4U), pp. 113â123, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.customnlp4u-1.10. URL https://aclanthology.org/2024.customnlp4u-1.10/. Xuanli He, Jun Wang, Qiongkai Xu, Pasquale Minervini, Pontus Stenetorp, Benjamin IP Rubinstein, and Trevor Cohn. Tuba: Cross-lingual transferability of backdoor attacks in llms with instruction tuning. arXiv preprint arXiv:2404.19597, 2024. 16 -- 16 of 26 -- Harry Ed Hoijer. Language in culture; conference on the interrelations of language and other aspects of culture. 1954. Wenyue Hua, Xianjun Yang, Mingyu Jin, Zelong Li, Wei Cheng, Ruixiang Tang, and Yongfeng Zhang. TrustAgent: Towards safe and trustworthy LLM-based agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 10000â10016, Miami, Florida, USA, November 2024. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2024.f indings- emnlp.585. URL https://aclanthology.org/2024.findings-emnlp.585/. Kaiyu Huang, Fengran Mo, Xinyu Zhang, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi,
Chunk 32 · 1,996 chars
putational Linguistics: EMNLP 2024, pp. 10000â10016, Miami, Florida, USA, November 2024. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2024.f indings- emnlp.585. URL https://aclanthology.org/2024.findings-emnlp.585/. Kaiyu Huang, Fengran Mo, Xinyu Zhang, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, et al. A survey on large language models with multilingualism: Recent advances and new frontiers. arXiv preprint arXiv:2405.10936, 2024a. Xin Huang, Tarun Kumar Vangani, Minh Duc Pham, Xunlong Zou, Bin Wang, Zhengyuan Liu, and Ai Ti Aw. Meralion-textllm: Cross-lingual understanding of large language models in chinese, indonesian, malay, and singlish. arXiv preprint arXiv:2501.08335, 2024b. Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, and Maarten Sap. Polyglotoxicityprompts: Multilingual evaluation of neural toxic degeneration in large lan- guage models. In First Conference on Language Modeling, 2024. URL https://openreview.n et/forum?id=ootI3ZO6TJ. Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, et al. Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models. arXiv preprint arXiv:2503.17682, 2025. Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, DayyĂĄn OâBrien, Hengyu Luo, Hinrich SchĂŒtze, Jörg Tiedemann, et al. Emma-500: Enhancing massively multilin- gual adaptation of large language models. arXiv preprint arXiv:2409.17892, 2024. Wenying Jiang. The relationship between culture and language. ELT journal, 54(4):328â334, 2000. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
Chunk 33 · 1,999 chars
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6282â6293, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.560. URL https://aclanthology.org/2020.ac l-main.560/. Arturs Kanepajs, Vladimir Ivanov, and Richard Moulange. Towards safe multilingual frontier AI. In Workshop on Socially Responsible Language Modelling Research, 2024. URL https://openre view.net/forum?id=iFHsnIkj4q. Teo Keipi, Matti NĂ€si, Atte Oksanen, and Pekka RĂ€sĂ€nen. Online hate and harmful content: Cross- national perspectives. Taylor & Francis, 2016. Tseen-Ling Khoo. Banana Bending: Asian-Australian and Asian-Canadian Literatures. Hong Kong University Press, 2003. Paria Khoshtab, Danial Namazifard, Mostafa Masoudi, Ali Akhgary, Samin Mahdizadeh Sani, and Yadollah Yaghoobzadeh. Comparative study of multilingual idioms and similes in large language 17 -- 17 of 26 -- models. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (eds.), Proceedings of the 31st International Conference on Com- putational Linguistics, pp. 8680â8698, Abu Dhabi, UAE, January 2025. Association for Compu- tational Linguistics. URL https://aclanthology.org/2025.coling-main.580/. Minbeom Kim, Jahyun Koo, Hwanhee Lee, Joonsuk Park, Hwaran Lee, and Kyomin Jung. LifeTox: Unveiling implicit toxicity in life advice. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 688â 698, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653 /v1/2024.naacl-short.60. URL
Chunk 34 · 1,953 chars
en Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 688â 698, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653 /v1/2024.naacl-short.60. URL https://aclanthology.org/2024.naacl-short.60/. Hannah Rose Kirk, Alexander Whitefield, Paul Rottger, Andrew M Bean, Katerina Margatina, Rafael Mosquera-Gomez, Juan Ciro, Max Bartolo, Adina Williams, He He, et al. The prism alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models. Advances in Neural Information Processing Systems, 37:105236â105344, 2024. Katerina Korre, Arianna Muti, Federico Ruggeri, and Alberto BarrĂłn-Cedeño. Untangling hate speech definitions: A semantic componential analysis across cultures and domains. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Findings of the Association for Computational Lin- guistics: NAACL 2025, pp. 3184â3198, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025. findings-naacl.175/. Claire Kramsch. Language and culture. AILA review, 27(1):30â55, 2014. Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, and Kocmi Tom. D\âej\a vu: Multilingual llm evaluation through the lens of machine translation evaluation. arXiv preprint arXiv:2504.11829, 2025. Udo Kruschwitz and Maximilian Schmidhuber. LLM-based synthetic datasets: Applications and limitations in toxicity detection. In Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Bharathi Raja Chakravarthi, Bornini Lahiri, Siddharth Singh, and Shyam Ratan (eds.), Pro- ceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024, pp. 37â51, Torino, Italia, May 2024. ELRA and ICCL. URL
Chunk 35 · 1,995 chars
limitations in toxicity detection. In Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, Bharathi Raja Chakravarthi, Bornini Lahiri, Siddharth Singh, and Shyam Ratan (eds.), Pro- ceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024, pp. 37â51, Torino, Italia, May 2024. ELRA and ICCL. URL https://aclanthology.org/2024. trac-1.6/. Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, et al. Specific versus general principles for constitutional ai. arXiv preprint arXiv:2310.13798, 2023. Yara Kyrychenko, Ke Zhou, Edyta Bogucka, and Daniele Quercia. C3ai: Crafting and evaluating constitutions for constitutional ai. In Proceedings of the ACM on Web Conference 2025, WWW â25, pp. 3204â3218, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400712746. doi: 10.1145/3696410.3714705. URL https://doi.org/10.1145/3696410.3714 705. Benjamin Larsen and Virginia Dignum. Ai value alignment: How we can align artificial intelligence with human values. World Economic Forum, 10 2024. URL https://weforum.org/stories/20 24/10/ai-value-alignment-how-we-can-align-artificial-intelligence-with-human-val ues/. 18 -- 18 of 26 -- Xiaochen Li, Zheng Xin Yong, and Stephen Bach. Preference tuning for toxicity mitigation general- izes across languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 13422â13440, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findi ngs-emnlp.784. URL https://aclanthology.org/2024.findings-emnlp.784/. Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025. Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, AndrĂ© FT Martins, and
Chunk 36 · 1,995 chars
nthology.org/2024.findings-emnlp.784/. Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025. Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, AndrĂ© FT Martins, and Hinrich SchĂŒtze. Mala-500: Massive language adaptation of large language models. arXiv preprint arXiv:2401.13303, 2024. Edoardo Manino, Julia Rozanova, Danilo Carvalho, Andre Freitas, and Lucas Cordeiro. Systematic- ity, compositionality and transitivity of deep NLP models: a metamorphic testing perspective. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Associ- ation for Computational Linguistics: ACL 2022, pp. 2355â2366, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.f indings-acl.185. URL https://aclanthology.org/2022.findings-acl.185/. Beverley Maxwell, MO Martin, and DL Kelly. Translation and cultural adaptation of the survey instruments. Third international mathematics and science study (TIMSS) technical report, 1: 159â169, 1996. Abdelfattah Mazari and Naoual Derraz. Language and culture. International Journal of Humanities and Cultural Studies, 2(2):350â359, 2015. Chidozie Emmanuel Mbada, Gafar Atanda Adeogun, Michael Opeoluwa Ogunlana, Rufus Adesoji Adedoyin, Adesanmi Akinsulore, Taofeek Oluwole Awotidebe, Opeyemi Ayodiipo Idowu, and Olumide Ayoola Olaoye. Translation, cross-cultural adaptation and psychometric evaluation of yoruba version of the short-form 36 health survey. Health and quality of life outcomes, 13:1â12, 2015. Preslav Nakov, Firoj Alam, Shaden Shaar, Giovanni Da San Martino, and Yifan Zhang. COVID-19 in Bulgarian social media: Factuality, harmfulness, propaganda, and framing. In Ruslan Mitkov and Galia Angelova (eds.), Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 997â1009, Held
Chunk 37 · 1,987 chars
Alam, Shaden Shaar, Giovanni Da San Martino, and Yifan Zhang. COVID-19 in Bulgarian social media: Factuality, harmfulness, propaganda, and framing. In Ruslan Mitkov and Galia Angelova (eds.), Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 997â1009, Held Online, September 2021. IN- COMA Ltd. URL https://aclanthology.org/2021.ranlp-1.113/. Zabir Al Nazi and Wei Peng. Large language models in healthcare and medical domain: A review. In Informatics, volume 11, pp. 57. MDPI, 2024. Alexander Tobias Neumann, Yue Yin, Sulayman Sowe, Stefan Decker, and Matthias Jarke. An llm- driven chatbot in higher education for databases and information systems. IEEE Transactions on Education, 2024. Hellina Hailu Nigatu and Inioluwa Deborah Raji. âi searched for a religious song in amharic and got sexual content insteadâ: Investigating online harm in low-resourced languages on youtube. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 141â160, 2024. 19 -- 19 of 26 -- Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, and Monojit Choudhury. The zenoâs paradox of âlow-resourceâ languages. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 17753â17774, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.983. URL https://aclantholo gy.org/2024.emnlp-main.983/. Chad Nilep. âcode switchingâ in sociocultural linguistics. Colorado research in linguistics, 2006. Nobal B. Niraula, Saurab Dulal, and Diwa Koirala. Offensive language detection in Nepali social media. In Aida Mostafazadeh Davani, Douwe Kiela, Mathias Lambert, Bertie Vidgen, Vinodku- mar Prabhakaran, and Zeerak Waseem (eds.), Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), Online, August 2021. Association for
Chunk 38 · 1,996 chars
Saurab Dulal, and Diwa Koirala. Offensive language detection in Nepali social media. In Aida Mostafazadeh Davani, Douwe Kiela, Mathias Lambert, Bertie Vidgen, Vinodku- mar Prabhakaran, and Zeerak Waseem (eds.), Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.woah-1.7. URL https://aclanthology.org/2021.woah-1.7/. Debora Nozza. Exposing the limits of zero-shot cross-lingual hate speech detection. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 907â914, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.114. URL https: //aclanthology.org/2021.acl-short.114/. Victor Ojewale, Ryan Steed, Briana Vecchione, Abeba Birhane, and Inioluwa Deborah Raji. Towards ai accountability infrastructure: Gaps and opportunities in ai audit tooling. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1â29, 2025. OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt/, November 2022. Accessed [Insert Date You Accessed This Page]. OpenAI. Openai gpt-4.5 system card. Technical report, OpenAI, 2 2025. Ankit Pal and Malaikannan Sankarasubbu. Gemini goes to Med school: Exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. In Tristan Naumann, Asma Ben Abacha, Steven Bethard, Kirk Roberts, and Danielle Bitterman (eds.), Proceedings of the 6th Clinical Natural Language Processing Workshop, pp. 21â46, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.clinicaln lp-1.3. URL https://aclanthology.org/2024.clinicalnlp-1.3/. Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. Towards multidomain
Chunk 39 · 1,994 chars
ings of the 6th Clinical Natural Language Processing Workshop, pp. 21â46, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.clinicaln lp-1.3. URL https://aclanthology.org/2024.clinicalnlp-1.3/. Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. Towards multidomain and multi- lingual abusive language detection: a survey. Personal and Ubiquitous Computing, 27(1):17â43, 2023. Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet ĂstĂŒn, Matthias GallĂ©, Marzieh Fadaee, and Sara Hooker. The multilingual divide and its impact on global ai safety. arXiv preprint arXiv:2505.21344, 2025. Bruna Pilz, Rodrigo A Vasconcelos, Freddy B Marcondes, Samuel S Lodovichi, Wilson Mello, and DĂ©bora B Grossi. The brazilian version of start back screening tool-translation, cross-cultural adaptation and reliability. Brazilian journal of physical therapy, 18:453â461, 2014. 20 -- 20 of 26 -- Giada Pistilli, Alina Leidinger, Yacine Jernite, Atoosa Kasirzadeh, Alexandra Sasha Luccioni, and Margaret Mitchell. Civics: Building a dataset for examining culturally-informed values in large language models. In Proceedings of the 2024 AAAI/ACM Conference on AI, Ethics, and Society, AIES â24, pp. 1132â1144. AAAI Press, 2025. Flor Miriam Plaza-del Arco, Debora Nozza, Marco Guerini, Jeffrey Sorensen, and Marcos Zampieri. Countering hateful and offensive speech online - open challenges. In Jessy Li and Fei Liu (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing: Tutorial Abstracts, pp. 11â16, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- tutorials.2. URL https: //aclanthology.org/2024.emnlp-tutorials.2/. Samuele Poppi, Zheng-Xin Yong, Yifei He, Bobbie Chern, Han Zhao, Aobo Yang, and Jianfeng
Chunk 40 · 1,994 chars
Language Pro- cessing: Tutorial Abstracts, pp. 11â16, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- tutorials.2. URL https: //aclanthology.org/2024.emnlp-tutorials.2/. Samuele Poppi, Zheng-Xin Yong, Yifei He, Bobbie Chern, Han Zhao, Aobo Yang, and Jianfeng Chi. Towards understanding the fragility of multilingual llms against fine-tuning attacks. Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025. Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and Philip S Yu. Multilingual large language model: A survey of resources, taxonomy and frontiers. arXiv preprint arXiv:2404.04925, 2024. Haoyi Qiu, Alexander R Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, and Chien-Sheng Wu. Evaluating cultural and social awareness of llm web agents. arXiv preprint arXiv:2410.23252, 2024. Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, and Nanyun Peng. Multimodal cultural safety: Evaluation frameworks and alignment strategies. arXiv preprint arXiv:2505.14972, 2025. Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, et al. Open problems in technical ai governance. arXiv preprint arXiv:2407.14981, 2024. Manon Reusens, Philipp Borchert, Margot Mieskes, Jochen De Weerdt, and Bart Baesens. In- vestigating bias in multilingual language models: Cross-lingual transfer of debiasing techniques. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2887â2896, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.175. URL https://aclanthology.org/2023.emnlp-main.175/. Paul Röttger, Fabio Pernisi, Bertie Vidgen, and Dirk Hovy. Safetyprompts: a systematic
Chunk 41 · 1,998 chars
erence on Empirical Methods in Natural Language Processing, pp. 2887â2896, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.175. URL https://aclanthology.org/2023.emnlp-main.175/. Paul Röttger, Fabio Pernisi, Bertie Vidgen, and Dirk Hovy. Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 27617â27627, 2025. Sebastian Ruder, Ivan VuliÄ, and Anders SĂžgaard. Square one bias in NLP: Towards a multi- dimensional exploration of the research manifold. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 2340â2354, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.1 8653/v1/2022.findings-acl.184. URL https://aclanthology.org/2022.findings-acl.184. Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwaraknath Gnaneshwar, Acyr Locatelli, Robert Kirk, Tim RocktĂ€schel, Edward Grefenstette, and Max Bartolo. Proce- dural knowledge in pretraining drives reasoning in large language models. In The Thirteenth 21 -- 21 of 26 -- International Conference on Learning Representations, 2025. URL https://openreview.net/f orum?id=1hQKHHUsMx. Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts. Advances in Neural Information Processing Systems, 37:69747â69786, 2024. Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9:845â874, 2021. Ayse Pinar Saygin. Processing figurative language in a multi-lingual task: Translation, transfer and metaphor. In Proceedings of
Chunk 42 · 1,994 chars
ce Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9:845â874, 2021. Ayse Pinar Saygin. Processing figurative language in a multi-lingual task: Translation, transfer and metaphor. In Proceedings of the Workshop on Corpus-based and Processing Approaches to Figurative Language. Citeseer, 2001. Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of LLMs in multilingual contexts. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 2668â2680, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.156. URL https://aclanthology.org/2024.findings-acl.156/. Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, and Deyi Xiong. Large language model safety: A holistic survey, 2024a. URL https://arxiv.org/abs/2412.17686. Shaojie Shi, Xiaoyu Tan, Xihe Qiu, Chao Qu, Kexin Nie, Yuan Cheng, Wei Chu, Xu Yinghui, and Yuan Qi. ULMR: Unlearning large language models via negative response and model parameter average. In Franck Dernoncourt, Daniel PreoĆŁiuc-Pietro, and Anastasia Shimorina (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing: Industry Track, pp. 755â762, Miami, Florida, US, November 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- industry.57. URL https: //aclanthology.org/2024.emnlp-industry.57/. Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172â180, 2023. Karan
Chunk 43 · 1,994 chars
4.emnlp- industry.57. URL https: //aclanthology.org/2024.emnlp-industry.57/. Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172â180, 2023. Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, pp. 1â8, 2025. Jiayang Song, Yuheng Huang, Zhehua Zhou, and Lei Ma. Multilingual blending: Large language model safety alignment evaluation with language mixture. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Findings of the Association for Computational Linguistics: NAACL 2025, pp. 3433â3449, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025.findings-naacl.191/. Kamal K Sridhar. Societal multilingualism. Sociolinguistics and language teaching, 47:70, 1996. 22 -- 22 of 26 -- Alex Tamkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saffron Huang, Alfred Mountfield, Jerry Hong, Stuart Ritchie, et al. Clio: Privacy-preserving insights into real-world ai use. arXiv preprint arXiv:2412.13678, 2024. Yan Tao, Olga Viberg, Ryan S Baker, and RenĂ© F Kizilcec. Cultural bias and cultural alignment of large language models. PNAS nexus, 3(9):pgae346, 2024. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Monica M Trieu. Understanding the use of âtwinkie,ââbanana,â and âfobâ: Identifying the origin, role, and consequences of internalized racism within asian america. Sociology Compass, 13(5): e12679, 2019. Ahmet
Chunk 44 · 1,999 chars
Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Monica M Trieu. Understanding the use of âtwinkie,ââbanana,â and âfobâ: Identifying the origin, role, and consequences of internalized racism within asian america. Sociology Compass, 13(5): e12679, 2019. Ahmet ĂstĂŒn, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel Dâsouza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pp. 15894â15939, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.845. URL https://aclanthology.org/2024.acl-long.845/. Jun Wang, Benjamin Rubinstein, and Trevor Cohn. Measuring and mitigating name biases in neural machine translation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2576â2590, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.184. URL https://aclanthology.org/2022.acl-l ong.184/. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All languages matter: On the multilingual safety of LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 5865â5877, Bangkok, Thailand, August 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.349. URL https://aclanthology.org/2024.fi ndings-acl.349/. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse
Chunk 45 · 1,988 chars
.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 5865â5877, Bangkok, Thailand, August 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.349. URL https://aclanthology.org/2024.fi ndings-acl.349/. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All languages matter: On the multilingual safety of LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 5865â5877, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.349. URL https://aclanthology.org/2024.fi ndings-acl.349/. Xiangwen Wang, Jie Peng, Kaidi Xu, Huaxiu Yao, and Tianlong Chen. Reinforcement learning- driven LLM agent for automated attacks on LLMs. In Ivan Habernal, Sepideh Ghanavati, Abhi- lasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, and Oluwaseyi Feyisetan (eds.), Proceedings of the Fifth Workshop on Privacy in Natural Lan- guage Processing, pp. 170â177, Bangkok, Thailand, August 2024c. Association for Computational Linguistics. URL https://aclanthology.org/2024.privatenlp-1.17/. 23 -- 23 of 26 -- Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich SchĂŒtze, and Barbara Plank. Refusal direction is universal across safety-aligned languages. arXiv preprint arXiv:2505.17306, 2025. Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu-Gang Jiang, Yu Qiao, and Yingchun Wang. Fake alignment: Are LLMs really aligned well? In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4696â4712, Mexico City, Mexico, June 2024d. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.263.
Chunk 46 · 1,995 chars
even Bethard (eds.), Proceedings of the 2024 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4696â4712, Mexico City, Mexico, June 2024d. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.263. URL https://aclanthology.org/2024.naacl-long.263/. Zhonghao Wang, Zijia Lu, Bo Jin, and Haiying Deng. Mediagpt: A large language model for chinese media. arXiv preprint arXiv:2307.10930, 2023. Qingsong Wen, Jing Liang, Carles Sierra, Rose Luckin, Richard Tong, Zitao Liu, Peng Cui, and Jiliang Tang. Ai for education (ai4edu): Advancing personalized education with llm and adaptive learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 6743â6744, 2024. Genta Winata, Alham Fikri Aji, Zheng Xin Yong, and Thamar Solorio. The decades progress on code-switching research in NLP: A systematic survey on trends and challenges. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Com- putational Linguistics: ACL 2023, pp. 2936â2978, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.f indings- acl.185. URL https: //aclanthology.org/2023.findings-acl.185/. Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. The bitter lesson learned from 2,000+ multilingual benchmarks. arXiv preprint arXiv:2504.15521, 2025. Hemant Yadav and Sunayana Sitaram. A survey of multilingual models for automatic speech recognition. arXiv preprint arXiv:2202.12576, 2022. Mohammad Ali Yaghan. " arabizi": A contemporary style of arabic slang. Design issues, 24(2): 39â52, 2008. Yahan Yang, Soham Dan, Dan Roth, and Insup Lee. Benchmarking llm guardrails in handling multilingual toxicity. arXiv preprint arXiv:2410.22153, 2024a. Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen,
Chunk 47 · 1,996 chars
2. Mohammad Ali Yaghan. " arabizi": A contemporary style of arabic slang. Design issues, 24(2): 39â52, 2008. Yahan Yang, Soham Dan, Dan Roth, and Insup Lee. Benchmarking llm guardrails in handling multilingual toxicity. arXiv preprint arXiv:2410.22153, 2024a. Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen, Minfeng Zhu, and Qian Liu. Self-distillation bridges distribution gap in language model fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1028â1043, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.58. URL https://aclanthology.org/2024.acl-long.58/. Da Yin, Haoyi Qiu, Kung-Hsiang Huang, Kai-Wei Chang, and Nanyun Peng. Safeworld: Geo- diverse safety alignment. Advances in Neural Information Processing Systems, 37:128734â128768, 2024. Zheng Xin Yong, Cristina Menghini, and Stephen Bach. Low-resource languages jailbreak GPT-4. In Socially Responsible Language Modelling Research, 2023a. URL https://openreview.net/f orum?id=pn83r8V2sv. 24 -- 24 of 26 -- Zheng Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Ade- lani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, Genta Winata, Stella Biderman, Edward Raff, Dragomir Radev, and Vassilina Nikoulina. BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11682â11703, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.653. URL https://aclanthology.org/2023.acl-long.653/. Zheng-Xin Yong, M Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer,
Chunk 48 · 1,991 chars
, pp. 11682â11703, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.653. URL https://aclanthology.org/2023.acl-long.653/. Zheng-Xin Yong, M Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H Bach, and Alham Fikri Aji. Crosslingual reasoning through test-time scaling. arXiv preprint arXiv:2505.05408, 2025. Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding. arXiv preprint arXiv:2406.15481, 2024. Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmark- ing safety risk awareness for LLM agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 1467â 1490, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.79. URL https://aclanthology.org/2024.findings-emnlp .79/. Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14322â14350, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.773. URL https://aclanthology.org/2024.acl-long.773/. Yizhou Zhang, Karishma Sharma, Lun Du, and Yan Liu. Toward mitigating misinformation and social media manipulation in llm era. In Companion Proceedings of the ACM Web Con- ference 2024, WWW â24, pp. 1302â1305, New York, NY, USA, 2024a. Association for Com- puting Machinery. ISBN 9798400701726. doi:
Chunk 49 · 1,952 chars
2024.acl-long.773/. Yizhou Zhang, Karishma Sharma, Lun Du, and Yan Liu. Toward mitigating misinformation and social media manipulation in llm era. In Companion Proceedings of the ACM Web Con- ference 2024, WWW â24, pp. 1302â1305, New York, NY, USA, 2024a. Association for Com- puting Machinery. ISBN 9798400701726. doi: 10.1145/3589335.3641256. URL https: //doi.org/10.1145/3589335.3641256. Zheyuan Zhang, Daniel Zhang-Li, Jifan Yu, Linlu Gong, Jinchang Zhou, Zhanxin Hao, Jianx- iao Jiang, Jie Cao, Huiqin Liu, Zhiyuan Liu, et al. Simulating classroom education with llm- empowered agents. arXiv preprint arXiv:2406.19226, 2024b. Weixiang Zhao, Yulin Hu, Yang Deng, Tongtong Wu, Wenxuan Zhang, Jiahe Guo, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, et al. Mpo: Multilingual safety alignment via reward gap optimization. arXiv preprint arXiv:2505.16869, 2025. Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Be- yond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Com- putational Linguistics: ACL 2024, pp. 10586â10613, Bangkok, Thailand, August 2024. As- sociation for Computational Linguistics. doi: 10.18653/v1/2024.f indings- acl.630. URL https://aclanthology.org/2024.findings-acl.630/. 25 -- 25 of 26 -- Shucheng Zhu, Bingjie Du, Jishun Zhao, Ying Liu, and Pengyuan Liu. Do PLMs and annotators share the same gender bias? definition, dataset, and framework of contextualized gender bias. In Agnieszka FaleĆska, Christine Basta, Marta Costa-jussĂ , Seraphina Goldfarb-Tarrant, and Deb- ora Nozza (eds.), Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pp. 20â32, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi: 10.18653/v1/2024.gebnlp-1.2. URL https://aclanthology.org/2024.gebnlp-1.2/. 26 -- 26 of 26 --