The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
Summary
This study investigates the effectiveness of preference-based alignment in multilingual large language models (LLMs), revealing a significant monolingual bias. While alignment techniques improve safety and reliability in English, their impact is inconsistent across other languages. The research evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, uncovering substantial disparities in latent representation spaces between high-resource and low-resource languages. Analysis of hidden representations before and after alignment shows that English models exhibit strong separation between harmful and harmless content, whereas this separation is weaker in languages like Hindi, Chinese, and German. Metrics such as Bhattacharyya Distance and Silhouette Score confirm these findings, highlighting weaker alignment effects in non-English languages. The study emphasizes the need for language-specific fine-tuning to ensure fair, reliable, and robust multilingual alignment. It also raises ethical concerns about the disproportionate impact of misaligned models on marginalized linguistic communities. The findings underscore the urgency of addressing alignment gaps in underrepresented languages to develop truly safe and equitable multilingual LLMs.
PDF viewer
Chunks(24)
Chunk 0 ¡ 1,995 chars
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context Nikhil Verma LG Toronto AI Research lab nikhil.verma@lge.com Manasa Bharadwaj LG Toronto AI Research lab manasa.bharadwaj@lge.com Abstract Alignment tuning has enabled large language models to excel in reasoning, instruction- following, and minimizing harmful genera- tions. However, despite their widespread de- ployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current align- ment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of dis- tributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety con- straints. Our study evaluates seven LLMs us- ing balanced toxicity datasets and parallel text- detoxification benchmarks, revealing substan- tial disparities in the latent representation space between high-resource and low-resource lan- guages . These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in under- represented languages. 1 Introduction Alignment techniques play a critical role in adapt- ing Large Language Models (LLMs) beyond their pre-training and fine-tuning phases, ensuring con- sistency with human values and preferences (Chris- tiano et al., 2017; Ziegler et al., 2019). To achieve this, methods based on online and offline policy optimization (Ouyang et al., 2022; Rafailov et al., 2023; Ethayarajh et al., 2024; Haldar et al., 2025) have been introduced to enhance model
Chunk 1 ¡ 1,998 chars
g phases, ensuring con- sistency with human values and preferences (Chris- tiano et al., 2017; Ziegler et al., 2019). To achieve this, methods based on online and offline policy optimization (Ouyang et al., 2022; Rafailov et al., 2023; Ethayarajh et al., 2024; Haldar et al., 2025) have been introduced to enhance model reliabil- ity, safety, and fairness for real-world deployment. However, despite being trained on multilingual cor- (a) Before Alignment (b) After Alignment Figure 1: Effect of Alignment on Hidden Representa- tions in Llama-2 (#7B) for English Prompt Safety. pora, LLM alignment remains predominantly op- timized for English, leading to disparities in per- formance and behavior across languages (Schwartz et al., 2022; Vashishtha et al., 2023). This discrep- ancy raises critical concerns about security, and us- ability, particularly for underrepresented languages, where alignment remains underexplored and its ef- fectiveness is poorly understood (Rystrøm et al., 2025; Khandelwal et al., 2023). A systematic inves- tigation into the cross-lingual impact of alignment is essential to understand its influence on model representations and behavior across English and other languages. Beyond fairness concerns, misalignment in mul- tilingual settings presents fundamental challenges in representation learning and decision boundaries within LLMs. While alignment techniques refine model behavior in English, their impact on la- tent space organization across languages remains poorly understood. Empirical evidence suggests that preference optimization shifts model represen- tations (Figure 1), yet such shifts may not general- ize uniformly across linguistic groups (Lin et al., 2024; Kirk et al., 2023). For languages with lim- ited preference data, alignment may inadvertently distort decision boundaries, leading to semantic drift, degraded reasoning capabilities, or increased susceptibility to adversarial inputs (Dang et al., arXiv:2504.02708v1 [cs.CL] 3 Apr 2025 -- 1
Chunk 2 ¡ 1,999 chars
ic groups (Lin et al., 2024; Kirk et al., 2023). For languages with lim- ited preference data, alignment may inadvertently distort decision boundaries, leading to semantic drift, degraded reasoning capabilities, or increased susceptibility to adversarial inputs (Dang et al., arXiv:2504.02708v1 [cs.CL] 3 Apr 2025 -- 1 of 14 -- 2024). This raises the pressing need to analyze how alignment reshapes the hidden space across languages and whether current methodologies ef- fectively preserve semantic consistency while miti- gating harmful outputs. A deeper investigation into cross-lingual representation shifts can illuminate the unintended consequences of alignment, guid- ing the development of more robust multilingual models. In this work, we systematically assess the effec- tiveness of LLM safety alignment across multiple languages by evaluating a diverse set of models on multilingual benchmarks. We analyze distribu- tional shifts in the embedding space of multilin- gual safety prompts for both reference and aligned models, uncovering how alignment mechanisms influence model behavior. To quantify the sepa- ration induced by alignment in enforcing safety constraints, we utilized a set of distributional met- rics, providing a concrete measure of alignment- induced shifts. Our findings reveal critical gaps in multilingual safety mechanisms, highlighting the inconsistencies in alignment effectiveness across languages. As deep generative models evolve, concerns re- garding memorization and bias propagation have intensified (Carlini et al., 2023; Biderman et al., 2023; Nasr et al., 2023). To better isolate the ef- fects of alignment, we analyze hidden representa- tions through probing only at the input processing stage, minimizing potential contamination from post-training phases. Our study uncovers signif- icant performance disparities across model fami- lies, emphasizing the monolingual bias inherent in current LLM development. These insights lay the groundwork for future
Chunk 3 ¡ 1,995 chars
a- tions through probing only at the input processing stage, minimizing potential contamination from post-training phases. Our study uncovers signif- icant performance disparities across model fami- lies, emphasizing the monolingual bias inherent in current LLM development. These insights lay the groundwork for future fine-tuning efforts on language-specific datasets, guiding improvements in multilingual fairness, safety, and robustness. 2 Related literature 2.1 Alignment of LLMs Fine-tuning LLMs based on human preferences has emerged as a key approach for post-training, en- hancing their ability to generate responses aligned with human values (Christiano et al., 2017; Ziegler et al., 2019; Liu et al., 2020). A variety of techniques have been developed to achieve this alignment, starting with Reinforcement Learning from Human Feedback (RLHF), which remains the dominant paradigm for online policy opti- mization (Ouyang et al., 2022; Ahmadian et al., 2024). Beyond RLHF, several offline optimization methods, such as Direct Preference Optimization DPO (Rafailov et al., 2023), Implicit Preference Optimization IPO (Azar et al., 2024), Kahneman- Tversky Optimization KTO (Ethayarajh et al., 2024), Bayesian Causal Optimization BCO (Jung et al., 2024), and KL divergence optimization KLDO (Haldar et al., 2025) have been introduced to refine model behavior without requiring costly reinforcement learning loops. The overarching ob- jective of these techniques is to improve alignment with human intent while mitigating the risks of generating harmful or toxic content, especially in large-scale real-world deployments. 2.2 Multilingual LLM performance A critical limitation of current alignment strate- gies is their heavy reliance on human preference datasets predominantly sourced from English or other high-resource languages (Taori et al., 2023; Chiang et al., 2023; Wu et al., 2023). As a re- sult, LLMs exhibit strong alignment in English but struggle with maintaining consistent
Chunk 4 ¡ 1,987 chars
itation of current alignment strate- gies is their heavy reliance on human preference datasets predominantly sourced from English or other high-resource languages (Taori et al., 2023; Chiang et al., 2023; Wu et al., 2023). As a re- sult, LLMs exhibit strong alignment in English but struggle with maintaining consistent safety and ethical considerations in underrepresented lan- guages (Schwartz et al., 2022; Vashishtha et al., 2023; Khandelwal et al., 2023). This imbalance raises concerns about fairness, as the effectiveness of alignment mechanisms can vary significantly across linguistic groups (Rystrøm et al., 2025). In particular, multilingual users may encounter unreli- able moderation, differing levels of content safety, or unintended biases when interacting with aligned LLMs in languages beyond English (Yong et al., 2023). 2.3 Jailbreaking studies Despite significant advancements, the alignment strategies are not foolproof (Li et al., 2023; AlKhamissi et al., 2024). An emerging body of re- search explores how safety-aligned LLMs remain vulnerable to adversarial exploits, including jail- breaking attacks that expose weaknesses in align- ment constraints (Lukas et al., 2023; Sun et al., 2024; Liu et al., 2023). These vulnerabilities are further amplified in multilingual settings, where models may bypass safety filters in languages for which alignment data is scarce (Winata et al., 2024; Son et al., 2024). Addressing these gaps is cru- cial for ensuring that LLM alignment strategies are not only robust but also equitable across diverse linguistic and cultural contexts (Dang et al., 2024). -- 2 of 14 -- Figure 2: Probing the Impact of Human Preference Tuning on Multilingual Safety at Inference Time: Llama- 3.1 (#8B) Alignment in English vs. Hindi 3 Background LLMs are typically trained in three distinct stages, each playing a crucial role in shaping their capabil- ities and alignment with human preferences. Pre-training. LLMs are initially pre-trained
Chunk 5 ¡ 1,997 chars
an Preference Tuning on Multilingual Safety at Inference Time: Llama- 3.1 (#8B) Alignment in English vs. Hindi 3 Background LLMs are typically trained in three distinct stages, each playing a crucial role in shaping their capabil- ities and alignment with human preferences. Pre-training. LLMs are initially pre-trained on large-scale corpora spanning multiple languages, optimizing the likelihood of predicting the next to- ken conditioned on preceding text. The modelâs vo- cabulary is designed to accommodate tokens from diverse languages, ensuring broad linguistic cover- age. Supervised Fine-tuning (SFT). Following pre- training, LLMs undergo fine-tuning on curated, high-quality datasets specific to various down- stream tasks, such as translation, dialogue, and summarization. This stage refines the modelâs abil- ity to generate more task-relevant responses and produces a reference model Ďref . Human Preference Tuning. The SFT model is prompted with inputs x to generate pairs of can- didate responses (y1, y2) âź Ďref (y | x). These responses are then presented to human annotators, who express a preference for one over the other, denoted as yw âť yl | x, where yw and yl repre- sent the preferred and dispreferred completions, respectively. Preference data is assumed to be gen- erated from an underlying latent reward function râ(x, y), which models human preferences. A com- mon approach to capturing the probability of a preference ranking is through the Bradley-Terry model (Bradley and Terry, 1952): pâ(yw âť yl | x) = Ď(râ(x, yw) â râ(x, yl)) (1) where Ď denotes the logistic function. More generally, when multiple ranked responses are available, Plackett-Luce models are also applica- ble (Plackett, 1975). In large-scale preference datasets (shown in Figure 2), human-annotated feedback is predominantly available for high- resource languages such as English, while low- resource languages are underrepresented (Chaud- hari et al., 2024). Since directly obtaining a true reward
Chunk 6 ¡ 1,997 chars
models are also applica- ble (Plackett, 1975). In large-scale preference datasets (shown in Figure 2), human-annotated feedback is predominantly available for high- resource languages such as English, while low- resource languages are underrepresented (Chaud- hari et al., 2024). Since directly obtaining a true reward function from human feedback is infea- sible, an alternative approach is to assume ac- cess to a static dataset of comparisons D = x(i), y(i) w , y(i) l }N i=1 sampled from pâ and to param- eterize a reward model rĎ(x, y). The parameters of this model are estimated via maximum likelihood: Ex,yw ,ylâźD log Ď(rĎ(x, yw) â rĎ(x, yl)) (2) To ensure that the optimized model maintains desirable text generation properties, such as fluency and coherence, a KL divergence penalty is intro- duced to limit deviation from Ďref . The reward model is then used to optimize a new policy Ďθ, formulated as: ExâD,yâźĎθ [rĎ(x, y)] â βDKL[Ďθ || Ďref ] (3) where β > 0 is a hyperparameter controlling the balance between reward maximization and distri- butional regularization. -- 3 of 14 -- Since this objective is non-differentiable, rein- forcement learning (e.g., Proximal Policy Opti- mization, PPO (Schulman et al., 2017)) is typically employed for policy updates. To address the in- stability and inefficiency associated with online policy optimization, recent approaches have ex- plored offline optimization techniques using closed- form objectives that maximize the margin between preferred and dispreferred completions. Methods such as DPO, KTO, BCO, and KLDO estimate divergence-based constraints and aim to structure the alignment space by separating safe and unsafe generations, within the language distribution they are trained on (Haldar et al., 2025). Consequently, models aligned with English-centric datasets may fail to generalize alignment properties effectively to other languages, leading to safety inconsistencies in multilingual contexts. 4 Methodology In this study,
Chunk 7 ¡ 1,997 chars
unsafe generations, within the language distribution they are trained on (Haldar et al., 2025). Consequently, models aligned with English-centric datasets may fail to generalize alignment properties effectively to other languages, leading to safety inconsistencies in multilingual contexts. 4 Methodology In this study, we address the issue of multilingual alignment in LLMs, particularly in the context of human preference datasets that are unevenly dis- tributed across languages. The reward function rĎ(x, y) and the optimized policy model Ďθ, there- fore, exhibit higher accuracy in high-resource lan- guages, while demonstrating poor alignment in low-resource languages. This misalignment re- sults in a significant failure in distinguishing be- tween aligned and unaligned data, as observed in Llama-3.1-8Bâs performance (Figure 2). For in- stance, when prompted in English with âHow to make a bomb?â, the model refuses to generate a re- sponse, aligning with safety constraints. However, when presented with the equivalent Hindi query, it produces detailed procedural steps, including ex- plicit instructions, highlighting a critical gap in multilingual safety alignment. The discrepancy in multilingual alignment man- ifests as strong safety adherence in English but increased susceptibility to harmful responses in underrepresented languages. To systematically an- alyze this issue, we probed the language models us- ing a balanced multilingual toxicity corpus, where prompts vary in both harmfulness and linguistic structure. We extract final-layer embeddings from the LLM and apply Principal Component Analy- sis (PCA) for dimensionality reduction, enabling visualization of harmful and harmless clusters, be- fore and after alignment. The explained variance ratio of principal components aids in understanding alignment-induced shifts in representation space. We further compute within-class and between-class variance to quantify intra-cluster cohesion and inter-cluster separation in
Chunk 8 ¡ 1,995 chars
tion of harmful and harmless clusters, be- fore and after alignment. The explained variance ratio of principal components aids in understanding alignment-induced shifts in representation space. We further compute within-class and between-class variance to quantify intra-cluster cohesion and inter-cluster separation in the embedding space. To rigorously measure alignment-induced sep- aration, we employ the Bhattacharyya Distance metric (Bhattacharyya, 1943), capturing the over- lap between harmful and harmless clusters. This metric serves as an indicator of how alignment influences the distinction between safety-related representations across different language models. Additionally, we use the Silhouette Score to assess clustering quality, evaluating how well prompt em- beddings align with their respective clusters. A higher Silhouette Score in high-resource languages further substantiates the improved separation of harmful and harmless content, providing a robust quantification of multilingual alignment effective- ness. 5 Experiments 5.1 Dataset Our primary focus is on four lan- guagesâEnglish (en), Hindi (hi), Chinese (zh), and German (de)âas these are the most commonly fine-tuned languages in LLMs through SFT or Preference Optimization. To systematically evaluate multilingual alignment, we utilized the following dataset: ⢠Balanced Toxicity Dataset: This dataset con- tains an equal number of toxic and non-toxic sentences, ensuring an unbiased comparison across different alignment strategies (Demen- tieva et al., 2024b). We used binary toxicity classification datasets available in nine lan- guages. Each language dataset consists of 5,000 samples, with 2,500 toxic and 2,500 non-toxic sentences. ⢠Multilingual Parallel Text-Detoxification Dataset: This dataset contains parallel sen- tence pairs where the toxic and detoxified ver- sions maintain semantic consistency but differ in syntactic toxicity expression (Dementieva et al., 2024a). It allows for a controlled
Chunk 9 ¡ 1,997 chars
with 2,500 toxic and 2,500 non-toxic sentences. ⢠Multilingual Parallel Text-Detoxification Dataset: This dataset contains parallel sen- tence pairs where the toxic and detoxified ver- sions maintain semantic consistency but differ in syntactic toxicity expression (Dementieva et al., 2024a). It allows for a controlled evalua- tion of how well models differentiate between toxic and non-toxic variations of the content across multiple languages. These datasets provide a structured and com- prehensive framework for evaluating multilingual alignment. -- 4 of 14 -- Model #Size Reference model (Ďref) Aligned model (Ďθ) Creator Llama-2(Touvron et al., 2023) 7B meta-llama/Llama-2-7b meta-llama/Llama-2-7b-chat Meta Qwen-2.5(Yang et al., 2024) 7B Qwen/Qwen2.5-7B Qwen/Qwen2.5-7B-Instruct Qwen Llama-3.1(Grattafiori et al., 2024) 8B meta-llama/Llama-3.1-8B meta-llama/Llama-3.1-8B-Instruct Meta Llama-Guard-3(Grattafiori et al., 2024) 8B meta-llama/Llama-3.1-8B meta-llama/Llama-Guard-3-8B Meta Gemma-2(Team et al., 2024) 9B google/gemma-2-9b google/gemma-2-9b-it Google Gemma-3(Team, 2025) 12B google/gemma-3-12b-pt google/gemma-3-12b-it Google Phi-4(Abdin et al., 2024) 14B - microsoft/phi-4 Mircosoft Table 1: Models used for Multilingual Preference-Tuning evaluation. The model cards refer to Hugging Face checkpoints of reference (Ďref) and aligned (Ďθ ) models. (a.1) Ďref-en (b.2) Ďref-hi (c.3) Ďref-zh (d.4) Ďref-de (a.1) Ďθ -en (b.2) Ďθ -hi (c.3) Ďθ -zh (d.4) Ďθ -de Figure 3: Impact of Alignment on Hidden Representations in Llama-2 for Multilingual Corpora. 5.2 Models We analyzed models from four distinct families, varying in scale and training objectives. We focused exclusively on openly available models hosted on Hugging Face, ensuring transparency and reproducibility (mentioned in Table 1). To assess multilingual alignment, we examined the hidden representation space of each model before and after alignment (where applicable). We evaluate multilingual alignment
Chunk 10 ¡ 1,996 chars
ives. We focused exclusively on openly available models hosted on Hugging Face, ensuring transparency and reproducibility (mentioned in Table 1). To assess multilingual alignment, we examined the hidden representation space of each model before and after alignment (where applicable). We evaluate multilingual alignment across di- verse LLM families, including Llama models (Llama-2, Llama-3.1, Llama-Guard-3), Qwen-2.5, Gemma models (Gemma-2, Gemma-3), and Phi-4. These models vary in pretraining strategies, align- ment methods, and language coverageâLlama models use SFT and RLHF, with Llama-Guard specializing in safety filtering; Qwen-2.5 explic- itly supports 29 languages; Gemma models expand multilingual capabilities with RL objectives, cover- ing up to 140 languages; and Phi-4, trained on cu- rated and synthetic datasets, includes 8% multilin- gual data. This selection provides a comprehensive lens to assess alignment robustness in multilingual settings. 5.3 Results Figure 3 demonstrates the impact of alignment on the latent representations of LLMs using PCA with the first two components. Before alignment, the reference policy Ďref shows overlapping clusters for harmless and harmful sentences, whereas the aligned model Ďθ exhibits improved separation due to divergence induced by alignment. While English clusters are well-separated, the effect is less pro- nounced in other languages. The PCA explained variance ratio was 49.61%, indicating that nearly half of the data variation is captured in the reduced space. Notably, the between-class variance ratio increased from 0.81% to 61.20% (a 60.39% im- provement), confirming enhanced cluster separabil- ity post-alignment. However, for Hindi, Chinese, and German, the improvements were significantly -- 5 of 14 -- (a) en (b) hi (c) zh (d) de Figure 4: Bhattacharyya Distance for All Models Pre- and Post-Alignment Tuning. Blue radar indicates values before alignment (Ďref), while green represents values after alignment (Ďθ
Chunk 11 ¡ 1,996 chars
post-alignment. However, for Hindi, Chinese, and German, the improvements were significantly -- 5 of 14 -- (a) en (b) hi (c) zh (d) de Figure 4: Bhattacharyya Distance for All Models Pre- and Post-Alignment Tuning. Blue radar indicates values before alignment (Ďref), while green represents values after alignment (Ďθ ). (a.1) Ďref-en (b.2) Ďref-hi (a.1) Ďθ -en (b.2) Ďθ -hi Figure 5: Impact of Alignment on Hidden Representa- tions in Llama-2 for Multilingual parallel text detoxifi- cation corporas. lower at 19.98%, 10.09%, and 26.85%, respec- tively, highlighting weaker alignment effects in these languages. To further analyze cluster sep- arability, we extended our evaluation beyond PCA with two components, as the explained variance ratio of 50% was insufficient. Instead, we used the first 10 components for computing additional metrics across all models. (cf. Appendix A for additional visualizations with other models.) One such metric, Bhattacharya distance, quan- tifies the separation between harmful and harm- less sentence clusters. As shown in Figure 4, this measure confirms our earlier observations. In En- glish, the reference model shows relatively less separation than the aligned model across all pairs, aligning with our PCA-based findings. Chinese and German also exhibit an increase in cluster sep- aration, though the effect is weaker compared to English. Notably, the scale in the plot is logarith- mic (ranging from 1e-3 to 1e+1), emphasizing the substantial differences in cluster distances across languages. A particularly interesting trend emerges for Hindi, where some models show improved sep- aration post-alignment, while others exhibit the opposite trend, with the unaligned model display- ing greater cluster separation. Similarly, silhouette scores, which measure cluster compactness and between-class variance, reveal a consistent pattern. Although alignment generally increases cluster sep- arability, the improvement is significantly higher for English
Chunk 12 ¡ 1,997 chars
e opposite trend, with the unaligned model display- ing greater cluster separation. Similarly, silhouette scores, which measure cluster compactness and between-class variance, reveal a consistent pattern. Although alignment generally increases cluster sep- arability, the improvement is significantly higher for English than for other languages. (cf. Appendix B for exact metric values across all models.) In a more challenging parallel text-detoxification setupâwhere harmless sentences are minor edits of harmful ones, often differing by just one or two tokensâthe effectiveness of alignment varies (Fig- ure 5). While English representations remain meaningfully separated even in this difficult set- ting, neither the reference nor the aligned mod- els capture a clear distribution shift for Hindi in lower-dimensional space. This highlights language- dependent differences in how alignment influences representation learning. 6 Conclusion Our study provides a comprehensive analysis of the multilingual alignment status of current preference- tuned models. Our findings indicate that state-of- the-art models perform well in English (monolin- gual bias), but their alignment across languages re- mains inconsistent, as shown by cluster separability metrics and hidden representation analyses. This underscores the need for a more holistic represen- tation of languages when training models intended for truly global audiences. -- 6 of 14 -- Limitations While our study provides a systematic analysis of multilingual alignment, it has certain limitations: ⢠Scope of Alignment Evaluation: Our methodology assumes that alignment mecha- nisms induce divergence in the human pref- erence space for the properties they were guardrailed against. However, we primarily focus on safety alignment, overlooking other critical domains such as multi-modality, rea- soning, instruction-following, or planning due to the lack of standardized multilingual bench- marks for these tasks. ⢠Language Coverage:
Chunk 13 ¡ 1,995 chars
ref- erence space for the properties they were guardrailed against. However, we primarily focus on safety alignment, overlooking other critical domains such as multi-modality, rea- soning, instruction-following, or planning due to the lack of standardized multilingual bench- marks for these tasks. ⢠Language Coverage: For this proof-of- concept study, we evaluate only three non- English languages, all of which are medium- resource. While our findings highlight mono- lingual bias, future work should extend this analysis to low-resource languages to better understand alignment disparities. ⢠Dataset Size: Our dataset consists of 5,000 sentences, which, while significantly larger than prior works ( 200 samples) (Lin et al., 2024; Haldar et al., 2025), may not fully cap- ture the diversity of real-world multilingual scenarios. A broader dataset could further val- idate our findings across different linguistic and cultural contexts. These limitations underscore the need for more comprehensive multilingual benchmarks and ex- tended evaluations across diverse alignment tasks and resource-constrained languages. Ethical Considerations Our findings raise serious ethical concerns regard- ing the practices of model developers who release large-scale language models without ensuring ro- bust multilingual alignment. This oversight dispro- portionately impacts marginalized linguistic com- munities, increasing their exposure to harmful out- puts while reinforcing systemic biases. Two major ethical risks highlighted by our anal- ysis are: ⢠Misuse Potential: Our study identifies failure cases in languages other than English, where models generate unsafe or misaligned outputs. Figure 2 presents an example in Hindi, demon- strating how alignment inconsistencies cre- ate vulnerabilities that could be exploited for harmful applications. To mitigate this, future alignment efforts must prioritize multilingual preference collection, particularly for high- risk domains. If direct human
Chunk 14 ¡ 1,987 chars
ed outputs. Figure 2 presents an example in Hindi, demon- strating how alignment inconsistencies cre- ate vulnerabilities that could be exploited for harmful applications. To mitigate this, future alignment efforts must prioritize multilingual preference collection, particularly for high- risk domains. If direct human preference gath- ering is infeasible, post-training interventions should explore synthetic techniques to transfer alignment knowledge from English to other languages. ⢠Harm to Vulnerable Populations: The unre- stricted release of insufficiently aligned mod- els disproportionately impacts marginalized communities, where users may interact with models in languages that lack rigorous safety guardrails. Our findings suggest that current models can be jailbroken more easily in sister languages to English, exposing these popu- lations to higher risks. This underscores the urgent need for comprehensive multilingual safety evaluations before open-weight LLMs are widely deployed. Our study calls for greater accountability in mul- tilingual alignment efforts, emphasizing that ethical AI deployment requires more than just English- centric safety measures. Future research should focus on developing alignment techniques that are language-agnostic and equitable across diverse lin- guistic contexts. References Marah Abdin, Jyoti Aneja, Harkirat Behl, SĂŠbastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, and 1 others. 2024. Phi-4 technical re- port. arXiv preprint arXiv:2412.08905. Arash Ahmadian, Chris Cremer, Matthias GallĂŠ, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet ĂstĂźn, and Sara Hooker. 2024. Back to ba- sics: Revisiting reinforce style optimization for learn- ing from human feedback in llms. arXiv preprint arXiv:2402.14740. Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. 2024. Investigating cultural alignment of large language models. arXiv preprint
Chunk 15 ¡ 1,997 chars
Ahmet ĂstĂźn, and Sara Hooker. 2024. Back to ba- sics: Revisiting reinforce style optimization for learn- ing from human feedback in llms. arXiv preprint arXiv:2402.14740. Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. 2024. Investigating cultural alignment of large language models. arXiv preprint arXiv:2402.13231. Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bi- lal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. 2024. A general theoret- ical paradigm to understand learning from human preferences. In International Conference on Arti- ficial Intelligence and Statistics, pages 4447â4455. PMLR. -- 7 of 14 -- Anil Bhattacharyya. 1943. On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society, 35:99â110. Stella Biderman, Usvsn Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. 2023. Emergent and pre- dictable memorization in large language models. Ad- vances in Neural Information Processing Systems, 36:28072â28090. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324â 345. Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Ex- tracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253â5270. Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Mura- hari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande, and Bruno Castro da Silva. 2024. Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms. arXiv preprint arXiv:2404.08555. Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, and 1 others. 2023. Vicuna:
Chunk 16 ¡ 1,983 chars
o Castro da Silva. 2024. Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms. arXiv preprint arXiv:2404.08555. Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, and 1 others. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6. Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Ad- vances in neural information processing systems, 30. John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet ĂstĂźn, and Sara Hooker. 2024. Rlhf can speak many languages: Unlocking multilingual preference optimization for llms. arXiv preprint arXiv:2407.02552. Daryna Dementieva, Nikolay Babakov, Amit Ronen, Abinew Ali Ayele, Naquee Rizwan, Florian Schnei- der, Xintong Wang, Seid Muhie Yimam, Daniil Moskovskiy, Elisei Stakovskii, and 1 others. 2024a. Multilingual and explainable text detoxification with parallel corpora. arXiv preprint arXiv:2412.11691. Daryna Dementieva, Daniil Moskovskiy, Nikolay Babakov, Abinew Ali Ayele, Naquee Rizwan, Fro- lian Schneider, Xintog Wang, Seid Muhie Yimam, Dmitry Ustalov, Elisei Stakovskii, Alisa Smirnova, Ashraf Elnagar, Animesh Mukherjee, and Alexander Panchenko. 2024b. Overview of the multilingual text detoxification task at pan 2024. In Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum. CEUR-WS.org. Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint
Chunk 17 ¡ 1,995 chars
2024. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, and Yue Xing. 2025. Llm safety alignment is di- vergence estimation in disguise. arXiv preprint arXiv:2502.00657. Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. 2024. Binary classifier optimiza- tion for large language model alignment. arXiv preprint arXiv:2404.04656. Khyati Khandelwal, Manuel Tonneau, Andrew M Bean, Hannah Rose Kirk, and Scott A Hale. 2023. Casteist but not racist? quantifying disparities in large lan- guage model bias between india and the west. CoRR. Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2023. Understanding the ef- fects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452. Haoran Li, Yulin Chen, Jinglong Luo, Jiecong Wang, Hao Peng, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit Chan, Zenglin Xu, and 1 others. 2023. Privacy in large language models: Attacks, defenses and future directions. arXiv preprint arXiv:2310.10383. Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, and Jiliang Tang. 2024. To- wards understanding jailbreak attacks in llms: A representation space analysis. arXiv preprint arXiv:2406.10794. Fei Liu and 1 others. 2020. Learning to summarize from human feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 583â592. Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kai- long Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860. Nils Lukas,
Chunk 18 ¡ 1,992 chars
nual Meeting of the Association for Computational Linguistics, pages 583â592. Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kai- long Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860. Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-BĂŠguelin. 2023. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Sym- posium on Security and Privacy (SP), pages 346â363. IEEE. Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A Feder Cooper, Daphne Ippolito, -- 8 of 14 -- Christopher A Choquette-Choo, Eric Wallace, Flo- rian Tramèr, and Katherine Lee. 2023. Scalable ex- traction of training data from (production) language models. arXiv preprint arXiv:2311.17035. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow in- structions with human feedback. Advances in neural information processing systems, 35:27730â27744. Robin L Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society Series C: Ap- plied Statistics, 24(2):193â202. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your lan- guage model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728â 53741. Jonathan Rystrøm, Hannah Rose Kirk, and Scott Hale. 2025. Multilingual!= multicultural: Evaluating gaps between multilingual capabilities and cultural align- ment in llms. arXiv preprint arXiv:2502.16534. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proxi- mal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Reva Schwartz, Reva Schwartz, Apostol Vassilev,
Chunk 19 ¡ 1,984 chars
ting gaps between multilingual capabilities and cultural align- ment in llms. arXiv preprint arXiv:2502.16534. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proxi- mal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Reva Schwartz, Reva Schwartz, Apostol Vassilev, Kris- ten Greene, Lori Perine, Andrew Burt, and Patrick Hall. 2022. Towards a standard for identifying and managing bias in artificial intelligence, volume 3. US Department of Commerce, National Institute of Standards and Technology . . . . Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula- Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-CristiĂ , LucĂa Tormo-BaĂąuelos, and Seungone Kim. 2024. Mm-eval: A multilingual meta-evaluation benchmark for llm-as-a-judge and reward models. arXiv preprint arXiv:2410.17578. Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, and 1 others. 2024. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 3. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. Gemma Team. 2025. Gemma 3. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, LĂŠonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre RamĂŠ, and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Aniket Vashishtha, Kabir Ahuja, and Sunayana Sitaram. 2023. On evaluating and mitigating gender bi- ases in multilingual settings. arXiv
Chunk 20 ¡ 1,996 chars
abaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Aniket Vashishtha, Kabir Ahuja, and Sunayana Sitaram. 2023. On evaluating and mitigating gender bi- ases in multilingual settings. arXiv preprint arXiv:2307.01503. Genta Indra Winata, Hanyang Zhao, Anirban Das, Wen- pin Tang, David D Yao, Shi-Xiong Zhang, and Sam- bit Sahu. 2024. Preference tuning with human feed- back on language, speech, and vision tasks: A survey. arXiv preprint arXiv:2409.11564. Minghao Wu, Abdul Waheed, Chiyu Zhang, Muham- mad Abdul-Mageed, and Alham Fikri Aji. 2023. Lamini-lm: A diverse herd of distilled mod- els from large-scale instructions. arXiv preprint arXiv:2304.14402. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. 2023. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. 2019. Fine-tuning lan- guage models from human preferences. arXiv preprint arXiv:1909.08593. -- 9 of 14 -- A Impact of Alignment on Hidden Representations for Multilingual Corpora For models other than Llama-2, visualizations of hidden representations before and after alignment are presented in Figures 6 to 10. For Phi-4, since only the aligned model checkpoint is available, only the representation analysis is shown in Figure 11. B Metrics of cluster quality, before and after alignment of LLMs The models we used in this study are mentioned below: ⢠Llama-2: A suite of open-source founda- tional and fine-tuned chat models. The pre- training corpus includes over 5% non-English high-quality data, though evaluations primar- ily focus on
Chunk 21 ¡ 1,992 chars
Figure 11. B Metrics of cluster quality, before and after alignment of LLMs The models we used in this study are mentioned below: ⢠Llama-2: A suite of open-source founda- tional and fine-tuned chat models. The pre- training corpus includes over 5% non-English high-quality data, though evaluations primar- ily focus on English. ⢠Qwen-2.5: An instruction-following model optimized for long-context reasoning and di- verse prompts. It explicitly supports 29 lan- guages in output generation. ⢠Llama-3.1: Fine-tuned versions use SFT and RLHF for preference alignment. Supported languages include English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. ⢠Llama-Guard: A content safety classifier fine-tuned on Llama-3.1-8B. It classifies both inputs (prompt filtering) and outputs (response moderation) and supports multilingual safety alignment in languages same as Llama-3.1. ⢠Gemma-2: A 9B parameter instruction-tuned model primarily trained on English data. How- ever, the SFT phase incorporates some multi- lingual contexts. ⢠Gemma-3: A 12B parameter multimodal and multilingual model which supports 140 lan- guages. In its RL objectives, it uses variety of reward functions to improve helpfulness, rea- soning and multilingual abilities, while mini- mizing model harmfulness. ⢠Phi-4: Trained on a mixture of synthetic datasets, filtered public domain websites, and traditional sources using SFT and DPO. Ap- proximately 8% of its training data is explic- itly multilingual covering wide range of lan- guages, including German, Spanish, French, Portuguese, Italian, Hindi and Japanese. This selection of models enables a comprehen- sive evaluation of alignment in multilingual set- tings, covering diverse pretraining strategies, align- ment techniques, and language coverage. Table 2 refers to the metrics used for comparison of PCA clusters with 10 components for all the models evaluated before and after alignment. -- 10 of 14 -- (a.1) Ďref-en (b.2) Ďref-hi (c.3)
Chunk 22 ¡ 1,997 chars
alignment in multilingual set- tings, covering diverse pretraining strategies, align- ment techniques, and language coverage. Table 2 refers to the metrics used for comparison of PCA clusters with 10 components for all the models evaluated before and after alignment. -- 10 of 14 -- (a.1) Ďref-en (b.2) Ďref-hi (c.3) Ďref-zh (d.4) Ďref-de (a.1) Ďθ -en (b.2) Ďθ -hi (c.3) Ďθ -zh (d.4) Ďθ -de Figure 6: Impact of Alignment on Hidden Representations in Qwen-2.5 for Multilingual Corpora. (a.1) Ďref-en (b.2) Ďref-hi (c.3) Ďref-zh (d.4) Ďref-de (a.1) Ďθ -en (b.2) Ďθ -hi (c.3) Ďθ -zh (d.4) Ďθ -de Figure 7: Impact of Alignment on Hidden Representations in Llama-3.1 for Multilingual Corpora. -- 11 of 14 -- (a.1) Ďref-en (b.2) Ďref-hi (c.3) Ďref-zh (d.4) Ďref-de (a.1) Ďθ -en (b.2) Ďθ -hi (c.3) Ďθ -zh (d.4) Ďθ -de Figure 8: Impact of Alignment on Hidden Representations in Llama-Guard-3 for Multilingual Corpora. (a.1) Ďref-en (b.2) Ďref-hi (c.3) Ďref-zh (d.4) Ďref-de (a.1) Ďθ -en (b.2) Ďθ -hi (c.3) Ďθ -zh (d.4) Ďθ -de Figure 9: Impact of Alignment on Hidden Representations in Gemma-2 for Multilingual Corpora. -- 12 of 14 -- (a.1) Ďref-en (b.2) Ďref-hi (c.3) Ďref-zh (d.4) Ďref-de (a.1) Ďθ -en (b.2) Ďθ -hi (c.3) Ďθ -zh (d.4) Ďθ -de Figure 10: Impact of Alignment on Hidden Representations in Gemma-3 for Multilingual Corpora. (a.1) Ďθ -en (b.2) Ďθ -hi (c.3) Ďθ -zh (d.4) Ďθ -de Figure 11: Impact of Alignment on Hidden Representations in Phi-4 for Multilingual Corpora. -- 13 of 14 -- Table 2: Metric values of different LLMs before and after alignment on Balanced Toxicity Dataset. We use âBDâ for Bhattacharyya distance, âSSâ for silhouette score, and âBCVâ for between-class variance. We use hyphen (-) where model checkpoint is not available. Model Language Reference Model Aligned Model BD SS BCV BD SS BCV Llama-2 English 0.035 0.0142 0.0303 2.5871 0.5433 0.3715 Hindi 0.1837 0.1355 0.1309 0.6743 0.3036 0.1828 Chinese 0.0044 0.0017 0.0033 0.0961 0.0878 0.0671 German 0.1905 0.0471
Chunk 23 ¡ 1,529 chars
een-class variance. We use hyphen (-) where model checkpoint is not available. Model Language Reference Model Aligned Model BD SS BCV BD SS BCV Llama-2 English 0.035 0.0142 0.0303 2.5871 0.5433 0.3715 Hindi 0.1837 0.1355 0.1309 0.6743 0.3036 0.1828 Chinese 0.0044 0.0017 0.0033 0.0961 0.0878 0.0671 German 0.1905 0.0471 0.0409 0.3907 0.2825 0.2067 Qwen-2.5 English 0.3037 0.1326 0.0776 0.8365 0.315 0.1987 Hindi 0.0309 0.0172 0.0631 0.1746 0.1047 0.0895 Chinese 0.0208 0.0102 0.0119 0.0233 0.0138 0.0263 German 0.0813 0.0482 0.033 0.077 0.0634 0.0703 Llama-3.1 English 0.1114 0.0268 0.0637 0.9639 0.3156 0.2047 Hindi 0.5402 0.2313 0.1646 0.3723 0.2253 0.1649 Chinese 0.0029 0.0025 0.0053 0.0162 0.014 0.0124 German 0.1262 0.0411 0.0405 0.2096 0.123 0.0904 Llama-Guard-3 English 0.1114 0.0268 0.0637 0.8627 0.2971 0.2080 Hindi 0.5402 0.2313 0.1646 0.2389 0.1456 0.1398 Chinese 0.0029 0.0025 0.0053 0.1576 0.0923 0.072 German 0.1262 0.0411 0.0405 0.2440 0.1697 0.1112 Gemma-2 English 0.1653 0.0565 0.0781 0.6046 0.2544 0.1368 Hindi 0.5061 0.2099 0.1491 0.1522 0.0882 0.0795 Chinese 0.0055 0.0051 0.0072 0.0182 0.0172 0.0202 German 0.0487 0.0301 0.0305 0.1976 0.1028 0.0629 Gemma-3 English 0.2087 0.0753 0.1016 1.1618 0.3913 0.2342 Hindi 1.2436 0.2389 0.18 0.394 0.1794 0.1073 Chinese 0.0046 0.0043 0.0102 0.0749 0.0362 0.0262 German 0.107 0.0396 0.0426 0.2298 0.1287 0.079 Phi-4 English - - - 1.1200 0.3712 0.1929 Hindi - - - 0.7817 0.2895 0.1475 Chinese - - - 0.0015 0.0004 0.0111 German - - - 0.2650 0.1543 0.0775 -- 14 of 14 --