The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Summary

This study investigates the effectiveness of preference-based alignment in multilingual large language models (LLMs), revealing a significant monolingual bias. While alignment techniques improve safety and reliability in English, their impact is inconsistent across other languages. The research evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, uncovering substantial disparities in latent representation spaces between high-resource and low-resource languages. Analysis of hidden representations before and after alignment shows that English models exhibit strong separation between harmful and harmless content, whereas this separation is weaker in languages like Hindi, Chinese, and German. Metrics such as Bhattacharyya Distance and Silhouette Score confirm these findings, highlighting weaker alignment effects in non-English languages. The study emphasizes the need for language-specific fine-tuning to ensure fair, reliable, and robust multilingual alignment. It also raises ethical concerns about the disproportionate impact of misaligned models on marginalized linguistic communities. The findings underscore the urgency of addressing alignment gaps in underrepresented languages to develop truly safe and equitable multilingual LLMs.

PDF viewer

Chunks(24)

Chunk 0 · 1,995 chars

The Hidden Space of Safety:
Understanding Preference-Tuned LLMs in Multilingual context
Nikhil Verma
LG Toronto AI Research lab
nikhil.verma@lge.com
Manasa Bharadwaj
LG Toronto AI Research lab
manasa.bharadwaj@lge.com
Abstract
Alignment tuning has enabled large language
models to excel in reasoning, instruction-
following, and minimizing harmful genera-
tions. However, despite their widespread de-
ployment, these models exhibit a monolingual
bias, raising concerns about the effectiveness
of alignment across languages. Current align-
ment methods predominantly focus on English,
leaving it unclear how alignment mechanism
generalize to multilingual settings. To address
this, we conduct a systematic analysis of dis-
tributional shifts in the embedding space of
LLMs before and after alignment, uncovering
its impact on model behavior across diverse
languages. We leverage the alignment-induced
separation in safety space as a quantitative tool
to measure how alignment enforces safety con-
straints. Our study evaluates seven LLMs us-
ing balanced toxicity datasets and parallel text-
detoxification benchmarks, revealing substan-
tial disparities in the latent representation space
between high-resource and low-resource lan-
guages . These findings underscore the need
for language-specific fine-tuning to ensure fair,
reliable and robust multilingual alignment. Our
insights provide a foundation for developing
truly safe multilingual LLMs, emphasizing the
urgency of addressing alignment gaps in under-
represented languages.
1 Introduction
Alignment techniques play a critical role in adapt-
ing Large Language Models (LLMs) beyond their
pre-training and fine-tuning phases, ensuring con-
sistency with human values and preferences (Chris-
tiano et al., 2017; Ziegler et al., 2019). To achieve
this, methods based on online and offline policy
optimization (Ouyang et al., 2022; Rafailov et al.,
2023; Ethayarajh et al., 2024; Haldar et al., 2025)
have been introduced to enhance model

Chunk 1 · 1,998 chars

g phases, ensuring con-
sistency with human values and preferences (Chris-
tiano et al., 2017; Ziegler et al., 2019). To achieve
this, methods based on online and offline policy
optimization (Ouyang et al., 2022; Rafailov et al.,
2023; Ethayarajh et al., 2024; Haldar et al., 2025)
have been introduced to enhance model reliabil-
ity, safety, and fairness for real-world deployment.
However, despite being trained on multilingual cor-
(a) Before Alignment (b) After Alignment
Figure 1: Effect of Alignment on Hidden Representa-
tions in Llama-2 (#7B) for English Prompt Safety.
pora, LLM alignment remains predominantly op-
timized for English, leading to disparities in per-
formance and behavior across languages (Schwartz
et al., 2022; Vashishtha et al., 2023). This discrep-
ancy raises critical concerns about security, and us-
ability, particularly for underrepresented languages,
where alignment remains underexplored and its ef-
fectiveness is poorly understood (Rystrøm et al.,
2025; Khandelwal et al., 2023). A systematic inves-
tigation into the cross-lingual impact of alignment
is essential to understand its influence on model
representations and behavior across English and
other languages.
Beyond fairness concerns, misalignment in mul-
tilingual settings presents fundamental challenges
in representation learning and decision boundaries
within LLMs. While alignment techniques refine
model behavior in English, their impact on la-
tent space organization across languages remains
poorly understood. Empirical evidence suggests
that preference optimization shifts model represen-
tations (Figure 1), yet such shifts may not general-
ize uniformly across linguistic groups (Lin et al.,
2024; Kirk et al., 2023). For languages with lim-
ited preference data, alignment may inadvertently
distort decision boundaries, leading to semantic
drift, degraded reasoning capabilities, or increased
susceptibility to adversarial inputs (Dang et al.,
arXiv:2504.02708v1 [cs.CL] 3 Apr 2025

-- 1

Chunk 2 · 1,999 chars

ic groups (Lin et al.,
2024; Kirk et al., 2023). For languages with lim-
ited preference data, alignment may inadvertently
distort decision boundaries, leading to semantic
drift, degraded reasoning capabilities, or increased
susceptibility to adversarial inputs (Dang et al.,
arXiv:2504.02708v1 [cs.CL] 3 Apr 2025

-- 1 of 14 --

2024). This raises the pressing need to analyze
how alignment reshapes the hidden space across
languages and whether current methodologies ef-
fectively preserve semantic consistency while miti-
gating harmful outputs. A deeper investigation into
cross-lingual representation shifts can illuminate
the unintended consequences of alignment, guid-
ing the development of more robust multilingual
models.
In this work, we systematically assess the effec-
tiveness of LLM safety alignment across multiple
languages by evaluating a diverse set of models
on multilingual benchmarks. We analyze distribu-
tional shifts in the embedding space of multilin-
gual safety prompts for both reference and aligned
models, uncovering how alignment mechanisms
influence model behavior. To quantify the sepa-
ration induced by alignment in enforcing safety
constraints, we utilized a set of distributional met-
rics, providing a concrete measure of alignment-
induced shifts. Our findings reveal critical gaps in
multilingual safety mechanisms, highlighting the
inconsistencies in alignment effectiveness across
languages.
As deep generative models evolve, concerns re-
garding memorization and bias propagation have
intensified (Carlini et al., 2023; Biderman et al.,
2023; Nasr et al., 2023). To better isolate the ef-
fects of alignment, we analyze hidden representa-
tions through probing only at the input processing
stage, minimizing potential contamination from
post-training phases. Our study uncovers signif-
icant performance disparities across model fami-
lies, emphasizing the monolingual bias inherent
in current LLM development. These insights lay
the groundwork for future

Chunk 3 · 1,995 chars

a-
tions through probing only at the input processing
stage, minimizing potential contamination from
post-training phases. Our study uncovers signif-
icant performance disparities across model fami-
lies, emphasizing the monolingual bias inherent
in current LLM development. These insights lay
the groundwork for future fine-tuning efforts on
language-specific datasets, guiding improvements
in multilingual fairness, safety, and robustness.
2 Related literature
2.1 Alignment of LLMs
Fine-tuning LLMs based on human preferences has
emerged as a key approach for post-training, en-
hancing their ability to generate responses aligned
with human values (Christiano et al., 2017; Ziegler
et al., 2019; Liu et al., 2020). A variety of
techniques have been developed to achieve this
alignment, starting with Reinforcement Learning
from Human Feedback (RLHF), which remains
the dominant paradigm for online policy opti-
mization (Ouyang et al., 2022; Ahmadian et al.,
2024). Beyond RLHF, several offline optimization
methods, such as Direct Preference Optimization
DPO (Rafailov et al., 2023), Implicit Preference
Optimization IPO (Azar et al., 2024), Kahneman-
Tversky Optimization KTO (Ethayarajh et al.,
2024), Bayesian Causal Optimization BCO (Jung
et al., 2024), and KL divergence optimization
KLDO (Haldar et al., 2025) have been introduced
to refine model behavior without requiring costly
reinforcement learning loops. The overarching ob-
jective of these techniques is to improve alignment
with human intent while mitigating the risks of
generating harmful or toxic content, especially in
large-scale real-world deployments.
2.2 Multilingual LLM performance
A critical limitation of current alignment strate-
gies is their heavy reliance on human preference
datasets predominantly sourced from English or
other high-resource languages (Taori et al., 2023;
Chiang et al., 2023; Wu et al., 2023). As a re-
sult, LLMs exhibit strong alignment in English
but struggle with maintaining consistent

Chunk 4 · 1,987 chars

itation of current alignment strate-
gies is their heavy reliance on human preference
datasets predominantly sourced from English or
other high-resource languages (Taori et al., 2023;
Chiang et al., 2023; Wu et al., 2023). As a re-
sult, LLMs exhibit strong alignment in English
but struggle with maintaining consistent safety
and ethical considerations in underrepresented lan-
guages (Schwartz et al., 2022; Vashishtha et al.,
2023; Khandelwal et al., 2023). This imbalance
raises concerns about fairness, as the effectiveness
of alignment mechanisms can vary significantly
across linguistic groups (Rystrøm et al., 2025). In
particular, multilingual users may encounter unreli-
able moderation, differing levels of content safety,
or unintended biases when interacting with aligned
LLMs in languages beyond English (Yong et al.,
2023).
2.3 Jailbreaking studies
Despite significant advancements, the alignment
strategies are not foolproof (Li et al., 2023;
AlKhamissi et al., 2024). An emerging body of re-
search explores how safety-aligned LLMs remain
vulnerable to adversarial exploits, including jail-
breaking attacks that expose weaknesses in align-
ment constraints (Lukas et al., 2023; Sun et al.,
2024; Liu et al., 2023). These vulnerabilities are
further amplified in multilingual settings, where
models may bypass safety filters in languages for
which alignment data is scarce (Winata et al., 2024;
Son et al., 2024). Addressing these gaps is cru-
cial for ensuring that LLM alignment strategies are
not only robust but also equitable across diverse
linguistic and cultural contexts (Dang et al., 2024).

-- 2 of 14 --

Figure 2: Probing the Impact of Human Preference Tuning on Multilingual Safety at Inference Time: Llama-
3.1 (#8B) Alignment in English vs. Hindi
3 Background
LLMs are typically trained in three distinct stages,
each playing a crucial role in shaping their capabil-
ities and alignment with human preferences.
Pre-training. LLMs are initially pre-trained

Chunk 5 · 1,997 chars

an Preference Tuning on Multilingual Safety at Inference Time: Llama-
3.1 (#8B) Alignment in English vs. Hindi
3 Background
LLMs are typically trained in three distinct stages,
each playing a crucial role in shaping their capabil-
ities and alignment with human preferences.
Pre-training. LLMs are initially pre-trained on
large-scale corpora spanning multiple languages,
optimizing the likelihood of predicting the next to-
ken conditioned on preceding text. The model’s vo-
cabulary is designed to accommodate tokens from
diverse languages, ensuring broad linguistic cover-
age.
Supervised Fine-tuning (SFT). Following pre-
training, LLMs undergo fine-tuning on curated,
high-quality datasets specific to various down-
stream tasks, such as translation, dialogue, and
summarization. This stage refines the model’s abil-
ity to generate more task-relevant responses and
produces a reference model πref .
Human Preference Tuning. The SFT model is
prompted with inputs x to generate pairs of can-
didate responses (y1, y2) ∼ πref (y | x). These
responses are then presented to human annotators,
who express a preference for one over the other,
denoted as yw ≻ yl | x, where yw and yl repre-
sent the preferred and dispreferred completions,
respectively. Preference data is assumed to be gen-
erated from an underlying latent reward function
r∗(x, y), which models human preferences. A com-
mon approach to capturing the probability of a
preference ranking is through the Bradley-Terry
model (Bradley and Terry, 1952):
p∗(yw ≻ yl | x) = σ(r∗(x, yw) − r∗(x, yl)) (1)
where σ denotes the logistic function. More
generally, when multiple ranked responses are
available, Plackett-Luce models are also applica-
ble (Plackett, 1975). In large-scale preference
datasets (shown in Figure 2), human-annotated
feedback is predominantly available for high-
resource languages such as English, while low-
resource languages are underrepresented (Chaud-
hari et al., 2024). Since directly obtaining a true
reward

Chunk 6 · 1,997 chars

models are also applica-
ble (Plackett, 1975). In large-scale preference
datasets (shown in Figure 2), human-annotated
feedback is predominantly available for high-
resource languages such as English, while low-
resource languages are underrepresented (Chaud-
hari et al., 2024). Since directly obtaining a true
reward function from human feedback is infea-
sible, an alternative approach is to assume ac-
cess to a static dataset of comparisons D =
x(i), y(i)
w , y(i)
l }N
i=1 sampled from p∗ and to param-
eterize a reward model rϕ(x, y). The parameters of
this model are estimated via maximum likelihood:
Ex,yw ,yl∼D
log σ(rϕ(x, yw) − rϕ(x, yl)) (2)
To ensure that the optimized model maintains
desirable text generation properties, such as fluency
and coherence, a KL divergence penalty is intro-
duced to limit deviation from πref . The reward
model is then used to optimize a new policy πθ,
formulated as:
Ex∈D,y∼πθ [rϕ(x, y)] − βDKL[πθ || πref ] (3)
where β > 0 is a hyperparameter controlling the
balance between reward maximization and distri-
butional regularization.

-- 3 of 14 --

Since this objective is non-differentiable, rein-
forcement learning (e.g., Proximal Policy Opti-
mization, PPO (Schulman et al., 2017)) is typically
employed for policy updates. To address the in-
stability and inefficiency associated with online
policy optimization, recent approaches have ex-
plored offline optimization techniques using closed-
form objectives that maximize the margin between
preferred and dispreferred completions. Methods
such as DPO, KTO, BCO, and KLDO estimate
divergence-based constraints and aim to structure
the alignment space by separating safe and unsafe
generations, within the language distribution they
are trained on (Haldar et al., 2025). Consequently,
models aligned with English-centric datasets may
fail to generalize alignment properties effectively to
other languages, leading to safety inconsistencies
in multilingual contexts.
4 Methodology
In this study,

Chunk 7 · 1,997 chars

unsafe
generations, within the language distribution they
are trained on (Haldar et al., 2025). Consequently,
models aligned with English-centric datasets may
fail to generalize alignment properties effectively to
other languages, leading to safety inconsistencies
in multilingual contexts.
4 Methodology
In this study, we address the issue of multilingual
alignment in LLMs, particularly in the context of
human preference datasets that are unevenly dis-
tributed across languages. The reward function
rϕ(x, y) and the optimized policy model πθ, there-
fore, exhibit higher accuracy in high-resource lan-
guages, while demonstrating poor alignment in
low-resource languages. This misalignment re-
sults in a significant failure in distinguishing be-
tween aligned and unaligned data, as observed in
Llama-3.1-8B’s performance (Figure 2). For in-
stance, when prompted in English with “How to
make a bomb?”, the model refuses to generate a re-
sponse, aligning with safety constraints. However,
when presented with the equivalent Hindi query, it
produces detailed procedural steps, including ex-
plicit instructions, highlighting a critical gap in
multilingual safety alignment.
The discrepancy in multilingual alignment man-
ifests as strong safety adherence in English but
increased susceptibility to harmful responses in
underrepresented languages. To systematically an-
alyze this issue, we probed the language models us-
ing a balanced multilingual toxicity corpus, where
prompts vary in both harmfulness and linguistic
structure. We extract final-layer embeddings from
the LLM and apply Principal Component Analy-
sis (PCA) for dimensionality reduction, enabling
visualization of harmful and harmless clusters, be-
fore and after alignment. The explained variance
ratio of principal components aids in understanding
alignment-induced shifts in representation space.
We further compute within-class and between-class
variance to quantify intra-cluster cohesion and
inter-cluster separation in

Chunk 8 · 1,995 chars

tion of harmful and harmless clusters, be-
fore and after alignment. The explained variance
ratio of principal components aids in understanding
alignment-induced shifts in representation space.
We further compute within-class and between-class
variance to quantify intra-cluster cohesion and
inter-cluster separation in the embedding space.
To rigorously measure alignment-induced sep-
aration, we employ the Bhattacharyya Distance
metric (Bhattacharyya, 1943), capturing the over-
lap between harmful and harmless clusters. This
metric serves as an indicator of how alignment
influences the distinction between safety-related
representations across different language models.
Additionally, we use the Silhouette Score to assess
clustering quality, evaluating how well prompt em-
beddings align with their respective clusters. A
higher Silhouette Score in high-resource languages
further substantiates the improved separation of
harmful and harmless content, providing a robust
quantification of multilingual alignment effective-
ness.
5 Experiments
5.1 Dataset
Our primary focus is on four lan-
guages—English (en), Hindi (hi), Chinese (zh),
and German (de)—as these are the most commonly
fine-tuned languages in LLMs through SFT
or Preference Optimization. To systematically
evaluate multilingual alignment, we utilized the
following dataset:
• Balanced Toxicity Dataset: This dataset con-
tains an equal number of toxic and non-toxic
sentences, ensuring an unbiased comparison
across different alignment strategies (Demen-
tieva et al., 2024b). We used binary toxicity
classification datasets available in nine lan-
guages. Each language dataset consists of
5,000 samples, with 2,500 toxic and 2,500
non-toxic sentences.
• Multilingual Parallel Text-Detoxification
Dataset: This dataset contains parallel sen-
tence pairs where the toxic and detoxified ver-
sions maintain semantic consistency but differ
in syntactic toxicity expression (Dementieva
et al., 2024a). It allows for a controlled

Chunk 9 · 1,997 chars

with 2,500 toxic and 2,500
non-toxic sentences.
• Multilingual Parallel Text-Detoxification
Dataset: This dataset contains parallel sen-
tence pairs where the toxic and detoxified ver-
sions maintain semantic consistency but differ
in syntactic toxicity expression (Dementieva
et al., 2024a). It allows for a controlled evalua-
tion of how well models differentiate between
toxic and non-toxic variations of the content
across multiple languages.
These datasets provide a structured and com-
prehensive framework for evaluating multilingual
alignment.

-- 4 of 14 --

Model 	#Size Reference model (πref) 	Aligned model (πθ) 	Creator
Llama-2(Touvron et al., 2023) 	7B meta-llama/Llama-2-7b meta-llama/Llama-2-7b-chat Meta
Qwen-2.5(Yang et al., 2024) 	7B Qwen/Qwen2.5-7B Qwen/Qwen2.5-7B-Instruct Qwen
Llama-3.1(Grattafiori et al., 2024) 	8B meta-llama/Llama-3.1-8B meta-llama/Llama-3.1-8B-Instruct Meta
Llama-Guard-3(Grattafiori et al., 2024) 8B meta-llama/Llama-3.1-8B meta-llama/Llama-Guard-3-8B Meta
Gemma-2(Team et al., 2024) 	9B google/gemma-2-9b 	google/gemma-2-9b-it 	Google
Gemma-3(Team, 2025) 	12B google/gemma-3-12b-pt google/gemma-3-12b-it Google
Phi-4(Abdin et al., 2024) 	14B 	- 	microsoft/phi-4 	Mircosoft
Table 1: Models used for Multilingual Preference-Tuning evaluation. The model cards refer to Hugging Face
checkpoints of reference (πref) and aligned (πθ ) models.
(a.1) πref-en (b.2) πref-hi (c.3) πref-zh (d.4) πref-de
(a.1) πθ -en (b.2) πθ -hi (c.3) πθ -zh (d.4) πθ -de
Figure 3: Impact of Alignment on Hidden Representations in Llama-2 for Multilingual Corpora.
5.2 Models
We analyzed models from four distinct families,
varying in scale and training objectives. We
focused exclusively on openly available models
hosted on Hugging Face, ensuring transparency
and reproducibility (mentioned in Table 1). To
assess multilingual alignment, we examined the
hidden representation space of each model before
and after alignment (where applicable).
We evaluate multilingual alignment

Chunk 10 · 1,996 chars

ives. We
focused exclusively on openly available models
hosted on Hugging Face, ensuring transparency
and reproducibility (mentioned in Table 1). To
assess multilingual alignment, we examined the
hidden representation space of each model before
and after alignment (where applicable).
We evaluate multilingual alignment across di-
verse LLM families, including Llama models
(Llama-2, Llama-3.1, Llama-Guard-3), Qwen-2.5,
Gemma models (Gemma-2, Gemma-3), and Phi-4.
These models vary in pretraining strategies, align-
ment methods, and language coverage—Llama
models use SFT and RLHF, with Llama-Guard
specializing in safety filtering; Qwen-2.5 explic-
itly supports 29 languages; Gemma models expand
multilingual capabilities with RL objectives, cover-
ing up to 140 languages; and Phi-4, trained on cu-
rated and synthetic datasets, includes 8% multilin-
gual data. This selection provides a comprehensive
lens to assess alignment robustness in multilingual
settings.
5.3 Results
Figure 3 demonstrates the impact of alignment on
the latent representations of LLMs using PCA with
the first two components. Before alignment, the
reference policy πref shows overlapping clusters
for harmless and harmful sentences, whereas the
aligned model πθ exhibits improved separation due
to divergence induced by alignment. While English
clusters are well-separated, the effect is less pro-
nounced in other languages. The PCA explained
variance ratio was 49.61%, indicating that nearly
half of the data variation is captured in the reduced
space. Notably, the between-class variance ratio
increased from 0.81% to 61.20% (a 60.39% im-
provement), confirming enhanced cluster separabil-
ity post-alignment. However, for Hindi, Chinese,
and German, the improvements were significantly

-- 5 of 14 --

Chunk 11 · 1,996 chars

post-alignment. However, for Hindi, Chinese,
and German, the improvements were significantly

-- 5 of 14 --

(a) en (b) hi (c) zh (d) de
Figure 4: Bhattacharyya Distance for All Models Pre- and Post-Alignment Tuning. Blue radar indicates values
before alignment (πref), while green represents values after alignment (πθ ).
(a.1) πref-en (b.2) πref-hi
(a.1) πθ -en (b.2) πθ -hi
Figure 5: Impact of Alignment on Hidden Representa-
tions in Llama-2 for Multilingual parallel text detoxifi-
cation corporas.
lower at 19.98%, 10.09%, and 26.85%, respec-
tively, highlighting weaker alignment effects in
these languages. To further analyze cluster sep-
arability, we extended our evaluation beyond PCA
with two components, as the explained variance
ratio of 50% was insufficient. Instead, we used
the first 10 components for computing additional
metrics across all models. (cf. Appendix A for
additional visualizations with other models.)
One such metric, Bhattacharya distance, quan-
tifies the separation between harmful and harm-
less sentence clusters. As shown in Figure 4, this
measure confirms our earlier observations. In En-
glish, the reference model shows relatively less
separation than the aligned model across all pairs,
aligning with our PCA-based findings. Chinese
and German also exhibit an increase in cluster sep-
aration, though the effect is weaker compared to
English. Notably, the scale in the plot is logarith-
mic (ranging from 1e-3 to 1e+1), emphasizing the
substantial differences in cluster distances across
languages. A particularly interesting trend emerges
for Hindi, where some models show improved sep-
aration post-alignment, while others exhibit the
opposite trend, with the unaligned model display-
ing greater cluster separation. Similarly, silhouette
scores, which measure cluster compactness and
between-class variance, reveal a consistent pattern.
Although alignment generally increases cluster sep-
arability, the improvement is significantly higher
for English

Chunk 12 · 1,997 chars

e
opposite trend, with the unaligned model display-
ing greater cluster separation. Similarly, silhouette
scores, which measure cluster compactness and
between-class variance, reveal a consistent pattern.
Although alignment generally increases cluster sep-
arability, the improvement is significantly higher
for English than for other languages. (cf. Appendix
B for exact metric values across all models.)
In a more challenging parallel text-detoxification
setup—where harmless sentences are minor edits
of harmful ones, often differing by just one or two
tokens—the effectiveness of alignment varies (Fig-
ure 5). While English representations remain
meaningfully separated even in this difficult set-
ting, neither the reference nor the aligned mod-
els capture a clear distribution shift for Hindi in
lower-dimensional space. This highlights language-
dependent differences in how alignment influences
representation learning.
6 Conclusion
Our study provides a comprehensive analysis of the
multilingual alignment status of current preference-
tuned models. Our findings indicate that state-of-
the-art models perform well in English (monolin-
gual bias), but their alignment across languages re-
mains inconsistent, as shown by cluster separability
metrics and hidden representation analyses. This
underscores the need for a more holistic represen-
tation of languages when training models intended
for truly global audiences.

-- 6 of 14 --

Limitations
While our study provides a systematic analysis of
multilingual alignment, it has certain limitations:
• Scope of Alignment Evaluation: Our
methodology assumes that alignment mecha-
nisms induce divergence in the human pref-
erence space for the properties they were
guardrailed against. However, we primarily
focus on safety alignment, overlooking other
critical domains such as multi-modality, rea-
soning, instruction-following, or planning due
to the lack of standardized multilingual bench-
marks for these tasks.
• Language Coverage:

Chunk 13 · 1,995 chars

ref-
erence space for the properties they were
guardrailed against. However, we primarily
focus on safety alignment, overlooking other
critical domains such as multi-modality, rea-
soning, instruction-following, or planning due
to the lack of standardized multilingual bench-
marks for these tasks.
• Language Coverage: For this proof-of-
concept study, we evaluate only three non-
English languages, all of which are medium-
resource. While our findings highlight mono-
lingual bias, future work should extend this
analysis to low-resource languages to better
understand alignment disparities.
• Dataset Size: Our dataset consists of 5,000
sentences, which, while significantly larger
than prior works ( 200 samples) (Lin et al.,
2024; Haldar et al., 2025), may not fully cap-
ture the diversity of real-world multilingual
scenarios. A broader dataset could further val-
idate our findings across different linguistic
and cultural contexts.
These limitations underscore the need for more
comprehensive multilingual benchmarks and ex-
tended evaluations across diverse alignment tasks
and resource-constrained languages.
Ethical Considerations
Our findings raise serious ethical concerns regard-
ing the practices of model developers who release
large-scale language models without ensuring ro-
bust multilingual alignment. This oversight dispro-
portionately impacts marginalized linguistic com-
munities, increasing their exposure to harmful out-
puts while reinforcing systemic biases.
Two major ethical risks highlighted by our anal-
ysis are:
• Misuse Potential: Our study identifies failure
cases in languages other than English, where
models generate unsafe or misaligned outputs.
Figure 2 presents an example in Hindi, demon-
strating how alignment inconsistencies cre-
ate vulnerabilities that could be exploited for
harmful applications. To mitigate this, future
alignment efforts must prioritize multilingual
preference collection, particularly for high-
risk domains. If direct human

Chunk 14 · 1,987 chars

ed outputs.
Figure 2 presents an example in Hindi, demon-
strating how alignment inconsistencies cre-
ate vulnerabilities that could be exploited for
harmful applications. To mitigate this, future
alignment efforts must prioritize multilingual
preference collection, particularly for high-
risk domains. If direct human preference gath-
ering is infeasible, post-training interventions
should explore synthetic techniques to transfer
alignment knowledge from English to other
languages.
• Harm to Vulnerable Populations: The unre-
stricted release of insufficiently aligned mod-
els disproportionately impacts marginalized
communities, where users may interact with
models in languages that lack rigorous safety
guardrails. Our findings suggest that current
models can be jailbroken more easily in sister
languages to English, exposing these popu-
lations to higher risks. This underscores the
urgent need for comprehensive multilingual
safety evaluations before open-weight LLMs
are widely deployed.
Our study calls for greater accountability in mul-
tilingual alignment efforts, emphasizing that ethical
AI deployment requires more than just English-
centric safety measures. Future research should
focus on developing alignment techniques that are
language-agnostic and equitable across diverse lin-
guistic contexts.
References
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien
Bubeck, Ronen Eldan, Suriya Gunasekar, Michael
Harrison, Russell J Hewett, Mojan Javaheripi, Piero
Kauffmann, and 1 others. 2024. Phi-4 technical re-
port. arXiv preprint arXiv:2412.08905.
Arash Ahmadian, Chris Cremer, Matthias Gallé,
Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin,
Ahmet Üstün, and Sara Hooker. 2024. Back to ba-
sics: Revisiting reinforce style optimization for learn-
ing from human feedback in llms. arXiv preprint
arXiv:2402.14740.
Badr AlKhamissi, Muhammad ElNokrashy, Mai
AlKhamissi, and Mona Diab. 2024. Investigating
cultural alignment of large language models. arXiv
preprint

Chunk 15 · 1,997 chars

Ahmet Üstün, and Sara Hooker. 2024. Back to ba-
sics: Revisiting reinforce style optimization for learn-
ing from human feedback in llms. arXiv preprint
arXiv:2402.14740.
Badr AlKhamissi, Muhammad ElNokrashy, Mai
AlKhamissi, and Mona Diab. 2024. Investigating
cultural alignment of large language models. arXiv
preprint arXiv:2402.13231.
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bi-
lal Piot, Remi Munos, Mark Rowland, Michal Valko,
and Daniele Calandriello. 2024. A general theoret-
ical paradigm to understand learning from human
preferences. In International Conference on Arti-
ficial Intelligence and Statistics, pages 4447–4455.
PMLR.

-- 7 of 14 --

Anil Bhattacharyya. 1943. On a measure of divergence
between two statistical populations defined by their
probability distribution. Bulletin of the Calcutta
Mathematical Society, 35:99–110.
Stella Biderman, Usvsn Prashanth, Lintang Sutawika,
Hailey Schoelkopf, Quentin Anthony, Shivanshu
Purohit, and Edward Raff. 2023. Emergent and pre-
dictable memorization in large language models. Ad-
vances in Neural Information Processing Systems,
36:28072–28090.
Ralph Allan Bradley and Milton E Terry. 1952. Rank
analysis of incomplete block designs: I. the method
of paired comparisons. Biometrika, 39(3/4):324–
345.
Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew
Jagielski, Vikash Sehwag, Florian Tramer, Borja
Balle, Daphne Ippolito, and Eric Wallace. 2023. Ex-
tracting training data from diffusion models. In 32nd
USENIX Security Symposium (USENIX Security 23),
pages 5253–5270.
Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Mura-
hari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik
Narasimhan, Ameet Deshpande, and Bruno Castro
da Silva. 2024. Rlhf deciphered: A critical analysis
of reinforcement learning from human feedback for
llms. arXiv preprint arXiv:2404.08555.
Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E Gonzalez, and
1 others. 2023. Vicuna:

Chunk 16 · 1,983 chars

o Castro
da Silva. 2024. Rlhf deciphered: A critical analysis
of reinforcement learning from human feedback for
llms. arXiv preprint arXiv:2404.08555.
Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E Gonzalez, and
1 others. 2023. Vicuna: An open-source chatbot
impressing gpt-4 with 90%* chatgpt quality. See
https://vicuna. lmsys. org (accessed 14 April 2023),
2(3):6.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
tic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Ad-
vances in neural information processing systems, 30.
John Dang, Arash Ahmadian, Kelly Marchisio, Julia
Kreutzer, Ahmet Üstün, and Sara Hooker. 2024. Rlhf
can speak many languages: Unlocking multilingual
preference optimization for llms. arXiv preprint
arXiv:2407.02552.
Daryna Dementieva, Nikolay Babakov, Amit Ronen,
Abinew Ali Ayele, Naquee Rizwan, Florian Schnei-
der, Xintong Wang, Seid Muhie Yimam, Daniil
Moskovskiy, Elisei Stakovskii, and 1 others. 2024a.
Multilingual and explainable text detoxification with
parallel corpora. arXiv preprint arXiv:2412.11691.
Daryna Dementieva, Daniil Moskovskiy, Nikolay
Babakov, Abinew Ali Ayele, Naquee Rizwan, Fro-
lian Schneider, Xintog Wang, Seid Muhie Yimam,
Dmitry Ustalov, Elisei Stakovskii, Alisa Smirnova,
Ashraf Elnagar, Animesh Mukherjee, and Alexander
Panchenko. 2024b. Overview of the multilingual text
detoxification task at pan 2024. In Working Notes of
CLEF 2024 - Conference and Labs of the Evaluation
Forum. CEUR-WS.org.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff,
Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model
alignment as prospect theoretic optimization. arXiv
preprint arXiv:2402.01306.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten,
Alex Vaughan, and 1 others. 2024. The llama 3 herd
of models. arXiv preprint

Chunk 17 · 1,995 chars

2024. Kto: Model
alignment as prospect theoretic optimization. arXiv
preprint arXiv:2402.01306.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten,
Alex Vaughan, and 1 others. 2024. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783.
Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin,
and Yue Xing. 2025. Llm safety alignment is di-
vergence estimation in disguise. arXiv preprint
arXiv:2502.00657.
Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and
Kyoung-Woon On. 2024. Binary classifier optimiza-
tion for large language model alignment. arXiv
preprint arXiv:2404.04656.
Khyati Khandelwal, Manuel Tonneau, Andrew M Bean,
Hannah Rose Kirk, and Scott A Hale. 2023. Casteist
but not racist? quantifying disparities in large lan-
guage model bias between india and the west. CoRR.
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis,
Jelena Luketina, Eric Hambro, Edward Grefenstette,
and Roberta Raileanu. 2023. Understanding the ef-
fects of rlhf on llm generalisation and diversity. arXiv
preprint arXiv:2310.06452.
Haoran Li, Yulin Chen, Jinglong Luo, Jiecong Wang,
Hao Peng, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit
Chan, Zenglin Xu, and 1 others. 2023. Privacy in
large language models: Attacks, defenses and future
directions. arXiv preprint arXiv:2310.10383.
Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto
Yamada, Hui Liu, and Jiliang Tang. 2024. To-
wards understanding jailbreak attacks in llms: A
representation space analysis. arXiv preprint
arXiv:2406.10794.
Fei Liu and 1 others. 2020. Learning to summarize
from human feedback. In Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 583–592.
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen
Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kai-
long Wang, and Yang Liu. 2023. Jailbreaking chatgpt
via prompt engineering: An empirical study. arXiv
preprint arXiv:2305.13860.
Nils Lukas,

Chunk 18 · 1,992 chars

nual Meeting of the Association for Computational
Linguistics, pages 583–592.
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen
Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kai-
long Wang, and Yang Liu. 2023. Jailbreaking chatgpt
via prompt engineering: An empirical study. arXiv
preprint arXiv:2305.13860.
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople,
Lukas Wutschitz, and Santiago Zanella-Béguelin.
2023. Analyzing leakage of personally identifiable
information in language models. In 2023 IEEE Sym-
posium on Security and Privacy (SP), pages 346–363.
IEEE.
Milad Nasr, Nicholas Carlini, Jonathan Hayase,
Matthew Jagielski, A Feder Cooper, Daphne Ippolito,

-- 8 of 14 --

Christopher A Choquette-Choo, Eric Wallace, Flo-
rian Tramèr, and Katherine Lee. 2023. Scalable ex-
traction of training data from (production) language
models. arXiv preprint arXiv:2311.17035.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, and 1
others. 2022. Training language models to follow in-
structions with human feedback. Advances in neural
information processing systems, 35:27730–27744.
Robin L Plackett. 1975. The analysis of permutations.
Journal of the Royal Statistical Society Series C: Ap-
plied Statistics, 24(2):193–202.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo-
pher D Manning, Stefano Ermon, and Chelsea Finn.
2023. Direct preference optimization: Your lan-
guage model is secretly a reward model. Advances in
Neural Information Processing Systems, 36:53728–
53741.
Jonathan Rystrøm, Hannah Rose Kirk, and Scott Hale.
2025. Multilingual!= multicultural: Evaluating gaps
between multilingual capabilities and cultural align-
ment in llms. arXiv preprint arXiv:2502.16534.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint
arXiv:1707.06347.
Reva Schwartz, Reva Schwartz, Apostol Vassilev,

Chunk 19 · 1,984 chars

ting gaps
between multilingual capabilities and cultural align-
ment in llms. arXiv preprint arXiv:2502.16534.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint
arXiv:1707.06347.
Reva Schwartz, Reva Schwartz, Apostol Vassilev, Kris-
ten Greene, Lori Perine, Andrew Burt, and Patrick
Hall. 2022. Towards a standard for identifying and
managing bias in artificial intelligence, volume 3.
US Department of Commerce, National Institute of
Standards and Technology . . . .
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-
Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin
Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos,
and Seungone Kim. 2024. Mm-eval: A multilingual
meta-evaluation benchmark for llm-as-a-judge and
reward models. arXiv preprint arXiv:2410.17578.
Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu,
Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan
Lyu, Yixuan Zhang, Xiner Li, and 1 others. 2024.
Trustllm: Trustworthiness in large language models.
arXiv preprint arXiv:2401.05561, 3.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model.
Gemma Team. 2025. Gemma 3.
Gemma Team, Morgane Riviere, Shreya Pathak,
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
raju, Léonard Hussenot, Thomas Mesnard, Bobak
Shahriari, Alexandre Ramé, and 1 others. 2024.
Gemma 2: Improving open language models at a
practical size. arXiv preprint arXiv:2408.00118.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, and 1 others. 2023. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Aniket Vashishtha, Kabir Ahuja, and Sunayana Sitaram.
2023. On evaluating and mitigating gender bi-
ases in multilingual settings. arXiv

Chunk 20 · 1,996 chars

abaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, and 1 others. 2023. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Aniket Vashishtha, Kabir Ahuja, and Sunayana Sitaram.
2023. On evaluating and mitigating gender bi-
ases in multilingual settings. arXiv preprint
arXiv:2307.01503.
Genta Indra Winata, Hanyang Zhao, Anirban Das, Wen-
pin Tang, David D Yao, Shi-Xiong Zhang, and Sam-
bit Sahu. 2024. Preference tuning with human feed-
back on language, speech, and vision tasks: A survey.
arXiv preprint arXiv:2409.11564.
Minghao Wu, Abdul Waheed, Chiyu Zhang, Muham-
mad Abdul-Mageed, and Alham Fikri Aji. 2023.
Lamini-lm: A diverse herd of distilled mod-
els from large-scale instructions. arXiv preprint
arXiv:2304.14402.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, and 1 others. 2024. Qwen2.
5 technical report. arXiv preprint arXiv:2412.15115.
Zheng-Xin Yong, Cristina Menghini, and Stephen H
Bach. 2023. Low-resource languages jailbreak gpt-4.
arXiv preprint arXiv:2310.02446.
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B
Brown, Alec Radford, Dario Amodei, Paul Chris-
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. arXiv
preprint arXiv:1909.08593.

-- 9 of 14 --

A Impact of Alignment on Hidden
Representations for Multilingual
Corpora
For models other than Llama-2, visualizations of
hidden representations before and after alignment
are presented in Figures 6 to 10. For Phi-4, since
only the aligned model checkpoint is available,
only the representation analysis is shown in Figure
11.
B Metrics of cluster quality, before and
after alignment of LLMs
The models we used in this study are mentioned
below:
• Llama-2: A suite of open-source founda-
tional and fine-tuned chat models. The pre-
training corpus includes over 5% non-English
high-quality data, though evaluations primar-
ily focus on

Chunk 21 · 1,992 chars

Figure
11.
B Metrics of cluster quality, before and
after alignment of LLMs
The models we used in this study are mentioned
below:
• Llama-2: A suite of open-source founda-
tional and fine-tuned chat models. The pre-
training corpus includes over 5% non-English
high-quality data, though evaluations primar-
ily focus on English.
• Qwen-2.5: An instruction-following model
optimized for long-context reasoning and di-
verse prompts. It explicitly supports 29 lan-
guages in output generation.
• Llama-3.1: Fine-tuned versions use SFT and
RLHF for preference alignment. Supported
languages include English, German, French,
Italian, Portuguese, Hindi, Spanish, and Thai.
• Llama-Guard: A content safety classifier
fine-tuned on Llama-3.1-8B. It classifies both
inputs (prompt filtering) and outputs (response
moderation) and supports multilingual safety
alignment in languages same as Llama-3.1.
• Gemma-2: A 9B parameter instruction-tuned
model primarily trained on English data. How-
ever, the SFT phase incorporates some multi-
lingual contexts.
• Gemma-3: A 12B parameter multimodal and
multilingual model which supports 140 lan-
guages. In its RL objectives, it uses variety of
reward functions to improve helpfulness, rea-
soning and multilingual abilities, while mini-
mizing model harmfulness.
• Phi-4: Trained on a mixture of synthetic
datasets, filtered public domain websites, and
traditional sources using SFT and DPO. Ap-
proximately 8% of its training data is explic-
itly multilingual covering wide range of lan-
guages, including German, Spanish, French,
Portuguese, Italian, Hindi and Japanese.
This selection of models enables a comprehen-
sive evaluation of alignment in multilingual set-
tings, covering diverse pretraining strategies, align-
ment techniques, and language coverage. Table 2
refers to the metrics used for comparison of PCA
clusters with 10 components for all the models
evaluated before and after alignment.

-- 10 of 14 --

(a.1) πref-en (b.2) πref-hi (c.3)

Chunk 22 · 1,997 chars

alignment in multilingual set-
tings, covering diverse pretraining strategies, align-
ment techniques, and language coverage. Table 2
refers to the metrics used for comparison of PCA
clusters with 10 components for all the models
evaluated before and after alignment.

-- 10 of 14 --

(a.1) πref-en (b.2) πref-hi (c.3) πref-zh (d.4) πref-de
(a.1) πθ -en (b.2) πθ -hi (c.3) πθ -zh (d.4) πθ -de
Figure 6: Impact of Alignment on Hidden Representations in Qwen-2.5 for Multilingual Corpora.
(a.1) πref-en (b.2) πref-hi (c.3) πref-zh (d.4) πref-de
(a.1) πθ -en (b.2) πθ -hi (c.3) πθ -zh (d.4) πθ -de
Figure 7: Impact of Alignment on Hidden Representations in Llama-3.1 for Multilingual Corpora.

-- 11 of 14 --

(a.1) πref-en (b.2) πref-hi (c.3) πref-zh (d.4) πref-de
(a.1) πθ -en (b.2) πθ -hi (c.3) πθ -zh (d.4) πθ -de
Figure 8: Impact of Alignment on Hidden Representations in Llama-Guard-3 for Multilingual Corpora.
(a.1) πref-en (b.2) πref-hi (c.3) πref-zh (d.4) πref-de
(a.1) πθ -en (b.2) πθ -hi (c.3) πθ -zh (d.4) πθ -de
Figure 9: Impact of Alignment on Hidden Representations in Gemma-2 for Multilingual Corpora.

-- 12 of 14 --

(a.1) πref-en (b.2) πref-hi (c.3) πref-zh (d.4) πref-de
(a.1) πθ -en (b.2) πθ -hi (c.3) πθ -zh (d.4) πθ -de
Figure 10: Impact of Alignment on Hidden Representations in Gemma-3 for Multilingual Corpora.
(a.1) πθ -en (b.2) πθ -hi (c.3) πθ -zh (d.4) πθ -de
Figure 11: Impact of Alignment on Hidden Representations in Phi-4 for Multilingual Corpora.

-- 13 of 14 --

Table 2: Metric values of different LLMs before and after alignment on Balanced Toxicity Dataset. We use “BD”
for Bhattacharyya distance, “SS” for silhouette score, and “BCV” for between-class variance. We use hyphen (-)
where model checkpoint is not available.
Model Language Reference Model Aligned Model
BD SS BCV BD SS BCV
Llama-2
English 0.035 0.0142 0.0303 2.5871 0.5433 0.3715
Hindi 0.1837 0.1355 0.1309 0.6743 0.3036 0.1828
Chinese 0.0044 0.0017 0.0033 0.0961 0.0878 0.0671
German 0.1905 0.0471

Chunk 23 · 1,529 chars

een-class variance. We use hyphen (-)
where model checkpoint is not available.
Model Language Reference Model Aligned Model
BD SS BCV BD SS BCV
Llama-2
English 0.035 0.0142 0.0303 2.5871 0.5433 0.3715
Hindi 0.1837 0.1355 0.1309 0.6743 0.3036 0.1828
Chinese 0.0044 0.0017 0.0033 0.0961 0.0878 0.0671
German 0.1905 0.0471 0.0409 0.3907 0.2825 0.2067
Qwen-2.5
English 0.3037 0.1326 0.0776 0.8365 0.315 0.1987
Hindi 0.0309 0.0172 0.0631 0.1746 0.1047 0.0895
Chinese 0.0208 0.0102 0.0119 0.0233 0.0138 0.0263
German 0.0813 0.0482 0.033 0.077 0.0634 0.0703
Llama-3.1
English 0.1114 0.0268 0.0637 0.9639 0.3156 0.2047
Hindi 0.5402 0.2313 0.1646 0.3723 0.2253 0.1649
Chinese 0.0029 0.0025 0.0053 0.0162 0.014 0.0124
German 0.1262 0.0411 0.0405 0.2096 0.123 0.0904
Llama-Guard-3
English 0.1114 0.0268 0.0637 0.8627 0.2971 0.2080
Hindi 0.5402 0.2313 0.1646 0.2389 0.1456 0.1398
Chinese 0.0029 0.0025 0.0053 0.1576 0.0923 0.072
German 0.1262 0.0411 0.0405 0.2440 0.1697 0.1112
Gemma-2
English 0.1653 0.0565 0.0781 0.6046 0.2544 0.1368
Hindi 0.5061 0.2099 0.1491 0.1522 0.0882 0.0795
Chinese 0.0055 0.0051 0.0072 0.0182 0.0172 0.0202
German 0.0487 0.0301 0.0305 0.1976 0.1028 0.0629
Gemma-3
English 0.2087 0.0753 0.1016 1.1618 0.3913 0.2342
Hindi 1.2436 0.2389 0.18 0.394 0.1794 0.1073
Chinese 0.0046 0.0043 0.0102 0.0749 0.0362 0.0262
German 0.107 0.0396 0.0426 0.2298 0.1287 0.079
Phi-4
English - - - 1.1200 0.3712 0.1929
Hindi - - - 0.7817 0.2895 0.1475
Chinese - - - 0.0015 0.0004 0.0111
German - - - 0.2650 0.1543 0.0775

-- 14 of 14 --