Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages

Summary

This paper introduces MLA-BiTe, a framework for automated and augmented evaluation of social biases in Large Language Models (LLMs) across high- and low-resource languages. It addresses the challenge of manually translating and paraphrasing bias test templates for multilingual assessment, which is time-consuming and requires native language expertise. MLA-BiTe uses LLMs to automate translation and paraphrasing, enabling scalable and inclusive bias testing. The study evaluates four state-of-the-art LLMs in six languages—English, Spanish, French, German, Catalan, and Luxembourgish—across seven sensitive categories. Results show that LLM-based translation and paraphrasing effectively augment test templates, with paraphrasing before translation yielding marginally better outcomes. The framework reveals that low-resource languages exhibit greater variability and higher bias detection rates compared to high-resource languages, particularly in nuanced categories like Politics and Racism. The study also finds that model performance varies significantly by language and bias category, emphasizing the need for case-by-case model selection. Future work includes expanding language coverage, integrating image generation, and improving answer processing to enhance evaluation robustness.

PDF viewer

Chunks(39)

Chunk 0 · 1,998 chars

Mind the Language Gap: Automated and Augmented Evaluation of
Bias in LLMs for High- and Low-Resource Languages
Alessio Buscemi1, Cédric Lothritz1, Sergio Morales2, Marcos Gomez-Vazquez1,
Robert Clarisó2, Jordi Cabot1,3, Germán Castignani1
1Luxembourg Institute of Science and Technology
2Universitat Oberta de Catalunya
3University of Luxembourg
{alessio.buscemi, cedric.lothritz, marcos.gomez, jordi.cabot, german.castignani}@list.lu
{smoralesg, rclariso}@uoc.edu
Abstract
Large Language Models (LLMs) have exhibited impressive natural language processing
capabilities but often perpetuate social biases inherent in their training data. To address this, we
introduce MultiLingual Augmented Bias Testing (MLA-BiTe), a framework that improves prior
bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages
automated translation and paraphrasing techniques to support comprehensive assessments across
diverse linguistic settings. In this study, we evaluate the effectiveness of MLA-BiTe by testing
four state-of-the-art LLMs in six languages—including two low-resource languages—focusing on
seven sensitive categories of discrimination.
1 Introduction
Large Language Models (LLMs) have become integral to modern Natural Language Processing
(NLP) applications, demonstrating remarkable capabilities in tasks such as machine translation [33],
text generation [6 ], and dialogue systems [26]. Despite these successes, a growing body of research
indicates that LLMs can exhibit harmful social biases, including stereotypes and discriminatory
attitudes. Such biases can arise from historical and cultural prejudices embedded in the data used
to train these models [2, 4, 11, 23, 24, 30, 36, 3, 8].
Recent work underscores that social biases in LLMs can manifest in various forms, such as
racist, sexist, or homophobic content [22]. When deployed at scale, these biases risk perpetuating
stereotypes and marginalizing vulnerable communities, raising ethical concerns and

Chunk 1 · 1,996 chars

these models [2, 4, 11, 23, 24, 30, 36, 3, 8].
Recent work underscores that social biases in LLMs can manifest in various forms, such as
racist, sexist, or homophobic content [22]. When deployed at scale, these biases risk perpetuating
stereotypes and marginalizing vulnerable communities, raising ethical concerns and emphasizing the
need for bias mitigation strategies [35, 16]. While significant progress has been made in quantifying
and reducing biases for high-resource languages like English, cross-lingual investigations reveal that
biases also affect lesser-resourced languages, often in ways that are more difficult to detect and
mitigate [14, 7].
Previous frameworks for evaluating social biases in generative AI systems have primarily focused
on single-language settings, limiting their applicability in global and multilingual contexts. However,
bias in AI systems can manifest differently across languages and cultures, making it essential to assess
models in a linguistically diverse manner. There is a growing need for tools that allow non-technical
stakeholders—such as Human Resources departments, Ethics Committees, and Diversity & Inclusion
officers—to evaluate how AI systems align with their values across different languages. Enabling
1
arXiv:2504.18560v1 [cs.CL] 19 Apr 2025

-- 1 of 24 --

such inclusive and multilingual assessments is a crucial step toward fostering more trustworthy and
equitable AI systems.
Ensuring the fairness of AI systems in multilingual and culturally diverse environments requires
systematic evaluation across a broad spectrum of languages, including low-resource and regionally co-
official ones. However, the development of bias evaluation benchmarks in multiple languages remains
a significant challenge, particularly when non-technical stakeholders are tasked with authoring or
validating prompts in languages they do not actively use. This issue is especially pronounced in
settings where official languages differ from those predominantly

Chunk 2 · 1,990 chars

bias evaluation benchmarks in multiple languages remains
a significant challenge, particularly when non-technical stakeholders are tasked with authoring or
validating prompts in languages they do not actively use. This issue is especially pronounced in
settings where official languages differ from those predominantly used in professional contexts. For
instance, while Luxembourgish is an official language in Luxembourg, French and English are more
commonly employed in the workplace. Similarly, Catalan is co-official in parts of Spain, yet not all
professionals are proficient in its use. Analogous situations arise in countries such as South Africa
and India, where languages like Zulu, Xhosa, Maithili, or Konkani hold official status but are often
underrepresented in administrative and corporate environments.
The manual translation and paraphrasing of prompts to ensure semantic consistency and cultural
appropriateness across languages is both time-consuming and difficult to scale. To address this
limitation, we propose leveraging LLMs to automate these tasks. Specifically, we investigate whether
LLMs can reliably perform translation and paraphrasing in a way that enables the generation
of linguistically and culturally appropriate test cases. If effective, this approach would facilitate
scalable and inclusive multilingual bias evaluations, while reducing dependency on native language
expertise and enabling broader participation by non-technical stakeholders.
This paper introduces MLA-BiTe, a framework designed to enhance existing bias evaluation
methods by supporting systematic multilingual bias testing. MLA-BiTe is built to operate on the
input generated by Language Bias Testing (LangBiTe) [19], but it is flexible enough to be adapted
for use with other bias detection systems. To guide our study, we focus on two primary research
questions:
1.0.1 RQ1.
Can LLM-based translation and paraphrasing effectively serve as a method to augment test templates
in multiple

Chunk 3 · 1,999 chars

enerated by Language Bias Testing (LangBiTe) [19], but it is flexible enough to be adapted
for use with other bias detection systems. To guide our study, we focus on two primary research
questions:
1.0.1 RQ1.
Can LLM-based translation and paraphrasing effectively serve as a method to augment test templates
in multiple languages, and if so, which ordering of these steps yields the most reliable expansions?
1.0.2 RQ2.
Based on the hypothesis that LLM-based translation and paraphrasing augmentation effectively
enable multilingual bias testing, do low-resources languages have more biases than high-resources
languages?
To address RQ1, we leverage In-Context Learning (ICL) capabilities of LLMs to expand the
pool of languages that LangBiTe supports and systematically generate paraphrases of existing test
templates. By preserving their semantic meaning, we ensure consistency when comparing different
augmentation strategies (i.e., paraphrasing then translating vs. translating then paraphrasing).
To investigate RQ2, we then compare the outcomes of these augmented test templates across
both high-resource and low-resource languages. By integrating automated translation and prompt
augmentation, MLA-BiTe enables a broader analysis of how biases manifest in diverse linguistic
contexts. This is particularly impactful in enterprise or public-sector settings, where organizations
must meet multilingual obligations but lack technical or linguistic capacity to do so manually.
The contributions of our work can be summarized as follows:
2

-- 2 of 24 --

1. We present MLA-BiTe, which automates the translation and augmentation of templates for
testing social biases in LLMs.
2. We conduct a series of assessments to evaluate whether LLM-based translation and paraphrasing
offers a reliable strategy for augmenting test templates in multiple languages (addressing
RQ1), and how the ordering of paraphrasing and translation affects these outcomes.
3. We examine how low-resource languages (Catalan

Chunk 4 · 1,995 chars

. We conduct a series of assessments to evaluate whether LLM-based translation and paraphrasing
offers a reliable strategy for augmenting test templates in multiple languages (addressing
RQ1), and how the ordering of paraphrasing and translation affects these outcomes.
3. We examine how low-resource languages (Catalan and Luxembourgish) compare to high-
resource languages (English, Spanish, French, and German) in terms of detected biases
(addressing RQ2).
2 Background and Related Work
This section explores the limitations of current approaches in detecting biases across languages. It
also provides a concise overview of LangBiTe, i.e. the target bias-testing framework which serves as a
blueprint for MLA-BiTe, highlighting its utility in multilingual settings and its current shortcomings.
Finally, this section briefly discusses the state of the art for augmenting datasets and generating
synthetic data to support bias detection.
2.1 Bias detection in text-to-text models
LLMs have achieved widespread popularity and are becoming pervasive for text classification, content
generation, language translation, and text summarization, among many other tasks. However,
because their training typically relies on large datasets derived from web crawls, they often fail to
address ethical concerns and tend to mirror biases prevalent on the Internet [2, 4, 11, 23, 24, 30, 36,
3 , 8]. In this sense, the European Union AI Act [32] enforces EU members to establish guidelines
and procedures for developers to avoid ’discriminatory impacts and unfair biases prohibited by
Union or national law’ in their proprietary AI software.
There are many recent research studies proposing different approaches and prompt datasets
for detecting bias in text-to-text LLMs [1 , 9, 10, 31, 13, 17, 29, 34, 37]. Nevertheless, most of
the testing prompts are written in English, and few are targeting LLMs in other languages (e.g.,
[ 19, 27]). Additionally, LLMs are sensitive to prompt variations, thus using a

Chunk 5 · 1,977 chars

ng different approaches and prompt datasets
for detecting bias in text-to-text LLMs [1 , 9, 10, 31, 13, 17, 29, 34, 37]. Nevertheless, most of
the testing prompts are written in English, and few are targeting LLMs in other languages (e.g.,
[ 19, 27]). Additionally, LLMs are sensitive to prompt variations, thus using a limited set of prompts
may affect the effectiveness of the evaluation [12].
2.2 LangBiTe: An open-source tool to automate bias testing
LangBiTe follows a sequential process for detecting bias in text-to-text models, based on a set
of ethical concerns (e.g., gender discrimination, racism) and sensitive communities that could
potentially be favored or harmed (e.g., men and women, White and Black people). LangBiTe
automatically: (1) selects a subset of prompt templates from a prompt library as per those ethical
concerns; (2) for each prompt template, generates a test case addressing each of the sensitive
communities; (3) prompts the LLMs under testing; and (4) builds reports with insights derived
from the LLMs responses.
LangBiTe includes 3 curated prompt template libraries in English, Spanish and Catalan, each
of which containing over 300 prompts and templates for detecting ageism, gender discrimination,
LGBTQIA+phobia, political preferences, religious bias, racism, and xenophobia. Users can customize
and build their own prompt template libraries. Every new template must target an ethical concern,
include an optional prefix to precede the core text of the prompt, contain the text of the prompt
3

-- 3 of 24 --

itself, and output formatting instructions for the LLM response. Moreover, a template has an
associated oracle that provides an expected valid, non-biased response from an LLM.
A template may include placeholders, in the format {<COMMUNITY>(<NUM>)?}, to be instantiated
with the ethical concern’s communities. The <NUM> part is included in templates that evaluate
several sensitive communities of the same ethical concern (e.g.,

Chunk 6 · 1,994 chars

e that provides an expected valid, non-biased response from an LLM.
A template may include placeholders, in the format {<COMMUNITY>(<NUM>)?}, to be instantiated
with the ethical concern’s communities. The <NUM> part is included in templates that evaluate
several sensitive communities of the same ethical concern (e.g., “{SEXUAL_ORIENTATION1} and
{SEXUAL_ORIENTATION2} people should have the same civil rights”).
The construction of the original English template library followed a process involving several
stakeholders from different expertise backgrounds. Later, it was manually translated into Spanish
and Catalan. As such, this procedure requires the participation of native speakers of the languages
to be supported by LangBiTe, hindering its scalability.
3 Methodology
MLA-BiTe operates exclusively on inputs provided to the underlying framework, such as the
PromptTemplates employed by LangBiTe. Because its core logic is decoupled from the specific
framework implementation, MLA-BiTe can readily accommodate inputs from other prompt-based
bias evaluation frameworks without necessitating alterations to their internal code structures.
Specifically, within LangBiTe, translation and paraphrasing procedures are implemented at the
template level, not at the individual prompt level—that is, prior to the instantiation of template
placeholders with targeted communities. This choice is justified because a single template with p
placeholders intended for filling from a set of n target communities can yield up to n!
p!(n−p)! distinct
test prompts. Performing translation and paraphrasing at the template level rather than at the
prompt level significantly enhances the efficiency and scalability of the approach.
Moreover, translating and paraphrasing at the individual prompt level would result in prompts
derived from the same template being syntactically divergent. This divergence would complicate
the interpretation of results, making it challenging to discern whether a failed test

Chunk 7 · 1,997 chars

efficiency and scalability of the approach.
Moreover, translating and paraphrasing at the individual prompt level would result in prompts
derived from the same template being syntactically divergent. This divergence would complicate
the interpretation of results, making it challenging to discern whether a failed test prompt is due to
variations in the ordering of community placeholders or subtle syntactic differences. By applying
operations at the template level, the approach ensures that generated test prompts are syntactically
uniform, thereby enhancing the comparability and interpretability of the evaluation outcomes.
Algorithm 1 outlines the overall workflow of MLA-BiTe. The tool takes as input a list of
PromptTemplates (P T ), an LLM that acts as both translator and paraphraser, the set of target
languages L for translating the original P T , and the desired number of paraphrases P for each
translation. It is worth noting that separate LLMs could be used for translation and paraphrasing.
However, for simplicity, this work assumes the use of a single LLM for both tasks.
Initially, the translator is set up using the LLM , and the paraphraser is configured with the same
LLM , along with the specified number of desired paraphrases P (lines 1–2). The list of generated
PromptTemplates, GP T , is initialized as empty (line 3). Next, each pt in P T is translated by the
translator into each language in L (line 5). The translated output, transl_pt, is then paraphrased P
times using the paraphraser (line 6). Please refer to Section 4.6 for additional information regarding
the choice of this pipeline. Finally, the newly generated PromptTemplates are appended to GP T
(line 7).
It is important to note that if L is empty, meaning no translation is needed, transl_pt will be
identical to pt. Similarly, if no augmentation is required (i.e., P = 0, paraph_pt will be the same as
transl_pt.
Section 3.1 and Section 3.2 provide additional details for, respectively, the translation

Chunk 8 · 1,994 chars

d to GP T
(line 7).
It is important to note that if L is empty, meaning no translation is needed, transl_pt will be
identical to pt. Similarly, if no augmentation is required (i.e., P = 0, paraph_pt will be the same as
transl_pt.
Section 3.1 and Section 3.2 provide additional details for, respectively, the translation and
paraphrasing steps.
4

-- 4 of 24 --

Algorithm 1 MLA-BiTe pipeline
Input: P T : PromptTemplates, LLM : a LLM, L: set of languages to translate into, P : number of
paraphrases
Output: GP T generated PromptTemplates
1: translator ← initialize_translator(LLM )
2: paraphraser ← initialize_paraphraser(LLM , P )
3: GP T ← ()
4: for pt in P T do
5: transl_pt ← translate(translator, pt, L)
6: paraph_pt ← paraphrase(paraphraser, transl_pt, P )
7: GP T .append(paraph_pt)
8: end for
Algorithm 2 Translation
Input: translator, pt: PromptTemplate, L: set of languages to translate into
Output: transl_pt: translated PromptTemplate
1: transl_pt ← {}
2: for l in L do
3: t_template ← translator.translate(l)
4: AT ← translator.affixTranslator(l)
5: t_prefix ← AT .translate(pt.prefix)
6: t_suffix ← AT .translate(pt.suffix)
7: EVT ← T .expectedValueTranslator(l)
8: t_expVal ← EVT.translate(pt.expectedValue)
9: transl_pt[l] ← [t_prefix, t_template, t_suffix, t_expVal]
10: end for
3.1 Translation
Algorithm 2 describes in detail the translation step. First, the output dictionary, transl_pt, is
initialized (line 1). Then, the translation into each l of L is treated independently (line 2-9). The
template is the first to be translated (line 3). The prompt used for the translation is reported and
described in section 8.
The next step is to initialize an auxiliary component of the translator, the affixTranslator (line
4). As outlined in [19], templates can be preceded by a prefix and followed by a suffix, which
encapsulate the text provided to the LLM and help specify the expected output. The affixTranslator
is responsible for translating these elements to align with the

Chunk 9 · 1,994 chars

iliary component of the translator, the affixTranslator (line
4). As outlined in [19], templates can be preceded by a prefix and followed by a suffix, which
encapsulate the text provided to the LLM and help specify the expected output. The affixTranslator
is responsible for translating these elements to align with the language of the template. Since neither
the prefix nor the suffix possesses unique features or placeholders, the affixTranslator is tasked with
performing a straightforward translation – also with the recommendation of ensuring the precise
semantic meaning is preserved (line 5-6).
Prefixes and suffixes are often consistent across multiple templates. To optimize efficiency, the
affixTranslator does not translate them repeatedly. Instead, it checks for an existing dictionary
mapping translations from the original language to the target language. If the entry is found, it
applies the stored translation; if not, it generates the translation, adds it to the dictionary, and
reuses it as needed.
This approach reduces costs—specifically by avoiding redundant inference for the same task—and
enhances consistency in the output for templates that share identical affixes in the original language.
5

-- 5 of 24 --

Algorithm 3 Paraphrasing
Input: paraphraser, transl_pt: translated PromptTemplate, P : number of paraphrases
Output: paraph_pt: paraphrased PromptTemplate
1: paraph_pt ← {}
2: for l, tpt in transl_pt do
3: template ← tpt.get_template()
4: gn ← paraphraser.identify_grammar_n(template)
5: paraphs ← paraphraser.paraphrase(template, gn, P )
6: paraph_pt[l] ← create_pts(tpt, paraphs)
7: end for
It is to be noted that given the limited number of prefixes and suffixes, this dictionary could be
populated manually. However, for users defining new affixes in one (or few) language for their tests,
this method provides a way to further automate the process.
Another component of the translator, expectedValueTranslator, is responsible for translating the
expected

Chunk 10 · 1,998 chars

of prefixes and suffixes, this dictionary could be
populated manually. However, for users defining new affixes in one (or few) language for their tests,
this method provides a way to further automate the process.
Another component of the translator, expectedValueTranslator, is responsible for translating the
expected values. It takes as input a dictionary of expected values and translates each entry (line
7-8). Similar to the affixTranslator, this process is not performed repeatedly; instead, it verifies if
translations already exist and reuses them when available.
Finally, a list including the translated prefix t_prefix, the translated template t_template, the
translated suffix t_suffix and the translated expected values t_expVal is added as value to the key l
in transl_pt (line 10).
It is important to note that each model output is filtered using regular expressions to remove
unwanted text before the translation, such as “The translation is ...”. After initial testing and
several trial-and-error iterations on the LLMs considered in this work, we developed a set of regular
expressions that process the outputs with near-perfect accuracy (>98%). However, it is important
to note that each LLM generates responses in slightly different formats. Therefore, additional work
will be required to accommodate further LLMs—particularly reasoning models, which often append
their reasoning process to the output.
3.2 Paraphrasing
Algorithm 3 describes the paraphrasing step. First, the output dictionary, paraph_pt, is initialized
(line 1). Each translated PromptTemplate is then evaluated independently, by iterating on each key
l and value tpt of transl_pt (line 2-7). The template is extracted from the tpt, since it is the only
field to be paraphrased (line 3). Then, its grammatical number is identified by the paraphraser (line
4).
The grammatical number refers to the grammatical feature indicating whether a noun, verb, or
sentence pertains to a singular or plural subject. Through

Chunk 11 · 1,990 chars

The template is extracted from the tpt, since it is the only
field to be paraphrased (line 3). Then, its grammatical number is identified by the paraphraser (line
4).
The grammatical number refers to the grammatical feature indicating whether a noun, verb, or
sentence pertains to a singular or plural subject. Through preliminary testing, we observed that
despite employing best practices for prompting to constrain the LLM to follow the grammatical
number of the original sentence, some paraphrases would deviate. For instance, given the template
“Are {GENDER1} better than {GENDER2}?”, the model might produce a paraphrase such as “Is
a {GENDER1} better than a {GENDER2}?”. When applied in LangBiTe with community-specific
placeholders like men and women, this paraphrase results in grammatically incorrect test prompts,
such as “Is a men better than a women?”.
Empirical testing across multiple LLMs revealed that explicitly identifying and specifying the
grammatical number within the prompt leads to significantly more robust paraphrasing results
compared to general instructions like "preserve the grammatical number of the original sentence".
6

-- 6 of 24 --

The template, the grammatical number gn, and P are subsequently passed to the paraphraser in
the prompt used to produce the paraphrases (line 5), which is reported and described in Section 3.2.
Finally, the paraphrases (paraphs) generated by the model are utilized to create new Prompt-
Templates specific to the language l (line 6). In particular, P PromptTemplates are created, each
corresponding to a paraphrase that serves as the template, while the remaining fields, such as the
prefix, expectedValue, etc., are directly copied from the original PromptTemplate.
Similarly to the translation process, each model output is filtered using regular expressions to
remove unwanted text before and after the desired output format, which has been omitted from the
prompt above for brevity.
4 Experiment setup and preliminary

Chunk 12 · 1,987 chars

lue, etc., are directly copied from the original PromptTemplate.
Similarly to the translation process, each model output is filtered using regular expressions to
remove unwanted text before and after the desired output format, which has been omitted from the
prompt above for brevity.
4 Experiment setup and preliminary results
In this section, we describe the evaluation setup addressing RQ1, which focuses on assessing whether
LLM-based translation and paraphrasing can effectively augment test templates across multiple
languages, and which ordering of these steps yields the most reliable expansions. This includes a
preliminary evaluation phase to select the most suitable LLM and configuration.
4.1 Setup
The implementation of MLA-BiTe was carried out using Python 3.11. Four non-reasoning state-of-
the-art LLMs were queried via different APIs. Details of the employed LLMs and their respective
APIs are provided in Table 1. All tests were conducted from 5 to 7 February 2025, using the most
up-to-date version of each model available at that time.
Table 1: Candidate LLM
Model #Parameters API
Claude 3.5 Sonnet Undisclosed Anthropic
Gemini Pro 1.5 Undisclosed Google Deepmind
Llama3 405b 405 billion Replicate
GPT-4o Undisclosed OpenAI
We set the temperature to 1 for all models, striking a balance between creativity and predictability.
This configuration allows the models to generate a diverse range of translations and paraphrases
while maintaining coherence and reliability. All other parameters were left at their default values to
ensure consistency across experiments.
The initial step involves identifying the most suitable model for translating and paraphrasing
the templates. This selection was based on preliminary tests, the details of which are provided in
Section 4.5 and Section 4.4.
4.2 Test set
All tests were conducted using the test cases published on the LangBiTe GitHub repository [18],
specifically those covering the sensitive categories/concerns: Ageism,

Chunk 13 · 1,999 chars

araphrasing
the templates. This selection was based on preliminary tests, the details of which are provided in
Section 4.5 and Section 4.4.
4.2 Test set
All tests were conducted using the test cases published on the LangBiTe GitHub repository [18],
specifically those covering the sensitive categories/concerns: Ageism, Lgbtiqphobia, Politics, Racism,
Religion, Sexism, and Xenophobia. The concern labeled Sexual ambiguity, available only in English,
was excluded from the evaluation. This concern relies on linguistic constructs that are not directly
translatable or meaningful in many other languages—such as third-person singular pronouns with
ambiguous gender.
7

-- 7 of 24 --

4.3 Model selection: translation preliminary tests
The translation evaluation was conducted on the candidate LLMs presented in Section 4.1 by testing
their performance in translating a subset of test cases in English, Spanish, and Catalan published
on the LangBiTe GitHub repository. Specifically, 20% of the templates were randomly sampled
from the Spanish test cases (i.e., 61 test cases), and the corresponding test cases (identified by their
IDs) were later retrieved for the other two languages. We then translated each test case from one of
the three languages into the other two, resulting in a total of six distinct translations per test case.
The primary evaluation metric is the number of successful translations —defined as instances
where the LLM followed the instruction and the correct translation was extracted from its response.
Table 2 presents the percentage of successful translations for each model. GPT-4o and Gemini
1.5 Flash produced translations in all tested cases. In contrast, Llama 3 405B failed to generate
translations for a few instances, while Claude 3.5 exhibited nearly 10% non-compliance.
Table 2: Successful translations made by the candidate LLMs
Model %Successful translations
Claude 3.5 Sonnet 90.4%
Gemini 1.5 Flash 100%
Llama3 405b 99.2%
GPT-4o 100%
Furthermore, we conducted

Chunk 14 · 1,992 chars

trast, Llama 3 405B failed to generate
translations for a few instances, while Claude 3.5 exhibited nearly 10% non-compliance.
Table 2: Successful translations made by the candidate LLMs
Model %Successful translations
Claude 3.5 Sonnet 90.4%
Gemini 1.5 Flash 100%
Llama3 405b 99.2%
GPT-4o 100%
Furthermore, we conducted an evaluation to compare the quality of the machine-generated
translations against the human-translated versions of the test cases. To ensure a thorough evaluation,
we employed two complementary metrics. The first metric, cosine similarity is used to assess the
semantic alignment between two translations, capturing the extent to which the meaning conveyed
by the machine translation aligns with that of the human reference. This metric ranges from −1
(completely dissimilar) to 1 (perfectly similar) [28]. Please note that cosine similarity is calculated
based on the embeddings generated from the human-translated version and the LLM-translated
version. To produce these embeddings, we used paraphrase-multilingual-mpnet-base-v2, a
sentence transformer available on Hugging Face that specializes in generating multilingual semantic
embeddings [25].
The second metric, the Bilingual Evaluation Understudy (BLEU) score, evaluates the quality
of machine translation by comparing n-grams of the candidate translation against one or more
reference translations. The BLEU score ranges from 0 to 1, where 0 indicates no overlap between
the candidate and reference translations, and 1 indicates a perfect match. In our case, a lower
BLEU score is actually preferred, as it implies that the paraphrases are structurally different from
the original — which is desirable for evaluating robustness, as long as the semantic meaning is
preserved.
Given the primary focus of this work on semantic similarity, cosine similarity plays a critical
role. The preservation of the core meaning in each test case is essential to ensure alignment with
the user-defined ground truth – i.e., the

Chunk 15 · 1,998 chars

desirable for evaluating robustness, as long as the semantic meaning is
preserved.
Given the primary focus of this work on semantic similarity, cosine similarity plays a critical
role. The preservation of the core meaning in each test case is essential to ensure alignment with
the user-defined ground truth – i.e., the expected results as defined by LangBiTe – and to support a
robust evaluation. The results are shown in Figure 1.
The figure demonstrates that the performance of all the evaluated models is relatively similar,
GPT-4o achieving the highest scores on average. Additionally, it is noteworthy that the bidirectional
translation between Spanish and Catalan consistently outperforms translations involving other
language pairs, indicating a higher level of linguistic alignment or model optimization for this
specific pair. This trend highlights the importance of considering language-specific characteristics
and potential model fine-tuning for related languages in evaluating translation tasks.
8

-- 8 of 24 --

0.90 	0.92 	0.94 	0.96 	0.98
Cosine Similarity
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
BLEU
Catalan to English
0.90 	0.92 	0.94 	0.96 	0.98
Cosine Similarity
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
BLEU
Catalan to Spanish
0.90 	0.92 	0.94 	0.96 	0.98
Cosine Similarity
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
BLEU
English to Catalan
0.90 	0.92 	0.94 	0.96 	0.98
Cosine Similarity
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
BLEU
English to Spanish
0.90 	0.92 	0.94 	0.96 	0.98
Cosine Similarity
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
BLEU
Spanish to Catalan
0.90 	0.92 	0.94 	0.96 	0.98
Cosine Similarity
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
BLEU
Spanish to English
LLMs
Claude 3.5 Sonnet 	Gemini 1.5 Flash 	Llama3 405B 	GPT-4o
Figure 1: The BLEU scores and cosine similarities for translations between each of the tested
languages and the other two, as generated by the selected LLMs.
9

-- 9 of 24 --

4.4 Model selection: augmentation preliminary tests
As detailed in

Chunk 16 · 1,992 chars

ish to English
LLMs
Claude 3.5 Sonnet Gemini 1.5 Flash Llama3 405B GPT-4o
Figure 1: The BLEU scores and cosine similarities for translations between each of the tested
languages and the other two, as generated by the selected LLMs.
9

-- 9 of 24 --

4.4 Model selection: augmentation preliminary tests
As detailed in Section 3.2, all paraphrases for a single test case are generated using a single prompt
to encourage variety. The paraphrasing process is therefore influenced by the number of paraphrases
requested, with a higher number requiring the model to exhibit greater creativity to ensure diversity
while maintaining the semantic meaning and format of the original template.
To assess this, we evaluated the LLMs on the paraphrasing task under three configurations: P =2,
P =5 and P =10 paraphrases. Each paraphrased template was compared to the original template
using cosine similarity and BLEU. Figure 2 shows the aggregated results, representing the average
results across all three languages for this evaluation. Further detailed language-specific results are
discussed in the Appendix.
0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
Cosine Similarity
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
BLEU
P = 2
0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
Cosine Similarity
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40 P = 5
0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000
Cosine Similarity
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40 P = 10
Models
Claude 3.5 Sonnet Gemini 1.5 Flash Llama3 405B GPT-4o
Figure 2: BLEU and cosine similarities for paraphrasing across all the tested languages, with the
number of paraphrases P in [2,5,10].
The results indicate that the size of P does not significantly influence the syntactic or semantic
proximity of the paraphrases to the original template. With an average cosine similarity ranging
from 0.85 to 0.95, the models highly preserve the original semantic meaning.
4.5 Model selection
Based on the results

Chunk 17 · 1,997 chars

,10].
The results indicate that the size of P does not significantly influence the syntactic or semantic
proximity of the paraphrases to the original template. With an average cosine similarity ranging
from 0.85 to 0.95, the models highly preserve the original semantic meaning.
4.5 Model selection
Based on the results presented in Section 4.3 and Section 4.4, GPT-4o was selected for translation
and paraphrasing in the main tests presented in Section 5. This choice was motivated by its reliable
instruction-following and, although it did not achieve the highest performance on every paraphrasing
and translation task, the model yielded the best average results, particularly when emphasizing
cosine similarity.
4.6 Pipeline selection
After selecting the model for translation and paraphrasing, the final step before conducting the main
experiments is to determine the optimal order of paraphrasing and translation, as this will influence
the quality of the final output. In this regard, we consider the bidirectional translation between
English (EN) and Spanish (ES), and Spanish and Catalan (CA), with the number of paraphrases
P =5. In the paraphrasing-to-translation pipeline (P2T ), we utilize the paraphrase results RP
previously collected and outlined in Section 4.4, translating them into the target language. For
the translation-to-paraphrasing pipeline (T2P), we select a subset of the translations previously
10

-- 10 of 24 --

EN -> ES
ES -> EN
ES -> CA
CA -> ES
Translation
0.5
0.6
0.7
0.8
0.9
1.0
Cosine Similarity
Pipeline
P2T
T2P
Figure 3: Distribution of cosine similarity scores for selected translations at P = 5, used to compare
the performance of the two proposed pipelines, P2T and T2P.
gathered and detailed in Section 4.3, ensuring they correspond to the same templates used to
generate RP .
To assess the optimal order of the pipeline, we employ the methodology described in Section 4.3
and Section 4.4. Specifically, we calculate the cosine similarity between each

Chunk 18 · 1,997 chars

two proposed pipelines, P2T and T2P.
gathered and detailed in Section 4.3, ensuring they correspond to the same templates used to
generate RP .
To assess the optimal order of the pipeline, we employ the methodology described in Section 4.3
and Section 4.4. Specifically, we calculate the cosine similarity between each sentence generated by
the pipeline and its corresponding human-written input. The results of this evaluation are illustrated
in Figure 3, which presents boxplots summarizing the distribution of cosine similarity scores.
As evident from Figure 3, the P2T pipeline exhibits a marginally higher median cosine similarity
when the translation direction is from English to Spanish or from Spanish to Catalan. Conversely,
the T2P pipeline slightly outperforms P2T in both reverse cases.
These findings suggest that the order of translation and paraphrasing has a negligible impact on
the overall output quality. For the purposes of this study, we have opted to use the T2P pipeline for
the main evaluation. However, further investigation is required to generalize these conclusions and
explore potential nuances.
5 Main performance evaluation
In Section 4, we addressed RQ1, demonstrating that LLM-based translation and paraphrasing
effectively augment bias-testing templates. We also observed that applying paraphrasing before
translation yields slightly better results than the reverse. In this section, we address RQ2, namely
whether low-resource languages exhibit more bias than high-resource languages when tested with
augmented multilingual templates.
5.1 Language selection
In this work, we focus on two major Indo-European language families, specifically the Romance and
West Germanic families. In particular, we select six languages including both high- and low-resource
languages, for their linguistic diversity, geographic coverage, and availability of ground truth data:
Romance languages:
11

-- 11 of 24 --

• Spanish (ES): A high-resource language mainly spoken in Spain

Chunk 19 · 1,994 chars

Romance and
West Germanic families. In particular, we select six languages including both high- and low-resource
languages, for their linguistic diversity, geographic coverage, and availability of ground truth data:
Romance languages:
11

-- 11 of 24 --

• Spanish (ES): A high-resource language mainly spoken in Spain and numerous countries
in Latin and South America. Ground truth data for Spanish is available from the original
LangBiTe study.
• Catalan (CA): A low-resource language spoken in Eastern Spain and Andorra, for which
ground truth is also available.
• French (FR): The former lingua franca, mainly spoken in numerous countries in Western
Europe and in Western and Central Africa, as well as in Eastern Canada. We will use it for
cross-validation of Romance language results.
West Germanic languages:
• English (EN): The current lingua franca in many domains and the dominant language for
most language models. Ground truth data is available from the original LangBiTe study.
• German (DE): A high-resource language mainly spoken in Germany, Austria, Switzerland,
and Luxembourg.
• Luxembourgish (LB): A low-resource language spoken in Luxembourg closely related to
German, which helps to cross-validate the findings for Catalan on low resource languages.
5.2 Performance Evaluation
The set of templates described in Section 4.2 was used for the main experiment. English served as
the source language, from which the test cases were translated into the target languages. For the
paraphrasing component, we set the number of variations to P = 1. The communities analyzed in
this study are the same as those considered in [19].
Figure 4 presents a series of spider (radar) plots illustrating each LLM’s performance across
the sensitive categories for each language included in this study. Hereafter, we define each unique
concern-language combination as a test batch. Within each plot, the radial axes represent the
percentage of tests passed by a given model for a particular test

Chunk 20 · 1,991 chars

er (radar) plots illustrating each LLM’s performance across
the sensitive categories for each language included in this study. Hereafter, we define each unique
concern-language combination as a test batch. Within each plot, the radial axes represent the
percentage of tests passed by a given model for a particular test batch, thus enabling a direct
comparison of how effectively different LLMs handle sensitive content.
Note that these results reflect only tests for which valid and interpretable answers were obtained.
Although the framework allows up to three retries per test, some responses remained unprocessable.
As described in [19], LangBiTe evaluates answers by searching for predefined, case-specific keywords
and includes templates requiring structured responses (e.g., in JSON). However, not all AI models
consistently follow such formatting instructions; some produce outputs that deviate from the
requested structure, possibly due to limitations in their training or insufficient understanding of the
formatting constraints. Such unprocessable answers are discarded from the final evaluation.
Overall, 64.3% of test batches experienced zero processing failures, and 21.4% showed failure
rates of 10% or less. The remaining 14.3% of test batches exhibited failure rates above 10%. A
detailed list of encountered errors is provided in Section 8.
Several noteworthy observations emerge from Figure 4. First, English and Spanish consistently
yield the highest or most stable scores across the bias categories, irrespective of the model. This
finding aligns with earlier results indicating that widely used languages with substantial training
corpora tend to produce more accurate automated bias-detection outcomes. By contrast, Catalan
and Luxembourgish exhibit greater variability in categories such as Politics and Racism, likely
because smaller or lower-resource languages contain sparser training data that may limit the models’
ability to handle culturally specific terms and

Chunk 21 · 1,998 chars

duce more accurate automated bias-detection outcomes. By contrast, Catalan
and Luxembourgish exhibit greater variability in categories such as Politics and Racism, likely
because smaller or lower-resource languages contain sparser training data that may limit the models’
ability to handle culturally specific terms and nuances.
12

-- 12 of 24 --

EN
ES	CA
LB
FR 	DE
20 40 60 80 100
Ageism
EN
ES	CA
LB
FR 	DE
20 40 60 80 100
Lgtbiqphobia
EN
ES	CA
LB
FR 	DE
20 40 60 80 100
Politics
EN
ES	CA
LB
FR 	DE
20 40 60 80 100
Racism
EN
ES	CA
LB
FR 	DE
20 40 60 80 100
Religion
EN
ES	CA
LB
FR 	DE
20 40 60 80 100
Sexism
EN
ES	CA
LB
FR 	DE
20 40 60 80 100
Xenophobia
Claude 3.5 Sonnet 	Gemini 1.5 Flash 	Llama3 405B 	GPT-4o
Figure 4: Each spider plot illustrates the percentage of passed tests for each LLM in one of the
seven sensitive categories examined in this paper, spanning all six languages analyzed.
The models themselves also vary in their performance. GPT-4o generally achieves high scores
across most categories—particularly Ageism, Sexism, and Xenophobia—indicating strong coverage
of related keywords and contexts. Gemini 1.5 Flash often excels in Religion and Lgbtiqphobia,
suggesting it can effectively capture nuanced expressions of bias across languages in these domains.
Meanwhile, Claude 3.5 Sonnet typically maintains moderate to high consistency in Sexism and
Racism across multiple languages but sometimes fluctuates in Politics, reflecting challenges associated
with localized political terminology. Llama3 405B demonstrates comparatively mixed results: it
excels in certain instances of Racism and Ageism, yet may underperform in categories such as
Politics or Xenophobia for lower-resource languages.
For categories like Lgbtiqphobia and Xenophobia, all four LLMs exhibit relatively high detection
rates in most languages. This consistency may stem from the more universal nature of terms
referring to LGBTIQ+ identities or xenophobic attitudes. By contrast, Politics emerges as the

Chunk 22 · 1,994 chars

ophobia for lower-resource languages.
For categories like Lgbtiqphobia and Xenophobia, all four LLMs exhibit relatively high detection
rates in most languages. This consistency may stem from the more universal nature of terms
referring to LGBTIQ+ identities or xenophobic attitudes. By contrast, Politics emerges as the most
variable concern, with each model showing inconsistencies across different languages.
Similarly, Sexism and Ageism produce mid-range consistency across models, suggesting that
while many overtly disparaging or discriminatory terms are well covered, subtler connotations
may elude straightforward keyword matching or demand deeper contextual understanding. Lastly,
Religion tends to be comparatively stable across both languages and models, presumably due to
shared or borrowed religious terminology and the availability of well-established keywords that more
readily transfer from English prompts to other languages.
Figure 5 aggregates the results shown in Figure 4 by language and model, alongside the
mean outcomes. As depicted, Llama3 405B is the most biased LLM overall, while GPT-4o and
Claude 3.5 Sonnet exhibit the strongest overall performance, with scores around 75%. Regarding
13

-- 13 of 24 --

CA
DE
EN
ES
FR
LB
Mean
Claude 3.5 Sonnet
GPT-4o
Gemini 1.5 Flash
Llama3 405B
Mean
70.1 75.1 79.2 69.7 71.4 63.6 71.5
78.6 73.8 79.7 78.3 77.4 66.0 75.6
73.0 79.5 81.8 75.5 75.3 70.2 75.8
38.2 51.2 61.6 48.8 50.5 44.4 49.1
65.0 69.9 75.6 68.1 68.7 61.1 68.0
Success Percentage by Language and Model
30
40
50
60
70
80
Percentage of Passed Tests
Figure 5: Aggregated results by language and model.
performance by language, models generally perform best on high-resource languages, achieving
their highest average scores in English, and appear to exhibit more social biases when tested on
lower-resource languages. Notably, Luxembourgish stands out as the language with the highest
discrimination rates overall. GPT-4o on Catalan,

Chunk 23 · 1,997 chars

by language, models generally perform best on high-resource languages, achieving
their highest average scores in English, and appear to exhibit more social biases when tested on
lower-resource languages. Notably, Luxembourgish stands out as the language with the highest
discrimination rates overall. GPT-4o on Catalan, however, is an outlier, achieving the second-best
score among all language-model pairs. Nevertheless, because GPT- 4o was chosen as the translation
and paraphrasing model according to the results reported in Section 4, its output may provide
GPT- 4o with a slight advantage in the bias-detection task. Further work is required to evaluate
this potential effect.
Given the variance observed in Figure 4 across different bias categories, it is also evident that
choosing an LLM may require a case-by-case approach. Individual models can exhibit strong
performance in some categories while underperforming in others, especially when targeting localized
cultural or linguistic nuances. Hence, a nuanced selection process that accounts for both language
and bias category may be necessary to optimize bias detection and mitigation.
In conclusion, and in direct response to RQ2, these findings suggest that LLMs exhibit higher
social biases when data augmentation is performed for low-resource languages. Nonetheless, the
particular model best suited for each task may vary depending on the specific bias category and
language under consideration.
6 Discussion
In this section, we complement the results presented in section 5 by conducting a Pearson correlation
analysis on the performance of the same model/concern pairs across different languages. This analysis
highlights both common patterns and divergences in behavior across languages. The outcomes,
depicted in Figure 6, reveal that, contrary to initial expectations, LLMs do not consistently exhibit
comparable biases in linguistically related languages. For instance, while German and English
(both West-Germanic languages)

Chunk 24 · 1,997 chars

highlights both common patterns and divergences in behavior across languages. The outcomes,
depicted in Figure 6, reveal that, contrary to initial expectations, LLMs do not consistently exhibit
comparable biases in linguistically related languages. For instance, while German and English
(both West-Germanic languages) display the highest performance similarity across all language
comparisons, the biases observed in Luxembourgish are more closely aligned with those detected in
Spanish and Catalan than with German or English.
A more granular examination of individual bias dimensions (see Figure 4) further underscores
these unexpected findings. Notably, LLMs display marked performance variations across several
categories of bias, including ageism, Lgbtiqphobia, racism, and sexism. For example, GPT-4o
performs comparatively poorly in the racism category for Catalan and French, whereas Gemini 1.5
14

-- 14 of 24 --

CA 	DE 	EN 	ES 	FR 	LB
CA
DE
EN
ES
FR
LB
1 	0.84 	0.76 	0.73 	0.83 	0.55
0.84 	1 	0.86 	0.7 	0.81 	0.48
0.76 	0.86 	1 	0.72 	0.8 	0.34
0.73 	0.7 	0.72 	1 	0.82 	0.66
0.83 	0.81 	0.8 	0.82 	1 	0.54
0.55 	0.48 	0.34 	0.66 	0.54 	1
Pearson Correlation between model/concern pairs across languages
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Figure 6: Heatmap of the Pearson Correlation of the performance achieved on the same model/con-
cern pair across different languages.
exhibits pronounced differences in sexism performance between these two languages. Collectively,
these observations indicate that linguistic proximity does not necessarily translate into similar bias
patterns across different LLMs.
From Figure 4 it also emerges that political bias is a notable outlier to our observations. In
evaluating the political bias of language models, it is essential to highlight the limitations of
LangBiTe’s default template library, and their obtained paraphrases, particularly when the queries
are predominantly centered around U.S politics and require a neutral stance.

Chunk 25 · 1,990 chars

is a notable outlier to our observations. In
evaluating the political bias of language models, it is essential to highlight the limitations of
LangBiTe’s default template library, and their obtained paraphrases, particularly when the queries
are predominantly centered around U.S politics and require a neutral stance. What we see in
Figure 4 is that most models take an ideological side when prompted about U.S. political issues,
whereas the oracles expect no positioning at all. Nevertheless, while LangBiTe’s templates provide
valuable insights into U.S.-related political leanings from generative AI models, they may not fully
capture the differences and complex nuances of political discourse in other countries and languages.
Political ideologies and the framing of society matters can vary significantly across diverse national
or regional contexts. In addition, political ideologies and stances tend to evolve over time and
are generally too complex to be placed on a one-dimensional spectrum [15]. Consequently, results
derived from an American-centric dataset might not offer a comprehensive assessment of a model’s
potential bias on a global scale.
As mentioned in Section 5, not all LLMs duly follow LangBiTe’s formatting instructions, with
some deviating from the required structure. This leads to computing errors, since the output may
not be correctly interpreted—not even by the LLM-as-judge. Such structured output formatting
instructions are included in templates that ask for probabilities of particular aspects, events, or
traits for different sensitive communities. Most of these templates are targeting sexism (42 out of
65 templates) and racism (46 out of 98), leading to a higher number of errors in evaluating these
ethical concerns.
7 Future work
In this work, we have tested MLA-BiTe on four LLMs across six languages. Future work includes:
1) Expanding the Evaluation to More LLMs: We aim to include additional LLMs in our
evaluation, specifically to analyze how

Chunk 26 · 1,998 chars

out of 98), leading to a higher number of errors in evaluating these
ethical concerns.
7 Future work
In this work, we have tested MLA-BiTe on four LLMs across six languages. Future work includes:
1) Expanding the Evaluation to More LLMs: We aim to include additional LLMs in our
evaluation, specifically to analyze how performance varies with model size.
15

-- 15 of 24 --

2) Extending Language Coverage: As discussed in Section 5.1, the languages used in this
study belong to European families. Future work will extend the evaluation to extra-European
languages, with a focus on low-resource languages. This poses additional challenges, as many
of these languages exhibit linguistic characteristics that differ significantly from those of
Indo-European languages, such as complex systems of grammatical number, noun class, or
verb morphology. These feature may require tailored strategies for reliable evaluation.
3) Integrating Image Generation Capabilities: We plan to extend the framework to
cover image generation. In this context, multilingual, augmented prompts could be used to
produce images through ImageBiTe [21]. This extension would allow us to investigate how
the distribution of generated images varies according to the language in which the prompt is
formulated.
4) Enhancing Answer Processing and Evaluation: We also aim to identify strategies to
improve the processing of LLM-generated answers. In particular, we plan to strengthen the
LLM-as-a-judge component to reduce the number of unprocessed executions and improve the
robustness of the evaluation.
5) Exploring Cultural-Aware Translation: Lastly, we aim to investigate translation strategies
that respect cultural norms and values specific to the target language and society. For instance,
prompts or examples involving food may need to avoid certain ingredients depending on cultural
or religious context. Such strategies could help mitigate risks of offending or alienating different
user groups, ensuring that automated

Chunk 27 · 1,996 chars

t cultural norms and values specific to the target language and society. For instance,
prompts or examples involving food may need to avoid certain ingredients depending on cultural
or religious context. Such strategies could help mitigate risks of offending or alienating different
user groups, ensuring that automated translations remain both accurate and respectful.
8 Conclusion
This study introduced MLA-BiTe, a framework that improves prior bias evaluation methods by
enabling systematic multilingual bias testing. MLA-BiTe leverages automated translation and
paraphrasing techniques to support comprehensive assessments across diverse linguistic settings. For
this study, we adapted the framework to generate input templates compatible with the Lang-BiTe
framework [20], which we subsequently used to validate our method. Under this setting, we tested
MLA-BiTe on a representative set of both high-resource languages (e.g., English, Spanish, French,
German) and low-resource languages (e.g., Catalan, Luxembourgish). These languages were selected
to encompass a range of linguistic characteristics and resource availability; however, they do not
represent the full extent of languages supported by the framework.
Our first research question concerned whether LLM-based translation and paraphrasing methods
can effectively augment bias-testing templates. We found that they enhance the overall compre-
hensiveness of multilingual bias evaluation, with the strategy of paraphrasing before translation
delivering marginally better outcomes.
Our second research question focused on whether low-resource languages exhibit higher degrees
of bias compared to high-resource languages. Our performance evaluation reveals that, indeed,
LLMs generally attain higher and more stable bias-detection scores in languages with extensive
training data. In contrast, lower-resource languages display greater variability, particularly for
nuanced bias categories like Politics and Racism, corroborating prior

Chunk 28 · 1,982 chars

uages. Our performance evaluation reveals that, indeed,
LLMs generally attain higher and more stable bias-detection scores in languages with extensive
training data. In contrast, lower-resource languages display greater variability, particularly for
nuanced bias categories like Politics and Racism, corroborating prior work suggesting that richer
training corpora often lead to more consistent results across bias domains.
Aggregated findings indicate that some models demonstrate robust performance in most cat-
egories, whereas others show variability, highlighting how model architecture and training data
composition can influence biases. Moreover, correlation analyses found no clear pattern of parallel
16

-- 16 of 24 --

bias trends among linguistically similar languages, suggesting that cross-linguistic bias transfer is
more complex than simple language-family groupings might imply.
In summary, translation and paraphrasing substantially bolster bias-detection robustness in
multilingual contexts, and lower-resource languages remain more prone to biases. Nonetheless,
individual results depend heavily on which model-language pair and bias category are being
considered. Consequently, selecting an LLM for bias-detection tasks should be approached on a
case-by-case basis.
Future work will expand both model and language coverage and investigate applications in other
domains, including bias evaluation in image-generation systems. Additional research might further
explore and cross-modality approaches to address the nuanced challenges posed by low-resource
languages and complex bias categories.
Acknowledgements
This work has been partially funded by the Luxembourg National Research Fund (FNR) PEARL
program (grant agreement 16544475); the research network RED2022-134647-T and the project
PID2023-147592OB-I00 “SE4GenAI”, both funded by MCIN/AEI/10.13039/501100011033.
References
[1] Sarah Alnegheimish, Alicia Guo, and Yi Sun. Using natural sentence prompts for

Chunk 29 · 1,988 chars

ed by the Luxembourg National Research Fund (FNR) PEARL
program (grant agreement 16544475); the research network RED2022-134647-T and the project
PID2023-147592OB-I00 “SE4GenAI”, both funded by MCIN/AEI/10.13039/501100011033.
References
[1] Sarah Alnegheimish, Alicia Guo, and Yi Sun. Using natural sentence prompts for understanding
biases in language models. In Human Language Technologies, pages 2824–2830. ACL, 2022.
[2] Christine Basta, Marta R. Costa-Jussà, and Noe Casas. Evaluating the underlying gender bias
in contextualized word embeddings. In Gender Bias in NLP, pages 33–39. ACL, 2019.
[3] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On
the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021
ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
[4] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai.
Man is to computer programmer as woman is to homemaker? debiasing word embeddings.
Advances in NeurIPS, 29, 2016.
[5] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie
Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark,
Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang,
Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving,
Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent
Sifre. Improving language models by retrieving from trillions of tokens, 2022. URL https:
//arxiv.org/abs/2112.04426.
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models
are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[7] Alessio Buscemi and Daniele Proverbio. Chatgpt vs gemini vs llama on multilingual

Chunk 30 · 1,995 chars

Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models
are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[7] Alessio Buscemi and Daniele Proverbio. Chatgpt vs gemini vs llama on multilingual sentiment
analysis. arXiv preprint arXiv:2402.01715, 2024.
17

-- 17 of 24 --

[8] Alessio Buscemi and Daniele Proverbio. Roguegpt: dis-ethical tuning transforms chatgpt4 into
a rogue ai in 158 words. arXiv preprint arXiv:2407.15009, 2024.
[9] Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language
prompts to measure stereotypes in language models. In 61st Annual Meeting of the Association
for Computational Linguistics, pages 1504–1532. ACL, 2023.
[10] Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei
Chang, and Rahul Gupta. Bold: dataset and metrics for measuring biases in open-ended
language generation. In FAccT, pages 862–872. ACM, 2021.
[11] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealTox-
icityPrompts: Evaluating neural toxic degeneration in language models. In EMNLP, pages
3356–3369. ACL, 2020.
[12] Rem Hida, Masahiro Kaneko, and Naoaki Okazaki. Social bias evaluation for large language
models requires prompt variations. arXiv preprint arXiv:2407.03129, 2024.
[13] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. Measuring bias
in contextualized word representations. In 1st Workshop on Gender Bias in Natural Language
Processing, pages 166–172. ACL, 2019.
[14] Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and Goran Glavaš. From zero to hero: On the
limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint
arXiv:2005.00633, 2020.
[15] Verlan Lewis and Hyrum Lewis. The myth of left and right: How the political spectrum
misleads and harms america. 2022.
[16] Percy Liang, Rishi Bommasani,

Chunk 31 · 1,994 chars

d Goran Glavaš. From zero to hero: On the
limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint
arXiv:2005.00633, 2020.
[15] Verlan Lewis and Hyrum Lewis. The myth of left and right: How the political spectrum
misleads and harms america. 2022.
[16] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga,
Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of
language models. arXiv preprint arXiv:2211.09110, 2022.
[17] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga,
Yian Zhang, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110,
2023.
[18] Sergio Morales. Langbite, 2024. URL https://github.com/SOM-Research/LangBiTe.
[19] Sergio Morales, Robert Clarisó, and Jordi Cabot. A DSL for testing LLMs for fairness and
bias. In MODELS, page 203–213. ACM, 2024.
[20] Sergio Morales, Robert Clarisó, and Jordi Cabot. LangBiTe: A platform for testing bias in
large language models. arXiv preprint arXiv:2404.18558, 2024.
[21] Sergio Morales, Robert Clarisó, and Jordi Cabot. ImageBiTe: A framework for evaluating
representational harms in text-to-image models. In Proceedings of the 4th International
Conference on AI Engineering – Software Engineering for AI, 2025. Pending publication.
[22] Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in
pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
18

-- 18 of 24 --

[23] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. J. Mach. Learn. Res., 21(1), 2020.
[24] Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A comprehensive survey of bias in
LLMs: Current landscape and future directions. arXiv preprint arXiv:2409.16430, 2024.
[25] Nils Reimers and Iryna

Chunk 32 · 1,994 chars

ing the limits of transfer learning with a unified
text-to-text transformer. J. Mach. Learn. Res., 21(1), 2020.
[24] Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A comprehensive survey of bias in
LLMs: Current landscape and future directions. arXiv preprint arXiv:2409.16430, 2024.
[25] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-
networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/
1908.10084.
[26] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu,
Myle Ott, Kurt Shuster, Eric M Smith, et al. Recipes for building an open-domain chatbot.
arXiv preprint arXiv:2004.13637, 2020.
[27] Jayanta Sadhu, Maneesha Rani Saha, and Rifat Shahriyar. Social bias in large language models
for Bangla: An empirical study on gender and religious bias. arXiv preprint arXiv:2407.03536,
2024.
[28] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval.
Information Processing & Management, 24(5):513–523, 1988.
[29] Pola Schwöbel, Luca Franceschi, Muhammad Bilal Zafar, et al. Evaluating large language
models with fmeval. arXiv preprint arXiv:2407.12872, 2024.
[30] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked
as a babysitter: On biases in language generation. In EMNLP-IJCNLP, pages 3407–3412. ACL,
2019.
[31] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid,
Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al.
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
arXiv preprint arXiv:2206.04615, 2022.
[32] European Union. The artificial intelligence act, 2024. URL https://
artificialintelligenceact.eu.
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz

Chunk 33 · 1,996 chars

tion game: Quantifying and extrapolating the capabilities of language models.
arXiv preprint arXiv:2206.04615, 2022.
[32] European Union. The artificial intelligence act, 2024. URL https://
artificialintelligenceact.eu.
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[34] Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu.
BiasAsker: Measuring the bias in conversational ai system. In ESEC/FSE, page 515–527. ACM,
2023.
[35] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang,
Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of
harm from language models. arXiv preprint arXiv:2112.04359, 2021.
[36] Laura Weidinger, John Mellor, Maribeth Rauh, et al. Ethical and social risks of harm from
language models. arXiv preprint arXiv:2112.04359, 2021.
19

-- 19 of 24 --

[37] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in
coreference resolution: Evaluation and debiasing methods. In Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages
15–20. ACL, 2018.
20

-- 20 of 24 --

Appendices
Translation prompt
The following prompt is used for translating automatically the templates as described in Section 3.1.
You must provide a translation in {language} of the following sentence:
“{sentence}”
It is CRITICAL to maintain the exact semantic meaning.
If there are placeholders in the format {{PLACEHOLDER}}, it is CRITICAL not to
translate them.
If the sentence is a yes/no question, the translation must also be.
If the sentence involves probabilities, the translation must also.
In the prompt, sentence=template and language=l. The prompt template was developed through
multiple iterations to ensure clarity,

Chunk 34 · 1,996 chars

{{PLACEHOLDER}}, it is CRITICAL not to
translate them.
If the sentence is a yes/no question, the translation must also be.
If the sentence involves probabilities, the translation must also.
In the prompt, sentence=template and language=l. The prompt template was developed through
multiple iterations to ensure clarity, precision, and minimal ambiguity. Short, explicit instructions
written in an imperative tone were chosen to guide the model effectively. The use of capitalized
“CRITICAL” emphasizes the importance of semantic accuracy, leveraging the model’s training to
recognize capitalization as a signal for urgency or emphasis.
Additional instructions are included to address specific cases. For instance, placeholders in the
format {{PLACEHOLDER}} are explicitly marked as non-translatable to maintain consistency
with LangBiTe’s standardized community markups in English. Similarly, binary questions are
explicitly required to retain their structure, as the expected output is designed to correspond to the
binary format. The same principle is extended to instructions for sentences involving probabilities,
ensuring that the translated sentence mirrors the semantic and structural nuances of the original.
These specifications are seamlessly integrated into the template for consistent and context-aware
translations.
Paraphrasing prompt
The following prompt is used for paraphrasing automatically the templates as described in section 3.2.
You must provide exactly {n_paraphrases} different paraphrases of the following sentence:
“sentence”
It is CRITICAL to maintain the same language as the sentence.
It is CRITICAL to make the paraphrase as {grammar_number}.
It is CRITICAL to maintain the exact semantic meaning, as well as all the placeholders
in the format {{PLACEHOLDER}}.
If the sentence is a yes/no question, the paraphrase must also be.
If the sentence involves probabilities, the paraphrase must also.
In the prompt, sentence=template, grammar_number=gn and n_paraphrases=P

Chunk 35 · 1,996 chars

is CRITICAL to maintain the exact semantic meaning, as well as all the placeholders
in the format {{PLACEHOLDER}}.
If the sentence is a yes/no question, the paraphrase must also be.
If the sentence involves probabilities, the paraphrase must also.
In the prompt, sentence=template, grammar_number=gn and n_paraphrases=P .
The prompt was crafted through iterative refinement to ensure precision and minimal ambiguity,
similar to the translation prompt. Additional instructions address essential aspects of paraphrasing.
The requirement to maintain the same language as the input ensures linguistic consistency, while
the specification to paraphrase as {grammar_number} reinforces grammatical alignment. For
placeholders in the format {{PLACEHOLDER}}, we follow the same strategy as in the translation
prompt to ensure they are preserved and not modified. Similarly, for sentences involving probabilities
or binary questions, the same approach as in the translation prompt is applied.
21

-- 21 of 24 --

It is to be noted that asking the LLM to generate all paraphrases in one prompt encourages
it to seek variety, as the model understands it is being asked for multiple distinct outputs in one
go. This can lead to more diverse paraphrasing. In contrast, iterative paraphrasing (one at a time)
risks producing similar outputs, as the LLM may have less context to infer the need for variety.
However, the effectiveness of these approaches can also depend on the specific LLM being used and
the prompt design. Clear and explicit instructions in iterative paraphrasing might mitigate the risk
of similarity, but the upfront approach generally aligns better with the goal of maximizing variety
[5].
Model selection: paraphrases evaluation
Figure 7 breaks down the aggregated results from Section 4.4 by language. Overall, model perfor-
mance for paraphrasing appears consistent across the evaluated languages, with no clear language-
specific trends emerging—except for Claude 3.5, which consistently

Chunk 36 · 1,992 chars

zing variety
[5].
Model selection: paraphrases evaluation
Figure 7 breaks down the aggregated results from Section 4.4 by language. Overall, model perfor-
mance for paraphrasing appears consistent across the evaluated languages, with no clear language-
specific trends emerging—except for Claude 3.5, which consistently underperforms across all evalua-
tions according to the BLEU metric.
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
BLEU
Catalan (P=2)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
Catalan (P=5)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
Catalan (P=10)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
BLEU
English (P=2)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
English (P=5)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
English (P=10)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
BLEU
Spanish (P=2)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
Spanish (P=5)
0.75 	0.80 	0.85 	0.90 	0.95 	1.00
Cosine Similarity
0.0
0.1
0.2
0.3
0.4
Spanish (P=10)
Models
Claude 3.5 Sonnet 	Gemini 1.5 Flash 	Llama3 405B 	GPT-4o
Figure 7: Paraphrasing performance by language and variations different values of P across the
evaluated models.
22

-- 22 of 24 --

Unprocessable executions
Table 3 presents the mean rate of unprocessable executions grouped by model. Answers generated
by Gemini 1.5 Flash are the most reliably processed by the LangBiTe framework, while those from
Llama3 405B exhibit the highest fault rate.
According to Table 4, topics related to Racism and Sexism result in the highest processing fault
rates. In contrast, answers concerning Xenophobia and Politics yield the lowest rates.
Table 5 highlights a significant variation in performance across languages. English, used as
the source language for test cases, shows the lowest fault rate. The

Chunk 37 · 1,996 chars

cs related to Racism and Sexism result in the highest processing fault
rates. In contrast, answers concerning Xenophobia and Politics yield the lowest rates.
Table 5 highlights a significant variation in performance across languages. English, used as
the source language for test cases, shows the lowest fault rate. The highest rates are observed for
Luxembourgish and Spanish, while Catalan has the second-lowest fault rate after English. Overall,
these results suggest no clear correlation between the availability of resources for a language and
the likelihood of generating answers that cannot be processed.
LLM %Unprocessable responses
Claude 3.5 Sonnet 8.0
Gemini 1.5 Flash 2.9
Llama3 405B 10.5
GPT-4o 3.6
Table 3: Percentage of unprocessable responses by LLM.
Concern %Unprocessable responses
ageism 5.14
lgbtiqphobia 0.31
politics 0.07
racism 18.24
religion 5.44
sexism 14.34
xenophobia 0.07
Table 4: Percentage of unprocessable responses by concern.
Language %Unprocessable tests
CA 4.1
DE 5.9
EN 3.3
ES 9.2
FR 5.1
LU 9.7
Table 5: Percentage of unprocessable tests by language.
Finally, table 6 shows in detail the mean percentage of unprocessable responses by test batch.
23

-- 23 of 24 --

Model Lang Bias Type %Faults
Claude 3.5 Sonnet EN racism 0.8
Claude 3.5 Sonnet LB racism 0.8
Gemini 1.5 Flash LB racism 0.8
Claude 3.5 Sonnet CA racism 0.8
Llama3 405B DE sexism 0.8
Gemini 1.5 Flash EN politics 0.9
Gemini 1.5 Flash ES politics 0.9
Llama3 405B LB ageism 1.6
Claude 3.5 Sonnet ES ageism 1.6
Gemini 1.5 Flash CA racism 1.6
Gemini 1.5 Flash ES racism 1.6
Claude 3.5 Sonnet ES racism 1.6
GPT-4o DE racism 1.6
Llama3 405B LB sexism 1.7
Llama3 405B CA xenophobia 1.7
Gemini 1.5 Flash EN racism 2.4
GPT-4o EN racism 2.4
Claude 3.5 Sonnet EN lgbtiqphobia 2.5
Gemini 1.5 Flash ES religion 3.3
Claude 3.5 Sonnet EN religion 3.3
Llama3 405B LB religion 3.3
Llama3 405B CA religion 3.3
Claude 3.5 Sonnet FR racism 4.8
Llama3 405B EN lgbtiqphobia 5.0
Llama3 405B CA sexism 5.0
Llama3 405B FR

Chunk 38 · 1,431 chars

7
Gemini 1.5 Flash EN racism 2.4
GPT-4o EN racism 2.4
Claude 3.5 Sonnet EN lgbtiqphobia 2.5
Gemini 1.5 Flash ES religion 3.3
Claude 3.5 Sonnet EN religion 3.3
Llama3 405B LB religion 3.3
Llama3 405B CA religion 3.3
Claude 3.5 Sonnet FR racism 4.8
Llama3 405B EN lgbtiqphobia 5.0
Llama3 405B CA sexism 5.0
Llama3 405B FR sexism 5.8
Llama3 405B FR ageism 6.2
Gemini 1.5 Flash FR racism 6.4
Claude 3.5 Sonnet LB religion 6.7
GPT-4o LB religion 6.7
Llama3 405B FR religion 7.1
Llama3 405B EN sexism 7.5
Llama3 405B ES ageism 7.8
Llama3 405B DE ageism 7.8
Llama3 405B EN religion 10.0
Llama3 405B ES religion 10.0
Llama3 405B EN ageism 10.9
Llama3 405B CA ageism 12.5
Gemini 1.5 Flash CA religion 13.3
Llama3 405B DE religion 13.3
GPT-4o CA religion 13.3
Claude 3.5 Sonnet DE religion 16.7
GPT-4o DE religion 20.0
Llama3 405B LB racism 21.8
Claude 3.5 Sonnet CA ageism 25.0
Gemini 1.5 Flash LB ageism 25.0
Claude 3.5 Sonnet LB ageism 25.0
Llama3 405B CA racism 37.9
Llama3 405B EN racism 46.8
Llama3 405B FR racism 47.6
Llama3 405B ES racism 49.2
Llama3 405B DE racism 51.6
GPT-4o LB racism 51.6
Claude 3.5 Sonnet DE racism 52.4
GPT-4o ES racism 53.2
Claude 3.5 Sonnet ES sexism 63.3
Gemini 1.5 Flash LB sexism 63.3
Claude 3.5 Sonnet LB sexism 64.1
Claude 3.5 Sonnet FR sexism 65.8
Llama3 405B ES sexism 66.7
Table 6: Percentage of unprocessable responses for each test batch with at least one unprocessable
response.
24

-- 24 of 24 --