Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages
Summary
This paper introduces MLA-BiTe, a framework for automated and augmented evaluation of social biases in Large Language Models (LLMs) across high- and low-resource languages. It addresses the challenge of manually translating and paraphrasing bias test templates for multilingual assessment, which is time-consuming and requires native language expertise. MLA-BiTe uses LLMs to automate translation and paraphrasing, enabling scalable and inclusive bias testing. The study evaluates four state-of-the-art LLMs in six languagesâEnglish, Spanish, French, German, Catalan, and Luxembourgishâacross seven sensitive categories. Results show that LLM-based translation and paraphrasing effectively augment test templates, with paraphrasing before translation yielding marginally better outcomes. The framework reveals that low-resource languages exhibit greater variability and higher bias detection rates compared to high-resource languages, particularly in nuanced categories like Politics and Racism. The study also finds that model performance varies significantly by language and bias category, emphasizing the need for case-by-case model selection. Future work includes expanding language coverage, integrating image generation, and improving answer processing to enhance evaluation robustness.
PDF viewer
Chunks(39)
Chunk 0 ¡ 1,998 chars
Mind the Language Gap: Automated and Augmented Evaluation of
Bias in LLMs for High- and Low-Resource Languages
Alessio Buscemi1, CĂŠdric Lothritz1, Sergio Morales2, Marcos Gomez-Vazquez1,
Robert ClarisĂł2, Jordi Cabot1,3, GermĂĄn Castignani1
1Luxembourg Institute of Science and Technology
2Universitat Oberta de Catalunya
3University of Luxembourg
{alessio.buscemi, cedric.lothritz, marcos.gomez, jordi.cabot, german.castignani}@list.lu
{smoralesg, rclariso}@uoc.edu
Abstract
Large Language Models (LLMs) have exhibited impressive natural language processing
capabilities but often perpetuate social biases inherent in their training data. To address this, we
introduce MultiLingual Augmented Bias Testing (MLA-BiTe), a framework that improves prior
bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages
automated translation and paraphrasing techniques to support comprehensive assessments across
diverse linguistic settings. In this study, we evaluate the effectiveness of MLA-BiTe by testing
four state-of-the-art LLMs in six languagesâincluding two low-resource languagesâfocusing on
seven sensitive categories of discrimination.
1 Introduction
Large Language Models (LLMs) have become integral to modern Natural Language Processing
(NLP) applications, demonstrating remarkable capabilities in tasks such as machine translation [33],
text generation [6 ], and dialogue systems [26]. Despite these successes, a growing body of research
indicates that LLMs can exhibit harmful social biases, including stereotypes and discriminatory
attitudes. Such biases can arise from historical and cultural prejudices embedded in the data used
to train these models [2, 4, 11, 23, 24, 30, 36, 3, 8].
Recent work underscores that social biases in LLMs can manifest in various forms, such as
racist, sexist, or homophobic content [22]. When deployed at scale, these biases risk perpetuating
stereotypes and marginalizing vulnerable communities, raising ethical concerns andChunk 1 ¡ 1,996 chars
these models [2, 4, 11, 23, 24, 30, 36, 3, 8]. Recent work underscores that social biases in LLMs can manifest in various forms, such as racist, sexist, or homophobic content [22]. When deployed at scale, these biases risk perpetuating stereotypes and marginalizing vulnerable communities, raising ethical concerns and emphasizing the need for bias mitigation strategies [35, 16]. While significant progress has been made in quantifying and reducing biases for high-resource languages like English, cross-lingual investigations reveal that biases also affect lesser-resourced languages, often in ways that are more difficult to detect and mitigate [14, 7]. Previous frameworks for evaluating social biases in generative AI systems have primarily focused on single-language settings, limiting their applicability in global and multilingual contexts. However, bias in AI systems can manifest differently across languages and cultures, making it essential to assess models in a linguistically diverse manner. There is a growing need for tools that allow non-technical stakeholdersâsuch as Human Resources departments, Ethics Committees, and Diversity & Inclusion officersâto evaluate how AI systems align with their values across different languages. Enabling 1 arXiv:2504.18560v1 [cs.CL] 19 Apr 2025 -- 1 of 24 -- such inclusive and multilingual assessments is a crucial step toward fostering more trustworthy and equitable AI systems. Ensuring the fairness of AI systems in multilingual and culturally diverse environments requires systematic evaluation across a broad spectrum of languages, including low-resource and regionally co- official ones. However, the development of bias evaluation benchmarks in multiple languages remains a significant challenge, particularly when non-technical stakeholders are tasked with authoring or validating prompts in languages they do not actively use. This issue is especially pronounced in settings where official languages differ from those predominantly
Chunk 2 ¡ 1,990 chars
bias evaluation benchmarks in multiple languages remains a significant challenge, particularly when non-technical stakeholders are tasked with authoring or validating prompts in languages they do not actively use. This issue is especially pronounced in settings where official languages differ from those predominantly used in professional contexts. For instance, while Luxembourgish is an official language in Luxembourg, French and English are more commonly employed in the workplace. Similarly, Catalan is co-official in parts of Spain, yet not all professionals are proficient in its use. Analogous situations arise in countries such as South Africa and India, where languages like Zulu, Xhosa, Maithili, or Konkani hold official status but are often underrepresented in administrative and corporate environments. The manual translation and paraphrasing of prompts to ensure semantic consistency and cultural appropriateness across languages is both time-consuming and difficult to scale. To address this limitation, we propose leveraging LLMs to automate these tasks. Specifically, we investigate whether LLMs can reliably perform translation and paraphrasing in a way that enables the generation of linguistically and culturally appropriate test cases. If effective, this approach would facilitate scalable and inclusive multilingual bias evaluations, while reducing dependency on native language expertise and enabling broader participation by non-technical stakeholders. This paper introduces MLA-BiTe, a framework designed to enhance existing bias evaluation methods by supporting systematic multilingual bias testing. MLA-BiTe is built to operate on the input generated by Language Bias Testing (LangBiTe) [19], but it is flexible enough to be adapted for use with other bias detection systems. To guide our study, we focus on two primary research questions: 1.0.1 RQ1. Can LLM-based translation and paraphrasing effectively serve as a method to augment test templates in multiple
Chunk 3 ¡ 1,999 chars
enerated by Language Bias Testing (LangBiTe) [19], but it is flexible enough to be adapted for use with other bias detection systems. To guide our study, we focus on two primary research questions: 1.0.1 RQ1. Can LLM-based translation and paraphrasing effectively serve as a method to augment test templates in multiple languages, and if so, which ordering of these steps yields the most reliable expansions? 1.0.2 RQ2. Based on the hypothesis that LLM-based translation and paraphrasing augmentation effectively enable multilingual bias testing, do low-resources languages have more biases than high-resources languages? To address RQ1, we leverage In-Context Learning (ICL) capabilities of LLMs to expand the pool of languages that LangBiTe supports and systematically generate paraphrases of existing test templates. By preserving their semantic meaning, we ensure consistency when comparing different augmentation strategies (i.e., paraphrasing then translating vs. translating then paraphrasing). To investigate RQ2, we then compare the outcomes of these augmented test templates across both high-resource and low-resource languages. By integrating automated translation and prompt augmentation, MLA-BiTe enables a broader analysis of how biases manifest in diverse linguistic contexts. This is particularly impactful in enterprise or public-sector settings, where organizations must meet multilingual obligations but lack technical or linguistic capacity to do so manually. The contributions of our work can be summarized as follows: 2 -- 2 of 24 -- 1. We present MLA-BiTe, which automates the translation and augmentation of templates for testing social biases in LLMs. 2. We conduct a series of assessments to evaluate whether LLM-based translation and paraphrasing offers a reliable strategy for augmenting test templates in multiple languages (addressing RQ1), and how the ordering of paraphrasing and translation affects these outcomes. 3. We examine how low-resource languages (Catalan
Chunk 4 ¡ 1,995 chars
. We conduct a series of assessments to evaluate whether LLM-based translation and paraphrasing offers a reliable strategy for augmenting test templates in multiple languages (addressing RQ1), and how the ordering of paraphrasing and translation affects these outcomes. 3. We examine how low-resource languages (Catalan and Luxembourgish) compare to high- resource languages (English, Spanish, French, and German) in terms of detected biases (addressing RQ2). 2 Background and Related Work This section explores the limitations of current approaches in detecting biases across languages. It also provides a concise overview of LangBiTe, i.e. the target bias-testing framework which serves as a blueprint for MLA-BiTe, highlighting its utility in multilingual settings and its current shortcomings. Finally, this section briefly discusses the state of the art for augmenting datasets and generating synthetic data to support bias detection. 2.1 Bias detection in text-to-text models LLMs have achieved widespread popularity and are becoming pervasive for text classification, content generation, language translation, and text summarization, among many other tasks. However, because their training typically relies on large datasets derived from web crawls, they often fail to address ethical concerns and tend to mirror biases prevalent on the Internet [2, 4, 11, 23, 24, 30, 36, 3 , 8]. In this sense, the European Union AI Act [32] enforces EU members to establish guidelines and procedures for developers to avoid âdiscriminatory impacts and unfair biases prohibited by Union or national lawâ in their proprietary AI software. There are many recent research studies proposing different approaches and prompt datasets for detecting bias in text-to-text LLMs [1 , 9, 10, 31, 13, 17, 29, 34, 37]. Nevertheless, most of the testing prompts are written in English, and few are targeting LLMs in other languages (e.g., [ 19, 27]). Additionally, LLMs are sensitive to prompt variations, thus using a
Chunk 5 ¡ 1,977 chars
ng different approaches and prompt datasets
for detecting bias in text-to-text LLMs [1 , 9, 10, 31, 13, 17, 29, 34, 37]. Nevertheless, most of
the testing prompts are written in English, and few are targeting LLMs in other languages (e.g.,
[ 19, 27]). Additionally, LLMs are sensitive to prompt variations, thus using a limited set of prompts
may affect the effectiveness of the evaluation [12].
2.2 LangBiTe: An open-source tool to automate bias testing
LangBiTe follows a sequential process for detecting bias in text-to-text models, based on a set
of ethical concerns (e.g., gender discrimination, racism) and sensitive communities that could
potentially be favored or harmed (e.g., men and women, White and Black people). LangBiTe
automatically: (1) selects a subset of prompt templates from a prompt library as per those ethical
concerns; (2) for each prompt template, generates a test case addressing each of the sensitive
communities; (3) prompts the LLMs under testing; and (4) builds reports with insights derived
from the LLMs responses.
LangBiTe includes 3 curated prompt template libraries in English, Spanish and Catalan, each
of which containing over 300 prompts and templates for detecting ageism, gender discrimination,
LGBTQIA+phobia, political preferences, religious bias, racism, and xenophobia. Users can customize
and build their own prompt template libraries. Every new template must target an ethical concern,
include an optional prefix to precede the core text of the prompt, contain the text of the prompt
3
-- 3 of 24 --
itself, and output formatting instructions for the LLM response. Moreover, a template has an
associated oracle that provides an expected valid, non-biased response from an LLM.
A template may include placeholders, in the format {<COMMUNITY>(<NUM>)?}, to be instantiated
with the ethical concernâs communities. The <NUM> part is included in templates that evaluate
several sensitive communities of the same ethical concern (e.g.,Chunk 6 ¡ 1,994 chars
e that provides an expected valid, non-biased response from an LLM.
A template may include placeholders, in the format {<COMMUNITY>(<NUM>)?}, to be instantiated
with the ethical concernâs communities. The <NUM> part is included in templates that evaluate
several sensitive communities of the same ethical concern (e.g., â{SEXUAL_ORIENTATION1} and
{SEXUAL_ORIENTATION2} people should have the same civil rightsâ).
The construction of the original English template library followed a process involving several
stakeholders from different expertise backgrounds. Later, it was manually translated into Spanish
and Catalan. As such, this procedure requires the participation of native speakers of the languages
to be supported by LangBiTe, hindering its scalability.
3 Methodology
MLA-BiTe operates exclusively on inputs provided to the underlying framework, such as the
PromptTemplates employed by LangBiTe. Because its core logic is decoupled from the specific
framework implementation, MLA-BiTe can readily accommodate inputs from other prompt-based
bias evaluation frameworks without necessitating alterations to their internal code structures.
Specifically, within LangBiTe, translation and paraphrasing procedures are implemented at the
template level, not at the individual prompt levelâthat is, prior to the instantiation of template
placeholders with targeted communities. This choice is justified because a single template with p
placeholders intended for filling from a set of n target communities can yield up to n!
p!(nâp)! distinct
test prompts. Performing translation and paraphrasing at the template level rather than at the
prompt level significantly enhances the efficiency and scalability of the approach.
Moreover, translating and paraphrasing at the individual prompt level would result in prompts
derived from the same template being syntactically divergent. This divergence would complicate
the interpretation of results, making it challenging to discern whether a failed testChunk 7 ¡ 1,997 chars
efficiency and scalability of the approach. Moreover, translating and paraphrasing at the individual prompt level would result in prompts derived from the same template being syntactically divergent. This divergence would complicate the interpretation of results, making it challenging to discern whether a failed test prompt is due to variations in the ordering of community placeholders or subtle syntactic differences. By applying operations at the template level, the approach ensures that generated test prompts are syntactically uniform, thereby enhancing the comparability and interpretability of the evaluation outcomes. Algorithm 1 outlines the overall workflow of MLA-BiTe. The tool takes as input a list of PromptTemplates (P T ), an LLM that acts as both translator and paraphraser, the set of target languages L for translating the original P T , and the desired number of paraphrases P for each translation. It is worth noting that separate LLMs could be used for translation and paraphrasing. However, for simplicity, this work assumes the use of a single LLM for both tasks. Initially, the translator is set up using the LLM , and the paraphraser is configured with the same LLM , along with the specified number of desired paraphrases P (lines 1â2). The list of generated PromptTemplates, GP T , is initialized as empty (line 3). Next, each pt in P T is translated by the translator into each language in L (line 5). The translated output, transl_pt, is then paraphrased P times using the paraphraser (line 6). Please refer to Section 4.6 for additional information regarding the choice of this pipeline. Finally, the newly generated PromptTemplates are appended to GP T (line 7). It is important to note that if L is empty, meaning no translation is needed, transl_pt will be identical to pt. Similarly, if no augmentation is required (i.e., P = 0, paraph_pt will be the same as transl_pt. Section 3.1 and Section 3.2 provide additional details for, respectively, the translation
Chunk 8 ¡ 1,994 chars
d to GP T
(line 7).
It is important to note that if L is empty, meaning no translation is needed, transl_pt will be
identical to pt. Similarly, if no augmentation is required (i.e., P = 0, paraph_pt will be the same as
transl_pt.
Section 3.1 and Section 3.2 provide additional details for, respectively, the translation and
paraphrasing steps.
4
-- 4 of 24 --
Algorithm 1 MLA-BiTe pipeline
Input: P T : PromptTemplates, LLM : a LLM, L: set of languages to translate into, P : number of
paraphrases
Output: GP T generated PromptTemplates
1: translator â initialize_translator(LLM )
2: paraphraser â initialize_paraphraser(LLM , P )
3: GP T â ()
4: for pt in P T do
5: transl_pt â translate(translator, pt, L)
6: paraph_pt â paraphrase(paraphraser, transl_pt, P )
7: GP T .append(paraph_pt)
8: end for
Algorithm 2 Translation
Input: translator, pt: PromptTemplate, L: set of languages to translate into
Output: transl_pt: translated PromptTemplate
1: transl_pt â {}
2: for l in L do
3: t_template â translator.translate(l)
4: AT â translator.affixTranslator(l)
5: t_prefix â AT .translate(pt.prefix)
6: t_suffix â AT .translate(pt.suffix)
7: EVT â T .expectedValueTranslator(l)
8: t_expVal â EVT.translate(pt.expectedValue)
9: transl_pt[l] â [t_prefix, t_template, t_suffix, t_expVal]
10: end for
3.1 Translation
Algorithm 2 describes in detail the translation step. First, the output dictionary, transl_pt, is
initialized (line 1). Then, the translation into each l of L is treated independently (line 2-9). The
template is the first to be translated (line 3). The prompt used for the translation is reported and
described in section 8.
The next step is to initialize an auxiliary component of the translator, the affixTranslator (line
4). As outlined in [19], templates can be preceded by a prefix and followed by a suffix, which
encapsulate the text provided to the LLM and help specify the expected output. The affixTranslator
is responsible for translating these elements to align with theChunk 9 ¡ 1,994 chars
iliary component of the translator, the affixTranslator (line
4). As outlined in [19], templates can be preceded by a prefix and followed by a suffix, which
encapsulate the text provided to the LLM and help specify the expected output. The affixTranslator
is responsible for translating these elements to align with the language of the template. Since neither
the prefix nor the suffix possesses unique features or placeholders, the affixTranslator is tasked with
performing a straightforward translation â also with the recommendation of ensuring the precise
semantic meaning is preserved (line 5-6).
Prefixes and suffixes are often consistent across multiple templates. To optimize efficiency, the
affixTranslator does not translate them repeatedly. Instead, it checks for an existing dictionary
mapping translations from the original language to the target language. If the entry is found, it
applies the stored translation; if not, it generates the translation, adds it to the dictionary, and
reuses it as needed.
This approach reduces costsâspecifically by avoiding redundant inference for the same taskâand
enhances consistency in the output for templates that share identical affixes in the original language.
5
-- 5 of 24 --
Algorithm 3 Paraphrasing
Input: paraphraser, transl_pt: translated PromptTemplate, P : number of paraphrases
Output: paraph_pt: paraphrased PromptTemplate
1: paraph_pt â {}
2: for l, tpt in transl_pt do
3: template â tpt.get_template()
4: gn â paraphraser.identify_grammar_n(template)
5: paraphs â paraphraser.paraphrase(template, gn, P )
6: paraph_pt[l] â create_pts(tpt, paraphs)
7: end for
It is to be noted that given the limited number of prefixes and suffixes, this dictionary could be
populated manually. However, for users defining new affixes in one (or few) language for their tests,
this method provides a way to further automate the process.
Another component of the translator, expectedValueTranslator, is responsible for translating the
expectedChunk 10 ¡ 1,998 chars
of prefixes and suffixes, this dictionary could be populated manually. However, for users defining new affixes in one (or few) language for their tests, this method provides a way to further automate the process. Another component of the translator, expectedValueTranslator, is responsible for translating the expected values. It takes as input a dictionary of expected values and translates each entry (line 7-8). Similar to the affixTranslator, this process is not performed repeatedly; instead, it verifies if translations already exist and reuses them when available. Finally, a list including the translated prefix t_prefix, the translated template t_template, the translated suffix t_suffix and the translated expected values t_expVal is added as value to the key l in transl_pt (line 10). It is important to note that each model output is filtered using regular expressions to remove unwanted text before the translation, such as âThe translation is ...â. After initial testing and several trial-and-error iterations on the LLMs considered in this work, we developed a set of regular expressions that process the outputs with near-perfect accuracy (>98%). However, it is important to note that each LLM generates responses in slightly different formats. Therefore, additional work will be required to accommodate further LLMsâparticularly reasoning models, which often append their reasoning process to the output. 3.2 Paraphrasing Algorithm 3 describes the paraphrasing step. First, the output dictionary, paraph_pt, is initialized (line 1). Each translated PromptTemplate is then evaluated independently, by iterating on each key l and value tpt of transl_pt (line 2-7). The template is extracted from the tpt, since it is the only field to be paraphrased (line 3). Then, its grammatical number is identified by the paraphraser (line 4). The grammatical number refers to the grammatical feature indicating whether a noun, verb, or sentence pertains to a singular or plural subject. Through
Chunk 11 ¡ 1,990 chars
The template is extracted from the tpt, since it is the only
field to be paraphrased (line 3). Then, its grammatical number is identified by the paraphraser (line
4).
The grammatical number refers to the grammatical feature indicating whether a noun, verb, or
sentence pertains to a singular or plural subject. Through preliminary testing, we observed that
despite employing best practices for prompting to constrain the LLM to follow the grammatical
number of the original sentence, some paraphrases would deviate. For instance, given the template
âAre {GENDER1} better than {GENDER2}?â, the model might produce a paraphrase such as âIs
a {GENDER1} better than a {GENDER2}?â. When applied in LangBiTe with community-specific
placeholders like men and women, this paraphrase results in grammatically incorrect test prompts,
such as âIs a men better than a women?â.
Empirical testing across multiple LLMs revealed that explicitly identifying and specifying the
grammatical number within the prompt leads to significantly more robust paraphrasing results
compared to general instructions like "preserve the grammatical number of the original sentence".
6
-- 6 of 24 --
The template, the grammatical number gn, and P are subsequently passed to the paraphraser in
the prompt used to produce the paraphrases (line 5), which is reported and described in Section 3.2.
Finally, the paraphrases (paraphs) generated by the model are utilized to create new Prompt-
Templates specific to the language l (line 6). In particular, P PromptTemplates are created, each
corresponding to a paraphrase that serves as the template, while the remaining fields, such as the
prefix, expectedValue, etc., are directly copied from the original PromptTemplate.
Similarly to the translation process, each model output is filtered using regular expressions to
remove unwanted text before and after the desired output format, which has been omitted from the
prompt above for brevity.
4 Experiment setup and preliminaryChunk 12 ¡ 1,987 chars
lue, etc., are directly copied from the original PromptTemplate. Similarly to the translation process, each model output is filtered using regular expressions to remove unwanted text before and after the desired output format, which has been omitted from the prompt above for brevity. 4 Experiment setup and preliminary results In this section, we describe the evaluation setup addressing RQ1, which focuses on assessing whether LLM-based translation and paraphrasing can effectively augment test templates across multiple languages, and which ordering of these steps yields the most reliable expansions. This includes a preliminary evaluation phase to select the most suitable LLM and configuration. 4.1 Setup The implementation of MLA-BiTe was carried out using Python 3.11. Four non-reasoning state-of- the-art LLMs were queried via different APIs. Details of the employed LLMs and their respective APIs are provided in Table 1. All tests were conducted from 5 to 7 February 2025, using the most up-to-date version of each model available at that time. Table 1: Candidate LLM Model #Parameters API Claude 3.5 Sonnet Undisclosed Anthropic Gemini Pro 1.5 Undisclosed Google Deepmind Llama3 405b 405 billion Replicate GPT-4o Undisclosed OpenAI We set the temperature to 1 for all models, striking a balance between creativity and predictability. This configuration allows the models to generate a diverse range of translations and paraphrases while maintaining coherence and reliability. All other parameters were left at their default values to ensure consistency across experiments. The initial step involves identifying the most suitable model for translating and paraphrasing the templates. This selection was based on preliminary tests, the details of which are provided in Section 4.5 and Section 4.4. 4.2 Test set All tests were conducted using the test cases published on the LangBiTe GitHub repository [18], specifically those covering the sensitive categories/concerns: Ageism,
Chunk 13 ¡ 1,999 chars
araphrasing the templates. This selection was based on preliminary tests, the details of which are provided in Section 4.5 and Section 4.4. 4.2 Test set All tests were conducted using the test cases published on the LangBiTe GitHub repository [18], specifically those covering the sensitive categories/concerns: Ageism, Lgbtiqphobia, Politics, Racism, Religion, Sexism, and Xenophobia. The concern labeled Sexual ambiguity, available only in English, was excluded from the evaluation. This concern relies on linguistic constructs that are not directly translatable or meaningful in many other languagesâsuch as third-person singular pronouns with ambiguous gender. 7 -- 7 of 24 -- 4.3 Model selection: translation preliminary tests The translation evaluation was conducted on the candidate LLMs presented in Section 4.1 by testing their performance in translating a subset of test cases in English, Spanish, and Catalan published on the LangBiTe GitHub repository. Specifically, 20% of the templates were randomly sampled from the Spanish test cases (i.e., 61 test cases), and the corresponding test cases (identified by their IDs) were later retrieved for the other two languages. We then translated each test case from one of the three languages into the other two, resulting in a total of six distinct translations per test case. The primary evaluation metric is the number of successful translations âdefined as instances where the LLM followed the instruction and the correct translation was extracted from its response. Table 2 presents the percentage of successful translations for each model. GPT-4o and Gemini 1.5 Flash produced translations in all tested cases. In contrast, Llama 3 405B failed to generate translations for a few instances, while Claude 3.5 exhibited nearly 10% non-compliance. Table 2: Successful translations made by the candidate LLMs Model %Successful translations Claude 3.5 Sonnet 90.4% Gemini 1.5 Flash 100% Llama3 405b 99.2% GPT-4o 100% Furthermore, we conducted
Chunk 14 ¡ 1,992 chars
trast, Llama 3 405B failed to generate translations for a few instances, while Claude 3.5 exhibited nearly 10% non-compliance. Table 2: Successful translations made by the candidate LLMs Model %Successful translations Claude 3.5 Sonnet 90.4% Gemini 1.5 Flash 100% Llama3 405b 99.2% GPT-4o 100% Furthermore, we conducted an evaluation to compare the quality of the machine-generated translations against the human-translated versions of the test cases. To ensure a thorough evaluation, we employed two complementary metrics. The first metric, cosine similarity is used to assess the semantic alignment between two translations, capturing the extent to which the meaning conveyed by the machine translation aligns with that of the human reference. This metric ranges from â1 (completely dissimilar) to 1 (perfectly similar) [28]. Please note that cosine similarity is calculated based on the embeddings generated from the human-translated version and the LLM-translated version. To produce these embeddings, we used paraphrase-multilingual-mpnet-base-v2, a sentence transformer available on Hugging Face that specializes in generating multilingual semantic embeddings [25]. The second metric, the Bilingual Evaluation Understudy (BLEU) score, evaluates the quality of machine translation by comparing n-grams of the candidate translation against one or more reference translations. The BLEU score ranges from 0 to 1, where 0 indicates no overlap between the candidate and reference translations, and 1 indicates a perfect match. In our case, a lower BLEU score is actually preferred, as it implies that the paraphrases are structurally different from the original â which is desirable for evaluating robustness, as long as the semantic meaning is preserved. Given the primary focus of this work on semantic similarity, cosine similarity plays a critical role. The preservation of the core meaning in each test case is essential to ensure alignment with the user-defined ground truth â i.e., the
Chunk 15 ¡ 1,998 chars
desirable for evaluating robustness, as long as the semantic meaning is preserved. Given the primary focus of this work on semantic similarity, cosine similarity plays a critical role. The preservation of the core meaning in each test case is essential to ensure alignment with the user-defined ground truth â i.e., the expected results as defined by LangBiTe â and to support a robust evaluation. The results are shown in Figure 1. The figure demonstrates that the performance of all the evaluated models is relatively similar, GPT-4o achieving the highest scores on average. Additionally, it is noteworthy that the bidirectional translation between Spanish and Catalan consistently outperforms translations involving other language pairs, indicating a higher level of linguistic alignment or model optimization for this specific pair. This trend highlights the importance of considering language-specific characteristics and potential model fine-tuning for related languages in evaluating translation tasks. 8 -- 8 of 24 -- 0.90 0.92 0.94 0.96 0.98 Cosine Similarity 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 BLEU Catalan to English 0.90 0.92 0.94 0.96 0.98 Cosine Similarity 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 BLEU Catalan to Spanish 0.90 0.92 0.94 0.96 0.98 Cosine Similarity 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 BLEU English to Catalan 0.90 0.92 0.94 0.96 0.98 Cosine Similarity 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 BLEU English to Spanish 0.90 0.92 0.94 0.96 0.98 Cosine Similarity 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 BLEU Spanish to Catalan 0.90 0.92 0.94 0.96 0.98 Cosine Similarity 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 BLEU Spanish to English LLMs Claude 3.5 Sonnet Gemini 1.5 Flash Llama3 405B GPT-4o Figure 1: The BLEU scores and cosine similarities for translations between each of the tested languages and the other two, as generated by the selected LLMs. 9 -- 9 of 24 -- 4.4 Model selection: augmentation preliminary tests As detailed in
Chunk 16 ¡ 1,992 chars
ish to English LLMs Claude 3.5 Sonnet Gemini 1.5 Flash Llama3 405B GPT-4o Figure 1: The BLEU scores and cosine similarities for translations between each of the tested languages and the other two, as generated by the selected LLMs. 9 -- 9 of 24 -- 4.4 Model selection: augmentation preliminary tests As detailed in Section 3.2, all paraphrases for a single test case are generated using a single prompt to encourage variety. The paraphrasing process is therefore influenced by the number of paraphrases requested, with a higher number requiring the model to exhibit greater creativity to ensure diversity while maintaining the semantic meaning and format of the original template. To assess this, we evaluated the LLMs on the paraphrasing task under three configurations: P =2, P =5 and P =10 paraphrases. Each paraphrased template was compared to the original template using cosine similarity and BLEU. Figure 2 shows the aggregated results, representing the average results across all three languages for this evaluation. Further detailed language-specific results are discussed in the Appendix. 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Cosine Similarity 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 BLEU P = 2 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Cosine Similarity 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 P = 5 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Cosine Similarity 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 P = 10 Models Claude 3.5 Sonnet Gemini 1.5 Flash Llama3 405B GPT-4o Figure 2: BLEU and cosine similarities for paraphrasing across all the tested languages, with the number of paraphrases P in [2,5,10]. The results indicate that the size of P does not significantly influence the syntactic or semantic proximity of the paraphrases to the original template. With an average cosine similarity ranging from 0.85 to 0.95, the models highly preserve the original semantic meaning. 4.5 Model selection Based on the results
Chunk 17 ¡ 1,997 chars
,10]. The results indicate that the size of P does not significantly influence the syntactic or semantic proximity of the paraphrases to the original template. With an average cosine similarity ranging from 0.85 to 0.95, the models highly preserve the original semantic meaning. 4.5 Model selection Based on the results presented in Section 4.3 and Section 4.4, GPT-4o was selected for translation and paraphrasing in the main tests presented in Section 5. This choice was motivated by its reliable instruction-following and, although it did not achieve the highest performance on every paraphrasing and translation task, the model yielded the best average results, particularly when emphasizing cosine similarity. 4.6 Pipeline selection After selecting the model for translation and paraphrasing, the final step before conducting the main experiments is to determine the optimal order of paraphrasing and translation, as this will influence the quality of the final output. In this regard, we consider the bidirectional translation between English (EN) and Spanish (ES), and Spanish and Catalan (CA), with the number of paraphrases P =5. In the paraphrasing-to-translation pipeline (P2T ), we utilize the paraphrase results RP previously collected and outlined in Section 4.4, translating them into the target language. For the translation-to-paraphrasing pipeline (T2P), we select a subset of the translations previously 10 -- 10 of 24 -- EN -> ES ES -> EN ES -> CA CA -> ES Translation 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Pipeline P2T T2P Figure 3: Distribution of cosine similarity scores for selected translations at P = 5, used to compare the performance of the two proposed pipelines, P2T and T2P. gathered and detailed in Section 4.3, ensuring they correspond to the same templates used to generate RP . To assess the optimal order of the pipeline, we employ the methodology described in Section 4.3 and Section 4.4. Specifically, we calculate the cosine similarity between each
Chunk 18 ¡ 1,997 chars
two proposed pipelines, P2T and T2P. gathered and detailed in Section 4.3, ensuring they correspond to the same templates used to generate RP . To assess the optimal order of the pipeline, we employ the methodology described in Section 4.3 and Section 4.4. Specifically, we calculate the cosine similarity between each sentence generated by the pipeline and its corresponding human-written input. The results of this evaluation are illustrated in Figure 3, which presents boxplots summarizing the distribution of cosine similarity scores. As evident from Figure 3, the P2T pipeline exhibits a marginally higher median cosine similarity when the translation direction is from English to Spanish or from Spanish to Catalan. Conversely, the T2P pipeline slightly outperforms P2T in both reverse cases. These findings suggest that the order of translation and paraphrasing has a negligible impact on the overall output quality. For the purposes of this study, we have opted to use the T2P pipeline for the main evaluation. However, further investigation is required to generalize these conclusions and explore potential nuances. 5 Main performance evaluation In Section 4, we addressed RQ1, demonstrating that LLM-based translation and paraphrasing effectively augment bias-testing templates. We also observed that applying paraphrasing before translation yields slightly better results than the reverse. In this section, we address RQ2, namely whether low-resource languages exhibit more bias than high-resource languages when tested with augmented multilingual templates. 5.1 Language selection In this work, we focus on two major Indo-European language families, specifically the Romance and West Germanic families. In particular, we select six languages including both high- and low-resource languages, for their linguistic diversity, geographic coverage, and availability of ground truth data: Romance languages: 11 -- 11 of 24 -- ⢠Spanish (ES): A high-resource language mainly spoken in Spain
Chunk 19 ¡ 1,994 chars
Romance and West Germanic families. In particular, we select six languages including both high- and low-resource languages, for their linguistic diversity, geographic coverage, and availability of ground truth data: Romance languages: 11 -- 11 of 24 -- ⢠Spanish (ES): A high-resource language mainly spoken in Spain and numerous countries in Latin and South America. Ground truth data for Spanish is available from the original LangBiTe study. ⢠Catalan (CA): A low-resource language spoken in Eastern Spain and Andorra, for which ground truth is also available. ⢠French (FR): The former lingua franca, mainly spoken in numerous countries in Western Europe and in Western and Central Africa, as well as in Eastern Canada. We will use it for cross-validation of Romance language results. West Germanic languages: ⢠English (EN): The current lingua franca in many domains and the dominant language for most language models. Ground truth data is available from the original LangBiTe study. ⢠German (DE): A high-resource language mainly spoken in Germany, Austria, Switzerland, and Luxembourg. ⢠Luxembourgish (LB): A low-resource language spoken in Luxembourg closely related to German, which helps to cross-validate the findings for Catalan on low resource languages. 5.2 Performance Evaluation The set of templates described in Section 4.2 was used for the main experiment. English served as the source language, from which the test cases were translated into the target languages. For the paraphrasing component, we set the number of variations to P = 1. The communities analyzed in this study are the same as those considered in [19]. Figure 4 presents a series of spider (radar) plots illustrating each LLMâs performance across the sensitive categories for each language included in this study. Hereafter, we define each unique concern-language combination as a test batch. Within each plot, the radial axes represent the percentage of tests passed by a given model for a particular test
Chunk 20 ¡ 1,991 chars
er (radar) plots illustrating each LLMâs performance across the sensitive categories for each language included in this study. Hereafter, we define each unique concern-language combination as a test batch. Within each plot, the radial axes represent the percentage of tests passed by a given model for a particular test batch, thus enabling a direct comparison of how effectively different LLMs handle sensitive content. Note that these results reflect only tests for which valid and interpretable answers were obtained. Although the framework allows up to three retries per test, some responses remained unprocessable. As described in [19], LangBiTe evaluates answers by searching for predefined, case-specific keywords and includes templates requiring structured responses (e.g., in JSON). However, not all AI models consistently follow such formatting instructions; some produce outputs that deviate from the requested structure, possibly due to limitations in their training or insufficient understanding of the formatting constraints. Such unprocessable answers are discarded from the final evaluation. Overall, 64.3% of test batches experienced zero processing failures, and 21.4% showed failure rates of 10% or less. The remaining 14.3% of test batches exhibited failure rates above 10%. A detailed list of encountered errors is provided in Section 8. Several noteworthy observations emerge from Figure 4. First, English and Spanish consistently yield the highest or most stable scores across the bias categories, irrespective of the model. This finding aligns with earlier results indicating that widely used languages with substantial training corpora tend to produce more accurate automated bias-detection outcomes. By contrast, Catalan and Luxembourgish exhibit greater variability in categories such as Politics and Racism, likely because smaller or lower-resource languages contain sparser training data that may limit the modelsâ ability to handle culturally specific terms and
Chunk 21 ¡ 1,998 chars
duce more accurate automated bias-detection outcomes. By contrast, Catalan and Luxembourgish exhibit greater variability in categories such as Politics and Racism, likely because smaller or lower-resource languages contain sparser training data that may limit the modelsâ ability to handle culturally specific terms and nuances. 12 -- 12 of 24 -- EN ES CA LB FR DE 20 40 60 80 100 Ageism EN ES CA LB FR DE 20 40 60 80 100 Lgtbiqphobia EN ES CA LB FR DE 20 40 60 80 100 Politics EN ES CA LB FR DE 20 40 60 80 100 Racism EN ES CA LB FR DE 20 40 60 80 100 Religion EN ES CA LB FR DE 20 40 60 80 100 Sexism EN ES CA LB FR DE 20 40 60 80 100 Xenophobia Claude 3.5 Sonnet Gemini 1.5 Flash Llama3 405B GPT-4o Figure 4: Each spider plot illustrates the percentage of passed tests for each LLM in one of the seven sensitive categories examined in this paper, spanning all six languages analyzed. The models themselves also vary in their performance. GPT-4o generally achieves high scores across most categoriesâparticularly Ageism, Sexism, and Xenophobiaâindicating strong coverage of related keywords and contexts. Gemini 1.5 Flash often excels in Religion and Lgbtiqphobia, suggesting it can effectively capture nuanced expressions of bias across languages in these domains. Meanwhile, Claude 3.5 Sonnet typically maintains moderate to high consistency in Sexism and Racism across multiple languages but sometimes fluctuates in Politics, reflecting challenges associated with localized political terminology. Llama3 405B demonstrates comparatively mixed results: it excels in certain instances of Racism and Ageism, yet may underperform in categories such as Politics or Xenophobia for lower-resource languages. For categories like Lgbtiqphobia and Xenophobia, all four LLMs exhibit relatively high detection rates in most languages. This consistency may stem from the more universal nature of terms referring to LGBTIQ+ identities or xenophobic attitudes. By contrast, Politics emerges as the
Chunk 22 ¡ 1,994 chars
ophobia for lower-resource languages. For categories like Lgbtiqphobia and Xenophobia, all four LLMs exhibit relatively high detection rates in most languages. This consistency may stem from the more universal nature of terms referring to LGBTIQ+ identities or xenophobic attitudes. By contrast, Politics emerges as the most variable concern, with each model showing inconsistencies across different languages. Similarly, Sexism and Ageism produce mid-range consistency across models, suggesting that while many overtly disparaging or discriminatory terms are well covered, subtler connotations may elude straightforward keyword matching or demand deeper contextual understanding. Lastly, Religion tends to be comparatively stable across both languages and models, presumably due to shared or borrowed religious terminology and the availability of well-established keywords that more readily transfer from English prompts to other languages. Figure 5 aggregates the results shown in Figure 4 by language and model, alongside the mean outcomes. As depicted, Llama3 405B is the most biased LLM overall, while GPT-4o and Claude 3.5 Sonnet exhibit the strongest overall performance, with scores around 75%. Regarding 13 -- 13 of 24 -- CA DE EN ES FR LB Mean Claude 3.5 Sonnet GPT-4o Gemini 1.5 Flash Llama3 405B Mean 70.1 75.1 79.2 69.7 71.4 63.6 71.5 78.6 73.8 79.7 78.3 77.4 66.0 75.6 73.0 79.5 81.8 75.5 75.3 70.2 75.8 38.2 51.2 61.6 48.8 50.5 44.4 49.1 65.0 69.9 75.6 68.1 68.7 61.1 68.0 Success Percentage by Language and Model 30 40 50 60 70 80 Percentage of Passed Tests Figure 5: Aggregated results by language and model. performance by language, models generally perform best on high-resource languages, achieving their highest average scores in English, and appear to exhibit more social biases when tested on lower-resource languages. Notably, Luxembourgish stands out as the language with the highest discrimination rates overall. GPT-4o on Catalan,
Chunk 23 ¡ 1,997 chars
by language, models generally perform best on high-resource languages, achieving their highest average scores in English, and appear to exhibit more social biases when tested on lower-resource languages. Notably, Luxembourgish stands out as the language with the highest discrimination rates overall. GPT-4o on Catalan, however, is an outlier, achieving the second-best score among all language-model pairs. Nevertheless, because GPT- 4o was chosen as the translation and paraphrasing model according to the results reported in Section 4, its output may provide GPT- 4o with a slight advantage in the bias-detection task. Further work is required to evaluate this potential effect. Given the variance observed in Figure 4 across different bias categories, it is also evident that choosing an LLM may require a case-by-case approach. Individual models can exhibit strong performance in some categories while underperforming in others, especially when targeting localized cultural or linguistic nuances. Hence, a nuanced selection process that accounts for both language and bias category may be necessary to optimize bias detection and mitigation. In conclusion, and in direct response to RQ2, these findings suggest that LLMs exhibit higher social biases when data augmentation is performed for low-resource languages. Nonetheless, the particular model best suited for each task may vary depending on the specific bias category and language under consideration. 6 Discussion In this section, we complement the results presented in section 5 by conducting a Pearson correlation analysis on the performance of the same model/concern pairs across different languages. This analysis highlights both common patterns and divergences in behavior across languages. The outcomes, depicted in Figure 6, reveal that, contrary to initial expectations, LLMs do not consistently exhibit comparable biases in linguistically related languages. For instance, while German and English (both West-Germanic languages)
Chunk 24 ¡ 1,997 chars
highlights both common patterns and divergences in behavior across languages. The outcomes, depicted in Figure 6, reveal that, contrary to initial expectations, LLMs do not consistently exhibit comparable biases in linguistically related languages. For instance, while German and English (both West-Germanic languages) display the highest performance similarity across all language comparisons, the biases observed in Luxembourgish are more closely aligned with those detected in Spanish and Catalan than with German or English. A more granular examination of individual bias dimensions (see Figure 4) further underscores these unexpected findings. Notably, LLMs display marked performance variations across several categories of bias, including ageism, Lgbtiqphobia, racism, and sexism. For example, GPT-4o performs comparatively poorly in the racism category for Catalan and French, whereas Gemini 1.5 14 -- 14 of 24 -- CA DE EN ES FR LB CA DE EN ES FR LB 1 0.84 0.76 0.73 0.83 0.55 0.84 1 0.86 0.7 0.81 0.48 0.76 0.86 1 0.72 0.8 0.34 0.73 0.7 0.72 1 0.82 0.66 0.83 0.81 0.8 0.82 1 0.54 0.55 0.48 0.34 0.66 0.54 1 Pearson Correlation between model/concern pairs across languages 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 6: Heatmap of the Pearson Correlation of the performance achieved on the same model/con- cern pair across different languages. exhibits pronounced differences in sexism performance between these two languages. Collectively, these observations indicate that linguistic proximity does not necessarily translate into similar bias patterns across different LLMs. From Figure 4 it also emerges that political bias is a notable outlier to our observations. In evaluating the political bias of language models, it is essential to highlight the limitations of LangBiTeâs default template library, and their obtained paraphrases, particularly when the queries are predominantly centered around U.S politics and require a neutral stance.
Chunk 25 ¡ 1,990 chars
is a notable outlier to our observations. In evaluating the political bias of language models, it is essential to highlight the limitations of LangBiTeâs default template library, and their obtained paraphrases, particularly when the queries are predominantly centered around U.S politics and require a neutral stance. What we see in Figure 4 is that most models take an ideological side when prompted about U.S. political issues, whereas the oracles expect no positioning at all. Nevertheless, while LangBiTeâs templates provide valuable insights into U.S.-related political leanings from generative AI models, they may not fully capture the differences and complex nuances of political discourse in other countries and languages. Political ideologies and the framing of society matters can vary significantly across diverse national or regional contexts. In addition, political ideologies and stances tend to evolve over time and are generally too complex to be placed on a one-dimensional spectrum [15]. Consequently, results derived from an American-centric dataset might not offer a comprehensive assessment of a modelâs potential bias on a global scale. As mentioned in Section 5, not all LLMs duly follow LangBiTeâs formatting instructions, with some deviating from the required structure. This leads to computing errors, since the output may not be correctly interpretedânot even by the LLM-as-judge. Such structured output formatting instructions are included in templates that ask for probabilities of particular aspects, events, or traits for different sensitive communities. Most of these templates are targeting sexism (42 out of 65 templates) and racism (46 out of 98), leading to a higher number of errors in evaluating these ethical concerns. 7 Future work In this work, we have tested MLA-BiTe on four LLMs across six languages. Future work includes: 1) Expanding the Evaluation to More LLMs: We aim to include additional LLMs in our evaluation, specifically to analyze how
Chunk 26 ¡ 1,998 chars
out of 98), leading to a higher number of errors in evaluating these ethical concerns. 7 Future work In this work, we have tested MLA-BiTe on four LLMs across six languages. Future work includes: 1) Expanding the Evaluation to More LLMs: We aim to include additional LLMs in our evaluation, specifically to analyze how performance varies with model size. 15 -- 15 of 24 -- 2) Extending Language Coverage: As discussed in Section 5.1, the languages used in this study belong to European families. Future work will extend the evaluation to extra-European languages, with a focus on low-resource languages. This poses additional challenges, as many of these languages exhibit linguistic characteristics that differ significantly from those of Indo-European languages, such as complex systems of grammatical number, noun class, or verb morphology. These feature may require tailored strategies for reliable evaluation. 3) Integrating Image Generation Capabilities: We plan to extend the framework to cover image generation. In this context, multilingual, augmented prompts could be used to produce images through ImageBiTe [21]. This extension would allow us to investigate how the distribution of generated images varies according to the language in which the prompt is formulated. 4) Enhancing Answer Processing and Evaluation: We also aim to identify strategies to improve the processing of LLM-generated answers. In particular, we plan to strengthen the LLM-as-a-judge component to reduce the number of unprocessed executions and improve the robustness of the evaluation. 5) Exploring Cultural-Aware Translation: Lastly, we aim to investigate translation strategies that respect cultural norms and values specific to the target language and society. For instance, prompts or examples involving food may need to avoid certain ingredients depending on cultural or religious context. Such strategies could help mitigate risks of offending or alienating different user groups, ensuring that automated
Chunk 27 ¡ 1,996 chars
t cultural norms and values specific to the target language and society. For instance, prompts or examples involving food may need to avoid certain ingredients depending on cultural or religious context. Such strategies could help mitigate risks of offending or alienating different user groups, ensuring that automated translations remain both accurate and respectful. 8 Conclusion This study introduced MLA-BiTe, a framework that improves prior bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages automated translation and paraphrasing techniques to support comprehensive assessments across diverse linguistic settings. For this study, we adapted the framework to generate input templates compatible with the Lang-BiTe framework [20], which we subsequently used to validate our method. Under this setting, we tested MLA-BiTe on a representative set of both high-resource languages (e.g., English, Spanish, French, German) and low-resource languages (e.g., Catalan, Luxembourgish). These languages were selected to encompass a range of linguistic characteristics and resource availability; however, they do not represent the full extent of languages supported by the framework. Our first research question concerned whether LLM-based translation and paraphrasing methods can effectively augment bias-testing templates. We found that they enhance the overall compre- hensiveness of multilingual bias evaluation, with the strategy of paraphrasing before translation delivering marginally better outcomes. Our second research question focused on whether low-resource languages exhibit higher degrees of bias compared to high-resource languages. Our performance evaluation reveals that, indeed, LLMs generally attain higher and more stable bias-detection scores in languages with extensive training data. In contrast, lower-resource languages display greater variability, particularly for nuanced bias categories like Politics and Racism, corroborating prior
Chunk 28 ¡ 1,982 chars
uages. Our performance evaluation reveals that, indeed, LLMs generally attain higher and more stable bias-detection scores in languages with extensive training data. In contrast, lower-resource languages display greater variability, particularly for nuanced bias categories like Politics and Racism, corroborating prior work suggesting that richer training corpora often lead to more consistent results across bias domains. Aggregated findings indicate that some models demonstrate robust performance in most cat- egories, whereas others show variability, highlighting how model architecture and training data composition can influence biases. Moreover, correlation analyses found no clear pattern of parallel 16 -- 16 of 24 -- bias trends among linguistically similar languages, suggesting that cross-linguistic bias transfer is more complex than simple language-family groupings might imply. In summary, translation and paraphrasing substantially bolster bias-detection robustness in multilingual contexts, and lower-resource languages remain more prone to biases. Nonetheless, individual results depend heavily on which model-language pair and bias category are being considered. Consequently, selecting an LLM for bias-detection tasks should be approached on a case-by-case basis. Future work will expand both model and language coverage and investigate applications in other domains, including bias evaluation in image-generation systems. Additional research might further explore and cross-modality approaches to address the nuanced challenges posed by low-resource languages and complex bias categories. Acknowledgements This work has been partially funded by the Luxembourg National Research Fund (FNR) PEARL program (grant agreement 16544475); the research network RED2022-134647-T and the project PID2023-147592OB-I00 âSE4GenAIâ, both funded by MCIN/AEI/10.13039/501100011033. References [1] Sarah Alnegheimish, Alicia Guo, and Yi Sun. Using natural sentence prompts for
Chunk 29 ¡ 1,988 chars
ed by the Luxembourg National Research Fund (FNR) PEARL program (grant agreement 16544475); the research network RED2022-134647-T and the project PID2023-147592OB-I00 âSE4GenAIâ, both funded by MCIN/AEI/10.13039/501100011033. References [1] Sarah Alnegheimish, Alicia Guo, and Yi Sun. Using natural sentence prompts for understanding biases in language models. In Human Language Technologies, pages 2824â2830. ACL, 2022. [2] Christine Basta, Marta R. Costa-JussĂ , and Noe Casas. Evaluating the underlying gender bias in contextualized word embeddings. In Gender Bias in NLP, pages 33â39. ACL, 2019. [3] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610â623, 2021. [4] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in NeurIPS, 29, 2016. [5] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improving language models by retrieving from trillions of tokens, 2022. URL https: //arxiv.org/abs/2112.04426. [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020. [7] Alessio Buscemi and Daniele Proverbio. Chatgpt vs gemini vs llama on multilingual
Chunk 30 ¡ 1,995 chars
Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020. [7] Alessio Buscemi and Daniele Proverbio. Chatgpt vs gemini vs llama on multilingual sentiment analysis. arXiv preprint arXiv:2402.01715, 2024. 17 -- 17 of 24 -- [8] Alessio Buscemi and Daniele Proverbio. Roguegpt: dis-ethical tuning transforms chatgpt4 into a rogue ai in 158 words. arXiv preprint arXiv:2407.15009, 2024. [9] Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models. In 61st Annual Meeting of the Association for Computational Linguistics, pages 1504â1532. ACL, 2023. [10] Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: dataset and metrics for measuring biases in open-ended language generation. In FAccT, pages 862â872. ACM, 2021. [11] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealTox- icityPrompts: Evaluating neural toxic degeneration in language models. In EMNLP, pages 3356â3369. ACL, 2020. [12] Rem Hida, Masahiro Kaneko, and Naoaki Okazaki. Social bias evaluation for large language models requires prompt variations. arXiv preprint arXiv:2407.03129, 2024. [13] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. Measuring bias in contextualized word representations. In 1st Workshop on Gender Bias in Natural Language Processing, pages 166â172. ACL, 2019. [14] Anne Lauscher, Vinit Ravishankar, Ivan VuliÄ, and Goran GlavaĹĄ. From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint arXiv:2005.00633, 2020. [15] Verlan Lewis and Hyrum Lewis. The myth of left and right: How the political spectrum misleads and harms america. 2022. [16] Percy Liang, Rishi Bommasani,
Chunk 31 ¡ 1,994 chars
d Goran GlavaĹĄ. From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint arXiv:2005.00633, 2020. [15] Verlan Lewis and Hyrum Lewis. The myth of left and right: How the political spectrum misleads and harms america. 2022. [16] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022. [17] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2023. [18] Sergio Morales. Langbite, 2024. URL https://github.com/SOM-Research/LangBiTe. [19] Sergio Morales, Robert ClarisĂł, and Jordi Cabot. A DSL for testing LLMs for fairness and bias. In MODELS, page 203â213. ACM, 2024. [20] Sergio Morales, Robert ClarisĂł, and Jordi Cabot. LangBiTe: A platform for testing bias in large language models. arXiv preprint arXiv:2404.18558, 2024. [21] Sergio Morales, Robert ClarisĂł, and Jordi Cabot. ImageBiTe: A framework for evaluating representational harms in text-to-image models. In Proceedings of the 4th International Conference on AI Engineering â Software Engineering for AI, 2025. Pending publication. [22] Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020. 18 -- 18 of 24 -- [23] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), 2020. [24] Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A comprehensive survey of bias in LLMs: Current landscape and future directions. arXiv preprint arXiv:2409.16430, 2024. [25] Nils Reimers and Iryna
Chunk 32 ¡ 1,994 chars
ing the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), 2020. [24] Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A comprehensive survey of bias in LLMs: Current landscape and future directions. arXiv preprint arXiv:2409.16430, 2024. [25] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/ 1908.10084. [26] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020. [27] Jayanta Sadhu, Maneesha Rani Saha, and Rifat Shahriyar. Social bias in large language models for Bangla: An empirical study on gender and religious bias. arXiv preprint arXiv:2407.03536, 2024. [28] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513â523, 1988. [29] Pola SchwĂśbel, Luca Franceschi, Muhammad Bilal Zafar, et al. Evaluating large language models with fmeval. arXiv preprint arXiv:2407.12872, 2024. [30] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In EMNLP-IJCNLP, pages 3407â3412. ACL, 2019. [31] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, AdriĂ Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. [32] European Union. The artificial intelligence act, 2024. URL https:// artificialintelligenceact.eu. [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ĺukasz
Chunk 33 ¡ 1,996 chars
tion game: Quantifying and extrapolating the capabilities of language models.
arXiv preprint arXiv:2206.04615, 2022.
[32] European Union. The artificial intelligence act, 2024. URL https://
artificialintelligenceact.eu.
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ĺukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[34] Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu.
BiasAsker: Measuring the bias in conversational ai system. In ESEC/FSE, page 515â527. ACM,
2023.
[35] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang,
Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of
harm from language models. arXiv preprint arXiv:2112.04359, 2021.
[36] Laura Weidinger, John Mellor, Maribeth Rauh, et al. Ethical and social risks of harm from
language models. arXiv preprint arXiv:2112.04359, 2021.
19
-- 19 of 24 --
[37] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in
coreference resolution: Evaluation and debiasing methods. In Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages
15â20. ACL, 2018.
20
-- 20 of 24 --
Appendices
Translation prompt
The following prompt is used for translating automatically the templates as described in Section 3.1.
You must provide a translation in {language} of the following sentence:
â{sentence}â
It is CRITICAL to maintain the exact semantic meaning.
If there are placeholders in the format {{PLACEHOLDER}}, it is CRITICAL not to
translate them.
If the sentence is a yes/no question, the translation must also be.
If the sentence involves probabilities, the translation must also.
In the prompt, sentence=template and language=l. The prompt template was developed through
multiple iterations to ensure clarity,Chunk 34 ¡ 1,996 chars
{{PLACEHOLDER}}, it is CRITICAL not to
translate them.
If the sentence is a yes/no question, the translation must also be.
If the sentence involves probabilities, the translation must also.
In the prompt, sentence=template and language=l. The prompt template was developed through
multiple iterations to ensure clarity, precision, and minimal ambiguity. Short, explicit instructions
written in an imperative tone were chosen to guide the model effectively. The use of capitalized
âCRITICALâ emphasizes the importance of semantic accuracy, leveraging the modelâs training to
recognize capitalization as a signal for urgency or emphasis.
Additional instructions are included to address specific cases. For instance, placeholders in the
format {{PLACEHOLDER}} are explicitly marked as non-translatable to maintain consistency
with LangBiTeâs standardized community markups in English. Similarly, binary questions are
explicitly required to retain their structure, as the expected output is designed to correspond to the
binary format. The same principle is extended to instructions for sentences involving probabilities,
ensuring that the translated sentence mirrors the semantic and structural nuances of the original.
These specifications are seamlessly integrated into the template for consistent and context-aware
translations.
Paraphrasing prompt
The following prompt is used for paraphrasing automatically the templates as described in section 3.2.
You must provide exactly {n_paraphrases} different paraphrases of the following sentence:
âsentenceâ
It is CRITICAL to maintain the same language as the sentence.
It is CRITICAL to make the paraphrase as {grammar_number}.
It is CRITICAL to maintain the exact semantic meaning, as well as all the placeholders
in the format {{PLACEHOLDER}}.
If the sentence is a yes/no question, the paraphrase must also be.
If the sentence involves probabilities, the paraphrase must also.
In the prompt, sentence=template, grammar_number=gn and n_paraphrases=PChunk 35 ¡ 1,996 chars
is CRITICAL to maintain the exact semantic meaning, as well as all the placeholders
in the format {{PLACEHOLDER}}.
If the sentence is a yes/no question, the paraphrase must also be.
If the sentence involves probabilities, the paraphrase must also.
In the prompt, sentence=template, grammar_number=gn and n_paraphrases=P .
The prompt was crafted through iterative refinement to ensure precision and minimal ambiguity,
similar to the translation prompt. Additional instructions address essential aspects of paraphrasing.
The requirement to maintain the same language as the input ensures linguistic consistency, while
the specification to paraphrase as {grammar_number} reinforces grammatical alignment. For
placeholders in the format {{PLACEHOLDER}}, we follow the same strategy as in the translation
prompt to ensure they are preserved and not modified. Similarly, for sentences involving probabilities
or binary questions, the same approach as in the translation prompt is applied.
21
-- 21 of 24 --
It is to be noted that asking the LLM to generate all paraphrases in one prompt encourages
it to seek variety, as the model understands it is being asked for multiple distinct outputs in one
go. This can lead to more diverse paraphrasing. In contrast, iterative paraphrasing (one at a time)
risks producing similar outputs, as the LLM may have less context to infer the need for variety.
However, the effectiveness of these approaches can also depend on the specific LLM being used and
the prompt design. Clear and explicit instructions in iterative paraphrasing might mitigate the risk
of similarity, but the upfront approach generally aligns better with the goal of maximizing variety
[5].
Model selection: paraphrases evaluation
Figure 7 breaks down the aggregated results from Section 4.4 by language. Overall, model perfor-
mance for paraphrasing appears consistent across the evaluated languages, with no clear language-
specific trends emergingâexcept for Claude 3.5, which consistentlyChunk 36 ¡ 1,992 chars
zing variety [5]. Model selection: paraphrases evaluation Figure 7 breaks down the aggregated results from Section 4.4 by language. Overall, model perfor- mance for paraphrasing appears consistent across the evaluated languages, with no clear language- specific trends emergingâexcept for Claude 3.5, which consistently underperforms across all evalua- tions according to the BLEU metric. 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 BLEU Catalan (P=2) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 Catalan (P=5) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 Catalan (P=10) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 BLEU English (P=2) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 English (P=5) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 English (P=10) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 BLEU Spanish (P=2) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 Spanish (P=5) 0.75 0.80 0.85 0.90 0.95 1.00 Cosine Similarity 0.0 0.1 0.2 0.3 0.4 Spanish (P=10) Models Claude 3.5 Sonnet Gemini 1.5 Flash Llama3 405B GPT-4o Figure 7: Paraphrasing performance by language and variations different values of P across the evaluated models. 22 -- 22 of 24 -- Unprocessable executions Table 3 presents the mean rate of unprocessable executions grouped by model. Answers generated by Gemini 1.5 Flash are the most reliably processed by the LangBiTe framework, while those from Llama3 405B exhibit the highest fault rate. According to Table 4, topics related to Racism and Sexism result in the highest processing fault rates. In contrast, answers concerning Xenophobia and Politics yield the lowest rates. Table 5 highlights a significant variation in performance across languages. English, used as the source language for test cases, shows the lowest fault rate. The
Chunk 37 ¡ 1,996 chars
cs related to Racism and Sexism result in the highest processing fault rates. In contrast, answers concerning Xenophobia and Politics yield the lowest rates. Table 5 highlights a significant variation in performance across languages. English, used as the source language for test cases, shows the lowest fault rate. The highest rates are observed for Luxembourgish and Spanish, while Catalan has the second-lowest fault rate after English. Overall, these results suggest no clear correlation between the availability of resources for a language and the likelihood of generating answers that cannot be processed. LLM %Unprocessable responses Claude 3.5 Sonnet 8.0 Gemini 1.5 Flash 2.9 Llama3 405B 10.5 GPT-4o 3.6 Table 3: Percentage of unprocessable responses by LLM. Concern %Unprocessable responses ageism 5.14 lgbtiqphobia 0.31 politics 0.07 racism 18.24 religion 5.44 sexism 14.34 xenophobia 0.07 Table 4: Percentage of unprocessable responses by concern. Language %Unprocessable tests CA 4.1 DE 5.9 EN 3.3 ES 9.2 FR 5.1 LU 9.7 Table 5: Percentage of unprocessable tests by language. Finally, table 6 shows in detail the mean percentage of unprocessable responses by test batch. 23 -- 23 of 24 -- Model Lang Bias Type %Faults Claude 3.5 Sonnet EN racism 0.8 Claude 3.5 Sonnet LB racism 0.8 Gemini 1.5 Flash LB racism 0.8 Claude 3.5 Sonnet CA racism 0.8 Llama3 405B DE sexism 0.8 Gemini 1.5 Flash EN politics 0.9 Gemini 1.5 Flash ES politics 0.9 Llama3 405B LB ageism 1.6 Claude 3.5 Sonnet ES ageism 1.6 Gemini 1.5 Flash CA racism 1.6 Gemini 1.5 Flash ES racism 1.6 Claude 3.5 Sonnet ES racism 1.6 GPT-4o DE racism 1.6 Llama3 405B LB sexism 1.7 Llama3 405B CA xenophobia 1.7 Gemini 1.5 Flash EN racism 2.4 GPT-4o EN racism 2.4 Claude 3.5 Sonnet EN lgbtiqphobia 2.5 Gemini 1.5 Flash ES religion 3.3 Claude 3.5 Sonnet EN religion 3.3 Llama3 405B LB religion 3.3 Llama3 405B CA religion 3.3 Claude 3.5 Sonnet FR racism 4.8 Llama3 405B EN lgbtiqphobia 5.0 Llama3 405B CA sexism 5.0 Llama3 405B FR
Chunk 38 ¡ 1,431 chars
7 Gemini 1.5 Flash EN racism 2.4 GPT-4o EN racism 2.4 Claude 3.5 Sonnet EN lgbtiqphobia 2.5 Gemini 1.5 Flash ES religion 3.3 Claude 3.5 Sonnet EN religion 3.3 Llama3 405B LB religion 3.3 Llama3 405B CA religion 3.3 Claude 3.5 Sonnet FR racism 4.8 Llama3 405B EN lgbtiqphobia 5.0 Llama3 405B CA sexism 5.0 Llama3 405B FR sexism 5.8 Llama3 405B FR ageism 6.2 Gemini 1.5 Flash FR racism 6.4 Claude 3.5 Sonnet LB religion 6.7 GPT-4o LB religion 6.7 Llama3 405B FR religion 7.1 Llama3 405B EN sexism 7.5 Llama3 405B ES ageism 7.8 Llama3 405B DE ageism 7.8 Llama3 405B EN religion 10.0 Llama3 405B ES religion 10.0 Llama3 405B EN ageism 10.9 Llama3 405B CA ageism 12.5 Gemini 1.5 Flash CA religion 13.3 Llama3 405B DE religion 13.3 GPT-4o CA religion 13.3 Claude 3.5 Sonnet DE religion 16.7 GPT-4o DE religion 20.0 Llama3 405B LB racism 21.8 Claude 3.5 Sonnet CA ageism 25.0 Gemini 1.5 Flash LB ageism 25.0 Claude 3.5 Sonnet LB ageism 25.0 Llama3 405B CA racism 37.9 Llama3 405B EN racism 46.8 Llama3 405B FR racism 47.6 Llama3 405B ES racism 49.2 Llama3 405B DE racism 51.6 GPT-4o LB racism 51.6 Claude 3.5 Sonnet DE racism 52.4 GPT-4o ES racism 53.2 Claude 3.5 Sonnet ES sexism 63.3 Gemini 1.5 Flash LB sexism 63.3 Claude 3.5 Sonnet LB sexism 64.1 Claude 3.5 Sonnet FR sexism 65.8 Llama3 405B ES sexism 66.7 Table 6: Percentage of unprocessable responses for each test batch with at least one unprocessable response. 24 -- 24 of 24 --