Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

Summary

This paper addresses "translationese bias" in multilingual Large Language Model (LLM) judges, where models systematically favor machine-translated text over human-authored references, particularly in low-resource languages. The authors attribute this bias to two spurious correlations: latent manifold alignment with English and cross-lingual predictability. To mitigate this, they propose DIBJUDGE, a fine-tuning framework based on the Disentangled Information Bottleneck principle. DIBJUDGE separates input representations into a robust branch for judgment-critical semantics and a bias branch that isolates spurious factors. It employs variational information compression to minimize the robust representation while using a cross-covariance penalty to ensure statistical independence between the two branches. Extensive evaluations on benchmarks like M-RewardBench and MM-Eval demonstrate that DIBJUDGE consistently outperforms strong baselines, including GPT-4o and Gemini-2.5-Flash, in multilingual reward modeling accuracy. Crucially, it substantially reduces translationese bias severity across high-, mid-, and low-resource languages, with the most significant improvements observed in under-resourced settings. The method also generalizes to unseen biases, such as length and self-preference bias, without degrading performance on English-centric tasks. These results confirm that explicitly disentangling semantic content from translation artifacts enhances the reliability and fairness of automated multilingual evaluation.

PDF viewer

Chunks(71)

Chunk 0 · 1,995 chars

Mitigating Translationese Bias in Multilingual
LLM-as-a-Judge via Disentangled Information Bottleneck
Hongbin Zhang 1 2 Kehai Chen 1 2 Xuefeng Bai 1 Youcheng Pan 2 Yang Xiang 2 Jinpeng Wang 3 Min Zhang 1 2
Abstract
Large language models (LLMs) have become a
standard for multilingual evaluation, yet they ex-
hibit a severe systematic “translationese bias”. In
this paper, “translationese bias” is characterized as
LLMs systematically favoring machine-translated
text over human-authored references, particularly
in low-resource languages. We attribute this bias
to spurious correlations with (i) latent manifold
alignment with English and (ii) cross-lingual pre-
dictability. To mitigate this bias, we propose DIB-
JUDGE, a robust fine-tuning framework that learns
a minimally sufficient, judgment-critical represen-
tation via variational information compression,
while explicitly isolating spurious factors into the
dedicated bias branch. Furthermore, we incor-
porate a cross-covariance penalty that explicitly
suppresses statistical dependence between robust
and bias representations, thereby encouraging
effective disentanglement. Extensive evaluations
on multilingual reward modeling benchmarks
and a dedicated translationese bias evaluation
suite demonstrate that the proposed DIBJUDGE
consistently outperforms strong baselines and
substantially mitigates translationese bias.
1. Introduction
The emergence of Large Language Models (LLMs) has
revolutionized evaluation paradigms (Gu et al., 2024; Li
et al., 2025), establishing “LLM-as-a-Judge” as a standard
framework for multilingual assessment (Son et al., 2024;
Pombal et al., 2025; Anugraha et al., 2025; Hada et al., 2024;
Fu & Liu, 2025; Doddapaneni et al., 2025). Consequently,
ensuring the accuracy and robustness of these automated
judges across diverse languages has become a critical
necessity (Padarha et al., 2025; Bogavelli et al., 2026).
1Institute of Computing and Intelligence, Harbin Institute of
Technology, Shenzhen,

Chunk 1 · 1,997 chars

al., 2024;
Fu & Liu, 2025; Doddapaneni et al., 2025). Consequently,
ensuring the accuracy and robustness of these automated
judges across diverse languages has become a critical
necessity (Padarha et al., 2025; Bogavelli et al., 2026).
1Institute of Computing and Intelligence, Harbin Institute of
Technology, Shenzhen, China 2Pengcheng Laboratory, Shenzhen,
China 3Keeta AI, Meituan, Beijing, China. Correspondence to:
Kehai Chen <chenkehai@hit.edu.cn>.
Copyright 2026 by the author(s).
0.1 	0.2 	0.3 	0.4 	0.5 	0.6 	0.7 	0.8
Bias severity (higher = more bias)
Southern Pashto
Plateau Malagasy
Yoruba
Zulu
Kyrgyz
Nepali
Amharic
Sinhala
Irish
Telugu
Cebuano
Lithuanian
Filipino
Standard Malay
Urdu
Bengali
Tamil
Thai
Ukrainian
Indonesian
Basque
Finnish
Hindi
Vietnamese
Standard Arabic
Portuguese
Japanese
Spanish
Simplified Chinese
English
Low Resource
Mid Resource
High Resource
Highest bias
Lowest bias
Local trend (window=5)
Overall mean
Figure 1. Translationese Bias Severity of GPT-4o across
languages. Languages are sorted by resource availability from
low (top) to high (bottom). The trend line illustrates the inverse
relationship between resource availability and translationese bias.
However, the reliability of LLM judges is frequently under-
mined by systematic biases (Wang et al., 2025a; Ye et al.,
2025; Gao et al., 2025), such as position (Shi et al., 2025;
Wang et al., 2024b) and verbosity bias (Saito et al., 2023).
While these limitations are well-studied in English con-
texts (Chen et al., 2024; Zheng et al., 2023), specific failure
modes within multilingual settings remain underexplored.
In this paper, we characterize a distinct bias of LLM-as-
a-Judge in multilingual contexts, termed translationese
bias, in which LLMs favor machine-translated content over
human-authored reference, even when the former is seman-
tically flawed. To investigate this bias, we first conduct a
comprehensive evaluation across a diverse spectrum of lan-
guages. As shown in Figure 1, this

Chunk 2 · 1,988 chars

e in multilingual contexts, termed translationese
bias, in which LLMs favor machine-translated content over
human-authored reference, even when the former is seman-
tically flawed. To investigate this bias, we first conduct a
comprehensive evaluation across a diverse spectrum of lan-
guages. As shown in Figure 1, this bias is not only pervasive
but is significantly exacerbated in low-resource languages.
Crucially, our further attribution analysis suggests that LLM
judges may conflate generation quality with two potential
spurious factors: (a) latent manifold alignment with English,
and (b) cross-lingual predictability. While recent advance-
ments in multilingual LLM judges have yielded promising
results (Pombal et al., 2025; Anugraha et al., 2025; Zhang
et al., 2025), most existing methods remain grounded in
standard Supervised Fine-Tuning (SFT). However, SFT is
1
arXiv:2603.10351v1 [cs.CL] 11 Mar 2026

-- 1 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
susceptible to exploiting spurious correlations (Shuieh et al.,
2025; Gui & Ji, 2025; Chen et al., 2025c), thereby limiting
its efficacy in mitigating translationese bias.
To this end, we propose the Disentangled Information
Bottleneck Judge (DIBJUDGE), a robust fine-tuning frame-
work that explicitly decouples the latent representation into
two components: a robust representation that preserves
judgment-critical semantic information, and a bias repre-
sentation that isolates the spurious factors identified above
(i.e., latent manifold alignment with English and cross-
lingual predictability). We leverage variational informa-
tion compression to learn a robust, minimally sufficient
representation that preserves only information essential for
accurate judgment. To further encourage disentanglement
between robust and bias representation, we penalize their
mutual dependence during training. Extensive experiments
on multilingual reward modeling

Chunk 3 · 1,991 chars

ion compression to learn a robust, minimally sufficient
representation that preserves only information essential for
accurate judgment. To further encourage disentanglement
between robust and bias representation, we penalize their
mutual dependence during training. Extensive experiments
on multilingual reward modeling benchmarks, including
M-RewardBench (Gureja et al., 2025) and MM-Eval (Son
et al., 2024), demonstrate that DIBJUDGE consistently out-
performs strong baselines, yielding improved multilingual
reward modeling performance. Moreover, evaluations on a
dedicated translationese bias suite confirm that DIBJUDGE
substantially mitigates the severity of translationese bias.
In summary, we make three key claims: (i) we characterize
translationese bias in multilingual LLM judges and identify
two related spurious factors—latent manifold alignment
with English and cross-lingual predictability, (ii) we propose
DIBJUDGE, a robust fine-tuning framework that disentan-
gles judgment-critical semantics from spurious factors, and
(iii) we show that DIBJUDGE consistently outperforms
strong baselines on multilingual reward modeling bench-
marks while effectively mitigating translationese bias.1
2. Preliminary Analysis of Translationese Bias
To systematically study translationese bias, we structure
our preliminary analysis around two research questions: (i)
RQ1: How does translationese bias vary across languages
with different levels of resource availability? (ii) RQ2:
What kinds of spurious factors are associated with this bias?
Bias Evaluation Protocol. We construct a controlled trans-
lationese bias benchmark derived from BELEBELE (Ban-
darkar et al., 2024), a multilingual reading comprehension
dataset spanning 122 languages. Following the language-
resource taxonomy of Joshi et al. (2020), we stratify lan-
guages into high-, medium-, and low-resource tiers and
sample 10 representative languages per tier. For each
language, we evaluate on 200 instances, yielding a

Chunk 4 · 1,992 chars

2024), a multilingual reading comprehension
dataset spanning 122 languages. Following the language-
resource taxonomy of Joshi et al. (2020), we stratify lan-
guages into high-, medium-, and low-resource tiers and
sample 10 representative languages per tier. For each
language, we evaluate on 200 instances, yielding a balanced
benchmark across resource levels. The full list of selected
languages is provided in Appendix A.1. We formulate
translationese bias evaluation as a pairwise preference task,
1Our anonymous code repo is here.
where an LLM judge compares two candidate responses
for the same query: (i) a human-authored reference xH ,
and (ii) a machine-generated variant xM obtained via back-
translation to induce translationese artifacts. To avoid posi-
tion bias, each instance i is evaluated under both forwarding
and reverse ordering. Details are available in Appendix A.2.
Bias Metric Definition. Let yi ∈ {0, 1} indicate whether
the judge prefers xM in the forward order, and let ¯yi ∈
{0, 1} denote the corresponding preference in the reverse
order. We define bias severity Sbias as the fraction of con-
sistent judgments that favor the machine-generated output:
Sbias =
PN
i=1 I [yi = 1 ∧ ¯yi = 1]
PN
i=1 I [yi = ¯yi] . (1)
2.1. Quantifying Bias across Resource Levels (RQ1)
Figure 1 illustrates the translationese bias severity of GPT-
4o across varying language resource levels. We observe
two salient patterns: first, translationese bias is pervasive
across the entire linguistic spectrum; second, there is a
distinct inverse correlation between resource availability
and the magnitude of bias. Specifically, while high-resource
languages exhibit minimal bias, low-resource languages
demonstrate significantly elevated severity. These findings
expose translationese bias as a severe yet previously ne-
glected failure mode in multilingual LLM judges, which
critically undermines evaluation reliability and dispropor-
tionately compromises under-resourced languages.
2.2.

Chunk 5 · 1,996 chars

mal bias, low-resource languages
demonstrate significantly elevated severity. These findings
expose translationese bias as a severe yet previously ne-
glected failure mode in multilingual LLM judges, which
critically undermines evaluation reliability and dispropor-
tionately compromises under-resourced languages.
2.2. Attribution Analysis of Translationese Bias (RQ2)
Due to the scarcity of high-quality native resources (Qin
et al., 2025; Huang et al., 2024), multilingual LLMs are typ-
ically pre-trained on English-dominated corpora (Kreutzer
et al., 2022; Weber et al., 2024) and subsequently adapted
using translated or synthetic data (Muennighoff et al., 2023;
Zhang et al., 2020a). Accordingly, we hypothesize that
translationese bias stems from spurious correlations induced
across these two stages: (i) latent manifold alignment
with English, where non-English representations are im-
plicitly aligned to an English-centric latent space during
pre-training; and (ii) cross-lingual predictability, where the
judge over-relies on probability heuristics that favor the
statistical patterns of machine-translated text, potentially
amplified by exposure to translated or synthetic data during
fine-tuning. However, causally attributing this bias to partic-
ular data mixtures remains non-trivial given the opacity and
heterogeneity of LLM training pipelines (Lai et al., 2025).
To address this, we introduce two measurable latent met-
rics that serve as quantitative proxies for these two fac-
tors: (i) Language Alignment Score (LAS), defined as
the degree to which a representation is geometrically
aligned with an English latent manifold: LAS(x) =
1
L
PL
l=1 cos hl(x), cen,l
, where x is input sequence,
2

-- 2 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
-0.20 	0.20	-0.10 	0.00 	0.10
Cross-lingual alignment discrepancy
0.0
0.1
0.2
0.3
0.4
Model win rate
β = -5.31 	Logit fit
Fit 95% CI
Binned estimate (95% CI)
(a)

Chunk 6 · 1,995 chars

, cen,l
, where x is input sequence,
2

-- 2 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
-0.20 	0.20	-0.10 	0.00 	0.10
Cross-lingual alignment discrepancy
0.0
0.1
0.2
0.3
0.4
Model win rate
β = -5.31 	Logit fit
Fit 95% CI
Binned estimate (95% CI)
(a) Machine Win Rate vs. CAD
0.4 	0.6 	1.4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Probability density
 	Δmedian = -0.34
Human win
Machine win
Median
0.8 	1.0 	1.2
Sequence Surprisal Ratio
 (b) SSR Density (Human vs. Machine Win)
0.0 	0.2 	0.8 	1.0	0.4 	0.6
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
True positive rate
SSR (AUC=0.695)
CAD (AUC=0.638)
 (c) ROC Analysis of CAD and SSR
Figure 2. Correlation analysis of judge preference with confounding factors. (a) Machine win rate decreases monotonically as CAD
increases, indicating that judge preference spuriously tracks latent manifold isomorphism with English. (b) SSR distributions exhibit a
clear drift between human-win and machine-win cases, showing that the judge systematically favors higher-likelihood outputs. (c) ROC
curves confirm that both CAD and SSR reliably predict judge outcomes, reinforcing the attribution that translationese bias is mediated by
latent manifold isomorphism with English and high predicative confidence.
hl(x) is the layer-l hidden representation of x, cen,l is
the English centroid at layer l, and L is the total number
of layers. (ii) Cross-lingual Sequence Surprisal (CSS),
defined as the length-normalized negative log-likelihood of
a target sequence x of T tokens, conditioned on its English
translated context xen: CSS(x) = − 1
T
PT
t=1 log P (xt |
xen, x<t). To answer RQ2, we then investigate the extent
to which the distributional divergence of these metrics
between xH and xM correlates with machine win rate:
(i) Cross-lingual Alignment Discrepancy (CAD), CAD =
LAS(xH ) − LAS(xM ), where CAD < 0 implies that xM
exhibits closer alignment to the English latent space than xH
does.

Chunk 7 · 1,999 chars

wer RQ2, we then investigate the extent
to which the distributional divergence of these metrics
between xH and xM correlates with machine win rate:
(i) Cross-lingual Alignment Discrepancy (CAD), CAD =
LAS(xH ) − LAS(xM ), where CAD < 0 implies that xM
exhibits closer alignment to the English latent space than xH
does. (ii) Sequence Surprisal Ratio (SSR): SSR = CSS(xM )
CSS(xH ) ,
where SSR < 1 indicates that xM is more cross-lingual
predictable by the model relative to xH .
As shown in Figure 2, LLM judges exhibit strong correla-
tions with the introduced latent metrics. Specifically, Fig-
ure 2a reveals a negative association between CAD and the
machine win rate, suggesting that the judge favors outputs
that align closely with the English manifold. Meanwhile,
Figure 2b demonstrates a pronounced distributional shift
in SSR: machine-generated outputs preferred by the judge
cluster significantly at lower SSR. This pattern indicates
a bias toward sequences with high statistical cross-lingual
predictability. The robustness of the observed correlations
is further corroborated by the ROC analysis in Figure 2c,
demonstrating meaningful discriminative power of these
features. Collectively, these results ascribe the observed
translationese bias to two confounding factors: latent mani-
fold alignment with English and cross-lingual predictability.
3. Disentangled Information Bottleneck Judge
To mitigate the spurious correlations identified in § 2.2, we
propose the Disentangled Information Bottleneck Judge
(DIBJUDGE), as illustrated in Figure 3. By explicitly
disentangling these spurious factors, DIBJUDGE learns a
compressed representation that retains sufficient, robust
features essential for accurate quality assessment.
3.1. Preliminaries
Mutual Information. Mutual Information (MI) quantifies
the statistical dependence between two random variables,
A and B. Given the joint distribution p(a, b) and marginals
p(a), p(b), MI is formally defined as :
I(A; B) ≜ Ep(a,b)

log p(a,

Chunk 8 · 1,995 chars

robust
features essential for accurate quality assessment.
3.1. Preliminaries
Mutual Information. Mutual Information (MI) quantifies
the statistical dependence between two random variables,
A and B. Given the joint distribution p(a, b) and marginals
p(a), p(b), MI is formally defined as :
I(A; B) ≜ Ep(a,b)

log p(a, b)
p(a)p(b)

. (2)
Information Bottleneck Principle. The Information Bot-
tleneck (IB) principle (Tishby et al., 2000; Alemi et al.,
2017) seeks a compressed representation Z that is sufficient
for a target task Y while remaining minimal with respect
to the input X. This is formalised by minimizing the
Lagrangian: LIB = −I(Y ; Z) + βI(X; Z), where β ≥ 0
governs the trade-off between prediction and compression.
However, compression may discard robust semantic features
in favor of simpler spurious correlations (Liu et al., 2022;
Zhang et al., 2024). Thus, relying solely on information
compactness cannot guarantee the robustness.
3.2. Disentangled Information Bottleneck Objective
To prevent LLM judges from exploiting spurious shortcuts
solely through information compression, we are inspired by
the idea of disentangled representation learning (Wang et al.,
2024d) to refine vanilla IB. Let S be a spurious variable,
Zr be a relevant variable encoding features necessary for
predicting the target Y , and Zb be a bias variable serving
as a dedicated “sink” to absorb S. We formalize the
Disentangled Information Bottleneck Objective as:
LDIB = −I(Y ; Zr )
| {z }
Prediction
+ β I(X; Zr )
| {z }
Compression
− γ I(S; Zb)
| {z }
Bias Capture
+ λ I(Zr ; Zb)
| {z }
Disentanglement
. (3)
3

-- 3 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
min
max
LLM
Judge
Learnable Frozen
min
Minimize dependence
Robust
Encoder
Bias
Encoder Decoder 1 	(ii) Log Probability Bin Classification Task
(i) Cross-lingual Alignment Contrastive Learning
Decoder 2
LoRA
Output
Robust Branch 	Bias Branch Proxy Task for Encoding

Chunk 9 · 1,994 chars

LLM-as-a-Judge via Disentangled Information Bottleneck
min
max
LLM
Judge
Learnable Frozen
min
Minimize dependence
Robust
Encoder
Bias
Encoder Decoder 1 	(ii) Log Probability Bin Classification Task
(i) Cross-lingual Alignment Contrastive Learning
Decoder 2
LoRA
Output
Robust Branch 	Bias Branch Proxy Task for Encoding Spurious Attributes (i) Latent Manifold
Isomorphism with English and (ii) Predictive Confidence)
1 Decouple
Input
Compress	2 	3 Predict
Separating and
Linear
max
Disentangle	4 	Spurious	5
5
robust
space
variational
bottleneck
Figure 3. Overview of our DIBJUDGE, which grounded in Equation 3. (1) employs a robust encoder gϕr and a bias encoder gϕb to
separate the input X into robust representations Zr and bias representations Zb. (2) introduces a variational bottleneck to minimize the
mutual information I(X; Zr ). (3) leverages the compressed Zr through LLM Judge optimized using LoRA (Hu et al., 2022) to generate
the final output Y by maximizing I(Y ; Zr ). (4) ensures feature independence by minimizing the dependence I(Zr ; Zb) between the
robust and bias branches. (5) explicitly captures spurious attributes S within Zb by maximizing I(S; Zb) through two proxy tasks: (i)
cross-lingual alignment contrastive learning and (ii) predictive confidence estimation via log-probability bin classification.
The first two terms apply a vanilla IB constraint restricted to
the robust channel Zr ; the third term makes Zb informative
about spurious attributes S; and the final term penalizes de-
pendence between (Zr , Zb), encouraging Zr to exclude spu-
rious correlations that are explicitly routed into Zb. Directly
optimizing this objective is computationally intractable due
to the difficulty of estimating mutual information in high-
dimensional spaces (Liu et al., 2024b). To address this, we
derive tractable variational surrogate objectives as follows:
Decouple Robust and Bias Representation. Using sepa-
rate encoders gϕr and gϕb , we decompose the input X into
a

Chunk 10 · 1,995 chars

tionally intractable due
to the difficulty of estimating mutual information in high-
dimensional spaces (Liu et al., 2024b). To address this, we
derive tractable variational surrogate objectives as follows:
Decouple Robust and Bias Representation. Using sepa-
rate encoders gϕr and gϕb , we decompose the input X into
a robust representation Zr = gϕr (X) ∈ RT ×d and a bias
representation Zb = gϕb (X) ∈ RT ×d, where T denotes the
sequence length and d the feature dimension. We leverage
Zr for task prediction while using Zb only during training.
Compression via Variational Information Constraints.
To facilitate compression, we leverage the Variational Infor-
mation Bottleneck (Alemi et al., 2017), which imposes an
upper bound on I(X; Zr ) via variational inference.
Proposition 3.1. Let Zr be a continuous random variable,
with variational posterior qϕ(Zr |X) and fixed prior p(Zr ).
Then I(X; Zr ) ≤ Ex∼p(X) [DKL(qϕ(Zr |x)∥p(Zr ))] .
Guided by Proposition 3.1 (proved in Appendix. B.1), we
can constrain I(X; Zr ) via penalizing the KL divergence be-
tween the variational posterior and fixed prior. Accordlingly,
we adopt a standard Gaussian prior p(Zr ) = N (0, I) and
parameterize the variational posterior qϕ(zr,t|x) at each
time step t as a multivariate Gaussian N (μt, σ2
t ). The
resulting compression objective, defined as the average KL
divergence over the sequence length T and feature dimen-
sion d, is derived as follows (details in Appendix. B.2):
Lcompress = 1
T
T	X
t=1
DKL
N μt, σ2
t
 N (0, I)
= − 1
2T
T	X
t=1
d	X
j=1

1 + log σ2
t,j − μ2
t,j − σ2
t,j

.
(4)
To allow for backpropagation, we sample the latent repre-
sentation zr,t using the reparameterization trick (Kingma
et al., 2015): zr,t = μt + σt ⊙ ϵ, where ϵ ∼ N (0, I).
Variational Mutual Information Maximization. The ob-
jective (Eq. 3) necessitates maximizing mutual information
along two disentangled pathways: the task-predictive term
I(Y ; Zr ) and the bias-capturing term I(S; Zb). We then
maximize a

Chunk 11 · 1,999 chars

meterization trick (Kingma
et al., 2015): zr,t = μt + σt ⊙ ϵ, where ϵ ∼ N (0, I).
Variational Mutual Information Maximization. The ob-
jective (Eq. 3) necessitates maximizing mutual information
along two disentangled pathways: the task-predictive term
I(Y ; Zr ) and the bias-capturing term I(S; Zb). We then
maximize a variational lower bound on the mutual informa-
tion guided by Proposition 3.2 (proved in Appendix B.3).
Proposition 3.2. Let U and V be random variables with
joint distribution p(U, V ). For any variational conditional
distribution qθ (U | V ), the mutual information satisfies
I(U ; V ) ≥ E(U,V )∼p(U,V )[log qθ (U | V )] + H(U ), where
H(U ) denotes the marginal entropy of U .
For the robust pathway, we treat the LLM judge fjudge as
the variational decoder qθ . We condition the generation
on a sequence formed by concatenating the instruction
embeddings Einst with the sampled robust representation
Zr . The task loss is defined as:
Ltask = EX,Y

−
|Y |	
X
t=1
log fjudge(Yt | [Einst; Zr ], Y<t)

 . (5)
4

-- 4 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
We employ lightweight MLP decoders qψbias and minimize
the negative log-likelihood of the spurious attribute S given
the bias representation Zb to facilitate the encoding of
spurious information into the bias pathway:
Lbias = EX,S [− log qψbias (S | Zb)] . (6)
We operationalize the identified spurious factors through
two designed proxy tasks. First, we address latent-manifold
alignment with English using a cross-lingual contrastive
learning objective. Second, we estimate cross-lingual
predictability via a log-probability bin classification task.
Disentanglement via Cross-Covariance Penalty. Directly
minimizing the mutual information I(Zr ; Zb) is generally
intractable. However, in high-dimensional regimes typi-
cal of LLMs, representation distributions are often well-
approximated by Gaussian statistics (Lee et al., 2018;
Hron et al.,

Chunk 12 · 1,978 chars

bin classification task.
Disentanglement via Cross-Covariance Penalty. Directly
minimizing the mutual information I(Zr ; Zb) is generally
intractable. However, in high-dimensional regimes typi-
cal of LLMs, representation distributions are often well-
approximated by Gaussian statistics (Lee et al., 2018;
Hron et al., 2020). Under this Gaussian assumption,
minimizing mutual information reduces to minimizing the
cross-covariance between latent variables (Cover, 1999;
Hyv ¨arinen & Oja, 2000). We formalize this relationship
in Proposition 3.3 (proof provided in Appendix B.4).
Proposition 3.3. Let Zr and Zb be jointly Gaussian random
vectors with marginal covariance matrices Σr and Σb,
and cross-covariance Σrb. Define the normalized cross-
covariance matrix as C = Σ−1/2
r ΣrbΣ−1/2
b . Provided
the spectral norm ∥C∥2 is sufficiently small, the mutual
information admits the following second-order expansion:
I(Zr ; Zb) = 1
2 ∥C∥2
F + o ∥C∥2
F
 , as ∥C∥2 → 0.
Accordingly, we adopt the cross-covariance penalty as a
computationally efficient surrogate for disengtanglement
term in Eq. 3. Given centered mini-batch representa-
tions ¯Zr , ¯Zb ∈ RN ×d, we compute the empirical cross-
covariance matrix ˆΣrb = 1
N −1 ¯Z⊤
r ¯Zb. We then minimize
the squared Frobenius norm of ˆΣrb to suppress correlations:
Ldisc = ∥ ˆΣrb∥2
F =
d	X
i=1
d	X
j=1
( ˆΣrb)2
ij . (7)
In practice, feature-wise normalization ensures that Zr and
Zb have approximately unit variance along each dimen-
sion (Ba et al., 2016). This objective penalizes second-order
dependencies, thereby encouraging statistical independence
in the learned representations (Zbontar et al., 2021).
Overall Learning Objective. We optimize DIBJUDGE end-
to-end by minimizing a weighted sum of tractable objectives
derived above. Concretely, the final training objective is
L = Ltask + β Lcompress + γ Lbias + λ Ldisc, (8)
where the weights β, γ, λ control the accuracy–compression–
bias-capture–independence trade-off.
4.

Chunk 13 · 1,997 chars

erall Learning Objective. We optimize DIBJUDGE end-
to-end by minimizing a weighted sum of tractable objectives
derived above. Concretely, the final training objective is
L = Ltask + β Lcompress + γ Lbias + λ Ldisc, (8)
where the weights β, γ, λ control the accuracy–compression–
bias-capture–independence trade-off.
4. Experiments
Evaluation Benchmarks. To evaluate the effectiveness of
LLM judges across multilingual contexts, we utilize three
primary reward modeling benchmarks (Lambert et al., 2024;
Son et al., 2024; Gureja et al., 2025) selected to ensure
a balanced consideration of the following aspects: a) rea-
soning and safety alignment across diverse conversational
contexts, b) performance across 23 distinct languages, c) the
distinction between translated content and native-speaker
data. Our primary evaluation metric is accuracy, reported
as the category average and the mean of language-specific
micro-averages. More details are provided in Appendix C.1.
Training settings. We adopt the same training corpus as
mR3 (Anugraha et al., 2025) and fine-tune using LoRA (Hu
et al., 2022). All experiments are optimized with the
Adam optimizer (Kingma, 2014) using a learning rate of
1×10−4 and a maximum sequence length of 16384. Further
implementation details are provided in Appendix C.2.
Baselines. We evaluate DIBJUDGE against proprietary
general-purpose models (GPT-4o (Hurst et al., 2024),
Gemini-2.5-Flash (Comanici et al., 2025)) and open-source
general-purpose LLMs (Qwen2.5/3 (Qwen et al., 2025;
Yang et al., 2025)). Since Qwen3 is our backbone, these
comparisons isolate gains from our training recipe be-
yond base model capacity. We additionally benchmark
multilingual reward models/judges, including Nemotron-
Multilingual-49B (Wang et al., 2025b), M-Prometheus
(3B/7B) (Pombal et al., 2025), mR3 (Anugraha et al., 2025),
and Think-as-Locals (7B) (Zhang et al., 2025).
Main Results. Table 1 reports the mean accuracy and stan-
dard deviation across benchmarks over three

Chunk 14 · 1,996 chars

ark
multilingual reward models/judges, including Nemotron-
Multilingual-49B (Wang et al., 2025b), M-Prometheus
(3B/7B) (Pombal et al., 2025), mR3 (Anugraha et al., 2025),
and Think-as-Locals (7B) (Zhang et al., 2025).
Main Results. Table 1 reports the mean accuracy and stan-
dard deviation across benchmarks over three independent
runs, with statistical significance assessed using pairwise
t-tests. On m-RewardBench, DIBJUDGE-Qwen3-8B es-
tablishes a new SOTA among open-weight models, signif-
icantly outperforming both its backbone-matched counter-
part and a substantially larger multilingual baseline. These
results confirm the effectiveness of the proposed approach.
In terms of generalization, DIBJUDGE-Qwen3-8B achieves
superior performance on the English-centric RewardBench,
statistically surpassing prior leading methods. This indicates
that the proposed method improves multilingual reward
modeling without degrading performance on monolingual
benchmarks. Detailed results are provided in Appendix D.
5. Analysis
To substantiate the theoretical claims proposed in this work,
we conduct a series of targeted experiments designed to
rigorously validate the efficacy and internal mechanics of
DIBJUDGE. Our analysis focuses on verifying that the
proposed information disentanglement objective translates
into tangible performance gains and interpretable latent
5

-- 5 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 1. Performance evaluation on multilingual reward benchmarks. Bold indicates the best performance, and underlined indicates the
second-best. Statistical significance compared to the best baseline is denoted by † (p < 0.05) and ‡ (p < 0.01).
m-RewardBench RewardBench MM-Eval
Model (Avg. 23 langs) (English) (Avg. 18 lang)
Proprietary Models
GPT-4o (Hurst et al., 2024) 85.75 ± 0.42 85.96 ± 0.35 71.85 ± 0.81
Gemini-2.5-Flash (Comanici et al., 2025) 88.06 ± 0.49 88.83 ± 0.47 77.47 ± 0.76
General Open

Chunk 15 · 1,985 chars

best baseline is denoted by † (p < 0.05) and ‡ (p < 0.01).
m-RewardBench RewardBench MM-Eval
Model (Avg. 23 langs) (English) (Avg. 18 lang)
Proprietary Models
GPT-4o (Hurst et al., 2024) 85.75 ± 0.42 85.96 ± 0.35 71.85 ± 0.81
Gemini-2.5-Flash (Comanici et al., 2025) 88.06 ± 0.49 88.83 ± 0.47 77.47 ± 0.76
General Open Models
Qwen2.5-3B-Instruct (Qwen et al., 2025) 66.97 ± 1.12 68.99 ± 1.05 57.99 ± 1.20
Qwen2.5-7B-Instruct (Qwen et al., 2025) 77.89 ± 0.89 78.59 ± 0.91 65.64 ± 0.95
Qwen3-4B (Yang et al., 2025) 85.06 ± 0.65 87.54 ± 0.55 80.85 ± 0.68
Qwen3-8B (Yang et al., 2025) 86.12 ± 0.52 88.81 ± 0.48 82.20 ± 0.60
Multilingual Open Reward Models
Nemotron-Multi-49B (Wang et al., 2025b) 88.83 ± 0.35 89.71 ± 0.31 76.31 ± 0.55
M-PROMETHEUS 3B (Pombal et al., 2025) 68.45 ± 0.98 69.79 ± 0.92 64.17 ± 1.10
M-PROMETHEUS 7B (Pombal et al., 2025) 78.03 ± 0.85 76.69 ± 0.78 69.38 ± 0.88
mR3-Qwen3-4B (Anugraha et al., 2025) 87.21 ± 0.45 89.75 ± 0.38 82.55 ± 0.52
mR3-Qwen3-8B (Anugraha et al., 2025) 88.58 ± 0.41 90.10 ± 0.40 85.29 ± 0.45
Think-as-Locals 7B (Zhang et al., 2025) 84.51 ± 0.60 88.79 ± 0.52 72.95 ± 0.70
Ours
DIBJudge-Qwen3-4B 89.84 ± 0.28† 90.32 ± 0.25 85.16 ± 0.33
DIBJudge-Qwen3-8B 91.37 ± 0.22‡ 91.01 ± 0.20† 87.53 ± 0.28‡
structures. We organize this empirical investigation around
five core research questions: (i) RQ1 (Bias Mitigation): To
what extent does DIBJUDGE effectively mitigate transla-
tionese bias across languages with varying resource avail-
ability? (ii) RQ2 (Utility Trade-off): How does the
information bottleneck constraint shape the Pareto Frontier
between bias mitigation and downstream task utility? (iii)
RQ3 (Disentanglement): Do the learned latent represen-
tations geometrically disentangle semantic content from
translationese artifacts, as theoretically hypothesized? (iv)
RQ4 (Generalization): Does the model exhibit robustness
against unseen bias types (e.g., length bias) that were not
explicitly included in the spurious proxy task? (v)

Chunk 16 · 1,997 chars

: Do the learned latent represen-
tations geometrically disentangle semantic content from
translationese artifacts, as theoretically hypothesized? (iv)
RQ4 (Generalization): Does the model exhibit robustness
against unseen bias types (e.g., length bias) that were not
explicitly included in the spurious proxy task? (v) RQ5
(Ablation Study): How do the distinct components of the
DIB objective (Eq. 3) and spurious proxy task contribute
to bias mitigation and reward modeling utility? Additional
analyses are deferred to the appendix, including studies of
proxy-task design (Appx. F.1), sensitivity to CAD and SSR
(Appx. F.2, F.3), comparisons with alternative compression
and disentanglement mechanisms (Appx. F.4, F.5), and
linear probing to assess information leakage (Appx. F.6).
RQ1: Efficacy in Translationese Bias Mitigation. We
extend the preliminary bias evaluation (§ 2) to a broader
suite of domains and datasets. We evaluate performance
on three diverse benchmarks: BELEBELE (machine read-
ing comprehension) (Bandarkar et al., 2024), AYA (Singh
et al., 2024) (open-ended instruction following), and XL-
SUM (Hasan et al., 2021) (summarization). This selection
Low Mid High
Resource Tier
0.00
0.08
0.16
0.24
Bias Severity
 Sbias (↓)
(a) Belebele
Avg Δ = 80% ↓
Low Mid High
Resource Tier
(b) Aya
Avg Δ = 56% ↓
Low Mid High
Resource Tier
(c) XL-Sum
Avg Δ = 75% ↓
Base Model 	Vanilla SFT 	Vanilla IB 	DIBJudge
Figure 4. Bias severity across resource tiers. Sbias (lower is
better) on BELEBELE, AYA, and XL-SUM. DIBJUDGE reduces
bias across all tiers, with average reductions of 80%, 56%, and
75%, and the strongest improvements in Low-Resource settings.
Error bars show std over 3 runs; Avg ∆ is relative to Vanilla SFT.
allows us to assess translationese bias across constrained
formats and realistic, open-ended interactions. Human-
authored references are used as ground-truth targets, while
negative samples (rejected responses) are generated via
back-translation as described in § 2.

Chunk 17 · 1,978 chars

std over 3 runs; Avg ∆ is relative to Vanilla SFT.
allows us to assess translationese bias across constrained
formats and realistic, open-ended interactions. Human-
authored references are used as ground-truth targets, while
negative samples (rejected responses) are generated via
back-translation as described in § 2. To investigate the
impact of data scarcity, we stratify languages into High-,
Mid-, and Low-Resource tiers (n = 10 languages per tier).
We benchmark DIBJUDGE against three baselines: the Base
model, Vanilla SFT, and a Vanilla IB variant. We quantify
efficacy using the Bias Severity metric (Sbias) defined in
Equation 1. More details in Appendix. A.2
Figure 4 demonstrates the efficacy of DIBJUDGE in miti-
gating translationese bias across diverse language resource
levels. On the benchmark BELEBELE, DIBJudge achieves
6

-- 6 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
a drastic reduction in bias severity, approaching near-zero
levels across the mid- and high-resource tiers. This trend
extends to generative tasks such as AYA and XL-SUM,
where we observe consistent bias suppression. Crucially,
DIBJUDGE significantly reduces disparity across resource
tiers; whereas vanilla SFT retains marked bias in low-
resource settings, our approach effectively dampens these
spurious correlations. These findings confirm that DIB-
JUDGE targets the bias amplification that disproportionately
affects underrepresented languages, rather than merely
enhancing general instruction-following capabilities.
0.10 0.15 0.20 0.25 0.30
Bias severity ( better)
0.85
0.86
0.87
0.88
0.89
m-RewardBench accuracy (

better)
DIBJudge
Vanilla IB
Qwen3-4B
GPT-4o
mR3-Qwen3-4B
Gemini-2.5-Flash
10 1
100
Bottleneck strength

(log-scale)
Figure 5. Bias–utility Pareto Frontier. Trade-off between Bias
Severity (↓; x-axis) and m-RewardBench accuracy (↑; y-axis).
Each point corresponds to a bottleneck strength β

Chunk 18 · 1,997 chars

9
m-RewardBench accuracy (
 
better)
DIBJudge
Vanilla IB
Qwen3-4B
GPT-4o
mR3-Qwen3-4B
Gemini-2.5-Flash
10 1
100
Bottleneck strength
 
(log-scale)
Figure 5. Bias–utility Pareto Frontier. Trade-off between Bias
Severity (↓; x-axis) and m-RewardBench accuracy (↑; y-axis).
Each point corresponds to a bottleneck strength β (log-scaled,
color-coded). The resulting Pareto frontiers are traced by DIB-
JUDGE (solid) and the VANILLA IB baseline (dashed). DIBJUDGE
consistently achieves higher accuracy at comparable bias levels
across β, yielding a uniformly superior bias–utility trade-off.
Markers indicate representative SOTA models, which DIBJUDGE
outperforms in terms of lower bias and higher accuracy.
RQ2: The Robustness-Utility Trade-off. We investigate
the tension between robustness and utility by modulating
the coefficient β of the compression term in Equation 3.
Specifically, we aim to characterize the Pareto Frontier of
this trade-off. Figure 5 illustrates this dynamic by plot-
ting bias severity (Sbias) against m-RewardBench accuracy.
DIBJUDGE achieves a consistently better Pareto frontier
than the VANILLA IB baseline, indicating that it learns a
more compact and robust representation without discarding
key semantic features. Furthermore, DIBJUDGE strictly
dominates its base model and strong proprietary baselines
(e.g., GPT-4o, Gemini-2.5-Flash), consistently achieving
higher accuracy across all fixed levels of bias severity.
These findings confirm that the proposed method mitigates
translationese bias without substantial utility degradation.
RQ3: Disentanglement of Latent Representations. We
visualize the geometry of the learned representations using
t-SNE (van der Maaten & Hinton, 2008), extracting bias
(Zb) and robust (Zr ) features from a held-out evaluation set
comprising human and machine-translated texts. As shown
in Figure 6, the latent spaces exhibit divergent topologies.
The bias space (Fig. 6a) forms distinct clusters based on text
origin, confirming that

Chunk 19 · 1,994 chars

-SNE (van der Maaten & Hinton, 2008), extracting bias
(Zb) and robust (Zr ) features from a held-out evaluation set
comprising human and machine-translated texts. As shown
in Figure 6, the latent spaces exhibit divergent topologies.
The bias space (Fig. 6a) forms distinct clusters based on text
origin, confirming that Zb encodes translationese artifacts.
200 	100 	0 	100 	200
150
100
50
0
50
100
150
200
Aya Bias Representation
Human
Machine
Source Separation: 1.65
(a) Bias Representations (Zb)
100 	50 	0 	50 	100
100
50
0
50
100
Aya Robust Representation
Human
Machine
Source Separation: 0.02
 (b) Robust Representations (Zr )
Figure 6. Visualization of Latent Representation Disentangle-
ment. t-SNE projections of embeddings for Human (Blue) vs.
Machine (Red) text. (a) Zb clearly separates domains. (b) Zr
shows a mixed distribution, corroborating domain invariance.
Table 2. Zero-Shot Generalization to Unseen Biases. We
evaluate performance on a held-out subset containing biases not
encountered during training (Out-of-Distribution). DIBJUDGE
achieves the lowest bias scores across both in-distribution (Trans-
lationese) and unseen heuristics (Length, Self-Preference).
ID OOD (Unseen Biases)
Method Trans. Sbias ↓ Length ρ ↓ Self-Pref. Sbias ↓
Vanilla SFT 0.247 0.553 0.314
Vanilla IB 0.168 0.482 0.276
DIBJUDGE 0.083 0.314 0.219
Conversely, the robust space (Fig. 6b) demonstrates substan-
tial domain overlap. This phenomenon demonstrates that
Zr achieves invariance to translationese artifacts, effectively
disentangling them from the underlying semantic content.
RQ4: Zero-Shot Generalization to Unseen Biases.
We evaluate the generalization capability of DIBJUDGE
against biases not encountered during training—specifically
length (Saito et al., 2023) and self-preference (Wataoka
et al., 2024)—using a held-out subset of Skywork-Reward-
Preference-80K (Liu et al., 2024a). Length bias is quantified
via the Spearman rank correlation (ρ) between response
length and predicted

Chunk 20 · 1,990 chars

IBJUDGE
against biases not encountered during training—specifically
length (Saito et al., 2023) and self-preference (Wataoka
et al., 2024)—using a held-out subset of Skywork-Reward-
Preference-80K (Liu et al., 2024a). Length bias is quantified
via the Spearman rank correlation (ρ) between response
length and predicted ratings, while self-preference is eval-
uated using bias severity (Sbias) as defined in Eq. 1, specif-
ically favoring the model’s own generations. As detailed
in Table 2, DIBJUDGE demonstrates superior robustness
compared to vanilla SFT and IB baselines. Our method sig-
nificantly reduces the correlation between response length
and response bias and minimizes self-preference bias. These
results suggest that DIBJUDGE successfully learns to filter
superficial heuristics (e.g., verbosity) rather than merely
memorizing specific artifacts such as translationese.
RQ5: Impact of DIB Objective Components and Spuri-
ous Proxy Tasks. To assess the individual contributions of
the components in the DIB objective (Eq. 3), we evaluate
various combinations of the compression, bias-capture, and
disentanglement objectives. As shown in Table 3, isolated
7

-- 7 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 3. Ablation study of the DIB objectives. We report the bias
score (Sbias, lower is better) and Accuracy (higher is better). The
combination of all terms achieves the best trade-off.
Objectives Metrics
Compression Bias Disentangle Sbias(↓) Acc. (↑)
✓ 0.124 85.25
✓ 0.150 86.60
✓ 0.053 85.85
✓ ✓ 0.091 88.20
✓ ✓ 0.039 88.55
✓ ✓ 0.035 89.10
✓ ✓ ✓ 0.031 89.85
Table 4. Ablation study of proxy tasks. CLA indicates the cross-
lingual alignment proxy task, and LPBC stands for log-probability
bin classification proxy task.
Configuration None CLA LPBC CLA + LPBC
Sbias(↓) 0.421 0.312 0.279 0.147
Acc. (↑) 87.12 87.86 88.43 89.18
objectives yield suboptimal results. While the disentangle-
ment term is

Chunk 21 · 1,999 chars

udy of proxy tasks. CLA indicates the cross-
lingual alignment proxy task, and LPBC stands for log-probability
bin classification proxy task.
Configuration None CLA LPBC CLA + LPBC
Sbias(↓) 0.421 0.312 0.279 0.147
Acc. (↑) 87.12 87.86 88.43 89.18
objectives yield suboptimal results. While the disentangle-
ment term is particularly effective at reducing the bias score
relative to the baseline, it remains insufficient to maximize
predictive accuracy. The highest performance across both
metrics is achieved by integrating all three terms, suggesting
that the synergy between information-bottleneck-driven
compression and latent-space disentanglement is critical
for balancing fairness and model utility. Furthermore, we
investigate the impact of the proposed spurious proxy tasks:
Cross-Lingual Alignment (CLA) and Log-Probability Bin
Classification (LPBC). Table 4 illustrates that omitting these
tasks leads to a significant degradation in bias mitigation.
The substantial improvement in Sbias when both tasks are
employed confirms their effectiveness in capturing spurious
correlations related to translationese. Detailed proxy task
ablation studies are presented in Appendix F.1.
6. Related Work
LLM-as-a-Judge. The LLM-as-a-Judge paradigm marks a
fundamental shift from traditional n-gram (e.g., BLEU (Pa-
pineni et al., 2002), ROUGE (Lin, 2004)) and embedding-
based metrics (e.g., BERTScore (Zhang et al., 2020b; Rei
et al., 2020)) toward generative evaluation (Gu et al., 2024;
Li et al., 2025). While early adoption relied on proprietary
models like GPT-4 due to their high correlation with human
judgment (Liu et al., 2023; Zheng et al., 2023), concerns
regarding cost and transparency have catalyzed a transition
to open-weight evaluators (Wang et al., 2024e;c;a; Kim et al.,
2024a;b). Recently, this paradigm has further evolved from
direct generative to incorporating explicit reasoning steps
to enhance reliability (Chen et al., 2025a; Guo et al., 2025;
Chen et al., 2025b). However,

Chunk 22 · 1,989 chars

and transparency have catalyzed a transition
to open-weight evaluators (Wang et al., 2024e;c;a; Kim et al.,
2024a;b). Recently, this paradigm has further evolved from
direct generative to incorporating explicit reasoning steps
to enhance reliability (Chen et al., 2025a; Guo et al., 2025;
Chen et al., 2025b). However, despite these advancements,
LLM judges remain susceptible to systematic biases (Ye
et al., 2025; Wang et al., 2024b; Zheng et al., 2024), such
as position bias (Shi et al., 2025; Ko et al., 2020), verbosity
bias (Saito et al., 2023), and self- preference bias (Wang
et al., 2024c). In contrast to prior work that primarily
studies bias in English-centric settings, we investigate
translationese bias in multilingual contexts and analyze
the spurious correlations underlying it.
Multilingual Judges. Compared to the English context,
multilingual LLM-as-a-Judge remains significantly under-
explored. Initial efforts to bridge this gap, such as Her-
cule (Doddapaneni et al., 2025) and M-Prometheus (Pombal
et al., 2025), rely heavily on fine-tuning with translated or
synthetic instruction sets. More recently, approaches like
mR3 (Anugraha et al., 2025) and Think-as-Locals (Zhang
et al., 2025) have advanced the field by integrating reasoning
capabilities, employing Chain-of-Thought (CoT) (Wei et al.,
2022) distillation and reinforcement learning to enhance
multilingual reward modeling. However, despite these
achieving promising results, the robustness of these evalua-
tors remains unexamined. Crucially, existing frameworks
fail to account for the systematic artifacts introduced by
translation-based training data. To address this reliability
gap, our work provides the first dedicated mitigation of
translationese bias, resolving specific failures in cross-
lingual evaluation that prior methodologies overlook.
Information Bottleneck in LLMs. Originally formulated
to extract minimal sufficient statistics (Tishby et al., 2000),
the Information Bottleneck (IB)

Chunk 23 · 1,999 chars

gap, our work provides the first dedicated mitigation of
translationese bias, resolving specific failures in cross-
lingual evaluation that prior methodologies overlook.
Information Bottleneck in LLMs. Originally formulated
to extract minimal sufficient statistics (Tishby et al., 2000),
the Information Bottleneck (IB) principle has recently
emerged as a vital framework for analyzing and optimizing
LLMs, spanning diverse objectives including enhancing
interpretability by mapping hidden states to human-readable
concepts (Sun et al., 2025; Li et al., 2023), optimizing CoT
reasoning paths to be invariant to prompt nuances (Lei et al.,
2025), and compressing contexts in Retrieval-Augmented
Generation to filter noise (Zhu et al., 2024). In safety
domains, methods such as IBProtector (Liu et al., 2024b)
leverage IB to strip adversarial triggers. Diverging from
these approaches, we present the first application of a
disentangled IB designed to debias LLM judges.
7. Conclusion
In this study, we systematically investigated translationese
bias in multilingual LLM-as-a-Judge frameworks and identi-
fied key spurious factors that undermine reliable evaluation,
particularly in low-resource languages. Guided by these
insights, we proposed DIBJUDGE, a disentangled informa-
tion bottleneck–based fine-tuning framework that separates
judgment-critical semantics from spurious translationese
attributes. Extensive experiments across multilingual re-
ward modeling benchmarks and dedicated bias evaluations
demonstrate that DIBJUDGE substantially mitigates trans-
lationese bias while maintaining strong utility.
8

-- 8 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Impact Statement
This paper presents work whose goal is to advance the
field of machine learning, with a focus on improving the
robustness of multilingual evaluation using large language
models. The methods proposed in this work are intended
for model evaluation and

Chunk 24 · 1,998 chars

M-as-a-Judge via Disentangled Information Bottleneck
Impact Statement
This paper presents work whose goal is to advance the
field of machine learning, with a focus on improving the
robustness of multilingual evaluation using large language
models. The methods proposed in this work are intended
for model evaluation and benchmarking rather than direct
user-facing applications. While improved evaluation may
have downstream benefits for the development of more
reliable and inclusive multilingual systems, we do not
foresee significant or immediate negative societal impacts
arising from this work.
References
Alain, G. and Bengio, Y. Understanding intermediate
layers using linear classifier probes. arXiv preprint
arXiv:1610.01644, 2016.
Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K.
Deep variational information bottleneck. In International
Conference on Learning Representations, 2017.
Anugraha, D., Hung, S.-Y., Tang, Z., Lee, A. E.-S., Wi-
jaya, D. T., and Winata, G. I. mr3: Multilingual
rubric-agnostic reward reasoning models. arXiv preprint
arXiv:2510.01146, 2025.
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.
arXiv preprint arXiv:1607.06450, 2016.
Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla,
S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer,
L., and Khabsa, M. The belebele benchmark: a par-
allel reading comprehension dataset in 122 language
variants. In Ku, L.-W., Martins, A., and Srikumar, V.
(eds.), Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pp. 749–775, Bangkok, Thailand, August
2024. Association for Computational Linguistics. doi:
10.18653/v1/2024.acl-long.44.
Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S.,
Bengio, Y., Courville, A., and Hjelm, D. Mutual in-
formation neural estimation. In International conference
on machine learning, pp. 531–540. PMLR, 2018.
Bogavelli, T., Bamgbose, O., Melanc¸ on, G. G., Riols, F.,
and Sharma, R. Evaluating

Chunk 25 · 1,999 chars

53/v1/2024.acl-long.44.
Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S.,
Bengio, Y., Courville, A., and Hjelm, D. Mutual in-
formation neural estimation. In International conference
on machine learning, pp. 531–540. PMLR, 2018.
Bogavelli, T., Bamgbose, O., Melanc¸ on, G. G., Riols, F.,
and Sharma, R. Evaluating robustness of large language
models in enterprise applications: Benchmarks for pertur-
bation consistency across formats and languages. arXiv
preprint arXiv:2601.06341, 2026.
Chen, G. H., Chen, S., Liu, Z., Jiang, F., and Wang, B.
Humans or LLMs as the judge? a study on judgement
bias. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N.
(eds.), Proceedings of the 2024 Conference on Empirical
Methods in Natural Language Processing, pp. 8301–
8327, Miami, Florida, USA, November 2024. Association
for Computational Linguistics. doi: 10.18653/v1/2024.
emnlp-main.474.
Chen, N., Hu, Z., Zou, Q., Wu, J., Wang, Q., Hooi, B., and
He, B. Judgelrm: Large reasoning models as a judge.
arXiv preprint arXiv:2504.00050, 2025a.
Chen, X., Li, G., Wang, Z., Jin, B., Qian, C., Wang,
Y., Wang, H., Zhang, Y., Zhang, D., Zhang, T., et al.
Rm-r1: Reward modeling as reasoning. arXiv preprint
arXiv:2505.02387, 2025b.
Chen, Y., Yao, Y., Zhang, Y., Shen, B., Liu, G., and Liu,
S. Safety mirage: How spurious correlations undermine
vlm safety fine-tuning. arXiv preprint arXiv:2503.11832,
2025c.
Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., and Carin,
L. Club: A contrastive log-ratio upper bound of mutual
information. In International conference on machine
learning, pp. 1779–1788. PMLR, 2020.
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I.,
Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang,
D., Rosen, E., et al. Gemini 2.5: Pushing the frontier
with advanced reasoning, multimodality, long context,
and next generation agentic capabilities. arXiv preprint
arXiv:2507.06261, 2025.
Costa-Juss `a, M. R., Cross, J., C¸ elebi, O., Elbayad, M.,
Heafield, K., Heffernan, K., Kalbassi,

Chunk 26 · 1,996 chars

Blistein, M., Ram, O., Zhang,
D., Rosen, E., et al. Gemini 2.5: Pushing the frontier
with advanced reasoning, multimodality, long context,
and next generation agentic capabilities. arXiv preprint
arXiv:2507.06261, 2025.
Costa-Juss `a, M. R., Cross, J., C¸ elebi, O., Elbayad, M.,
Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht,
D., Maillard, J., et al. No language left behind: Scaling
human-centered machine translation. arXiv preprint
arXiv:2207.04672, 2022.
Cover, T. M. Elements of information theory. John Wiley &
Sons, 1999.
Dao, T. FlashAttention-2: Faster attention with better paral-
lelism and work partitioning. In International Conference
on Learning Representations (ICLR), 2024.
Doddapaneni, S., Khan, M. S. U. R., Venkatesh, D., Dabre,
R., Kunchukuttan, A., and Khapra, M. M. Cross-lingual
auto evaluation for assessing multilingual LLMs. In Che,
W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),
Proceedings of the 63rd Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long
Papers), pp. 29297–29329, Vienna, Austria, July 2025.
Association for Computational Linguistics. ISBN 979-8-
89176-251-0. doi: 10.18653/v1/2025.acl-long.1419.
Fu, X. and Liu, W. How reliable is multilingual llm-as-a-
judge? arXiv preprint arXiv:2505.12201, 2025.
9

-- 9 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Gao, J., Chen, C., Jia, Y., Gong, X., Lam, K.-Y., and Wang,
Q. Evaluating and mitigating llm-as-a-judge bias in com-
munication systems. arXiv preprint arXiv:2510.12462,
2025.
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian,
A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A.,
Vaughan, A., et al. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783, 2024.
Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and
Sch ¨olkopf, B. Kernel methods for measuring indepen-
dence. J. Mach. Learn. Res., 6:2075–2129, December
2005. ISSN 1532-4435.
Gu, J., Jiang,

Chunk 27 · 1,993 chars

man, A., Mathur, A., Schelten, A.,
Vaughan, A., et al. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783, 2024.
Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and
Sch ¨olkopf, B. Kernel methods for measuring indepen-
dence. J. Mach. Learn. Res., 6:2075–2129, December
2005. ISSN 1532-4435.
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W.,
Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a-
judge. The Innovation, 2024.
Gui, S. and Ji, S. Mitigating spurious correlations in
llms via causality-aware post-training. arXiv preprint
arXiv:2506.09433, 2025.
Guo, J., Chi, Z., Dong, L., Dong, Q., Wu, X., Huang, S., and
Wei, F. Reward reasoning models. In The Thirty-ninth
Annual Conference on Neural Information Processing
Systems, 2025.
Gureja, S., Miranda, L. J. V., Islam, S. B., Maheshwary,
R., Sharma, D., Winata, G. T., Lambert, N., Ruder, S.,
Hooker, S., and Fadaee, M. M-RewardBench: Evaluating
reward models in multilingual settings. In Che, W.,
Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Pro-
ceedings of the 63rd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers),
pp. 43–58, Vienna, Austria, July 2025. Association for
Computational Linguistics. ISBN 979-8-89176-251-0.
doi: 10.18653/v1/2025.acl-long.3.
Hada, R., Gumma, V., de Wynter, A., Diddee, H., Ahmed,
M., Choudhury, M., Bali, K., and Sitaram, S. Are large
language model-based evaluators the solution to scaling
up multilingual evaluation? In Graham, Y. and Purver, M.
(eds.), Findings of the Association for Computational Lin-
guistics: EACL 2024, pp. 1051–1070, St. Julian’s, Malta,
March 2024. Association for Computational Linguistics.
Hasan, T., Bhattacharjee, A., Islam, M. S., Mubasshir, K., Li,
Y.-F., Kang, Y.-B., Rahman, M. S., and Shahriyar, R. XL-
sum: Large-scale multilingual abstractive summarization
for 44 languages. In Zong, C., Xia, F., Li, W., and Navigli,
R. (eds.), Findings of the Association for Computational
Linguistics:

Chunk 28 · 1,989 chars

Linguistics.
Hasan, T., Bhattacharjee, A., Islam, M. S., Mubasshir, K., Li,
Y.-F., Kang, Y.-B., Rahman, M. S., and Shahriyar, R. XL-
sum: Large-scale multilingual abstractive summarization
for 44 languages. In Zong, C., Xia, F., Li, W., and Navigli,
R. (eds.), Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021, pp. 4693–4703, Online,
August 2021. Association for Computational Linguistics.
doi: 10.18653/v1/2021.findings-acl.413.
Hron, J., Bahri, Y., Sohl-Dickstein, J., and Novak, R. Infinite
attention: Nngp and ntk for deep attention networks. In
International Conference on Machine Learning, pp. 4376–
4386. PMLR, 2020.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation
of large language models. ICLR, 1(2):3, 2022.
Huang, K., Mo, F., Zhang, X., Li, H., Li, Y., Zhang, Y.,
Yi, W., Mao, Y., Liu, J., Xu, Y., et al. A survey on large
language models with multilingualism: Recent advances
and new frontiers. arXiv preprint arXiv:2405.10936,
2024.
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh,
A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A.,
Radford, A., et al. Gpt-4o system card. arXiv preprint
arXiv:2410.21276, 2024.
Hyv ¨arinen, A. and Oja, E. Independent component analysis:
algorithms and applications. Neural Networks, 13(4):
411–430, 2000. ISSN 0893-6080. doi: https://doi.org/10.
1016/S0893-6080(00)00026-5.
Joshi, P., Santy, S., Budhiraja, A., Bali, K., and Choudhury,
M. The state and fate of linguistic diversity and inclusion
in the NLP world. In Jurafsky, D., Chai, J., Schluter, N.,
and Tetreault, J. (eds.), Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics,
pp. 6282–6293, Online, July 2020. Association for Com-
putational Linguistics. doi: 10.18653/v1/2020.acl-main.
560.
Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun,
S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus:
Inducing fine-grained

Chunk 29 · 1,992 chars

Meeting of the Association for Computational Linguistics,
pp. 6282–6293, Online, July 2020. Association for Com-
putational Linguistics. doi: 10.18653/v1/2020.acl-main.
560.
Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun,
S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus:
Inducing fine-grained evaluation capability in language
models. In The Twelfth International Conference on
Learning Representations, 2024a.
Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck,
S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus
2: An open source language model specialized in evalu-
ating other language models. In Al-Onaizan, Y., Bansal,
M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Con-
ference on Empirical Methods in Natural Language Pro-
cessing, pp. 4334–4353, Miami, Florida, USA, November
2024b. Association for Computational Linguistics. doi:
10.18653/v1/2024.emnlp-main.248.
Kingma, D. P. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
Kingma, D. P., Salimans, T., and Welling, M. Variational
dropout and the local reparameterization trick. In Cortes,
C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett,
R. (eds.), Advances in Neural Information Processing
Systems, volume 28. Curran Associates, Inc., 2015.
10

-- 10 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Ko, M., Lee, J., Kim, H., Kim, G., and Kang, J. Look at
the first sentence: Position bias in question answering. In
Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceed-
ings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pp. 1109–1121,
Online, November 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.emnlp-main.84.
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch,
D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov,
A., Sikasote, C., et al. Quality at a glance: An audit
of web-crawled multilingual

Chunk 30 · 1,973 chars

P), pp. 1109–1121,
Online, November 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.emnlp-main.84.
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch,
D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov,
A., Sikasote, C., et al. Quality at a glance: An audit
of web-crawled multilingual datasets. Transactions of
the Association for Computational Linguistics, 10:50–72,
2022.
Lai, H., Liu, X., Gao, J., Cheng, J., Qi, Z., Xu, Y., Yao, S.,
Zhang, D., Du, J., Hou, Z., Lv, X., Huang, M., Dong,
Y., and Tang, J. A survey of post-training scaling in
large language models. In Che, W., Nabende, J., Shutova,
E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 2771–2791,
Vienna, Austria, July 2025. Association for Compu-
tational Linguistics. ISBN 979-8-89176-251-0. doi:
10.18653/v1/2025.acl-long.140.
Lambert, N., Pyatkin, V., Morrison, J., Miranda, L., Lin,
B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi,
Y., Smith, N. A., and Hajishirzi, H. Rewardbench:
Evaluating reward models for language modeling, 2024.
Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington,
J., and Sohl-Dickstein, J. Deep neural networks as
gaussian processes. In International Conference on
Learning Representations, 2018.
Lei, S., Cheng, Z., Jia, K., and Tao, D. Revisiting llm
reasoning via information bottleneck. arXiv preprint
arXiv:2507.18391, 2025.
Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z.,
Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., Shu, K.,
Cheng, L., and Liu, H. From generation to judgment:
Opportunities and challenges of LLM-as-a-judge. In
Christodoulopoulos, C., Chakraborty, T., Rose, C., and
Peng, V. (eds.), Proceedings of the 2025 Conference on
Empirical Methods in Natural Language Processing, pp.
2757–2791, Suzhou, China, November 2025. Association
for Computational Linguistics. ISBN 979-8-89176-332-6.
doi:

Chunk 31 · 1,985 chars

ities and challenges of LLM-as-a-judge. In
Christodoulopoulos, C., Chakraborty, T., Rose, C., and
Peng, V. (eds.), Proceedings of the 2025 Conference on
Empirical Methods in Natural Language Processing, pp.
2757–2791, Suzhou, China, November 2025. Association
for Computational Linguistics. ISBN 979-8-89176-332-6.
doi: 10.18653/v1/2025.emnlp-main.138.
Li, Q., Wu, Z., Kong, L., and Bi, W. Explanation re-
generation via information bottleneck. In Rogers, A.,
Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the
Association for Computational Linguistics: ACL 2023, pp.
12081–12102, Toronto, Canada, July 2023. Association
for Computational Linguistics. doi: 10.18653/v1/2023.
findings-acl.765.
Lin, C.-Y. ROUGE: A package for automatic evaluation
of summaries. In Text Summarization Branches Out,
pp. 74–81, Barcelona, Spain, July 2004. Association for
Computational Linguistics.
Liu, C. Y., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan,
S., Liu, Y., and Zhou, Y. Skywork-reward: Bag of tricks
for reward modeling in llms, 2024a.
Liu, X., Sanchez, P., Thermos, S., O’Neil, A. Q., and
Tsaftaris, S. A. Learning disentangled representations in
the imaging domain. Medical Image Analysis, 80:102516,
2022.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C.
G-eval: NLG evaluation using gpt-4 with better human
alignment. In Bouamor, H., Pino, J., and Bali, K. (eds.),
Proceedings of the 2023 Conference on Empirical Meth-
ods in Natural Language Processing, pp. 2511–2522, Sin-
gapore, December 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.emnlp-main.153.
Liu, Z., Wang, Z., Xu, L., Wang, J., Song, L., Wang, T.,
Chen, C., Cheng, W., and Bian, J. Protecting your LLMs
with information bottleneck. In The Thirty-eighth Annual
Conference on Neural Information Processing Systems,
2024b.
Muennighoff, N., Wang, T., Sutawika, L., Roberts, A.,
Biderman, S., Le Scao, T., Bari, M. S., Shen, S., Yong,
Z. X., Schoelkopf, H., Tang, X., Radev, D., Aji, A.

Chunk 32 · 1,996 chars

g, W., and Bian, J. Protecting your LLMs
with information bottleneck. In The Thirty-eighth Annual
Conference on Neural Information Processing Systems,
2024b.
Muennighoff, N., Wang, T., Sutawika, L., Roberts, A.,
Biderman, S., Le Scao, T., Bari, M. S., Shen, S., Yong,
Z. X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F.,
Almubarak, K., Albanie, S., Alyafeai, Z., Webson, A.,
Raff, E., and Raffel, C. Crosslingual generalization
through multitask finetuning. In Rogers, A., Boyd-Graber,
J., and Okazaki, N. (eds.), Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 15991–16111, Toronto,
Canada, July 2023. Association for Computational Lin-
guistics. doi: 10.18653/v1/2023.acl-long.891.
Padarha, S., Hale, S. A., Mahdi, A., Semenova, E., and
Vidgen, B. Evaluating LLM-as-a-judge under multi-
lingual, multimodal and multi-domain constraints. In
NeurIPS 2025 Workshop on Evaluating the Evolving LLM
Lifecycle: Benchmarks, Emergent Abilities, and Scaling,
2025.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a
method for automatic evaluation of machine translation.
In Isabelle, P., Charniak, E., and Lin, D. (eds.), Proceed-
ings of the 40th Annual Meeting of the Association for
Computational Linguistics, pp. 311–318, Philadelphia,
Pennsylvania, USA, July 2002. Association for Computa-
tional Linguistics. doi: 10.3115/1073083.1073135.
11

-- 11 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Pombal, J., Yoon, D., Fernandes, P., Wu, I., Kim, S., Rei,
R., Neubig, G., and Martins, A. M-prometheus: A suite
of open multilingual LLM judges. In Second Conference
on Language Modeling, 2025.
Qin, L., Chen, Q., Zhou, Y., Chen, Z., Li, Y., Liao, L., Li,
M., Che, W., and Yu, P. S. A survey of multilingual large
language models. Patterns, 6(1):101118, 2025. ISSN
2666-3899. doi: https://doi.org/10.1016/j.patter.2024.
101118.
Qwen, :, Yang, A., Yang, B.,

Chunk 33 · 1,985 chars

judges. In Second Conference
on Language Modeling, 2025.
Qin, L., Chen, Q., Zhou, Y., Chen, Z., Li, Y., Liao, L., Li,
M., Che, W., and Yu, P. S. A survey of multilingual large
language models. Patterns, 6(1):101118, 2025. ISSN
2666-3899. doi: https://doi.org/10.1016/j.patter.2024.
101118.
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B.,
Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang,
J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J.,
Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue,
M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang,
T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y.,
Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5
technical report, 2025.
Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. Deep-
speed: System optimizations enable training deep learn-
ing models with over 100 billion parameters. Proceedings
of the 26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, 2020.
Rei, R., Stewart, C., Farinha, A. C., and Lavie, A. COMET:
A neural framework for MT evaluation. In Webber, B.,
Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of
the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 2685–2702, Online,
November 2020. Association for Computational Linguis-
tics. doi: 10.18653/v1/2020.emnlp-main.213.
Saito, K., Wachi, A., Wataoka, K., and Akimoto, Y. Ver-
bosity bias in preference labeling by large language
models. In NeurIPS 2023 Workshop on Instruction Tuning
and Instruction Following, 2023.
Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., and Vosoughi,
S. Judging the judges: A systematic study of position
bias in LLM-as-a-judge. In Inui, K., Sakti, S., Wang,
H., Wong, D. F., Bhattacharyya, P., Banerjee, B., Ekbal,
A., Chakraborty, T., and Singh, D. P. (eds.), Proceedings
of the 14th International Joint Conference on Natural
Language Processing and the 4th Conference of the Asia-
Pacific Chapter of the Association for

Chunk 34 · 1,992 chars

n
bias in LLM-as-a-judge. In Inui, K., Sakti, S., Wang,
H., Wong, D. F., Bhattacharyya, P., Banerjee, B., Ekbal,
A., Chakraborty, T., and Singh, D. P. (eds.), Proceedings
of the 14th International Joint Conference on Natural
Language Processing and the 4th Conference of the Asia-
Pacific Chapter of the Association for Computational
Linguistics, pp. 292–314, Mumbai, India, December
2025. The Asian Federation of Natural Language Process-
ing and The Association for Computational Linguistics.
ISBN 979-8-89176-298-5.
Shuieh, J., Singhal, P., Shanker, A., Heyer, J., Pu, G.,
and Denton, S. M. ASSESSING ROBUSTNESS TO
SPURIOUS CORRELATIONS IN POST-TRAINING
LANGUAGE MODELS. In Workshop on Spurious
Correlation and Shortcut Learning: Foundations and
Solutions, 2025.
Singh, S., Vargus, F., D’souza, D., Karlsson, B. F., Mahendi-
ran, A., Ko, W.-Y., Shandilya, H., Patel, J., Mataciunas,
D., O’Mahony, L., Zhang, M., Hettiarachchi, R., Wilson,
J., Machado, M., Moura, L., Krzemi ´nski, D., Fadaei,
H., Ergun, I., Okoh, I., Alaagib, A., Mudannayake,
O., Alyafeai, Z., Chien, V., Ruder, S., Guthikonda, S.,
Alghamdi, E., Gehrmann, S., Muennighoff, N., Bartolo,
M., Kreutzer, J., ¨Ust ¨un, A., Fadaee, M., and Hooker, S.
Aya dataset: An open-access collection for multilingual
instruction tuning. In Ku, L.-W., Martins, A., and
Srikumar, V. (eds.), Proceedings of the 62nd Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 11521–11567, Bangkok,
Thailand, August 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.acl-long.620.
Son, G., Yoon, D., Suk, J., Aula-Blasco, J., Aslan, M., Kim,
V. T., Islam, S. B., Prats-Cristi `a, J., Tormo-Ba ˜nuelos, L.,
and Kim, S. Mm-eval: A multilingual meta-evaluation
benchmark for llm-as-a-judge and reward models. arXiv
preprint arXiv:2410.17578, 2024.
Sun, C.-E., Oikarinen, T., Ustun, B., and Weng, T.-W. Con-
cept bottleneck large language models. In The Thirteenth
International Conference on

Chunk 35 · 1,996 chars

Cristi `a, J., Tormo-Ba ˜nuelos, L.,
and Kim, S. Mm-eval: A multilingual meta-evaluation
benchmark for llm-as-a-judge and reward models. arXiv
preprint arXiv:2410.17578, 2024.
Sun, C.-E., Oikarinen, T., Ustun, B., and Weng, T.-W. Con-
cept bottleneck large language models. In The Thirteenth
International Conference on Learning Representations,
2025.
Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard,
N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A.,
Rivi `ere, M., et al. Gemma 3 technical report. arXiv
preprint arXiv:2503.19786, 2025.
Tishby, N., Pereira, F. C., and Bialek, W. The information
bottleneck method. arXiv preprint physics/0004057,
2000.
Van Den Oord, A., Vinyals, O., et al. Neural discrete
representation learning. Advances in neural information
processing systems, 30, 2017.
van der Maaten, L. and Hinton, G. Visualizing data using
t-sne. Journal of Machine Learning Research, 9(86):
2579–2605, 2008.
Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang,
T. Interpretable preferences via multi-objective reward
modeling and mixture-of-experts. In Al-Onaizan, Y.,
Bansal, M., and Chen, Y.-N. (eds.), Findings of the
Association for Computational Linguistics: EMNLP
2024, pp. 10582–10592, Miami, Florida, USA, November
2024a. Association for Computational Linguistics. doi:
10.18653/v1/2024.findings-emnlp.620.
12

-- 12 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao,
Y., Kong, L., Liu, Q., Liu, T., and Sui, Z. Large language
models are not fair evaluators. In Ku, L.-W., Martins, A.,
and Srikumar, V. (eds.), Proceedings of the 62nd Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 9440–9450, Bangkok,
Thailand, August 2024b. Association for Computational
Linguistics. doi: 10.18653/v1/2024.acl-long.511.
Wang, Q., Lou, Z., Tang, Z., Chen, N., Zhao, X., Zhang,
W., Song, D., and He, B. Assessing

Chunk 36 · 1,994 chars

of the 62nd Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 9440–9450, Bangkok,
Thailand, August 2024b. Association for Computational
Linguistics. doi: 10.18653/v1/2024.acl-long.511.
Wang, Q., Lou, Z., Tang, Z., Chen, N., Zhao, X., Zhang,
W., Song, D., and He, B. Assessing judging bias in large
reasoning models: An empirical study. arXiv preprint
arXiv:2504.09946, 2025a.
Wang, T., Kulikov, I., Golovneva, O., Yu, P., Yuan, W.,
Dwivedi-Yu, J., Pang, R. Y., Fazel-Zarandi, M., Weston,
J., and Li, X. Self-taught evaluators. arXiv preprint
arXiv:2408.02666, 2024c.
Wang, X., Chen, H., Tang, S., Wu, Z., and Zhu, W.
Disentangled representation learning. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 46(12):
9677–9696, 2024d.
Wang, Y., Yu, Z., Yao, W., Zeng, Z., Yang, L., Wang, C.,
Chen, H., Jiang, C., Xie, R., Wang, J., Xie, X., Ye, W.,
Zhang, S., and Zhang, Y. PandaLM: An automatic evalua-
tion benchmark for LLM instruction tuning optimization.
In The Twelfth International Conference on Learning
Representations, 2024e.
Wang, Z., Zeng, J., Delalleau, O., Egert, D., Evans, E.,
Shin, H.-C., Soares, F., Dong, Y., and Kuchaiev, O.
HelpSteer3: Human-annotated feedback and edit data to
empower inference-time scaling in open-ended general-
domain tasks. In Che, W., Nabende, J., Shutova, E.,
and Pilehvar, M. T. (eds.), Proceedings of the 63rd
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 25640–25662,
Vienna, Austria, July 2025b. Association for Compu-
tational Linguistics. ISBN 979-8-89176-251-0. doi:
10.18653/v1/2025.acl-long.1246.
Wataoka, K., Takahashi, T., and Ri, R. Self-preference
bias in LLM-as-a-judge. In Neurips Safe Generative AI
Workshop 2024, 2024.
Weber, M., Fu, D. Y., Anthony, Q., Oren, Y., Adams, S.,
Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams,
V., Athiwaratkun, B., Chalamala, R., Chen, K., Ryabinin,
M., Dao, T., Liang, P., R ´e, C., Rish, I., and

Chunk 37 · 1,996 chars

T., and Ri, R. Self-preference
bias in LLM-as-a-judge. In Neurips Safe Generative AI
Workshop 2024, 2024.
Weber, M., Fu, D. Y., Anthony, Q., Oren, Y., Adams, S.,
Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams,
V., Athiwaratkun, B., Chalamala, R., Chen, K., Ryabinin,
M., Dao, T., Liang, P., R ´e, C., Rish, I., and Zhang, C.
Redpajama: an open dataset for training large language
models. NeurIPS Datasets and Benchmarks Track, 2024.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824–24837,
2022.
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B.,
Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical
report. arXiv preprint arXiv:2505.09388, 2025.
Ye, J., Wang, Y., Huang, Y., Chen, D., Zhang, Q., Moniz, N.,
Gao, T., Geyer, W., Huang, C., Chen, P.-Y., Chawla, N. V.,
and Zhang, X. Justice or prejudice? quantifying biases
in LLM-as-a-judge. In The Thirteenth International
Conference on Learning Representations, 2025.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S.
Barlow twins: Self-supervised learning via redundancy
reduction. In International conference on machine learn-
ing, pp. 12310–12320. PMLR, 2021.
Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improv-
ing massively multilingual neural machine translation
and zero-shot translation. In Jurafsky, D., Chai, J.,
Schluter, N., and Tetreault, J. (eds.), Proceedings of
the 58th Annual Meeting of the Association for Com-
putational Linguistics, pp. 1628–1639, Online, July
2020a. Association for Computational Linguistics. doi:
10.18653/v1/2020.acl-main.148.
Zhang, B., Xie, H., Gao, Z., and Wang, Y. Choose what you
need: Disentangled representation learning for scene text
recognition removal and editing. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pp. 28358–28368, 2024.
Zhang, H.,

Chunk 38 · 1,995 chars

l Linguistics. doi:
10.18653/v1/2020.acl-main.148.
Zhang, B., Xie, H., Gao, Z., and Wang, Y. Choose what you
need: Disentangled representation learning for scene text
recognition removal and editing. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pp. 28358–28368, 2024.
Zhang, H., Chen, K., Bai, X., Xiang, Y., and Zhang, M. Eval-
uating and improving cultural awareness of reward mod-
els for llm alignment. arXiv preprint arXiv:2509.21798,
2025.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi,
Y. Bertscore: Evaluating text generation with bert. In
International Conference on Learning Representations,
2020b.
Zheng, C., Zhou, H., Meng, F., Zhou, J., and Huang, M.
Large language models are not robust multiple choice
selectors. In The Twelfth International Conference on
Learning Representations, 2024.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H.,
Gonzalez, J. E., and Stoica, I. Judging LLM-as-a-judge
with MT-bench and chatbot arena. In Thirty-seventh
Conference on Neural Information Processing Systems
Datasets and Benchmarks Track, 2023.
Zhu, K., Feng, X., Du, X., Gu, Y., Yu, W., Wang, H., Chen,
Q., Chu, Z., Chen, J., and Qin, B. An information bottle-
neck perspective for effective noise filtering on retrieval-
augmented generation. In Ku, L.-W., Martins, A., and
13

-- 13 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Srikumar, V. (eds.), Proceedings of the 62nd Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 1044–1069, Bangkok,
Thailand, August 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.acl-long.59.
14

-- 14 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Appendix Contents
Section Contents
Appendix A Translationese Bias Evaluation Suite. Language

Chunk 39 · 1,988 chars

ok,
Thailand, August 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.acl-long.59.
14

-- 14 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Appendix Contents
Section Contents
Appendix A Translationese Bias Evaluation Suite. Language taxonomy, dataset construction, and qualitative
examples.
Appendix B Theoretical Analysis. Information bottleneck bounds, derivation of compression loss function,
identifiability discussion, and disentanglement proofs.
Appendix C Experimental Details. Benchmarks, evaluation metrics, model configurations, and training
protocols.
Appendix D Comprehensive Multilingual Results. Full results on RewardBench, M-RewardBench, and MM-
Eval.
Appendix E Translationese Bias Results. Detailed quantitative translationese bias evaluation results across
language pairs.
Appendix F Additional Experiments and Ablations. Spurious proxy task analysis, sensitivity analysis, method
robustness checks, and information leakage validation of disentangle representation.
A. Bias Evaluation Suite
A.1. Language Selection and Taxonomy
We adopt the language taxonomy proposed by Joshi et al. (2020) to categorize languages based on their resource availability.
Specifically, we partition the selected languages into three distinct tiers based on their assigned resource classes: High-
resource (Classes 4 and 5), Mid-resource (Class 3), and Low-resource (Classes 0, 1, and 2).
Our evaluation spans three primary datasets: Aya (Singh et al., 2024), Belebele (Bandarkar et al., 2024), and XL-Sum
(Hasan et al., 2021). The complete distribution of evaluated languages across these resource tiers is summarized in Table 5.
Table 5. Classification of evaluated languages across datasets based on Joshi et al. (2020) taxonomy.
Dataset High-Resource Mid-Resource Low-Resource
Aya Basque, English, Finnish, Hindi,
Japanese, Portuguese, Simp. Chi-
nese, Spanish, Arabic, Vietnamese
Bengali, Cebuano,

Chunk 40 · 1,992 chars

these resource tiers is summarized in Table 5.
Table 5. Classification of evaluated languages across datasets based on Joshi et al. (2020) taxonomy.
Dataset High-Resource Mid-Resource Low-Resource
Aya Basque, English, Finnish, Hindi,
Japanese, Portuguese, Simp. Chi-
nese, Spanish, Arabic, Vietnamese
Bengali, Cebuano, Filipino,
Indonesian, Lithuanian, Malay,
Tamil, Thai, Ukrainian, Urdu
Amharic, Irish, Kyrgyz, Nepali,
Malagasy, Sinhala, S. Pashto, Tel-
ugu, Yoruba, Zulu
Belebele Arabic, English, Finnish, Hindi,
Japanese, Korean, Russian, Turk-
ish, Vietnamese, Simp. Chinese
Bengali, Greek, Hebrew, Georgian,
Kazakh, Tamil, Thai, Ukrainian,
Urdu, Malay
Amharic, Tibetan, Guarani, Kan-
nada, Khmer, Kyrgyz, Burmese,
Punjabi, Pashto, Zulu
XL-Sum Arabic, Simp. Chinese, English,
French, Hindi, Japanese, Korean,
Russian, Turkish, Vietnamese
Azerbaijani, Bengali, Indonesian,
Tamil, Thai, Ukrainian, Urdu,
Uzbek
Amharic, Burmese, Hausa, Kyr-
gyz, Marathi, Nepali, Pashto, Sin-
hala, Telugu, Welsh
A.2. Test Set Construction
We formulate a pairwise preference task where an LLM evaluator compares two candidate responses for a given query:
(i) Chosen (xH ): The original human-authored or high-quality translated reference; and (ii) Rejected (xM ): A machine-
generated counterpart produced via back-translation using NLLB-200-3.3B (Costa-Juss `a et al., 2022)2 to inject subtle
translationese artifacts.
To isolate translationese as the primary variable and mitigate length-based confounding, we enforce a length constraint
where the token count differential between xH and xM is within ±5%. We further ensure evaluation robustness through
a position-swapping protocol, retaining only consistent judgments where the model’s preference remains invariant to the
presentation order.
Beyond the baseline comparison, we introduce two distinct experimental configurations (summarized with examples in the
following subsection A.3):
2https://huggingface.co/facebook/nllb-200-3.3B
15

-- 15 of 35 --

Chunk 41 · 1,997 chars

aining only consistent judgments where the model’s preference remains invariant to the
presentation order.
Beyond the baseline comparison, we introduce two distinct experimental configurations (summarized with examples in the
following subsection A.3):
2https://huggingface.co/facebook/nllb-200-3.3B
15

-- 15 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
• Parallel: Both candidates are semantically equivalent and factually correct. This isolates the model’s stylistic preference
for human vs. machine-translated syntax.
• Perturbed: We introduce minor, controlled edits to the machine-generated candidate (xM ) to alter its factual correctness
or logical coherence, testing whether the judge prioritizes correctness over stylistic fluency.
A.3. Detailed Dataset Examples
In this section, we provide the full text for the construction of our test sets across the three primary datasets: Aya, Belebele,
and XL-Sum. For each dataset, we present both the Parallel setting (testing stylistic preference/bias) and the Perturbed
setting (testing sensitivity to factual correctness).
Example: Aya Dataset
User Question:
问题：油的营养价值如何？
Setting 1: Parallel (Stylistic Variation)
Both answers are factually correct; B uses back-translated ”translationese.”
[Assistant A (Chosen xH )]
回答：油是脂肪的重要来源，提供能量和必需脂肪酸，但摄入应适量控制，避免摄入过多的脂肪。
[Assistant B (Rejected xM )]
回答：油脂乃脂肪主来源，供应热能与必备脂肪酸，然食量宜适当限制，防止摄取过多的脂质。
Setting 2: Perturbed (Factual Error)
Assistant B is modified to suggest an incorrect health outcome.
[Assistant A (Chosen xH )]
回答：油是脂肪的重要来源，提供能量和必需脂肪酸，但摄入应适量控制，避免摄入过多的脂肪。
[Assistant B (Rejected xM )]
回答：油是脂肪的重要来源，提供能量和必需脂肪酸，但摄入应适量控制，以促进脂肪摄入。
Example: Belebele Dataset
User Question (System Prompt):
Create an example of multiple-choice reading test in Chinese (Simplified). You may choose any topic you want...
Output only the test (passage, questions, answer choices, and correct answer).
Setting 1: Parallel
[Assistant A (Chosen xH )]
###

Chunk 42 · 1,934 chars

肪酸，但摄入应适量控制，以促进脂肪摄入。
Example: Belebele Dataset
User Question (System Prompt):
Create an example of multiple-choice reading test in Chinese (Simplified). You may choose any topic you want...
Output only the test (passage, questions, answer choices, and correct answer).
Setting 1: Parallel
[Assistant A (Chosen xH )]
### Passage: 乘坐头等舱或商务舱最明显的方式是... 航空公司非常清楚，有一些核心乘客愿花高价...
### Question: 根据这段文字，预订某些票价舱位时，搜索什么是浪费时间？
### Answer Choices: (1) 直达航班(2) 商务舱(3) 折扣(4) 头等舱
### Correct Answer: (3) 折扣
16

-- 16 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
[Assistant B (Rejected xM )]
### Passage: 想体验头等或商务舱最直接的途径... 航司深知，部分核心客户乐意支付高昂费用...
### Question: 依照本文内容，在订特定票价舱位时，查找何事乃浪费光阴？
### Answer Choices: (1) 不停航程(2) 公务舱(3) 优惠(4) 首等舱
### Correct Answer: (3) 优惠
Setting 2: Perturbed
Assistant B’s ”Correct Answer” index is modified to be factually wrong.
[Assistant B (Rejected xM )]
... ### Answer Choices: (1) 不停航程(2) 公务舱(3) 优惠(4) 首等舱
### Correct Answer: (4) 首等舱
Example: XL-Sum Dataset
User Question (Summarization Task):
Generate a concise, coherent abstractive summary in Chinese Simplified... Do not include information not present in
the source text.
Source Text: [Long article regarding the Brazilian ”Operation Weak Flesh” meat scandal involving President Michel
Temer, JBS, and BRF...]
Setting 1: Parallel (Stylistic Variation)
Both summaries accurately reflect that countries suspended imports for several years.
[Assistant A (Chosen xH )]
近期巴西”问题肉”丑闻揭露巴西的一些公司已经数年销售不安全肉类产品。包括中国在内的一些国家和组
织已经叫停巴西的进口肉。
[Assistant B (Rejected xM )]
最近巴西“劣质肉”风波披露，部分巴西公司长达数年贩卖问题肉品。中国等许多国家与机构，均已暂停来
自巴西的肉类进口。
Setting 2: Perturbed (Factual Hallucination)
Assistant B is modified to falsely state that restrictions were ”relaxed” (放宽限制) and that the scandal ”just started”
(刚开始).
[Assistant A (Chosen xH )]
近期巴西”问题肉”丑闻揭露巴西的一些公司已经数年销售不安全肉类产品。包括中国在内的一些国家和组
织已经叫停巴西的进口肉。
[Assistant B (Rejected xM

Chunk 43 · 1,993 chars

达数年贩卖问题肉品。中国等许多国家与机构，均已暂停来
自巴西的肉类进口。
Setting 2: Perturbed (Factual Hallucination)
Assistant B is modified to falsely state that restrictions were ”relaxed” (放宽限制) and that the scandal ”just started”
(刚开始).
[Assistant A (Chosen xH )]
近期巴西”问题肉”丑闻揭露巴西的一些公司已经数年销售不安全肉类产品。包括中国在内的一些国家和组
织已经叫停巴西的进口肉。
[Assistant B (Rejected xM )]
近期巴西”问题肉”丑闻揭露巴西的一些公司刚开始销售不安全肉类产品。包括中国在内的一些国家和组织
已经放宽限制巴西的进口肉。
17

-- 17 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
B. Theory Supplementary
B.1. Proof of The Upper Boundary of I(X; Zr )
Proof. By definition, the mutual information I(X; Zr ) is the KL divergence between the joint distribution qϕ(X, Zr ) =
p(x)qϕ(zr |x) and the product of the marginals p(x)qϕ(zr ), where qϕ(zr ) = R qϕ(zr |x)p(x) dx is the aggregate posterior
(marginal) distribution. This is expressed as:
I(X; Zr ) = Ex∼p(x)
Z
qϕ(zr |x) log qϕ(zr |x)
qϕ(zr ) dzr

. (9)
To derive the upper bound, we introduce an arbitrary fixed prior p(zr ). We multiply and divide the argument of the logarithm
by this prior p(zr ):
I(X; Zr ) = Ex∼p(x)
Z
qϕ(zr |x) log
 qϕ(zr |x)
qϕ(zr ) · p(zr )
p(zr )

dzr

= Ex∼p(x)
Z
qϕ(zr |x)

log qϕ(zr |x)
p(zr ) − log qϕ(zr )
p(zr )

dzr

.
(10)
Using the linearity of the expectation, we separate the integral into two terms:
I(X; Zr ) = Ex∼p(x)
Z
qϕ(zr |x) log qϕ(zr |x)
p(zr ) dzr

− Ex∼p(x)
Z
qϕ(zr |x) log qϕ(zr )
p(zr ) dzr

. (11)
The first term is exactly the expected KL divergence between the posterior and the prior. For the second term, we observe
that log qϕ(zr )
p(zr ) does not depend on x directly, other than through the integration of the joint density. We can simplify the
expectation over x:
Ex∼p(x)
Z
qϕ(zr |x) log qϕ(zr )
p(zr ) dzr

=
Z Z
p(x)qϕ(zr |x) dx

log qϕ(zr )
p(zr ) dzr
=
Z
qϕ(zr ) log qϕ(zr )
p(zr ) dzr
= DKL(qϕ(Zr )∥p(Zr )).
(12)
Substituting this back into the expression for mutual information yields the following decomposition:
I(X; Zr ) = Ex∼p(x)

Chunk 44 · 1,989 chars

We can simplify the
expectation over x:
Ex∼p(x)
Z
qϕ(zr |x) log qϕ(zr )
p(zr ) dzr

=
Z Z
p(x)qϕ(zr |x) dx

log qϕ(zr )
p(zr ) dzr
=
Z
qϕ(zr ) log qϕ(zr )
p(zr ) dzr
= DKL(qϕ(Zr )∥p(Zr )).
(12)
Substituting this back into the expression for mutual information yields the following decomposition:
I(X; Zr ) = Ex∼p(x) [DKL(qϕ(Zr |x)∥p(Zr ))] − DKL(qϕ(Zr )∥p(Zr )). (13)
Since the KL divergence is non-negative (Gibbs’ inequality), i.e., DKL(qϕ(Zr )∥p(Zr )) ≥ 0, it follows that:
I(X; Zr ) ≤ Ex∼p(x) [DKL(qϕ(Zr |x)∥p(Zr ))] . (14)
This completes the proof.
B.2. Detailed Derivation of Lcompress
We derive the analytic form of the compression regularizer used in Eq. (4) . At each time step t ∈ {1, . . . , T }, we assume a
diagonal-covariance Gaussian variational posterior
qϕ(zr,t | x) = N (μt, diag(σ2
t )), (15)
and a standard Gaussian prior
p(zr,t) = N (0, I), (16)
where μt ∈ Rd and σ2
t ∈ Rd
>0.
We regularize the information capacity by minimizing the average KL divergence across the sequence:
Lcompress = 1
T
T	X
t=1
DKL(qϕ(zr,t | x) ∥ p(zr,t)) . (17)
18

-- 18 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Thus, it suffices to derive a closed form for DKL
N (μt, diag(σ2
t )) ∥ N (0, I).
For q = N (μq , Σq ) and p = N (μp, Σp) in Rd, the KL divergence admits the well-known closed form:
DKL(q∥p) = 1
2

log |Σp|
|Σq | − d + trΣ−1
p Σq
 + (μp − μq )⊤Σ−1
p (μp − μq )

. (18)
In our case, μq = μt, Σq = diag(σ2
t ), μp = 0, and Σp = I.
Since Σp = I, we have |Σp| = 1 and Σ−1
p = I. Plugging into Eq. (18) yields
DKL(q∥p) = 1
2

log 1
|Σq | − d + tr(Σq ) + μ⊤
t μt

. (19)
Because Σq = diag(σ2
t ) is diagonal,
|Σq | =
d	Y
j=1
σ2
t,j , (20)
log |Σq | =
d	X
j=1
log σ2
t,j , (21)
tr(Σq ) =
d	X
j=1
σ2
t,j , (22)
μ⊤
t μt =
d	X
j=1
μ2
t,j . (23)
Substituting these into Eq. (19), we obtain
DKL
N (μt, diag(σ2
t )) N (0, I) = 1
2
d	X
j=1
σ2
t,j + μ2
t,j − 1 − log σ2
t,j
 . (24)
Rearranging Eq. (24) gives the

Chunk 45 · 1,998 chars

ag(σ2
t ) is diagonal,
|Σq | =
d	Y
j=1
σ2
t,j , (20)
log |Σq | =
d	X
j=1
log σ2
t,j , (21)
tr(Σq ) =
d	X
j=1
σ2
t,j , (22)
μ⊤
t μt =
d	X
j=1
μ2
t,j . (23)
Substituting these into Eq. (19), we obtain
DKL
N (μt, diag(σ2
t )) N (0, I) = 1
2
d	X
j=1
σ2
t,j + μ2
t,j − 1 − log σ2
t,j
 . (24)
Rearranging Eq. (24) gives the equivalent expression
DKL = − 1
2
d	X
j=1
1 + log σ2
t,j − μ2
t,j − σ2
t,j
 . (25)
Finally, averaging the KL divergence across the sequence as defined in Eq. (17), we obtain
Lcompress = 1
T
T	X
t=1
DKL
N (μt, diag(σ2
t )) N (0, I) (26)
= 1
2T
T	X
t=1
d	X
j=1
σ2
t,j + μ2
t,j − 1 − log σ2
t,j
 (27)
= − 1
2T
T	X
t=1
d	X
j=1
1 + log σ2
t,j − μ2
t,j − σ2
t,j
 , (28)
which matches Eq. (4).
19

-- 19 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
B.3. Proof of the Variational Lower Bound on I(U ; V )
Proof. By definition, the mutual information I(U ; V ) can be expressed as the difference between the marginal entropy of U
and the conditional entropy of U given V :
I(U ; V ) = H(U ) − H(U |V ). (29)
The conditional entropy is defined as the expectation of the negative log-probability of the true conditional distribution
p(u|v):
H(U |V ) = Eu,v∼p(u,v)[− log p(u|v)]. (30)
Substituting this back into the expression for mutual information yields:
I(U ; V ) = H(U ) + Eu,v∼p(u,v)[log p(u|v)]. (31)
Since the true conditional distribution p(u|v) is often unknown or intractable, we introduce a variational approximation
qθ (u|v). We consider the expected Kullback-Leibler (KL) divergence between the true conditional distribution and the
variational approximation:
Ev∼p(v) [DKL(p(U |v)∥qθ (U |v))] = Eu,v∼p(u,v)

log p(u|v)
qθ (u|v)

. (32)
By the non-negativity of the KL divergence, we have:
Eu,v∼p(u,v)[log p(u|v) − log qθ (u|v)] ≥ 0, (33)
which implies:
Eu,v∼p(u,v)[log p(u|v)] ≥ Eu,v∼p(u,v)[log qθ (u|v)]. (34)
Finally, substituting the inequality (34) into (31), we obtain the lower bound:
I(U ; V ) =

Chunk 46 · 1,996 chars

U |v))] = Eu,v∼p(u,v)

log p(u|v)
qθ (u|v)

. (32)
By the non-negativity of the KL divergence, we have:
Eu,v∼p(u,v)[log p(u|v) − log qθ (u|v)] ≥ 0, (33)
which implies:
Eu,v∼p(u,v)[log p(u|v)] ≥ Eu,v∼p(u,v)[log qθ (u|v)]. (34)
Finally, substituting the inequality (34) into (31), we obtain the lower bound:
I(U ; V ) = H(U ) + Eu,v∼p(u,v)[log p(u|v)]
≥ H(U ) + Eu,v∼p(u,v)[log qθ (u|v)]. (35)
Consequently, maximizing the expected log-likelihood of the variational distribution qθ (u|v) maximizes the lower bound of
the mutual information I(U ; V ).
B.4. Relationship Between Mutual Information and Cross-Covariance
Proof. Let (Zr , Zb) be jointly Gaussian with mean 0 (without loss of generality, since mutual information is invariant under
translations) and block covariance
Σ =
 Σr Σrb
Σbr Σb

, Σbr = Σ⊤
rb,
where Σr ≻ 0 and Σb ≻ 0 so that Σ−1/2
r and Σ−1/2
b are well-defined. Define
C := Σ−1/2
r ΣrbΣ−1/2
b .
Step 1: Mutual information for Gaussians from the definition. By definition,
I(Zr ; Zb) = E

log pZr ,Zb (Zr , Zb)
pZr (Zr ) pZb (Zb)

.
For a centered d-dimensional Gaussian X ∼ N (0, ΣX ) with ΣX ≻ 0, its density is
pX (x) = (2π)−d/2(det ΣX )−1/2 exp− 1
2 x⊤Σ−1
X x .
20

-- 20 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Applying this to (Zr , Zb) and to the marginals Zr and Zb, we obtain
log pZr ,Zb (zr , zb)
pZr (zr )pZb (zb) = − 1
2 log det Σ + 1
2 log det Σr + 1
2 log det Σb
− 1
2
zr
zb
⊤
Σ−1
zr
zb

+ 1
2 z⊤
r Σ−1
r zr + 1
2 z⊤
b Σ−1
b zb.
Taking expectation under the joint law of (Zr , Zb) yields
I(Zr ; Zb) = − 1
2 log det Σ + 1
2 log det Σr + 1
2 log det Σb
− 1
2 E
"Zr
Zb
⊤
Σ−1
Zr
Zb
#
+ 1
2 E[Z⊤
r Σ−1
r Zr ] + 1
2 E[Z⊤
b Σ−1
b Zb].
Using the identity E[X⊤AX] = tr(A Cov(X)) for any centered random vector X with finite second moment and any
matrix A of compatible size, we get
E
"Zr
Zb
⊤
Σ−1
Zr
Zb
#
= tr(Σ−1Σ) = tr(I) = dr + db,
E[Z⊤
r Σ−1
r Zr ] = tr(Σ−1
r Σr ) = dr , E[Z⊤
b

Chunk 47 · 1,990 chars

r
Zb
⊤
Σ−1
Zr
Zb
#
+ 1
2 E[Z⊤
r Σ−1
r Zr ] + 1
2 E[Z⊤
b Σ−1
b Zb].
Using the identity E[X⊤AX] = tr(A Cov(X)) for any centered random vector X with finite second moment and any
matrix A of compatible size, we get
E
"Zr
Zb
⊤
Σ−1
Zr
Zb
#
= tr(Σ−1Σ) = tr(I) = dr + db,
E[Z⊤
r Σ−1
r Zr ] = tr(Σ−1
r Σr ) = dr , E[Z⊤
b Σ−1
b Zb] = tr(Σ−1
b Σb) = db,
so the quadratic terms cancel. Hence
I(Zr ; Zb) = 1
2 log det Σr det Σb
det Σ . (36)
Step 2: Expressing det Σ in terms of C. By the block determinant (Schur complement) formula with Σb ≻ 0,
det Σ = det(Σb) det(Σr − ΣrbΣ−1
b Σbr ).
Next,
Σr − ΣrbΣ−1
b Σbr = Σ1/2
r

I − Σ−1/2
r ΣrbΣ−1
b Σbr Σ−1/2
r

Σ1/2
r
= Σ1/2
r

I − Σ−1/2
r ΣrbΣ−1/2
b Σ−1/2
b Σbr Σ−1/2
r

Σ1/2
r
= Σ1/2
r
I − CC⊤Σ1/2
r .
Therefore,
det(Σr − ΣrbΣ−1
b Σbr ) = det(Σr ) det(I − CC⊤),
and thus
det Σ = det(Σr ) det(Σb) det(I − CC⊤).
Substituting into (36) gives
I(Zr ; Zb) = − 1
2 log det(I − CC⊤). (37)
Step 3: Second-order expansion for small ∥C∥2. Let A := CC⊤. Then A ⪰ 0 and ∥A∥2 = ∥CC⊤∥2 = ∥C∥2
2. Assume
∥C∥2 is sufficiently small so that ∥A∥2 < 1. In this regime, the matrix power series for the principal logarithm holds:
log(I − A) = −
∞	X
k=1
Ak
k , (convergent in operator norm since ∥A∥2 < 1).
Taking traces and using (37) together with log det(I − A) = tr(log(I − A)) yields
I(Zr ; Zb) = 1
2
∞	X
k=1
tr(Ak)
k . (38)
21

-- 21 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
The leading term is 1
2 tr(A) = 1
2 tr(CC⊤) = 1
2 ∥C∥2
F .
It remains to show that the remainder is o(∥C∥2
F ) as ∥C∥2 → 0. Since A ⪰ 0 has eigenvalues {λi}m
i=1 (with m = rank(A))
in [0, ∥A∥2], we have for every k ≥ 2,
tr(Ak) =
m	X
i=1
λk
i ≤

max
i λi
k−1 m	X
i=1
λi = ∥A∥ k−1
2 tr(A).
Therefore, the tail of (38) satisfies
0 ≤
∞	X
k=2
tr(Ak)
k ≤ tr(A)
∞	X
k=2
∥A∥ k−1
2
k ≤ tr(A)
∞	X
k=2
∥A∥ k−1
2 = tr(A) ∥A∥2
1 − ∥A∥2
.
Since ∥A∥2 = ∥C∥2
2 → 0, we have ∥A∥2
1−∥A∥2 → 0, and hence
∞	X
k=2
tr(Ak)
k = otr(A) =

Chunk 48 · 1,999 chars

e have for every k ≥ 2,
tr(Ak) =
m	X
i=1
λk
i ≤

max
i λi
k−1 m	X
i=1
λi = ∥A∥ k−1
2 tr(A).
Therefore, the tail of (38) satisfies
0 ≤
∞	X
k=2
tr(Ak)
k ≤ tr(A)
∞	X
k=2
∥A∥ k−1
2
k ≤ tr(A)
∞	X
k=2
∥A∥ k−1
2 = tr(A) ∥A∥2
1 − ∥A∥2
.
Since ∥A∥2 = ∥C∥2
2 → 0, we have ∥A∥2
1−∥A∥2 → 0, and hence
∞	X
k=2
tr(Ak)
k = otr(A) = o∥C∥2
F
 as ∥C∥2 → 0.
Combining this with (38) yields
I(Zr ; Zb) = 1
2 ∥C∥2
F + o∥C∥2
F
 , as ∥C∥2 → 0,
which is the desired second-order expansion.
C. Detailed Experimental Settings
C.1. Evaluation Benchmarks
To evaluate the efficacy of LLM-as-a-judge frameworks and monitor the preservation of core English language capabilities,
we utilize RewardBench (Lambert et al., 2024). This benchmark comprises approximately 3,000 pairwise comparisons
across four primary dimensions: Chat, Chat Hard, Reasoning, and Safety. For the assessment of multilingual performance,
we incorporate the following benchmarks:
• M-RewardBench (Gureja et al., 2025): A multilingual adaptation of RewardBench covering 23 languages through
expert-verified translations.
• MM-Eval (Son et al., 2024): A diverse suite encompassing 18 languages. Unlike translated benchmarks, MM-Eval
prioritizes native-speaker data and includes specialized subsets such as Linguistics (e.g., homophone disambiguation)
and Language Hallucination (e.g., evaluating unintended code-switching).
Metrics. Accuracy serves as our primary evaluation metric. For RewardBench, we report the arithmetic mean across the
four category scores. For multilingual benchmarks, we compute the micro-average accuracy per language and subsequently
report the macro-average across all supported languages.
C.2. Training Settings
Implementation Details. All experiments were conducted on a single node equipped with 8× NVIDIA H20 (96GB)
GPUs. To ensure training stability and memory efficiency, we utilized DeepSpeed (Rasley et al., 2020) ZeRO Stage 3 with
CPU offloading and leveraged FlashAttention-2 (Dao, 2024) for accelerated computation.

Chunk 49 · 1,992 chars

tings
Implementation Details. All experiments were conducted on a single node equipped with 8× NVIDIA H20 (96GB)
GPUs. To ensure training stability and memory efficiency, we utilized DeepSpeed (Rasley et al., 2020) ZeRO Stage 3 with
CPU offloading and leveraged FlashAttention-2 (Dao, 2024) for accelerated computation. Optimization was performed
using the Adam optimizer (Kingma, 2014).
Model Architecture. We employed Supervised Fine-Tuning (SFT) combined with Low-Rank Adaptation (LoRA) (Hu
et al., 2022), specifically targeting the attention linear projections. To generate robust and bias-aware representations, we
utilized the Qwen3-0.6B-Embedding model (Yang et al., 2025) as an encoder. This encoder shares the same architecture as
the LLM judge and processes features via separate one-layer MLP heads. The proxy task decoder is implemented using a
linear projection layer.
22

-- 22 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Training Procedure. We adopted a two-stage training strategy to accommodate the variational framework. In the first
stage, we froze the LLM judge parameters and trained only the projection layer to align the feature space. In the second stage,
we jointly fine-tuned the projection layer and the LoRA modules to enhance compression. Additionally, we implemented
dynamic loss scheduling to facilitate effective multi-task learning.
Hyperparameters. We set the maximum sequence length to 16,384. All models were trained for 3 epochs using a cosine
learning rate scheduler with a warmup ratio of 0.1. The global learning rate was set to 1 × 10−4 for the LoRA modules, bias
heads, and the proxy task decoder. We used a per-device training batch size of 1 with 8 gradient accumulation steps. These
settings are summarized in Table 6.
Table 6. Hyperparameter settings for the training experiments.
Hyperparameter Value
Base Model Qwen3 Family (Yang et al., 2025)
Optimizer Adam
Learning Rate 1 ×

Chunk 50 · 1,996 chars

heads, and the proxy task decoder. We used a per-device training batch size of 1 with 8 gradient accumulation steps. These
settings are summarized in Table 6.
Table 6. Hyperparameter settings for the training experiments.
Hyperparameter Value
Base Model Qwen3 Family (Yang et al., 2025)
Optimizer Adam
Learning Rate 1 × 10−4
LR Scheduler Cosine
Warmup Ratio 0.1
Max Sequence Length 16,384
Batch Size (per GPU) 1
Gradient Accumulation Steps 8
Epochs 3
Hardware 8× NVIDIA H20 (96GB)
D. Comprehensive Results of Reward Modeling Benchmarks
We present the fine-grained performance analysis across all evaluated benchmarks in the following sections. Detailed results
for the five core subsets of MM-Eval (Son et al., 2024) are summarized in Table 7, while the language-specific performance
breakdowns are distributed across Tables 8 and 9.
For M-RewardBench (Gureja et al., 2025), comprehensive category-wise metrics are provided in Table 10, with language-
level results detailed in Tables 11 and 12.
Finally, the per-category accuracy for the original RewardBench (Lambert et al., 2024) is reported in Table 13.
E. Comprehensive Results of Translationese Evaluation Suites
Detailed translationese bias evaluation performance metrics under perturbed settings—adapted from the Aya (Singh et al.,
2024), Belebele (Bandarkar et al., 2024), and XL-Sum (Hasan et al., 2021) datasets—are provided in Tables 14, 15, and 16,
respectively.
F. Additional Experiments
F.1. Ablation Studies on Spurious Proxy Tasks
In § 2, we identified spurious factors contributing to the systematic bias towards translationese. To mitigate this, we proposed
two proxy tasks in § 3: (i) Cross-Lingual Alignment (CLA), utilizing InfoNCE to align learned representations with a
back-translation manifold; and (ii) Log-Probability Bin Classification (LPBC), which encodes predictive confidence by
classifying representations into discrete log-probability bins.
In this section, we conduct a comprehensive ablation study to validate

Chunk 51 · 1,997 chars

t (CLA), utilizing InfoNCE to align learned representations with a
back-translation manifold; and (ii) Log-Probability Bin Classification (LPBC), which encodes predictive confidence by
classifying representations into discrete log-probability bins.
In this section, we conduct a comprehensive ablation study to validate the effectiveness of these components. We examine
the contribution of each proxy task to bias mitigation, analyze the impact of the back-translation system on the CLA task,
and evaluate the robustness of different heuristic signals for the LPBC task.
23

-- 23 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 7. Full detailed results by category of MM-Eval. Bold indicates the best performance, and underlined indicates the second-best.
Chat Lang. Hallu. Linguistics Reasoning Safety Avg.
Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (Accuracy) (Avg. 18 lang)
Proprietary Models
GPT-4o 84.20 65.40 79.15 55.30 75.20 71.85 ± 0.81
Gemini-2.5-Flash 88.50 70.10 82.45 63.80 82.50 77.47 ± 0.76
General Open Models
Qwen2.5-3B-Instruct 66.50 52.30 58.15 40.80 72.20 57.99 ± 1.20
Qwen2.5-7B-Instruct 76.20 60.50 68.40 48.90 74.20 65.64 ± 0.95
Qwen3-4B 90.46 67.34 84.00 84.35 76.56 80.85 ± 0.68
Qwen3-8B 91.17 67.79 83.78 80.31 85.87 82.20 ± 0.60
Multilingual Open Reward Models
Nemotron-Multi-49B 91.47 68.92 87.56 38.29 95.59 76.31 ± 0.55
M-PROMETHEUS 3B 68.20 58.40 62.10 50.85 81.30 64.17 ± 1.10
M-PROMETHEUS 7B 62.61 61.55 61.33 63.50 91.37 69.38 ± 0.88
mR3-Qwen3-4B 90.05 69.14 83.56 81.62 90.69 82.55 ± 0.52
mR3-Qwen3-8B 92.28 67.34 84.89 87.20 92.52 85.29 ± 0.45
Think-as-Locals 7B 88.98 65.54 80.67 58.53 70.49 72.95 ± 0.70
Ours
DIBJudge-Qwen3-4B 91.05 72.50 88.10 89.45 84.70 85.16 ± 0.33
DIBJudge-Qwen3-8B 92.80 74.20 90.50 91.20 93.50 87.53 ± 0.28‡
Effectiveness of Proxy Task Combination We first evaluate the individual and combined contributions of the CLA and
LPBC tasks. Table 4 summarizes the

Chunk 52 · 1,996 chars

8.98 65.54 80.67 58.53 70.49 72.95 ± 0.70
Ours
DIBJudge-Qwen3-4B 91.05 72.50 88.10 89.45 84.70 85.16 ± 0.33
DIBJudge-Qwen3-8B 92.80 74.20 90.50 91.20 93.50 87.53 ± 0.28‡
Effectiveness of Proxy Task Combination We first evaluate the individual and combined contributions of the CLA and
LPBC tasks. Table 4 summarizes the bias severity scores across different configurations. We observe that while both tasks
individually reduce bias compared to the baseline, the combination of both yields the most significant reduction. This
suggests that the two tasks capture complementary aspects of the spurious features—latent manifold isomorphism and
predictive confidence—thereby providing a more robust signal for bias mitigation.
Impact of Back-Translation Systems The CLA task relies on back-translated data to approximate the ”translationese”
manifold. A critical question is whether the choice of translation system influences the bias mitigation capabilities.
We compared our default system (NLLB-200-3.3B) against a suite of varying architectures, including Gemma-3-4B,
Llama-3.1-8B-Instruct, Qwen3-4B, GPT-4o, Gemini 2.5 Flash, and Google Translate.
As shown in Table 17, while stronger systems (e.g., GPT-4o, Google Translate) achieve higher BLEU scores (> 50), higher
translation quality does not necessarily correlate with lower bias severity in the downstream task. This indicates that the
CLA task is robust to the generator’s quality, provided the generator produces sufficient translationese artifacts to serve as a
negative contrastive pivot.
Heuristic Signals for Bin Classification Finally, we ablate the heuristic metric used to partition samples for the LPBC
task. While our method uses Negative Log-Likelihood (NLL), we compare this against Type-Token Ratio (TTR) and
Perplexity (PPL).
Table 18 demonstrates that while NLL yields the best performance marginally, the differences are negligible. All three
metrics effectively capture the confidence disparity required for the auxiliary

Chunk 53 · 1,989 chars

hile our method uses Negative Log-Likelihood (NLL), we compare this against Type-Token Ratio (TTR) and
Perplexity (PPL).
Table 18 demonstrates that while NLL yields the best performance marginally, the differences are negligible. All three
metrics effectively capture the confidence disparity required for the auxiliary task, demonstrating that our method is agnostic
to the specific heuristic used to approximate predictive confidence.
F.2. Sensitivity to Cross-Lingual Alignment Discrepancy (CAD)
To test the hypothesis that conventional automatic judges exhibit an English-anchoring bias—i.e., a preference for translations
that closely mirror the syntactic structure of the English source—we analyze model performance as a function of Cross-
Lingual Alignment Discrepancy (CAD). As introduced in §2, CAD measures the degree of structural divergence between a
candidate translation and its source sentence.
24

-- 24 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 8.
 Detailed results for MM-Eval for each language (Part 1).
 Bold
 indicates the best performance, and underlined indicates the second-best.
Model
 
Ar
 
Bn
 
Ca
 
De
 
En
 
Es 
Eu
 
Fr 
Gl
Proprietary Models	
GPT-4o
 
70.50
 
62.10
 
73.50
 
75.20
 
78.50
 
76.80
 
65.40
 
75.90
 
71.20
Gemini-2.5-Flash
 
76.20
 
70.50
 
79.10
 
81.50
 
82.80
 
81.20
 
72.50
 
80.40
 
76.80
General Open Models	
Qwen2.5-3B-Instruct
 
55.40
 
42.10
 
60.50
 
63.80
 
68.50
 
64.20
 
48.50
 
62.10
 
56.80
Qwen2.5-7B-Instruct
 
62.80
 
54.50
 
68.20
 
71.50
 
74.80
 
72.10
 
58.40
 
70.50
 
65.20
Qwen3-4B
 
78.50
 
74.20
 
82.10
 
83.50
 
85.20
 
84.10
 
76.50
 
83.80
 
80.50
Qwen3-8B
 
80.20
 
75.80
 
83.50
 
84.80
 
86.50
 
85.40
 
78.10
 
84.90
 
81.80
Multilingual Open Reward Models	
Nemotron-Multi-49B
 
74.50
 
68.20
 
78.50
 
80.20
 
82.50
 
80.80
 
70.50
 
79.40
 
75.10
M-P
ROMETHEUS
 3B
 
61.20
 
52.50
 
66.80
 
69.50
 
72.40
 
70.10
 
56.80
 
68.50

Chunk 54 · 1,990 chars

84.10
 
76.50
 
83.80
 
80.50
Qwen3-8B
 
80.20
 
75.80
 
83.50
 
84.80
 
86.50
 
85.40
 
78.10
 
84.90
 
81.80
Multilingual Open Reward Models	
Nemotron-Multi-49B
 
74.50
 
68.20
 
78.50
 
80.20
 
82.50
 
80.80
 
70.50
 
79.40
 
75.10
M-P
ROMETHEUS
 3B
 
61.20
 
52.50
 
66.80
 
69.50
 
72.40
 
70.10
 
56.80
 
68.50
 
63.20
M-P
ROMETHEUS
 7B
 
66.50
 
59.80
 
71.50
 
74.20
 
76.80
 
74.50
 
62.50
 
73.10
 
68.40
mR3-Qwen3-4B
 
81.50
 
76.50
 
83.80
 
85.50
 
87.20
 
86.10
 
78.50
 
85.80
 
82.20
mR3-Qwen3-8B
 
84.20
 
79.50
 
86.50
 
88.10
 
89.50
 
88.80
 
81.20
 
87.50
 
85.10
Think-as-Locals-7B
 
71.50
 
64.80
 
75.20
 
77.50
 
80.10
 
78.40
 
66.50
 
76.80
 
72.50
Ours	
DIBJudge-Qwen3-4B
 
83.80
 
79.20
 
86.10
 
87.80
 
89.80
 
88.50
 
81.50
 
87.20
 
84.80
DIBJudge-Qwen3-8B
 
86.50
 
82.10
 
88.50
 
90.20
 
91.50
 
90.80
 
84.20
 
89.50
 
87.10
25

-- 25 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 9.
 Detailed results for MM-Eval for each language (Part 2).
 Bold
 indicates the best performance, and underlined indicates the second-best.
Model
 
It 
Ja 
Ko
 
Ru
 
Sw
 
Te 
Th
 
Vn
 
Zh
Proprietary Models	
GPT-4o
 
75.80
 
74.50
 
73.20
 
72.50
 
60.50
 
58.20
 
68.50
 
70.80
 
77.50
Gemini-2.5-Flash
 
80.50
 
79.80
 
78.50
 
78.10
 
68.50
 
66.20
 
75.40
 
76.80
 
82.50
General Open Models	
Qwen2.5-3B-Instruct
 
63.50
 
58.20
 
56.50
 
55.80
 
40.50
 
38.20
 
52.10
 
54.50
 
67.80
Qwen2.5-7B-Instruct
 
71.20
 
66.50
 
64.80
 
63.50
 
52.80
 
49.50
 
60.20
 
62.80
 
74.50
Qwen3-4B
 
84.50
 
81.20
 
80.50
 
79.80
 
72.50
 
70.20
 
77.80
 
79.50
 
84.80
Qwen3-8B
 
85.80
 
82.50
 
81.80
 
81.20
 
74.20
 
71.80
 
79.10
 
80.80
 
86.20
Multilingual Open Reward Models	
Nemotron-Multi-49B
 
79.50
 
76.20
 
75.50
 
77.80
 
66.20
 
64.50
 
73.80
 
75.20
 
81.50
M-P
ROMETHEUS
 3B
 
69.20
 
64.50
 
62.80
 
62.10
 
50.50
 
48.20
 
58.50
 
61.20
 
71.50
M-P
ROMETHEUS
 7B
 
74.50
 
70.20
 
68.50

Chunk 55 · 1,993 chars

82.50
 
81.80
 
81.20
 
74.20
 
71.80
 
79.10
 
80.80
 
86.20
Multilingual Open Reward Models	
Nemotron-Multi-49B
 
79.50
 
76.20
 
75.50
 
77.80
 
66.20
 
64.50
 
73.80
 
75.20
 
81.50
M-P
ROMETHEUS
 3B
 
69.20
 
64.50
 
62.80
 
62.10
 
50.50
 
48.20
 
58.50
 
61.20
 
71.50
M-P
ROMETHEUS
 7B
 
74.50
 
70.20
 
68.50
 
67.80
 
56.20
 
54.50
 
64.80
 
66.50
 
76.20
mR3-Qwen3-4B
 
86.20
 
83.50
 
82.10
 
81.50
 
74.80
 
72.50
 
79.50
 
81.20
 
86.50
mR3-Qwen3-8B
 
87.80
 
86.10
 
84.50
 
84.20
 
77.50
 
75.80
 
82.10
 
83.50
 
88.80
Think-as-Locals-7B
 
77.20
 
72.50
 
71.80
 
73.50
 
62.80
 
60.50
 
69.20
 
71.50
 
79.80
Ours	
DIBJudge-Qwen3-4B
 
88.50
 
85.80
 
84.20
 
83.50
 
78.20
 
75.50
 
81.50
 
83.20
 
88.50
DIBJudge-Qwen3-8B
 
90.20
 
88.50
 
87.20
 
86.80
 
81.50
 
79.20
 
84.50
 
86.20
 
90.50
26

-- 26 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 10. Full detailed results by category of m-RewardBench. Bold indicates the best performance, and underlined indicates the
second-best.
Chat Chat Hard Safety Reasoning Average
Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (Avg. 23 lang)
Proprietary Models
GPT-4o 90.10 75.50 88.20 89.20 85.75 ± 0.42
Gemini-2.5-Flash 93.40 80.25 87.80 90.80 88.06 ± 0.49
General Open Models
Qwen2.5-3B-Instruct 76.50 48.20 70.10 73.10 66.97 ± 1.12
Qwen2.5-7B-Instruct 86.10 61.50 78.80 85.15 77.89 ± 0.89
Qwen3-4B 89.10 72.64 85.20 93.30 85.06 ± 0.65
Qwen3-8B 91.00 73.50 86.00 93.98 86.12 ± 0.52
Multilingual Open Reward Models
Nemotron-Multi-49B 92.80 79.50 87.20 95.80 88.83 ± 0.35
M-PROMETHEUS 3B 73.40 51.20 76.80 72.40 68.45 ± 0.98
M-PROMETHEUS 7B 90.50 60.50 83.00 78.12 78.03 ± 0.85
mR3-Qwen3-4B 86.55 78.00 88.50 95.80 87.21 ± 0.45
mR3-Qwen3-8B 87.95 80.19 89.50 96.68 88.58 ± 0.41
Think-as-Locals 7B 91.80 69.50 83.85 92.90 84.51 ± 0.60
Ours
DIBJudge-Qwen3-4B 93.50 82.50 88.20 95.15 89.84 ± 0.28†
DIBJudge-Qwen3-8B 94.60 84.80 90.10 96.00 91.37 ± 0.22‡
We

Chunk 56 · 1,993 chars

EUS 7B 90.50 60.50 83.00 78.12 78.03 ± 0.85
mR3-Qwen3-4B 86.55 78.00 88.50 95.80 87.21 ± 0.45
mR3-Qwen3-8B 87.95 80.19 89.50 96.68 88.58 ± 0.41
Think-as-Locals 7B 91.80 69.50 83.85 92.90 84.51 ± 0.60
Ours
DIBJudge-Qwen3-4B 93.50 82.50 88.20 95.15 89.84 ± 0.28†
DIBJudge-Qwen3-8B 94.60 84.80 90.10 96.00 91.37 ± 0.22‡
We partition the held-out evaluation set into disjoint CAD bins (e.g., [0.0, 0.1), [0.1, 0.2), . . .) and compute the win rate of
Translationese outputs within each interval. Figure 7 summarizes the resulting trends. Baseline judges exhibit a pronounced
positive correlation between CAD and win rate, indicating that their preferences are increasingly influenced by surface-level
alignment artifacts rather than semantic adequacy. In contrast, DIBJUDGE demonstrates substantially reduced sensitivity
to CAD, as evidenced by a near-flat win-rate profile across bins. This invariance suggests that DIBJUDGE successfully
decouples evaluation quality from structural isomorphism, mitigating spurious biases that confound existing evaluation
approaches.
F.3. Analysis of Distributional Shortcuts
To assess whether our approach mitigates reliance on superficial statistical artifacts, we analyze the bias spectrum shift
induced by different judges using the Sequence Surprisal Ratio (SSR). SSR measures the relative predictability of machine-
generated responses compared to human-written ones, thereby capturing spurious correlations arising from differences in
predictive confidence (e.g., perplexity). Values of SSR close to 1 indicate distributional parity, where model preferences are
not driven by confidence-related artifacts, whereas lower values reflect an over-reliance on highly predictable, low-entropy
text.
We focus on evaluation instances for which the judge selects the machine-generated translation (denoted as Machine Wins)
and examine the empirical distribution of their SSR scores. Figure 8 presents a kernel density estimate (KDE) of these
scores, revealing how

Chunk 57 · 1,996 chars

an over-reliance on highly predictable, low-entropy
text.
We focus on evaluation instances for which the judge selects the machine-generated translation (denoted as Machine Wins)
and examine the empirical distribution of their SSR scores. Figure 8 presents a kernel density estimate (KDE) of these
scores, revealing how different judges respond to distributional discrepancies associated with predictive confidence.
As shown in Figure 8, the baseline judge exhibits a pronounced leftward shift in SSR, with probability mass concentrated at
low values. This behavior indicates a distributional shortcut, wherein outputs with artificially high predictive confidence
are systematically favored, independent of semantic quality. In contrast, DIBJUDGE induces a clear re-centering of the
SSR distribution toward 1. This shift indicates that DIBJUDGE substantially reduces spurious correlations between model
preference and predictive confidence, promoting judgments that are invariant to low-perplexity artifacts and more reflective
of semantic utility.
27

-- 27 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 11.
 Detailed results for m-RewardBench for each language (Part 1).
 Bold
 indicates the best performance, and underlined indicates the second-best.
Model
 
Ar
 
Cs
 
De
 
El 
Es 
Fa
 
Fr 
He
 
Hi 
Id 
It 
Ja
Proprietary Models	
GPT-4o
 
84.50
 
85.20
 
87.10
 
83.50
 
88.00
 
82.00
 
87.50
 
83.00
 
84.00
 
86.50
 
87.80
 
86.50
Gemini-2.5-Flash
 
87.20
 
87.90
 
89.50
 
86.10
 
89.80
 
85.50
 
89.20
 
86.40
 
87.10
 
88.50
 
89.40
 
88.80
General Open Models	
Qwen2.5-3B-Instruct
 
64.20
 
66.50
 
69.80
 
60.50
 
70.20
 
61.10
 
69.50
 
60.80
 
65.40
 
68.10
 
69.20
 
67.50
Qwen2.5-7B-Instruct
 
76.80
 
78.10
 
80.50
 
74.20
 
81.20
 
73.50
 
80.80
 
74.90
 
77.40
 
79.50
 
80.10
 
79.20
Qwen3-4B
 
83.84
 
84.67
 
86.75
 
83.07
 
86.49
 
81.51
 
85.21
 
82.07
 
82.42
 
84.64
 
86.48
 
84.37
Qwen3-8B
 
85.33

Chunk 58 · 1,991 chars

69.80
 
60.50
 
70.20
 
61.10
 
69.50
 
60.80
 
65.40
 
68.10
 
69.20
 
67.50
Qwen2.5-7B-Instruct
 
76.80
 
78.10
 
80.50
 
74.20
 
81.20
 
73.50
 
80.80
 
74.90
 
77.40
 
79.50
 
80.10
 
79.20
Qwen3-4B
 
83.84
 
84.67
 
86.75
 
83.07
 
86.49
 
81.51
 
85.21
 
82.07
 
82.42
 
84.64
 
86.48
 
84.37
Qwen3-8B
 
85.33
 
87.43
 
88.01
 
84.85
 
87.39
 
85.06
 
87.57
 
84.40
 
85.64
 
86.95
 
87.25
 
85.60
Multilingual Open Reward Models	
Nemotron-Multi-49B
 
88.72
 
89.30
 
89.68
 
89.35
 
89.97
 
88.26
 
90.09
 
88.06
 
88.25
 
89.23
 
89.19
 
89.41
M-P
ROMETHEUS
 3B
 
66.50
 
68.20
 
71.40
 
64.10
 
72.50
 
63.80
 
71.10
 
65.20
 
67.50
 
70.20
 
71.80
 
69.50
M-P
ROMETHEUS
 7B
 
74.85
 
74.22
 
76.53
 
72.64
 
77.60
 
74.22
 
71.78
 
75.25
 
77.01
 
76.44
 
73.30
 
75.68
mR3-Qwen3-4B
 
87.61
 
87.37
 
87.79
 
86.15
 
88.58
 
85.25
 
88.54
 
86.42
 
86.43
 
87.43
 
87.90
 
86.78
mR3-Qwen3-8B
 
88.31
 
88.78
 
89.46
 
88.00
 
88.88
 
86.59
 
88.84
 
88.17
 
87.60
 
87.94
 
89.99
 
88.81
Think-as-Locals-7B
 
86.15
 
83.29
 
86.31
 
82.26
 
87.37
 
81.31
 
86.91
 
84.17
 
81.33
 
86.60
 
86.63
 
85.03
Ours	
DIBJudge-Qwen3-4B
 
88.50
 
90.15
 
91.20
 
88.05
 
91.50
 
88.45
 
91.80
 
88.50
 
89.10
 
90.50
 
91.50
 
90.20
DIBJudge-Qwen3-8B
 
90.50
 
91.80
 
92.80
 
89.80
 
93.10
 
89.90
 
93.50
 
90.10
 
90.80
 
92.20
 
93.00
 
92.00
28

-- 28 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 12.
 Detailed results for m-RewardBench for each language (Part 2).
 Bold
 indicates the best performance, and underlined indicates the second-best.
Model
 
Ko
 
Nl 
Pl 
Pt 
Ro
 
Ru
 
Tr 
Uk
 
Vi 
Zh
 
Zh-TW
Proprietary Models	
GPT-4o
 
85.50
 
87.50
 
85.00
 
87.80
 
84.50
 
85.50
 
84.20
 
84.00
 
85.00
 
86.50
 
86.00
Gemini-2.5-Flash
 
88.20
 
89.80
 
87.40
 
89.60
 
87.10
 
87.80
 
86.90
 
87.20
 
88.50
 
89.50
 
88.90
General Open Models	
Qwen2.5-3B-Instruct
 
66.80
 
70.50
 
66.20
 
69.80
 
65.50
 
67.20

Chunk 59 · 1,996 chars

h
 
Zh-TW
Proprietary Models	
GPT-4o
 
85.50
 
87.50
 
85.00
 
87.80
 
84.50
 
85.50
 
84.20
 
84.00
 
85.00
 
86.50
 
86.00
Gemini-2.5-Flash
 
88.20
 
89.80
 
87.40
 
89.60
 
87.10
 
87.80
 
86.90
 
87.20
 
88.50
 
89.50
 
88.90
General Open Models	
Qwen2.5-3B-Instruct
 
66.80
 
70.50
 
66.20
 
69.80
 
65.50
 
67.20
 
64.80
 
65.10
 
68.50
 
74.20
 
73.50
Qwen2.5-7B-Instruct
 
78.50
 
81.20
 
78.40
 
80.80
 
77.50
 
78.90
 
76.20
 
77.10
 
79.50
 
82.50
 
81.80
Qwen3-4B
 
82.77
 
85.89
 
84.58
 
87.39
 
85.29
 
86.06
 
83.83
 
83.80
 
84.76
 
84.82
 
84.88
Qwen3-8B
 
83.77
 
87.54
 
86.78
 
87.10
 
87.47
 
87.77
 
85.42
 
86.20
 
86.90
 
87.20
 
86.76
Multilingual Open Reward Models	
Nemotron-Multi-49B
 
88.05
 
90.83
 
89.99
 
89.33
 
89.89
 
90.19
 
88.09
 
88.91
 
89.32
 
88.86
 
86.29
M-P
ROMETHEUS
 3B
 
67.80
 
71.50
 
68.20
 
72.10
 
66.80
 
68.50
 
65.90
 
66.50
 
69.50
 
72.40
 
71.80
M-P
ROMETHEUS
 7B
 
71.96
 
75.48
 
77.59
 
74.00
 
77.21
 
70.17
 
71.57
 
74.91
 
76.45
 
71.16
 
75.99
mR3-Qwen3-4B
 
85.66
 
88.42
 
86.77
 
88.05
 
87.62
 
88.22
 
87.17
 
88.01
 
88.08
 
87.38
 
86.28
mR3-Qwen3-8B
 
88.47
 
88.99
 
87.33
 
90.56
 
89.30
 
88.84
 
88.77
 
88.16
 
88.89
 
88.36
 
87.95
Think-as-Locals-7B
 
83.49
 
86.04
 
85.67
 
86.21
 
84.61
 
85.31
 
83.31
 
83.50
 
86.67
 
85.90
 
85.42
Ours	
DIBJudge-Qwen3-4B
 
89.80
 
91.50
 
89.90
 
90.20
 
89.80
 
90.10
 
89.20
 
88.80
 
90.00
 
90.80
 
90.50
DIBJudge-Qwen3-8B
 
91.50
 
93.20
 
91.80
 
93.50
 
91.50
 
92.00
 
90.80
 
91.20
 
91.80
 
92.50
 
92.00
29

-- 29 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 13. Full detailed results by category of RewardBench (English). Bold indicates the best performance, and underlined indicates the
second-best.
Chat Chat Hard Safety Reasoning Average
Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (English)
Proprietary Models
GPT-4o 90.50 75.10 88.50 89.74 85.96 ± 0.35
Gemini-2.5-Flash 93.80

Chunk 60 · 1,998 chars

detailed results by category of RewardBench (English). Bold indicates the best performance, and underlined indicates the
second-best.
Chat Chat Hard Safety Reasoning Average
Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (English)
Proprietary Models
GPT-4o 90.50 75.10 88.50 89.74 85.96 ± 0.35
Gemini-2.5-Flash 93.80 81.20 89.10 91.22 88.83 ± 0.47
General Open Models
Qwen2.5-3B-Instruct 82.50 41.50 74.50 77.46 68.99 ± 1.05
Qwen2.5-7B-Instruct 89.10 58.20 82.40 84.66 78.59 ± 0.91
Qwen3-4B 92.50 76.50 86.50 94.66 87.54 ± 0.55
Qwen3-8B 92.00 82.70 87.05 93.49 88.81 ± 0.48
Multilingual Open Reward Models
Nemotron-Multi-49B 93.50 85.80 90.00 89.54 89.71 ± 0.31
M-PROMETHEUS 3B 80.50 42.10 80.50 76.06 69.79 ± 0.92
M-PROMETHEUS 7B 90.00 53.00 84.00 79.76 76.69 ± 0.78
mR3-Qwen3-4B 88.90 84.10 89.50 96.50 89.75 ± 0.38
mR3-Qwen3-8B 88.00 84.47 90.41 97.52 90.10 ± 0.40
Think-as-Locals 7B 91.20 79.00 89.50 95.46 88.79 ± 0.52
Ours
DIBJudge-Qwen3-4B 94.20 86.50 89.80 90.78 90.32 ± 0.25
DIBJudge-Qwen3-8B 95.50 88.10 90.80 89.64 91.01 ± 0.20†
F.4. Ablation Study: Efficacy of Information Bottlenecks
A central hypothesis of our work is that a variational information constraint provides a superior trade-off between semantic
preservation and artifact suppression compared to discrete or deterministic alternatives. To evaluate this, we compare our
framework against three representative bottleneck mechanisms, assessing their impact on the robustness-utility Pareto
frontier:
• Vector Quantization (VQ): We replace the continuous variational information constraint with a discrete codebook
constraint (Van Den Oord et al., 2017), mapping latent representations to the nearest centroid. This imposes a rigid
structural bottleneck.
• Low-Rank Projection (Low-Rank): We utilize a deterministic linear projection to a lower-dimensional subspace.
This variant relies solely on capacity reduction without stochastic regularization.
• Stochastic Noise (Noise): We apply isotropic Gaussian noise injection,

Chunk 61 · 1,995 chars

troid. This imposes a rigid
structural bottleneck.
• Low-Rank Projection (Low-Rank): We utilize a deterministic linear projection to a lower-dimensional subspace.
This variant relies solely on capacity reduction without stochastic regularization.
• Stochastic Noise (Noise): We apply isotropic Gaussian noise injection, z = h+ϵ, as a baseline stochastic regularization
technique, notably lacking a prior distribution constraint.
Analysis. Table 19 demonstrates that while the Low-Rank variant preserves accuracy, it fails to filter translationese
artifacts, suggesting that dimensionality reduction alone cannot achieve effective disentanglement. Conversely, VQ achieves
the lowest bias severity but suffers from utility collapse, where the rigid discrete constraint discards the nuanced semantic
features necessary for reward modeling.
Our variational information constraint method bridges this gap by explicitly optimizing the I(Z; X) vs. I(Z; Y ) trade-off.
By penalizing I(Z; X) via the KL divergence to a prior, VIB selectively purges nuisance factors—such as translationese
artifacts—while retaining task-relevant style features. Consequently, VIB dominates the noise and low-rank baselines in
robustness while maintaining competitive utility.
30

-- 30 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 14. Bias severity by language on Aya dataset (Singh et al., 2024) under perturbed setting
Language Base Vanilla SFT Vanilla IB DIBJudge
High-Resource
Basque 0.081 ± 0.008 0.088 ± 0.043 0.056 ± 0.011 0.042 ± 0.009
English 0.045 ± 0.008 0.058 ± 0.010 0.041 ± 0.008 0.031 ± 0.005
Finnish 0.125 ± 0.010 0.165 ± 0.028 0.079 ± 0.015 0.058 ± 0.005
Hindi 0.089 ± 0.008 0.076 ± 0.010 0.048 ± 0.007 0.036 ± 0.012
Japanese 0.049 ± 0.011 0.048 ± 0.010 0.034 ± 0.014 0.026 ± 0.015
Portuguese 0.089 ± 0.011 0.084 ± 0.013 0.056 ± 0.011 0.042 ± 0.006
Simplified Chinese 0.072 ± 0.015 0.086 ± 0.013 0.048 ± 0.007 0.035 ± 0.008
Spanish

Chunk 62 · 1,996 chars

0.165 ± 0.028 0.079 ± 0.015 0.058 ± 0.005
Hindi 0.089 ± 0.008 0.076 ± 0.010 0.048 ± 0.007 0.036 ± 0.012
Japanese 0.049 ± 0.011 0.048 ± 0.010 0.034 ± 0.014 0.026 ± 0.015
Portuguese 0.089 ± 0.011 0.084 ± 0.013 0.056 ± 0.011 0.042 ± 0.006
Simplified Chinese 0.072 ± 0.015 0.086 ± 0.013 0.048 ± 0.007 0.035 ± 0.008
Spanish 0.099 ± 0.012 0.128 ± 0.012 0.064 ± 0.013 0.046 ± 0.007
Vietnamese 0.173 ± 0.022 0.192 ± 0.038 0.099 ± 0.007 0.071 ± 0.006
Avg (High) 0.091 ± 0.015 0.103 ± 0.022 0.058 ± 0.020 0.043 ± 0.014
Mid-Resource
Bengali 0.084 ± 0.014 0.140 ± 0.028 0.096 ± 0.026 0.043 ± 0.016
Cebuano 0.108 ± 0.008 0.113 ± 0.027 0.120 ± 0.013 0.052 ± 0.008
Filipino 0.118 ± 0.018 0.122 ± 0.020 0.132 ± 0.015 0.057 ± 0.010
Indonesian 0.064 ± 0.004 0.059 ± 0.015 0.071 ± 0.024 0.032 ± 0.016
Lithuanian 0.166 ± 0.004 0.202 ± 0.019 0.182 ± 0.029 0.082 ± 0.015
Malay 0.086 ± 0.007 0.113 ± 0.027 0.095 ± 0.012 0.044 ± 0.011
Tamil 0.157 ± 0.034 0.192 ± 0.051 0.170 ± 0.021 0.077 ± 0.015
Thai 0.082 ± 0.018 0.112 ± 0.024 0.090 ± 0.023 0.041 ± 0.017
Ukrainian 0.106 ± 0.018 0.133 ± 0.004 0.118 ± 0.013 0.052 ± 0.012
Urdu 0.139 ± 0.023 0.189 ± 0.025 0.155 ± 0.020 0.069 ± 0.018
Avg (Mid) 0.111 ± 0.016 0.138 ± 0.024 0.123 ± 0.037 0.055 ± 0.016
Low-Resource
Amharic 0.376 ± 0.056 0.201 ± 0.021 0.300 ± 0.036 0.225 ± 0.042
Irish 0.350 ± 0.034 0.348 ± 0.013 0.276 ± 0.049 0.208 ± 0.031
Kyrgyz 0.174 ± 0.024 0.227 ± 0.024 0.138 ± 0.043 0.107 ± 0.036
Nepali 0.111 ± 0.004 0.101 ± 0.015 0.089 ± 0.069 0.070 ± 0.034
Malagasy 0.262 ± 0.024 0.178 ± 0.131 0.210 ± 0.049 0.160 ± 0.030
Sinhala 0.254 ± 0.032 0.310 ± 0.101 0.202 ± 0.050 0.154 ± 0.018
Pashto 0.244 ± 0.011 0.235 ± 0.079 0.195 ± 0.029 0.149 ± 0.053
Telugu 0.138 ± 0.022 0.148 ± 0.012 0.111 ± 0.058 0.085 ± 0.037
Yoruba 0.472 ± 0.026 0.453 ± 0.066 0.372 ± 0.029 0.283 ± 0.048
Zulu 0.338 ± 0.016 0.464 ± 0.021 0.266 ± 0.056 0.201 ± 0.044
Avg (Low) 0.272 ± 0.028 0.266 ± 0.048 0.216 ± 0.089 0.164 ± 0.067
F.5. Ablation Study: Disentanglement Mechanisms
We evaluate

Chunk 63 · 1,995 chars

0.053
Telugu 0.138 ± 0.022 0.148 ± 0.012 0.111 ± 0.058 0.085 ± 0.037
Yoruba 0.472 ± 0.026 0.453 ± 0.066 0.372 ± 0.029 0.283 ± 0.048
Zulu 0.338 ± 0.016 0.464 ± 0.021 0.266 ± 0.056 0.201 ± 0.044
Avg (Low) 0.272 ± 0.028 0.266 ± 0.048 0.216 ± 0.089 0.164 ± 0.067
F.5. Ablation Study: Disentanglement Mechanisms
We evaluate the efficacy of our proposed cross-covariance penalty against other methods. Our primary hypothesis is that
a cross-covariance-based constraint provides a computationally efficient proxy for independence, effectively minimizing
mutual information without the overhead of complex density estimators.
Alternative Disentanglement Objectives. We compare our cross-covariance approach, Lcov, against three established
baseline objectives:
• Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005): A kernel-based measure of dependence.
While theoretically robust, its O(n2) complexity per batch is prohibitive for long-context LLM fine-tuning.
• Mutual Information Estimators (CLUB (Cheng et al., 2020)/MINE (Belghazi et al., 2018)): Variational upper
bounds on Mutual Information (MI). These require auxiliary neural networks, increasing the parameter search space
and training time.
• Orthogonality Constraint (Lorth): A first-order geometric constraint minimizing absolute cosine similarity:
Lorth = E
 |z⊤
r zb|
∥zr ∥2∥zb∥2

. (39)
31

-- 31 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 15. Bias severity by language on Belebele (Bandarkar et al., 2024) under perturbed etting
Language Base Vanilla SFT Vanilla IB DIBJudge
High-Resource
Arabic 0.100 ± 0.029 0.095 ± 0.016 0.058 ± 0.010 0.014 ± 0.006
English 0.072 ± 0.005 0.063 ± 0.004 0.043 ± 0.014 0.011 ± 0.005
Finnish 0.086 ± 0.015 0.067 ± 0.017 0.049 ± 0.023 0.012 ± 0.010
Hindi 0.091 ± 0.011 0.089 ± 0.011 0.052 ± 0.012 0.013 ± 0.011
Japanese 0.074 ± 0.009 0.069 ± 0.012 0.047 ± 0.008 0.012 ± 0.006
Korean 0.084 ± 0.013 0.078 ± 0.018

Chunk 64 · 1,990 chars

6 0.058 ± 0.010 0.014 ± 0.006
English 0.072 ± 0.005 0.063 ± 0.004 0.043 ± 0.014 0.011 ± 0.005
Finnish 0.086 ± 0.015 0.067 ± 0.017 0.049 ± 0.023 0.012 ± 0.010
Hindi 0.091 ± 0.011 0.089 ± 0.011 0.052 ± 0.012 0.013 ± 0.011
Japanese 0.074 ± 0.009 0.069 ± 0.012 0.047 ± 0.008 0.012 ± 0.006
Korean 0.084 ± 0.013 0.078 ± 0.018 0.050 ± 0.011 0.013 ± 0.004
Russian 0.065 ± 0.010 0.058 ± 0.010 0.041 ± 0.021 0.010 ± 0.009
Turkish 0.072 ± 0.012 0.070 ± 0.010 0.045 ± 0.017 0.011 ± 0.009
Vietnamese 0.081 ± 0.006 0.080 ± 0.008 0.048 ± 0.013 0.012 ± 0.009
Avg (High) 0.081 ± 0.012 0.074 ± 0.011 0.048 ± 0.005 0.012 ± 0.001
Mid-Resource
Bengali 0.094 ± 0.014 0.091 ± 0.015 0.066 ± 0.021 0.014 ± 0.005
Greek 0.083 ± 0.016 0.075 ± 0.019 0.060 ± 0.013 0.012 ± 0.009
Hebrew 0.088 ± 0.021 0.079 ± 0.019 0.062 ± 0.012 0.013 ± 0.006
Georgian 0.091 ± 0.016 0.085 ± 0.018 0.063 ± 0.019 0.013 ± 0.006
Kazakh 0.088 ± 0.012 0.081 ± 0.014 0.062 ± 0.014 0.013 ± 0.010
Tamil 0.102 ± 0.018 0.095 ± 0.022 0.072 ± 0.030 0.015 ± 0.014
Thai 0.080 ± 0.012 0.074 ± 0.014 0.057 ± 0.028 0.012 ± 0.006
Ukrainian 0.093 ± 0.018 0.089 ± 0.019 0.068 ± 0.027 0.014 ± 0.009
Urdu 0.099 ± 0.020 0.093 ± 0.019 0.073 ± 0.019 0.015 ± 0.009
Malay 0.090 ± 0.014 0.084 ± 0.016 0.062 ± 0.019 0.013 ± 0.013
Avg (Mid) 0.091 ± 0.016 0.085 ± 0.017 0.065 ± 0.005 0.013 ± 0.001
Low-Resource
Amharic 0.122 ± 0.025 0.111 ± 0.023 0.185 ± 0.049 0.093 ± 0.041
Burmese 0.136 ± 0.029 0.119 ± 0.028 0.205 ± 0.055 0.103 ± 0.031
Guarani 0.098 ± 0.021 0.091 ± 0.019 0.147 ± 0.029 0.074 ± 0.024
Kannada 0.142 ± 0.029 0.132 ± 0.027 0.218 ± 0.051 0.109 ± 0.019
Khmer 0.129 ± 0.026 0.117 ± 0.024 0.197 ± 0.053 0.099 ± 0.034
Kyrgyz 0.111 ± 0.022 0.104 ± 0.020 0.170 ± 0.054 0.085 ± 0.035
Punjabi 0.115 ± 0.023 0.109 ± 0.021 0.179 ± 0.040 0.089 ± 0.023
Pashto 0.168 ± 0.031 0.156 ± 0.029 0.255 ± 0.058 0.129 ± 0.025
Zulu 0.123 ± 0.024 0.114 ± 0.022 0.188 ± 0.044 0.094 ± 0.034
Avg (Low) 0.126 ± 0.026 0.117 ± 0.025 0.194 ± 0.031 0.097 ± 0.016
Quantitative Analysis of

Chunk 65 · 1,985 chars

.104 ± 0.020 0.170 ± 0.054 0.085 ± 0.035
Punjabi 0.115 ± 0.023 0.109 ± 0.021 0.179 ± 0.040 0.089 ± 0.023
Pashto 0.168 ± 0.031 0.156 ± 0.029 0.255 ± 0.058 0.129 ± 0.025
Zulu 0.123 ± 0.024 0.114 ± 0.022 0.188 ± 0.044 0.094 ± 0.034
Avg (Low) 0.126 ± 0.026 0.117 ± 0.025 0.194 ± 0.031 0.097 ± 0.016
Quantitative Analysis of Efficiency vs. Robustness. The trade-off between disentanglement strength and computational
overhead is summarized in Table 20. Our analysis reveals that while variational estimators (CLUB, MINE) and kernel-based
methods (HSIC) theoretically offer tighter bounds on independence, their integration into LLM architectures is bottlenecked
by the high-dimensional nature of the hidden states. Specifically, the auxiliary network in CLUB introduces a 112% increase
in training latency due to the additional forward-backward passes required for the critic update.
In contrast, our cross-covariance approach, Lcov, achieves a Bias Severity score of 14.2, which is competitive with the
13.8 achieved by HSIC, but at a fraction of the computational cost (1.2× vs 2.8× latency). We observe that simple
orthogonality (Lorth) suffers from significant ”information leakage,” as evidenced by its high Bias Severity (21.5); this
confirms that first-order geometric constraints are insufficient to capture the complex, non-linear correlations inherent in
LLM representations. By minimizing the cross-covariance, we achieve a second-order alignment that serves as a ”practical
optimum”—sufficiently decorrelating the robust and biased subspaces without the prohibitive O(n2) complexity or training
instability of higher-order estimators.
F.6. Quantitative Analysis: Linear Probing for Information Leakage
To quantify the degree of disentanglement achieved by DIBJUDGE, we employ linear probing (Alain & Bengio, 2016).
Our hypothesis is that a truly disentangled robust representation, zr , should be invariant to the text origin (Human vs.
Machine-translated). Conversely, the bias

Chunk 66 · 1,996 chars

is: Linear Probing for Information Leakage
To quantify the degree of disentanglement achieved by DIBJUDGE, we employ linear probing (Alain & Bengio, 2016).
Our hypothesis is that a truly disentangled robust representation, zr , should be invariant to the text origin (Human vs.
Machine-translated). Conversely, the bias representation, zb, should explicitly encode these “translationese” artifacts.
32

-- 32 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 16. Bias severity by language on XL-Sum (Hasan et al., 2021) under perturbed setting
Language Base Vanilla SFT Vanilla IB DIBJudge
High-Resource
Arabic 0.079 ± 0.031 0.100 ± 0.030 0.084 ± 0.029 0.025 ± 0.008
English 0.060 ± 0.006 0.076 ± 0.008 0.063 ± 0.022 0.019 ± 0.005
French 0.054 ± 0.010 0.095 ± 0.019 0.069 ± 0.009 0.021 ± 0.005
Hindi 0.101 ± 0.010 0.126 ± 0.006 0.083 ± 0.029 0.024 ± 0.006
Japanese 0.067 ± 0.009 0.066 ± 0.026 0.064 ± 0.025 0.020 ± 0.008
Korean 0.081 ± 0.009 0.096 ± 0.037 0.075 ± 0.022 0.022 ± 0.012
Russian 0.077 ± 0.013 0.072 ± 0.012 0.067 ± 0.017 0.021 ± 0.004
Turkish 0.060 ± 0.016 0.065 ± 0.010 0.062 ± 0.020 0.019 ± 0.012
Vietnamese 0.069 ± 0.005 0.081 ± 0.029 0.081 ± 0.016 0.026 ± 0.006
Avg (High) 0.071 ± 0.014 0.086 ± 0.020 0.072 ± 0.009 0.022 ± 0.003
Mid-Resource
Azerbaijani 0.116 ± 0.024 0.145 ± 0.037 0.046 ± 0.012 0.012 ± 0.010
Bengali 0.080 ± 0.011 0.075 ± 0.015 0.040 ± 0.012 0.011 ± 0.008
Indonesian 0.058 ± 0.012 0.078 ± 0.009 0.036 ± 0.023 0.010 ± 0.005
Tamil 0.109 ± 0.006 0.123 ± 0.030 0.045 ± 0.022 0.012 ± 0.012
Thai 0.074 ± 0.007 0.083 ± 0.025 0.041 ± 0.029 0.011 ± 0.010
Ukrainian 0.092 ± 0.015 0.100 ± 0.022 0.043 ± 0.020 0.011 ± 0.013
Urdu 0.091 ± 0.020 0.117 ± 0.014 0.047 ± 0.029 0.012 ± 0.012
Uzbek 0.141 ± 0.012 0.159 ± 0.009 0.052 ± 0.012 0.013 ± 0.005
Avg (Mid) 0.095 ± 0.016 0.123 ± 0.022 0.044 ± 0.005 0.012 ± 0.001
Low-Resource
Amharic 0.303 ± 0.022 0.271 ± 0.036 0.110 ± 0.042 0.088 ± 0.020
Burmese 0.157 ±

Chunk 67 · 1,999 chars

0 ± 0.022 0.043 ± 0.020 0.011 ± 0.013
Urdu 0.091 ± 0.020 0.117 ± 0.014 0.047 ± 0.029 0.012 ± 0.012
Uzbek 0.141 ± 0.012 0.159 ± 0.009 0.052 ± 0.012 0.013 ± 0.005
Avg (Mid) 0.095 ± 0.016 0.123 ± 0.022 0.044 ± 0.005 0.012 ± 0.001
Low-Resource
Amharic 0.303 ± 0.022 0.271 ± 0.036 0.110 ± 0.042 0.088 ± 0.020
Burmese 0.157 ± 0.028 0.081 ± 0.028 0.087 ± 0.025 0.067 ± 0.019
Hausa 0.215 ± 0.022 0.207 ± 0.015 0.098 ± 0.052 0.076 ± 0.040
Kyrgyz 0.160 ± 0.021 0.165 ± 0.035 0.081 ± 0.060 0.063 ± 0.016
Marathi 0.115 ± 0.016 0.124 ± 0.018 0.079 ± 0.044 0.061 ± 0.031
Nepali 0.084 ± 0.021 0.068 ± 0.018 0.070 ± 0.027 0.054 ± 0.040
Pashto 0.176 ± 0.034 0.170 ± 0.060 0.091 ± 0.063 0.071 ± 0.052
Sinhala 0.246 ± 0.042 0.248 ± 0.022 0.105 ± 0.021 0.082 ± 0.055
Telugu 0.095 ± 0.013 0.121 ± 0.001 0.076 ± 0.027 0.059 ± 0.056
Welsh 0.163 ± 0.026 0.225 ± 0.028 0.084 ± 0.025 0.066 ± 0.038
Avg (Low) 0.172 ± 0.024 0.168 ± 0.026 0.088 ± 0.013 0.069 ± 0.011
Experimental Setup. We freeze the DIBJUDGE encoder and train a linear classifier (the probe) on the extracted
representations. The probe is optimized to distinguish the two domains via binary classification. We report probing accuracy
on a held-out test set; an accuracy of 50% (random chance) signifies perfect invariance, whereas higher accuracy indicates
significant information leakage.
Results. As shown in Table 21, the probe trained on the bias representation zb achieves near-perfect accuracy (96.1%),
confirming that DIBJUDGE successfully isolates translation artifacts. More importantly, the probe trained on the robust
representation zr yields an accuracy of approximately 50%, demonstrating that predictive features are effectively sanitized
of domain-specific signals. In contrast, standard embeddings from the baseline SFT model exhibit substantial leakage
(82.4%), highlighting that standard fine-tuning fails to decouple semantic content from stylistic artifacts.
33

-- 33 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge

Chunk 68 · 1,989 chars

y sanitized
of domain-specific signals. In contrast, standard embeddings from the baseline SFT model exhibit substantial leakage
(82.4%), highlighting that standard fine-tuning fails to decouple semantic content from stylistic artifacts.
33

-- 33 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Table 17. Impact of Back-Translation Systems. We report BLEU scores of the generated back-translations and the resulting Bias
Severity.
Translation System BLEU ↑ Bias Severity ↓
NLLB-200-3.3B (Costa-Juss`a et al., 2022) (Default) 52.4 0.15
Qwen3-4B (Yang et al., 2025) 54.1 0.16
Gemma-3-4B (Team et al., 2025) 55.8 0.14
Llama-3.1-8B-Instruct (Grattafiori et al., 2024) 56.2 0.18
Gemini-2.5-Flash (Comanici et al., 2025) 57.9 0.15
GPT-4o (Hurst et al., 2024) 58.5 0.17
Google Translate 59.1 0.14
Table 18. Comparison of Heuristic Metrics. Ablation of the metric used for Bin Classification. NLL performs slightly better, but all
metrics provide similar bias mitigation.
Metric Bias Severity ↓
Type-Token Ratio (TTR) 0.17
Perplexity (PPL) 0.16
Negative Log-Likelihood (NLL) 0.15
0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0
CAD (binned)
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Win rate
High-CAD regime
Smallest slope: DIBJudge
DIBJudge (slope = +0.07)
Base (slope = +0.16)
Vanilla SFT (slope = +0.28)
Vanilla IB (slope = +0.15)
Figure 7. Sensitivity of Win Rate to Cross-Lingual Alignment Discrepancy (CAD). Win rates of Translationese outputs are plotted
against binned CAD values. Baseline judges (dashed lines) display a strong positive dependence on CAD, reflecting an English-anchoring
bias toward structurally aligned translations. In contrast, DIBJUDGE (solid line) maintains a near-constant win rate across CAD regimes,
including high-CAD regions, indicating robustness to cross-lingual structural divergence.
Table 19. Comparison of Bottleneck Mechanisms. Effect of different latent

Chunk 69 · 1,992 chars

English-anchoring
bias toward structurally aligned translations. In contrast, DIBJUDGE (solid line) maintains a near-constant win rate across CAD regimes,
including high-CAD regions, indicating robustness to cross-lingual structural divergence.
Table 19. Comparison of Bottleneck Mechanisms. Effect of different latent constraints on utility (m-RewardBench accuracy) and bias
mitigation (Bias Severity). Lower Bias Severity is better.
Method Constraint Type m-RB Acc. ↑ Bias Sev. ↓
Low-Rank Deterministic (Capacity) 88.2 0.185
Noise Stochastic (Additive) 85.5 0.182
VQ Discrete (Structural) 84.9 0.082
Ours Variational (Information) 89.3 0.091
34

-- 34 of 35 --

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
Sequence Surprisal Ratio (SSR)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Density
Vanilla SFT peak
DIBJudge peak
Vanilla SFT
DIBJudge
Figure 8. Bias Spectrum Shift via Sequence Surprisal Ratio (SSR). Kernel density estimates of SSR for instances where the judge selects
the machine-generated output (Machine Wins). Lower SSR values correspond to spurious correlations with high predictive confidence
(i.e., low perplexity), indicating a preference for distributionally simple text. Values near 1 reflect invariance to confidence-related artifacts.
While the baseline judge concentrates mass at low SSR values, DIBJUDGE shifts the spectrum toward 1, demonstrating reduced reliance
on predictive-confidence shortcuts.
Table 20. Efficiency–bias trade-off of disentanglement mechanisms on Llama-3-8B. Training latency is normalized to standard
fine-tuning (1.0×). Bias Severity is defined in § 2, where lower is better.
Method Computational Cost Latency ↑ Bias Severity ↓
Baseline (No disentanglement) – 1.00× 0.284
Lorth (Orthogonality) O(d) 1.05× 0.215
MINE (Belghazi et al., 2018) O(aux. net) 2.42× 0.145
CLUB (Cheng et al., 2020) O(aux. net) 2.12× 0.141
HSIC (Gretton et al., 2005) O(n2) 2.78×

Chunk 70 · 754 chars

is defined in § 2, where lower is better.
Method Computational Cost Latency ↑ Bias Severity ↓
Baseline (No disentanglement) – 1.00× 0.284
Lorth (Orthogonality) O(d) 1.05× 0.215
MINE (Belghazi et al., 2018) O(aux. net) 2.42× 0.145
CLUB (Cheng et al., 2020) O(aux. net) 2.12× 0.141
HSIC (Gretton et al., 2005) O(n2) 2.78× 0.138
Lcov (Ours) O(d2) 1.18× 0.142
Table 21. Linear probing for domain classification. Higher accuracy indicates stronger domain information. Effective disentanglement
yields high accuracy for bias representations and low accuracy for robust representations.
Model Probed Representation Accuracy (%)
Baseline (SFT) Standard embedding h 82.4
DIBJUDGE (Ours) Bias representation zb 96.1
Robust representation zr 50.3
35

-- 35 of 35 --