Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Summary
This paper addresses "translationese bias" in multilingual Large Language Model (LLM) judges, where models systematically favor machine-translated text over human-authored references, particularly in low-resource languages. The authors attribute this bias to two spurious correlations: latent manifold alignment with English and cross-lingual predictability. To mitigate this, they propose DIBJUDGE, a fine-tuning framework based on the Disentangled Information Bottleneck principle. DIBJUDGE separates input representations into a robust branch for judgment-critical semantics and a bias branch that isolates spurious factors. It employs variational information compression to minimize the robust representation while using a cross-covariance penalty to ensure statistical independence between the two branches. Extensive evaluations on benchmarks like M-RewardBench and MM-Eval demonstrate that DIBJUDGE consistently outperforms strong baselines, including GPT-4o and Gemini-2.5-Flash, in multilingual reward modeling accuracy. Crucially, it substantially reduces translationese bias severity across high-, mid-, and low-resource languages, with the most significant improvements observed in under-resourced settings. The method also generalizes to unseen biases, such as length and self-preference bias, without degrading performance on English-centric tasks. These results confirm that explicitly disentangling semantic content from translation artifacts enhances the reliability and fairness of automated multilingual evaluation.
PDF viewer
Chunks(71)
Chunk 0 · 1,995 chars
Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Hongbin Zhang 1 2 Kehai Chen 1 2 Xuefeng Bai 1 Youcheng Pan 2 Yang Xiang 2 Jinpeng Wang 3 Min Zhang 1 2 Abstract Large language models (LLMs) have become a standard for multilingual evaluation, yet they ex- hibit a severe systematic “translationese bias”. In this paper, “translationese bias” is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual pre- dictability. To mitigate this bias, we propose DIB- JUDGE, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical represen- tation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incor- porate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJUDGE consistently outperforms strong baselines and substantially mitigates translationese bias. 1. Introduction The emergence of Large Language Models (LLMs) has revolutionized evaluation paradigms (Gu et al., 2024; Li et al., 2025), establishing “LLM-as-a-Judge” as a standard framework for multilingual assessment (Son et al., 2024; Pombal et al., 2025; Anugraha et al., 2025; Hada et al., 2024; Fu & Liu, 2025; Doddapaneni et al., 2025). Consequently, ensuring the accuracy and robustness of these automated judges across diverse languages has become a critical necessity (Padarha et al., 2025; Bogavelli et al., 2026). 1Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen,
Chunk 1 · 1,997 chars
al., 2024; Fu & Liu, 2025; Doddapaneni et al., 2025). Consequently, ensuring the accuracy and robustness of these automated judges across diverse languages has become a critical necessity (Padarha et al., 2025; Bogavelli et al., 2026). 1Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China 2Pengcheng Laboratory, Shenzhen, China 3Keeta AI, Meituan, Beijing, China. Correspondence to: Kehai Chen <chenkehai@hit.edu.cn>. Copyright 2026 by the author(s). 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Bias severity (higher = more bias) Southern Pashto Plateau Malagasy Yoruba Zulu Kyrgyz Nepali Amharic Sinhala Irish Telugu Cebuano Lithuanian Filipino Standard Malay Urdu Bengali Tamil Thai Ukrainian Indonesian Basque Finnish Hindi Vietnamese Standard Arabic Portuguese Japanese Spanish Simplified Chinese English Low Resource Mid Resource High Resource Highest bias Lowest bias Local trend (window=5) Overall mean Figure 1. Translationese Bias Severity of GPT-4o across languages. Languages are sorted by resource availability from low (top) to high (bottom). The trend line illustrates the inverse relationship between resource availability and translationese bias. However, the reliability of LLM judges is frequently under- mined by systematic biases (Wang et al., 2025a; Ye et al., 2025; Gao et al., 2025), such as position (Shi et al., 2025; Wang et al., 2024b) and verbosity bias (Saito et al., 2023). While these limitations are well-studied in English con- texts (Chen et al., 2024; Zheng et al., 2023), specific failure modes within multilingual settings remain underexplored. In this paper, we characterize a distinct bias of LLM-as- a-Judge in multilingual contexts, termed translationese bias, in which LLMs favor machine-translated content over human-authored reference, even when the former is seman- tically flawed. To investigate this bias, we first conduct a comprehensive evaluation across a diverse spectrum of lan- guages. As shown in Figure 1, this
Chunk 2 · 1,988 chars
e in multilingual contexts, termed translationese bias, in which LLMs favor machine-translated content over human-authored reference, even when the former is seman- tically flawed. To investigate this bias, we first conduct a comprehensive evaluation across a diverse spectrum of lan- guages. As shown in Figure 1, this bias is not only pervasive but is significantly exacerbated in low-resource languages. Crucially, our further attribution analysis suggests that LLM judges may conflate generation quality with two potential spurious factors: (a) latent manifold alignment with English, and (b) cross-lingual predictability. While recent advance- ments in multilingual LLM judges have yielded promising results (Pombal et al., 2025; Anugraha et al., 2025; Zhang et al., 2025), most existing methods remain grounded in standard Supervised Fine-Tuning (SFT). However, SFT is 1 arXiv:2603.10351v1 [cs.CL] 11 Mar 2026 -- 1 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck susceptible to exploiting spurious correlations (Shuieh et al., 2025; Gui & Ji, 2025; Chen et al., 2025c), thereby limiting its efficacy in mitigating translationese bias. To this end, we propose the Disentangled Information Bottleneck Judge (DIBJUDGE), a robust fine-tuning frame- work that explicitly decouples the latent representation into two components: a robust representation that preserves judgment-critical semantic information, and a bias repre- sentation that isolates the spurious factors identified above (i.e., latent manifold alignment with English and cross- lingual predictability). We leverage variational informa- tion compression to learn a robust, minimally sufficient representation that preserves only information essential for accurate judgment. To further encourage disentanglement between robust and bias representation, we penalize their mutual dependence during training. Extensive experiments on multilingual reward modeling
Chunk 3 · 1,991 chars
ion compression to learn a robust, minimally sufficient representation that preserves only information essential for accurate judgment. To further encourage disentanglement between robust and bias representation, we penalize their mutual dependence during training. Extensive experiments on multilingual reward modeling benchmarks, including M-RewardBench (Gureja et al., 2025) and MM-Eval (Son et al., 2024), demonstrate that DIBJUDGE consistently out- performs strong baselines, yielding improved multilingual reward modeling performance. Moreover, evaluations on a dedicated translationese bias suite confirm that DIBJUDGE substantially mitigates the severity of translationese bias. In summary, we make three key claims: (i) we characterize translationese bias in multilingual LLM judges and identify two related spurious factors—latent manifold alignment with English and cross-lingual predictability, (ii) we propose DIBJUDGE, a robust fine-tuning framework that disentan- gles judgment-critical semantics from spurious factors, and (iii) we show that DIBJUDGE consistently outperforms strong baselines on multilingual reward modeling bench- marks while effectively mitigating translationese bias.1 2. Preliminary Analysis of Translationese Bias To systematically study translationese bias, we structure our preliminary analysis around two research questions: (i) RQ1: How does translationese bias vary across languages with different levels of resource availability? (ii) RQ2: What kinds of spurious factors are associated with this bias? Bias Evaluation Protocol. We construct a controlled trans- lationese bias benchmark derived from BELEBELE (Ban- darkar et al., 2024), a multilingual reading comprehension dataset spanning 122 languages. Following the language- resource taxonomy of Joshi et al. (2020), we stratify lan- guages into high-, medium-, and low-resource tiers and sample 10 representative languages per tier. For each language, we evaluate on 200 instances, yielding a
Chunk 4 · 1,992 chars
2024), a multilingual reading comprehension
dataset spanning 122 languages. Following the language-
resource taxonomy of Joshi et al. (2020), we stratify lan-
guages into high-, medium-, and low-resource tiers and
sample 10 representative languages per tier. For each
language, we evaluate on 200 instances, yielding a balanced
benchmark across resource levels. The full list of selected
languages is provided in Appendix A.1. We formulate
translationese bias evaluation as a pairwise preference task,
1Our anonymous code repo is here.
where an LLM judge compares two candidate responses
for the same query: (i) a human-authored reference xH ,
and (ii) a machine-generated variant xM obtained via back-
translation to induce translationese artifacts. To avoid posi-
tion bias, each instance i is evaluated under both forwarding
and reverse ordering. Details are available in Appendix A.2.
Bias Metric Definition. Let yi ∈ {0, 1} indicate whether
the judge prefers xM in the forward order, and let ¯yi ∈
{0, 1} denote the corresponding preference in the reverse
order. We define bias severity Sbias as the fraction of con-
sistent judgments that favor the machine-generated output:
Sbias =
PN
i=1 I [yi = 1 ∧ ¯yi = 1]
PN
i=1 I [yi = ¯yi] . (1)
2.1. Quantifying Bias across Resource Levels (RQ1)
Figure 1 illustrates the translationese bias severity of GPT-
4o across varying language resource levels. We observe
two salient patterns: first, translationese bias is pervasive
across the entire linguistic spectrum; second, there is a
distinct inverse correlation between resource availability
and the magnitude of bias. Specifically, while high-resource
languages exhibit minimal bias, low-resource languages
demonstrate significantly elevated severity. These findings
expose translationese bias as a severe yet previously ne-
glected failure mode in multilingual LLM judges, which
critically undermines evaluation reliability and dispropor-
tionately compromises under-resourced languages.
2.2.Chunk 5 · 1,996 chars
mal bias, low-resource languages demonstrate significantly elevated severity. These findings expose translationese bias as a severe yet previously ne- glected failure mode in multilingual LLM judges, which critically undermines evaluation reliability and dispropor- tionately compromises under-resourced languages. 2.2. Attribution Analysis of Translationese Bias (RQ2) Due to the scarcity of high-quality native resources (Qin et al., 2025; Huang et al., 2024), multilingual LLMs are typ- ically pre-trained on English-dominated corpora (Kreutzer et al., 2022; Weber et al., 2024) and subsequently adapted using translated or synthetic data (Muennighoff et al., 2023; Zhang et al., 2020a). Accordingly, we hypothesize that translationese bias stems from spurious correlations induced across these two stages: (i) latent manifold alignment with English, where non-English representations are im- plicitly aligned to an English-centric latent space during pre-training; and (ii) cross-lingual predictability, where the judge over-relies on probability heuristics that favor the statistical patterns of machine-translated text, potentially amplified by exposure to translated or synthetic data during fine-tuning. However, causally attributing this bias to partic- ular data mixtures remains non-trivial given the opacity and heterogeneity of LLM training pipelines (Lai et al., 2025). To address this, we introduce two measurable latent met- rics that serve as quantitative proxies for these two fac- tors: (i) Language Alignment Score (LAS), defined as the degree to which a representation is geometrically aligned with an English latent manifold: LAS(x) = 1 L PL l=1 cos hl(x), cen,l , where x is input sequence, 2 -- 2 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck -0.20 0.20 -0.10 0.00 0.10 Cross-lingual alignment discrepancy 0.0 0.1 0.2 0.3 0.4 Model win rate β = -5.31 Logit fit Fit 95% CI Binned estimate (95% CI) (a)
Chunk 6 · 1,995 chars
, cen,l , where x is input sequence, 2 -- 2 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck -0.20 0.20 -0.10 0.00 0.10 Cross-lingual alignment discrepancy 0.0 0.1 0.2 0.3 0.4 Model win rate β = -5.31 Logit fit Fit 95% CI Binned estimate (95% CI) (a) Machine Win Rate vs. CAD 0.4 0.6 1.4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Probability density Δmedian = -0.34 Human win Machine win Median 0.8 1.0 1.2 Sequence Surprisal Ratio (b) SSR Density (Human vs. Machine Win) 0.0 0.2 0.8 1.0 0.4 0.6 False positive rate 0.0 0.2 0.4 0.6 0.8 1.0 True positive rate SSR (AUC=0.695) CAD (AUC=0.638) (c) ROC Analysis of CAD and SSR Figure 2. Correlation analysis of judge preference with confounding factors. (a) Machine win rate decreases monotonically as CAD increases, indicating that judge preference spuriously tracks latent manifold isomorphism with English. (b) SSR distributions exhibit a clear drift between human-win and machine-win cases, showing that the judge systematically favors higher-likelihood outputs. (c) ROC curves confirm that both CAD and SSR reliably predict judge outcomes, reinforcing the attribution that translationese bias is mediated by latent manifold isomorphism with English and high predicative confidence. hl(x) is the layer-l hidden representation of x, cen,l is the English centroid at layer l, and L is the total number of layers. (ii) Cross-lingual Sequence Surprisal (CSS), defined as the length-normalized negative log-likelihood of a target sequence x of T tokens, conditioned on its English translated context xen: CSS(x) = − 1 T PT t=1 log P (xt | xen, x<t). To answer RQ2, we then investigate the extent to which the distributional divergence of these metrics between xH and xM correlates with machine win rate: (i) Cross-lingual Alignment Discrepancy (CAD), CAD = LAS(xH ) − LAS(xM ), where CAD < 0 implies that xM exhibits closer alignment to the English latent space than xH does.
Chunk 7 · 1,999 chars
wer RQ2, we then investigate the extent to which the distributional divergence of these metrics between xH and xM correlates with machine win rate: (i) Cross-lingual Alignment Discrepancy (CAD), CAD = LAS(xH ) − LAS(xM ), where CAD < 0 implies that xM exhibits closer alignment to the English latent space than xH does. (ii) Sequence Surprisal Ratio (SSR): SSR = CSS(xM ) CSS(xH ) , where SSR < 1 indicates that xM is more cross-lingual predictable by the model relative to xH . As shown in Figure 2, LLM judges exhibit strong correla- tions with the introduced latent metrics. Specifically, Fig- ure 2a reveals a negative association between CAD and the machine win rate, suggesting that the judge favors outputs that align closely with the English manifold. Meanwhile, Figure 2b demonstrates a pronounced distributional shift in SSR: machine-generated outputs preferred by the judge cluster significantly at lower SSR. This pattern indicates a bias toward sequences with high statistical cross-lingual predictability. The robustness of the observed correlations is further corroborated by the ROC analysis in Figure 2c, demonstrating meaningful discriminative power of these features. Collectively, these results ascribe the observed translationese bias to two confounding factors: latent mani- fold alignment with English and cross-lingual predictability. 3. Disentangled Information Bottleneck Judge To mitigate the spurious correlations identified in § 2.2, we propose the Disentangled Information Bottleneck Judge (DIBJUDGE), as illustrated in Figure 3. By explicitly disentangling these spurious factors, DIBJUDGE learns a compressed representation that retains sufficient, robust features essential for accurate quality assessment. 3.1. Preliminaries Mutual Information. Mutual Information (MI) quantifies the statistical dependence between two random variables, A and B. Given the joint distribution p(a, b) and marginals p(a), p(b), MI is formally defined as : I(A; B) ≜ Ep(a,b) log p(a,
Chunk 8 · 1,995 chars
robust
features essential for accurate quality assessment.
3.1. Preliminaries
Mutual Information. Mutual Information (MI) quantifies
the statistical dependence between two random variables,
A and B. Given the joint distribution p(a, b) and marginals
p(a), p(b), MI is formally defined as :
I(A; B) ≜ Ep(a,b)
log p(a, b)
p(a)p(b)
. (2)
Information Bottleneck Principle. The Information Bot-
tleneck (IB) principle (Tishby et al., 2000; Alemi et al.,
2017) seeks a compressed representation Z that is sufficient
for a target task Y while remaining minimal with respect
to the input X. This is formalised by minimizing the
Lagrangian: LIB = −I(Y ; Z) + βI(X; Z), where β ≥ 0
governs the trade-off between prediction and compression.
However, compression may discard robust semantic features
in favor of simpler spurious correlations (Liu et al., 2022;
Zhang et al., 2024). Thus, relying solely on information
compactness cannot guarantee the robustness.
3.2. Disentangled Information Bottleneck Objective
To prevent LLM judges from exploiting spurious shortcuts
solely through information compression, we are inspired by
the idea of disentangled representation learning (Wang et al.,
2024d) to refine vanilla IB. Let S be a spurious variable,
Zr be a relevant variable encoding features necessary for
predicting the target Y , and Zb be a bias variable serving
as a dedicated “sink” to absorb S. We formalize the
Disentangled Information Bottleneck Objective as:
LDIB = −I(Y ; Zr )
| {z }
Prediction
+ β I(X; Zr )
| {z }
Compression
− γ I(S; Zb)
| {z }
Bias Capture
+ λ I(Zr ; Zb)
| {z }
Disentanglement
. (3)
3
-- 3 of 35 --
Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
min
max
LLM
Judge
Learnable Frozen
min
Minimize dependence
Robust
Encoder
Bias
Encoder Decoder 1 (ii) Log Probability Bin Classification Task
(i) Cross-lingual Alignment Contrastive Learning
Decoder 2
LoRA
Output
Robust Branch Bias Branch Proxy Task for EncodingChunk 9 · 1,994 chars
LLM-as-a-Judge via Disentangled Information Bottleneck min max LLM Judge Learnable Frozen min Minimize dependence Robust Encoder Bias Encoder Decoder 1 (ii) Log Probability Bin Classification Task (i) Cross-lingual Alignment Contrastive Learning Decoder 2 LoRA Output Robust Branch Bias Branch Proxy Task for Encoding Spurious Attributes (i) Latent Manifold Isomorphism with English and (ii) Predictive Confidence) 1 Decouple Input Compress 2 3 Predict Separating and Linear max Disentangle 4 Spurious 5 5 robust space variational bottleneck Figure 3. Overview of our DIBJUDGE, which grounded in Equation 3. (1) employs a robust encoder gϕr and a bias encoder gϕb to separate the input X into robust representations Zr and bias representations Zb. (2) introduces a variational bottleneck to minimize the mutual information I(X; Zr ). (3) leverages the compressed Zr through LLM Judge optimized using LoRA (Hu et al., 2022) to generate the final output Y by maximizing I(Y ; Zr ). (4) ensures feature independence by minimizing the dependence I(Zr ; Zb) between the robust and bias branches. (5) explicitly captures spurious attributes S within Zb by maximizing I(S; Zb) through two proxy tasks: (i) cross-lingual alignment contrastive learning and (ii) predictive confidence estimation via log-probability bin classification. The first two terms apply a vanilla IB constraint restricted to the robust channel Zr ; the third term makes Zb informative about spurious attributes S; and the final term penalizes de- pendence between (Zr , Zb), encouraging Zr to exclude spu- rious correlations that are explicitly routed into Zb. Directly optimizing this objective is computationally intractable due to the difficulty of estimating mutual information in high- dimensional spaces (Liu et al., 2024b). To address this, we derive tractable variational surrogate objectives as follows: Decouple Robust and Bias Representation. Using sepa- rate encoders gϕr and gϕb , we decompose the input X into a
Chunk 10 · 1,995 chars
tionally intractable due to the difficulty of estimating mutual information in high- dimensional spaces (Liu et al., 2024b). To address this, we derive tractable variational surrogate objectives as follows: Decouple Robust and Bias Representation. Using sepa- rate encoders gϕr and gϕb , we decompose the input X into a robust representation Zr = gϕr (X) ∈ RT ×d and a bias representation Zb = gϕb (X) ∈ RT ×d, where T denotes the sequence length and d the feature dimension. We leverage Zr for task prediction while using Zb only during training. Compression via Variational Information Constraints. To facilitate compression, we leverage the Variational Infor- mation Bottleneck (Alemi et al., 2017), which imposes an upper bound on I(X; Zr ) via variational inference. Proposition 3.1. Let Zr be a continuous random variable, with variational posterior qϕ(Zr |X) and fixed prior p(Zr ). Then I(X; Zr ) ≤ Ex∼p(X) [DKL(qϕ(Zr |x)∥p(Zr ))] . Guided by Proposition 3.1 (proved in Appendix. B.1), we can constrain I(X; Zr ) via penalizing the KL divergence be- tween the variational posterior and fixed prior. Accordlingly, we adopt a standard Gaussian prior p(Zr ) = N (0, I) and parameterize the variational posterior qϕ(zr,t|x) at each time step t as a multivariate Gaussian N (μt, σ2 t ). The resulting compression objective, defined as the average KL divergence over the sequence length T and feature dimen- sion d, is derived as follows (details in Appendix. B.2): Lcompress = 1 T T X t=1 DKL N μt, σ2 t N (0, I) = − 1 2T T X t=1 d X j=1 1 + log σ2 t,j − μ2 t,j − σ2 t,j . (4) To allow for backpropagation, we sample the latent repre- sentation zr,t using the reparameterization trick (Kingma et al., 2015): zr,t = μt + σt ⊙ ϵ, where ϵ ∼ N (0, I). Variational Mutual Information Maximization. The ob- jective (Eq. 3) necessitates maximizing mutual information along two disentangled pathways: the task-predictive term I(Y ; Zr ) and the bias-capturing term I(S; Zb). We then maximize a
Chunk 11 · 1,999 chars
meterization trick (Kingma et al., 2015): zr,t = μt + σt ⊙ ϵ, where ϵ ∼ N (0, I). Variational Mutual Information Maximization. The ob- jective (Eq. 3) necessitates maximizing mutual information along two disentangled pathways: the task-predictive term I(Y ; Zr ) and the bias-capturing term I(S; Zb). We then maximize a variational lower bound on the mutual informa- tion guided by Proposition 3.2 (proved in Appendix B.3). Proposition 3.2. Let U and V be random variables with joint distribution p(U, V ). For any variational conditional distribution qθ (U | V ), the mutual information satisfies I(U ; V ) ≥ E(U,V )∼p(U,V )[log qθ (U | V )] + H(U ), where H(U ) denotes the marginal entropy of U . For the robust pathway, we treat the LLM judge fjudge as the variational decoder qθ . We condition the generation on a sequence formed by concatenating the instruction embeddings Einst with the sampled robust representation Zr . The task loss is defined as: Ltask = EX,Y − |Y | X t=1 log fjudge(Yt | [Einst; Zr ], Y<t) . (5) 4 -- 4 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck We employ lightweight MLP decoders qψbias and minimize the negative log-likelihood of the spurious attribute S given the bias representation Zb to facilitate the encoding of spurious information into the bias pathway: Lbias = EX,S [− log qψbias (S | Zb)] . (6) We operationalize the identified spurious factors through two designed proxy tasks. First, we address latent-manifold alignment with English using a cross-lingual contrastive learning objective. Second, we estimate cross-lingual predictability via a log-probability bin classification task. Disentanglement via Cross-Covariance Penalty. Directly minimizing the mutual information I(Zr ; Zb) is generally intractable. However, in high-dimensional regimes typi- cal of LLMs, representation distributions are often well- approximated by Gaussian statistics (Lee et al., 2018; Hron et al.,
Chunk 12 · 1,978 chars
bin classification task. Disentanglement via Cross-Covariance Penalty. Directly minimizing the mutual information I(Zr ; Zb) is generally intractable. However, in high-dimensional regimes typi- cal of LLMs, representation distributions are often well- approximated by Gaussian statistics (Lee et al., 2018; Hron et al., 2020). Under this Gaussian assumption, minimizing mutual information reduces to minimizing the cross-covariance between latent variables (Cover, 1999; Hyv ¨arinen & Oja, 2000). We formalize this relationship in Proposition 3.3 (proof provided in Appendix B.4). Proposition 3.3. Let Zr and Zb be jointly Gaussian random vectors with marginal covariance matrices Σr and Σb, and cross-covariance Σrb. Define the normalized cross- covariance matrix as C = Σ−1/2 r ΣrbΣ−1/2 b . Provided the spectral norm ∥C∥2 is sufficiently small, the mutual information admits the following second-order expansion: I(Zr ; Zb) = 1 2 ∥C∥2 F + o ∥C∥2 F , as ∥C∥2 → 0. Accordingly, we adopt the cross-covariance penalty as a computationally efficient surrogate for disengtanglement term in Eq. 3. Given centered mini-batch representa- tions ¯Zr , ¯Zb ∈ RN ×d, we compute the empirical cross- covariance matrix ˆΣrb = 1 N −1 ¯Z⊤ r ¯Zb. We then minimize the squared Frobenius norm of ˆΣrb to suppress correlations: Ldisc = ∥ ˆΣrb∥2 F = d X i=1 d X j=1 ( ˆΣrb)2 ij . (7) In practice, feature-wise normalization ensures that Zr and Zb have approximately unit variance along each dimen- sion (Ba et al., 2016). This objective penalizes second-order dependencies, thereby encouraging statistical independence in the learned representations (Zbontar et al., 2021). Overall Learning Objective. We optimize DIBJUDGE end- to-end by minimizing a weighted sum of tractable objectives derived above. Concretely, the final training objective is L = Ltask + β Lcompress + γ Lbias + λ Ldisc, (8) where the weights β, γ, λ control the accuracy–compression– bias-capture–independence trade-off. 4.
Chunk 13 · 1,997 chars
erall Learning Objective. We optimize DIBJUDGE end- to-end by minimizing a weighted sum of tractable objectives derived above. Concretely, the final training objective is L = Ltask + β Lcompress + γ Lbias + λ Ldisc, (8) where the weights β, γ, λ control the accuracy–compression– bias-capture–independence trade-off. 4. Experiments Evaluation Benchmarks. To evaluate the effectiveness of LLM judges across multilingual contexts, we utilize three primary reward modeling benchmarks (Lambert et al., 2024; Son et al., 2024; Gureja et al., 2025) selected to ensure a balanced consideration of the following aspects: a) rea- soning and safety alignment across diverse conversational contexts, b) performance across 23 distinct languages, c) the distinction between translated content and native-speaker data. Our primary evaluation metric is accuracy, reported as the category average and the mean of language-specific micro-averages. More details are provided in Appendix C.1. Training settings. We adopt the same training corpus as mR3 (Anugraha et al., 2025) and fine-tune using LoRA (Hu et al., 2022). All experiments are optimized with the Adam optimizer (Kingma, 2014) using a learning rate of 1×10−4 and a maximum sequence length of 16384. Further implementation details are provided in Appendix C.2. Baselines. We evaluate DIBJUDGE against proprietary general-purpose models (GPT-4o (Hurst et al., 2024), Gemini-2.5-Flash (Comanici et al., 2025)) and open-source general-purpose LLMs (Qwen2.5/3 (Qwen et al., 2025; Yang et al., 2025)). Since Qwen3 is our backbone, these comparisons isolate gains from our training recipe be- yond base model capacity. We additionally benchmark multilingual reward models/judges, including Nemotron- Multilingual-49B (Wang et al., 2025b), M-Prometheus (3B/7B) (Pombal et al., 2025), mR3 (Anugraha et al., 2025), and Think-as-Locals (7B) (Zhang et al., 2025). Main Results. Table 1 reports the mean accuracy and stan- dard deviation across benchmarks over three
Chunk 14 · 1,996 chars
ark multilingual reward models/judges, including Nemotron- Multilingual-49B (Wang et al., 2025b), M-Prometheus (3B/7B) (Pombal et al., 2025), mR3 (Anugraha et al., 2025), and Think-as-Locals (7B) (Zhang et al., 2025). Main Results. Table 1 reports the mean accuracy and stan- dard deviation across benchmarks over three independent runs, with statistical significance assessed using pairwise t-tests. On m-RewardBench, DIBJUDGE-Qwen3-8B es- tablishes a new SOTA among open-weight models, signif- icantly outperforming both its backbone-matched counter- part and a substantially larger multilingual baseline. These results confirm the effectiveness of the proposed approach. In terms of generalization, DIBJUDGE-Qwen3-8B achieves superior performance on the English-centric RewardBench, statistically surpassing prior leading methods. This indicates that the proposed method improves multilingual reward modeling without degrading performance on monolingual benchmarks. Detailed results are provided in Appendix D. 5. Analysis To substantiate the theoretical claims proposed in this work, we conduct a series of targeted experiments designed to rigorously validate the efficacy and internal mechanics of DIBJUDGE. Our analysis focuses on verifying that the proposed information disentanglement objective translates into tangible performance gains and interpretable latent 5 -- 5 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 1. Performance evaluation on multilingual reward benchmarks. Bold indicates the best performance, and underlined indicates the second-best. Statistical significance compared to the best baseline is denoted by † (p < 0.05) and ‡ (p < 0.01). m-RewardBench RewardBench MM-Eval Model (Avg. 23 langs) (English) (Avg. 18 lang) Proprietary Models GPT-4o (Hurst et al., 2024) 85.75 ± 0.42 85.96 ± 0.35 71.85 ± 0.81 Gemini-2.5-Flash (Comanici et al., 2025) 88.06 ± 0.49 88.83 ± 0.47 77.47 ± 0.76 General Open
Chunk 15 · 1,985 chars
best baseline is denoted by † (p < 0.05) and ‡ (p < 0.01). m-RewardBench RewardBench MM-Eval Model (Avg. 23 langs) (English) (Avg. 18 lang) Proprietary Models GPT-4o (Hurst et al., 2024) 85.75 ± 0.42 85.96 ± 0.35 71.85 ± 0.81 Gemini-2.5-Flash (Comanici et al., 2025) 88.06 ± 0.49 88.83 ± 0.47 77.47 ± 0.76 General Open Models Qwen2.5-3B-Instruct (Qwen et al., 2025) 66.97 ± 1.12 68.99 ± 1.05 57.99 ± 1.20 Qwen2.5-7B-Instruct (Qwen et al., 2025) 77.89 ± 0.89 78.59 ± 0.91 65.64 ± 0.95 Qwen3-4B (Yang et al., 2025) 85.06 ± 0.65 87.54 ± 0.55 80.85 ± 0.68 Qwen3-8B (Yang et al., 2025) 86.12 ± 0.52 88.81 ± 0.48 82.20 ± 0.60 Multilingual Open Reward Models Nemotron-Multi-49B (Wang et al., 2025b) 88.83 ± 0.35 89.71 ± 0.31 76.31 ± 0.55 M-PROMETHEUS 3B (Pombal et al., 2025) 68.45 ± 0.98 69.79 ± 0.92 64.17 ± 1.10 M-PROMETHEUS 7B (Pombal et al., 2025) 78.03 ± 0.85 76.69 ± 0.78 69.38 ± 0.88 mR3-Qwen3-4B (Anugraha et al., 2025) 87.21 ± 0.45 89.75 ± 0.38 82.55 ± 0.52 mR3-Qwen3-8B (Anugraha et al., 2025) 88.58 ± 0.41 90.10 ± 0.40 85.29 ± 0.45 Think-as-Locals 7B (Zhang et al., 2025) 84.51 ± 0.60 88.79 ± 0.52 72.95 ± 0.70 Ours DIBJudge-Qwen3-4B 89.84 ± 0.28† 90.32 ± 0.25 85.16 ± 0.33 DIBJudge-Qwen3-8B 91.37 ± 0.22‡ 91.01 ± 0.20† 87.53 ± 0.28‡ structures. We organize this empirical investigation around five core research questions: (i) RQ1 (Bias Mitigation): To what extent does DIBJUDGE effectively mitigate transla- tionese bias across languages with varying resource avail- ability? (ii) RQ2 (Utility Trade-off): How does the information bottleneck constraint shape the Pareto Frontier between bias mitigation and downstream task utility? (iii) RQ3 (Disentanglement): Do the learned latent represen- tations geometrically disentangle semantic content from translationese artifacts, as theoretically hypothesized? (iv) RQ4 (Generalization): Does the model exhibit robustness against unseen bias types (e.g., length bias) that were not explicitly included in the spurious proxy task? (v)
Chunk 16 · 1,997 chars
: Do the learned latent represen- tations geometrically disentangle semantic content from translationese artifacts, as theoretically hypothesized? (iv) RQ4 (Generalization): Does the model exhibit robustness against unseen bias types (e.g., length bias) that were not explicitly included in the spurious proxy task? (v) RQ5 (Ablation Study): How do the distinct components of the DIB objective (Eq. 3) and spurious proxy task contribute to bias mitigation and reward modeling utility? Additional analyses are deferred to the appendix, including studies of proxy-task design (Appx. F.1), sensitivity to CAD and SSR (Appx. F.2, F.3), comparisons with alternative compression and disentanglement mechanisms (Appx. F.4, F.5), and linear probing to assess information leakage (Appx. F.6). RQ1: Efficacy in Translationese Bias Mitigation. We extend the preliminary bias evaluation (§ 2) to a broader suite of domains and datasets. We evaluate performance on three diverse benchmarks: BELEBELE (machine read- ing comprehension) (Bandarkar et al., 2024), AYA (Singh et al., 2024) (open-ended instruction following), and XL- SUM (Hasan et al., 2021) (summarization). This selection Low Mid High Resource Tier 0.00 0.08 0.16 0.24 Bias Severity Sbias (↓) (a) Belebele Avg Δ = 80% ↓ Low Mid High Resource Tier (b) Aya Avg Δ = 56% ↓ Low Mid High Resource Tier (c) XL-Sum Avg Δ = 75% ↓ Base Model Vanilla SFT Vanilla IB DIBJudge Figure 4. Bias severity across resource tiers. Sbias (lower is better) on BELEBELE, AYA, and XL-SUM. DIBJUDGE reduces bias across all tiers, with average reductions of 80%, 56%, and 75%, and the strongest improvements in Low-Resource settings. Error bars show std over 3 runs; Avg ∆ is relative to Vanilla SFT. allows us to assess translationese bias across constrained formats and realistic, open-ended interactions. Human- authored references are used as ground-truth targets, while negative samples (rejected responses) are generated via back-translation as described in § 2.
Chunk 17 · 1,978 chars
std over 3 runs; Avg ∆ is relative to Vanilla SFT. allows us to assess translationese bias across constrained formats and realistic, open-ended interactions. Human- authored references are used as ground-truth targets, while negative samples (rejected responses) are generated via back-translation as described in § 2. To investigate the impact of data scarcity, we stratify languages into High-, Mid-, and Low-Resource tiers (n = 10 languages per tier). We benchmark DIBJUDGE against three baselines: the Base model, Vanilla SFT, and a Vanilla IB variant. We quantify efficacy using the Bias Severity metric (Sbias) defined in Equation 1. More details in Appendix. A.2 Figure 4 demonstrates the efficacy of DIBJUDGE in miti- gating translationese bias across diverse language resource levels. On the benchmark BELEBELE, DIBJudge achieves 6 -- 6 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck a drastic reduction in bias severity, approaching near-zero levels across the mid- and high-resource tiers. This trend extends to generative tasks such as AYA and XL-SUM, where we observe consistent bias suppression. Crucially, DIBJUDGE significantly reduces disparity across resource tiers; whereas vanilla SFT retains marked bias in low- resource settings, our approach effectively dampens these spurious correlations. These findings confirm that DIB- JUDGE targets the bias amplification that disproportionately affects underrepresented languages, rather than merely enhancing general instruction-following capabilities. 0.10 0.15 0.20 0.25 0.30 Bias severity ( better) 0.85 0.86 0.87 0.88 0.89 m-RewardBench accuracy ( better) DIBJudge Vanilla IB Qwen3-4B GPT-4o mR3-Qwen3-4B Gemini-2.5-Flash 10 1 100 Bottleneck strength (log-scale) Figure 5. Bias–utility Pareto Frontier. Trade-off between Bias Severity (↓; x-axis) and m-RewardBench accuracy (↑; y-axis). Each point corresponds to a bottleneck strength β
Chunk 18 · 1,997 chars
9 m-RewardBench accuracy ( better) DIBJudge Vanilla IB Qwen3-4B GPT-4o mR3-Qwen3-4B Gemini-2.5-Flash 10 1 100 Bottleneck strength (log-scale) Figure 5. Bias–utility Pareto Frontier. Trade-off between Bias Severity (↓; x-axis) and m-RewardBench accuracy (↑; y-axis). Each point corresponds to a bottleneck strength β (log-scaled, color-coded). The resulting Pareto frontiers are traced by DIB- JUDGE (solid) and the VANILLA IB baseline (dashed). DIBJUDGE consistently achieves higher accuracy at comparable bias levels across β, yielding a uniformly superior bias–utility trade-off. Markers indicate representative SOTA models, which DIBJUDGE outperforms in terms of lower bias and higher accuracy. RQ2: The Robustness-Utility Trade-off. We investigate the tension between robustness and utility by modulating the coefficient β of the compression term in Equation 3. Specifically, we aim to characterize the Pareto Frontier of this trade-off. Figure 5 illustrates this dynamic by plot- ting bias severity (Sbias) against m-RewardBench accuracy. DIBJUDGE achieves a consistently better Pareto frontier than the VANILLA IB baseline, indicating that it learns a more compact and robust representation without discarding key semantic features. Furthermore, DIBJUDGE strictly dominates its base model and strong proprietary baselines (e.g., GPT-4o, Gemini-2.5-Flash), consistently achieving higher accuracy across all fixed levels of bias severity. These findings confirm that the proposed method mitigates translationese bias without substantial utility degradation. RQ3: Disentanglement of Latent Representations. We visualize the geometry of the learned representations using t-SNE (van der Maaten & Hinton, 2008), extracting bias (Zb) and robust (Zr ) features from a held-out evaluation set comprising human and machine-translated texts. As shown in Figure 6, the latent spaces exhibit divergent topologies. The bias space (Fig. 6a) forms distinct clusters based on text origin, confirming that
Chunk 19 · 1,994 chars
-SNE (van der Maaten & Hinton, 2008), extracting bias (Zb) and robust (Zr ) features from a held-out evaluation set comprising human and machine-translated texts. As shown in Figure 6, the latent spaces exhibit divergent topologies. The bias space (Fig. 6a) forms distinct clusters based on text origin, confirming that Zb encodes translationese artifacts. 200 100 0 100 200 150 100 50 0 50 100 150 200 Aya Bias Representation Human Machine Source Separation: 1.65 (a) Bias Representations (Zb) 100 50 0 50 100 100 50 0 50 100 Aya Robust Representation Human Machine Source Separation: 0.02 (b) Robust Representations (Zr ) Figure 6. Visualization of Latent Representation Disentangle- ment. t-SNE projections of embeddings for Human (Blue) vs. Machine (Red) text. (a) Zb clearly separates domains. (b) Zr shows a mixed distribution, corroborating domain invariance. Table 2. Zero-Shot Generalization to Unseen Biases. We evaluate performance on a held-out subset containing biases not encountered during training (Out-of-Distribution). DIBJUDGE achieves the lowest bias scores across both in-distribution (Trans- lationese) and unseen heuristics (Length, Self-Preference). ID OOD (Unseen Biases) Method Trans. Sbias ↓ Length ρ ↓ Self-Pref. Sbias ↓ Vanilla SFT 0.247 0.553 0.314 Vanilla IB 0.168 0.482 0.276 DIBJUDGE 0.083 0.314 0.219 Conversely, the robust space (Fig. 6b) demonstrates substan- tial domain overlap. This phenomenon demonstrates that Zr achieves invariance to translationese artifacts, effectively disentangling them from the underlying semantic content. RQ4: Zero-Shot Generalization to Unseen Biases. We evaluate the generalization capability of DIBJUDGE against biases not encountered during training—specifically length (Saito et al., 2023) and self-preference (Wataoka et al., 2024)—using a held-out subset of Skywork-Reward- Preference-80K (Liu et al., 2024a). Length bias is quantified via the Spearman rank correlation (ρ) between response length and predicted
Chunk 20 · 1,990 chars
IBJUDGE against biases not encountered during training—specifically length (Saito et al., 2023) and self-preference (Wataoka et al., 2024)—using a held-out subset of Skywork-Reward- Preference-80K (Liu et al., 2024a). Length bias is quantified via the Spearman rank correlation (ρ) between response length and predicted ratings, while self-preference is eval- uated using bias severity (Sbias) as defined in Eq. 1, specif- ically favoring the model’s own generations. As detailed in Table 2, DIBJUDGE demonstrates superior robustness compared to vanilla SFT and IB baselines. Our method sig- nificantly reduces the correlation between response length and response bias and minimizes self-preference bias. These results suggest that DIBJUDGE successfully learns to filter superficial heuristics (e.g., verbosity) rather than merely memorizing specific artifacts such as translationese. RQ5: Impact of DIB Objective Components and Spuri- ous Proxy Tasks. To assess the individual contributions of the components in the DIB objective (Eq. 3), we evaluate various combinations of the compression, bias-capture, and disentanglement objectives. As shown in Table 3, isolated 7 -- 7 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 3. Ablation study of the DIB objectives. We report the bias score (Sbias, lower is better) and Accuracy (higher is better). The combination of all terms achieves the best trade-off. Objectives Metrics Compression Bias Disentangle Sbias(↓) Acc. (↑) ✓ 0.124 85.25 ✓ 0.150 86.60 ✓ 0.053 85.85 ✓ ✓ 0.091 88.20 ✓ ✓ 0.039 88.55 ✓ ✓ 0.035 89.10 ✓ ✓ ✓ 0.031 89.85 Table 4. Ablation study of proxy tasks. CLA indicates the cross- lingual alignment proxy task, and LPBC stands for log-probability bin classification proxy task. Configuration None CLA LPBC CLA + LPBC Sbias(↓) 0.421 0.312 0.279 0.147 Acc. (↑) 87.12 87.86 88.43 89.18 objectives yield suboptimal results. While the disentangle- ment term is
Chunk 21 · 1,999 chars
udy of proxy tasks. CLA indicates the cross- lingual alignment proxy task, and LPBC stands for log-probability bin classification proxy task. Configuration None CLA LPBC CLA + LPBC Sbias(↓) 0.421 0.312 0.279 0.147 Acc. (↑) 87.12 87.86 88.43 89.18 objectives yield suboptimal results. While the disentangle- ment term is particularly effective at reducing the bias score relative to the baseline, it remains insufficient to maximize predictive accuracy. The highest performance across both metrics is achieved by integrating all three terms, suggesting that the synergy between information-bottleneck-driven compression and latent-space disentanglement is critical for balancing fairness and model utility. Furthermore, we investigate the impact of the proposed spurious proxy tasks: Cross-Lingual Alignment (CLA) and Log-Probability Bin Classification (LPBC). Table 4 illustrates that omitting these tasks leads to a significant degradation in bias mitigation. The substantial improvement in Sbias when both tasks are employed confirms their effectiveness in capturing spurious correlations related to translationese. Detailed proxy task ablation studies are presented in Appendix F.1. 6. Related Work LLM-as-a-Judge. The LLM-as-a-Judge paradigm marks a fundamental shift from traditional n-gram (e.g., BLEU (Pa- pineni et al., 2002), ROUGE (Lin, 2004)) and embedding- based metrics (e.g., BERTScore (Zhang et al., 2020b; Rei et al., 2020)) toward generative evaluation (Gu et al., 2024; Li et al., 2025). While early adoption relied on proprietary models like GPT-4 due to their high correlation with human judgment (Liu et al., 2023; Zheng et al., 2023), concerns regarding cost and transparency have catalyzed a transition to open-weight evaluators (Wang et al., 2024e;c;a; Kim et al., 2024a;b). Recently, this paradigm has further evolved from direct generative to incorporating explicit reasoning steps to enhance reliability (Chen et al., 2025a; Guo et al., 2025; Chen et al., 2025b). However,
Chunk 22 · 1,989 chars
and transparency have catalyzed a transition to open-weight evaluators (Wang et al., 2024e;c;a; Kim et al., 2024a;b). Recently, this paradigm has further evolved from direct generative to incorporating explicit reasoning steps to enhance reliability (Chen et al., 2025a; Guo et al., 2025; Chen et al., 2025b). However, despite these advancements, LLM judges remain susceptible to systematic biases (Ye et al., 2025; Wang et al., 2024b; Zheng et al., 2024), such as position bias (Shi et al., 2025; Ko et al., 2020), verbosity bias (Saito et al., 2023), and self- preference bias (Wang et al., 2024c). In contrast to prior work that primarily studies bias in English-centric settings, we investigate translationese bias in multilingual contexts and analyze the spurious correlations underlying it. Multilingual Judges. Compared to the English context, multilingual LLM-as-a-Judge remains significantly under- explored. Initial efforts to bridge this gap, such as Her- cule (Doddapaneni et al., 2025) and M-Prometheus (Pombal et al., 2025), rely heavily on fine-tuning with translated or synthetic instruction sets. More recently, approaches like mR3 (Anugraha et al., 2025) and Think-as-Locals (Zhang et al., 2025) have advanced the field by integrating reasoning capabilities, employing Chain-of-Thought (CoT) (Wei et al., 2022) distillation and reinforcement learning to enhance multilingual reward modeling. However, despite these achieving promising results, the robustness of these evalua- tors remains unexamined. Crucially, existing frameworks fail to account for the systematic artifacts introduced by translation-based training data. To address this reliability gap, our work provides the first dedicated mitigation of translationese bias, resolving specific failures in cross- lingual evaluation that prior methodologies overlook. Information Bottleneck in LLMs. Originally formulated to extract minimal sufficient statistics (Tishby et al., 2000), the Information Bottleneck (IB)
Chunk 23 · 1,999 chars
gap, our work provides the first dedicated mitigation of translationese bias, resolving specific failures in cross- lingual evaluation that prior methodologies overlook. Information Bottleneck in LLMs. Originally formulated to extract minimal sufficient statistics (Tishby et al., 2000), the Information Bottleneck (IB) principle has recently emerged as a vital framework for analyzing and optimizing LLMs, spanning diverse objectives including enhancing interpretability by mapping hidden states to human-readable concepts (Sun et al., 2025; Li et al., 2023), optimizing CoT reasoning paths to be invariant to prompt nuances (Lei et al., 2025), and compressing contexts in Retrieval-Augmented Generation to filter noise (Zhu et al., 2024). In safety domains, methods such as IBProtector (Liu et al., 2024b) leverage IB to strip adversarial triggers. Diverging from these approaches, we present the first application of a disentangled IB designed to debias LLM judges. 7. Conclusion In this study, we systematically investigated translationese bias in multilingual LLM-as-a-Judge frameworks and identi- fied key spurious factors that undermine reliable evaluation, particularly in low-resource languages. Guided by these insights, we proposed DIBJUDGE, a disentangled informa- tion bottleneck–based fine-tuning framework that separates judgment-critical semantics from spurious translationese attributes. Extensive experiments across multilingual re- ward modeling benchmarks and dedicated bias evaluations demonstrate that DIBJUDGE substantially mitigates trans- lationese bias while maintaining strong utility. 8 -- 8 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Impact Statement This paper presents work whose goal is to advance the field of machine learning, with a focus on improving the robustness of multilingual evaluation using large language models. The methods proposed in this work are intended for model evaluation and
Chunk 24 · 1,998 chars
M-as-a-Judge via Disentangled Information Bottleneck Impact Statement This paper presents work whose goal is to advance the field of machine learning, with a focus on improving the robustness of multilingual evaluation using large language models. The methods proposed in this work are intended for model evaluation and benchmarking rather than direct user-facing applications. While improved evaluation may have downstream benefits for the development of more reliable and inclusive multilingual systems, we do not foresee significant or immediate negative societal impacts arising from this work. References Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016. Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. In International Conference on Learning Representations, 2017. Anugraha, D., Hung, S.-Y., Tang, Z., Lee, A. E.-S., Wi- jaya, D. T., and Winata, G. I. mr3: Multilingual rubric-agnostic reward reasoning models. arXiv preprint arXiv:2510.01146, 2025. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., and Khabsa, M. The belebele benchmark: a par- allel reading comprehension dataset in 122 language variants. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 749–775, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.44. Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual in- formation neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018. Bogavelli, T., Bamgbose, O., Melanc¸ on, G. G., Riols, F., and Sharma, R. Evaluating
Chunk 25 · 1,999 chars
53/v1/2024.acl-long.44. Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual in- formation neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018. Bogavelli, T., Bamgbose, O., Melanc¸ on, G. G., Riols, F., and Sharma, R. Evaluating robustness of large language models in enterprise applications: Benchmarks for pertur- bation consistency across formats and languages. arXiv preprint arXiv:2601.06341, 2026. Chen, G. H., Chen, S., Liu, Z., Jiang, F., and Wang, B. Humans or LLMs as the judge? a study on judgement bias. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301– 8327, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. emnlp-main.474. Chen, N., Hu, Z., Zou, Q., Wu, J., Wang, Q., Hooi, B., and He, B. Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025a. Chen, X., Li, G., Wang, Z., Jin, B., Qian, C., Wang, Y., Wang, H., Zhang, Y., Zhang, D., Zhang, T., et al. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025b. Chen, Y., Yao, Y., Zhang, Y., Shen, B., Liu, G., and Liu, S. Safety mirage: How spurious correlations undermine vlm safety fine-tuning. arXiv preprint arXiv:2503.11832, 2025c. Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., and Carin, L. Club: A contrastive log-ratio upper bound of mutual information. In International conference on machine learning, pp. 1779–1788. PMLR, 2020. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. Costa-Juss `a, M. R., Cross, J., C¸ elebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi,
Chunk 26 · 1,996 chars
Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. Costa-Juss `a, M. R., Cross, J., C¸ elebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J., et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022. Cover, T. M. Elements of information theory. John Wiley & Sons, 1999. Dao, T. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024. Doddapaneni, S., Khan, M. S. U. R., Venkatesh, D., Dabre, R., Kunchukuttan, A., and Khapra, M. M. Cross-lingual auto evaluation for assessing multilingual LLMs. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pp. 29297–29329, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8- 89176-251-0. doi: 10.18653/v1/2025.acl-long.1419. Fu, X. and Liu, W. How reliable is multilingual llm-as-a- judge? arXiv preprint arXiv:2505.12201, 2025. 9 -- 9 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Gao, J., Chen, C., Jia, Y., Gong, X., Lam, K.-Y., and Wang, Q. Evaluating and mitigating llm-as-a-judge bias in com- munication systems. arXiv preprint arXiv:2510.12462, 2025. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Sch ¨olkopf, B. Kernel methods for measuring indepen- dence. J. Mach. Learn. Res., 6:2075–2129, December 2005. ISSN 1532-4435. Gu, J., Jiang,
Chunk 27 · 1,993 chars
man, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Sch ¨olkopf, B. Kernel methods for measuring indepen- dence. J. Mach. Learn. Res., 6:2075–2129, December 2005. ISSN 1532-4435. Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a- judge. The Innovation, 2024. Gui, S. and Ji, S. Mitigating spurious correlations in llms via causality-aware post-training. arXiv preprint arXiv:2506.09433, 2025. Guo, J., Chi, Z., Dong, L., Dong, Q., Wu, X., Huang, S., and Wei, F. Reward reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. Gureja, S., Miranda, L. J. V., Islam, S. B., Maheshwary, R., Sharma, D., Winata, G. T., Lambert, N., Ruder, S., Hooker, S., and Fadaee, M. M-RewardBench: Evaluating reward models in multilingual settings. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Pro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 43–58, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.3. Hada, R., Gumma, V., de Wynter, A., Diddee, H., Ahmed, M., Choudhury, M., Bali, K., and Sitaram, S. Are large language model-based evaluators the solution to scaling up multilingual evaluation? In Graham, Y. and Purver, M. (eds.), Findings of the Association for Computational Lin- guistics: EACL 2024, pp. 1051–1070, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. Hasan, T., Bhattacharjee, A., Islam, M. S., Mubasshir, K., Li, Y.-F., Kang, Y.-B., Rahman, M. S., and Shahriyar, R. XL- sum: Large-scale multilingual abstractive summarization for 44 languages. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Findings of the Association for Computational Linguistics:
Chunk 28 · 1,989 chars
Linguistics. Hasan, T., Bhattacharjee, A., Islam, M. S., Mubasshir, K., Li, Y.-F., Kang, Y.-B., Rahman, M. S., and Shahriyar, R. XL- sum: Large-scale multilingual abstractive summarization for 44 languages. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4693–4703, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.413. Hron, J., Bahri, Y., Sohl-Dickstein, J., and Novak, R. Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pp. 4376– 4386. PMLR, 2020. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. Huang, K., Mo, F., Zhang, X., Li, H., Li, Y., Zhang, Y., Yi, W., Mao, Y., Liu, J., Xu, Y., et al. A survey on large language models with multilingualism: Recent advances and new frontiers. arXiv preprint arXiv:2405.10936, 2024. Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. Hyv ¨arinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural Networks, 13(4): 411–430, 2000. ISSN 0893-6080. doi: https://doi.org/10. 1016/S0893-6080(00)00026-5. Joshi, P., Santy, S., Budhiraja, A., Bali, K., and Choudhury, M. The state and fate of linguistic diversity and inclusion in the NLP world. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6282–6293, Online, July 2020. Association for Com- putational Linguistics. doi: 10.18653/v1/2020.acl-main. 560. Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus: Inducing fine-grained
Chunk 29 · 1,992 chars
Meeting of the Association for Computational Linguistics, pp. 6282–6293, Online, July 2020. Association for Com- putational Linguistics. doi: 10.18653/v1/2020.acl-main. 560. Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., and Seo, M. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024a. Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus 2: An open source language model specialized in evalu- ating other language models. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Pro- cessing, pp. 4334–4353, Miami, Florida, USA, November 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.248. Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. 10 -- 10 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Ko, M., Lee, J., Kim, H., Kim, G., and Kang, J. Look at the first sentence: Position bias in question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1109–1121, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.84. Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., et al. Quality at a glance: An audit of web-crawled multilingual
Chunk 30 · 1,973 chars
P), pp. 1109–1121, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.84. Kreutzer, J., Caswell, I., Wang, L., Wahab, A., Van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022. Lai, H., Liu, X., Gao, J., Cheng, J., Qi, Z., Xu, Y., Yao, S., Zhang, D., Du, J., Hou, Z., Lv, X., Huang, M., Dong, Y., and Tang, J. A survey of post-training scaling in large language models. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2771–2791, Vienna, Austria, July 2025. Association for Compu- tational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.140. Lambert, N., Pyatkin, V., Morrison, J., Miranda, L., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., Smith, N. A., and Hajishirzi, H. Rewardbench: Evaluating reward models for language modeling, 2024. Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. Lei, S., Cheng, Z., Jia, K., and Tao, D. Revisiting llm reasoning via information bottleneck. arXiv preprint arXiv:2507.18391, 2025. Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., Shu, K., Cheng, L., and Liu, H. From generation to judgment: Opportunities and challenges of LLM-as-a-judge. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 2757–2791, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi:
Chunk 31 · 1,985 chars
ities and challenges of LLM-as-a-judge. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 2757–2791, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.138. Li, Q., Wu, Z., Kong, L., and Bi, W. Explanation re- generation via information bottleneck. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 12081–12102, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023. findings-acl.765. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. Liu, C. Y., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y., and Zhou, Y. Skywork-reward: Bag of tricks for reward modeling in llms, 2024a. Liu, X., Sanchez, P., Thermos, S., O’Neil, A. Q., and Tsaftaris, S. A. Learning disentangled representations in the imaging domain. Medical Image Analysis, 80:102516, 2022. Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. G-eval: NLG evaluation using gpt-4 with better human alignment. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pp. 2511–2522, Sin- gapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. Liu, Z., Wang, Z., Xu, L., Wang, J., Song, L., Wang, T., Chen, C., Cheng, W., and Bian, J. Protecting your LLMs with information bottleneck. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b. Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Le Scao, T., Bari, M. S., Shen, S., Yong, Z. X., Schoelkopf, H., Tang, X., Radev, D., Aji, A.
Chunk 32 · 1,996 chars
g, W., and Bian, J. Protecting your LLMs with information bottleneck. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b. Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Le Scao, T., Bari, M. S., Shen, S., Yong, Z. X., Schoelkopf, H., Tang, X., Radev, D., Aji, A. F., Almubarak, K., Albanie, S., Alyafeai, Z., Webson, A., Raff, E., and Raffel, C. Crosslingual generalization through multitask finetuning. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15991–16111, Toronto, Canada, July 2023. Association for Computational Lin- guistics. doi: 10.18653/v1/2023.acl-long.891. Padarha, S., Hale, S. A., Mahdi, A., Semenova, E., and Vidgen, B. Evaluating LLM-as-a-judge under multi- lingual, multimodal and multi-domain constraints. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.), Proceed- ings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa- tional Linguistics. doi: 10.3115/1073083.1073135. 11 -- 11 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Pombal, J., Yoon, D., Fernandes, P., Wu, I., Kim, S., Rei, R., Neubig, G., and Martins, A. M-prometheus: A suite of open multilingual LLM judges. In Second Conference on Language Modeling, 2025. Qin, L., Chen, Q., Zhou, Y., Chen, Z., Li, Y., Liao, L., Li, M., Che, W., and Yu, P. S. A survey of multilingual large language models. Patterns, 6(1):101118, 2025. ISSN 2666-3899. doi: https://doi.org/10.1016/j.patter.2024. 101118. Qwen, :, Yang, A., Yang, B.,
Chunk 33 · 1,985 chars
judges. In Second Conference on Language Modeling, 2025. Qin, L., Chen, Q., Zhou, Y., Chen, Z., Li, Y., Liao, L., Li, M., Che, W., and Yu, P. S. A survey of multilingual large language models. Patterns, 6(1):101118, 2025. ISSN 2666-3899. doi: https://doi.org/10.1016/j.patter.2024. 101118. Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 technical report, 2025. Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. Deep- speed: System optimizations enable training deep learn- ing models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. Rei, R., Stewart, C., Farinha, A. C., and Lavie, A. COMET: A neural framework for MT evaluation. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702, Online, November 2020. Association for Computational Linguis- tics. doi: 10.18653/v1/2020.emnlp-main.213. Saito, K., Wachi, A., Wataoka, K., and Akimoto, Y. Ver- bosity bias in preference labeling by large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., and Vosoughi, S. Judging the judges: A systematic study of position bias in LLM-as-a-judge. In Inui, K., Sakti, S., Wang, H., Wong, D. F., Bhattacharyya, P., Banerjee, B., Ekbal, A., Chakraborty, T., and Singh, D. P. (eds.), Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for
Chunk 34 · 1,992 chars
n bias in LLM-as-a-judge. In Inui, K., Sakti, S., Wang, H., Wong, D. F., Bhattacharyya, P., Banerjee, B., Ekbal, A., Chakraborty, T., and Singh, D. P. (eds.), Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics, pp. 292–314, Mumbai, India, December 2025. The Asian Federation of Natural Language Process- ing and The Association for Computational Linguistics. ISBN 979-8-89176-298-5. Shuieh, J., Singhal, P., Shanker, A., Heyer, J., Pu, G., and Denton, S. M. ASSESSING ROBUSTNESS TO SPURIOUS CORRELATIONS IN POST-TRAINING LANGUAGE MODELS. In Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions, 2025. Singh, S., Vargus, F., D’souza, D., Karlsson, B. F., Mahendi- ran, A., Ko, W.-Y., Shandilya, H., Patel, J., Mataciunas, D., O’Mahony, L., Zhang, M., Hettiarachchi, R., Wilson, J., Machado, M., Moura, L., Krzemi ´nski, D., Fadaei, H., Ergun, I., Okoh, I., Alaagib, A., Mudannayake, O., Alyafeai, Z., Chien, V., Ruder, S., Guthikonda, S., Alghamdi, E., Gehrmann, S., Muennighoff, N., Bartolo, M., Kreutzer, J., ¨Ust ¨un, A., Fadaee, M., and Hooker, S. Aya dataset: An open-access collection for multilingual instruction tuning. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11521–11567, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.620. Son, G., Yoon, D., Suk, J., Aula-Blasco, J., Aslan, M., Kim, V. T., Islam, S. B., Prats-Cristi `a, J., Tormo-Ba ˜nuelos, L., and Kim, S. Mm-eval: A multilingual meta-evaluation benchmark for llm-as-a-judge and reward models. arXiv preprint arXiv:2410.17578, 2024. Sun, C.-E., Oikarinen, T., Ustun, B., and Weng, T.-W. Con- cept bottleneck large language models. In The Thirteenth International Conference on
Chunk 35 · 1,996 chars
Cristi `a, J., Tormo-Ba ˜nuelos, L., and Kim, S. Mm-eval: A multilingual meta-evaluation benchmark for llm-as-a-judge and reward models. arXiv preprint arXiv:2410.17578, 2024. Sun, C.-E., Oikarinen, T., Ustun, B., and Weng, T.-W. Con- cept bottleneck large language models. In The Thirteenth International Conference on Learning Representations, 2025. Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi `ere, M., et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025. Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000. Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86): 2579–2605, 2008. Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 10582–10592, Miami, Florida, USA, November 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.620. 12 -- 12 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., and Sui, Z. Large language models are not fair evaluators. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9440–9450, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.511. Wang, Q., Lou, Z., Tang, Z., Chen, N., Zhao, X., Zhang, W., Song, D., and He, B. Assessing
Chunk 36 · 1,994 chars
of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9440–9450, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.511. Wang, Q., Lou, Z., Tang, Z., Chen, N., Zhao, X., Zhang, W., Song, D., and He, B. Assessing judging bias in large reasoning models: An empirical study. arXiv preprint arXiv:2504.09946, 2025a. Wang, T., Kulikov, I., Golovneva, O., Yu, P., Yuan, W., Dwivedi-Yu, J., Pang, R. Y., Fazel-Zarandi, M., Weston, J., and Li, X. Self-taught evaluators. arXiv preprint arXiv:2408.02666, 2024c. Wang, X., Chen, H., Tang, S., Wu, Z., and Zhu, W. Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12): 9677–9696, 2024d. Wang, Y., Yu, Z., Yao, W., Zeng, Z., Yang, L., Wang, C., Chen, H., Jiang, C., Xie, R., Wang, J., Xie, X., Ye, W., Zhang, S., and Zhang, Y. PandaLM: An automatic evalua- tion benchmark for LLM instruction tuning optimization. In The Twelfth International Conference on Learning Representations, 2024e. Wang, Z., Zeng, J., Delalleau, O., Egert, D., Evans, E., Shin, H.-C., Soares, F., Dong, Y., and Kuchaiev, O. HelpSteer3: Human-annotated feedback and edit data to empower inference-time scaling in open-ended general- domain tasks. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 25640–25662, Vienna, Austria, July 2025b. Association for Compu- tational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1246. Wataoka, K., Takahashi, T., and Ri, R. Self-preference bias in LLM-as-a-judge. In Neurips Safe Generative AI Workshop 2024, 2024. Weber, M., Fu, D. Y., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams, V., Athiwaratkun, B., Chalamala, R., Chen, K., Ryabinin, M., Dao, T., Liang, P., R ´e, C., Rish, I., and
Chunk 37 · 1,996 chars
T., and Ri, R. Self-preference bias in LLM-as-a-judge. In Neurips Safe Generative AI Workshop 2024, 2024. Weber, M., Fu, D. Y., Anthony, Q., Oren, Y., Adams, S., Alexandrov, A., Lyu, X., Nguyen, H., Yao, X., Adams, V., Athiwaratkun, B., Chalamala, R., Chen, K., Ryabinin, M., Dao, T., Liang, P., R ´e, C., Rish, I., and Zhang, C. Redpajama: an open dataset for training large language models. NeurIPS Datasets and Benchmarks Track, 2024. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. Ye, J., Wang, Y., Huang, Y., Chen, D., Zhang, Q., Moniz, N., Gao, T., Geyer, W., Huang, C., Chen, P.-Y., Chawla, N. V., and Zhang, X. Justice or prejudice? quantifying biases in LLM-as-a-judge. In The Thirteenth International Conference on Learning Representations, 2025. Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In International conference on machine learn- ing, pp. 12310–12320. PMLR, 2021. Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improv- ing massively multilingual neural machine translation and zero-shot translation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, pp. 1628–1639, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.148. Zhang, B., Xie, H., Gao, Z., and Wang, Y. Choose what you need: Disentangled representation learning for scene text recognition removal and editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 28358–28368, 2024. Zhang, H.,
Chunk 38 · 1,995 chars
l Linguistics. doi: 10.18653/v1/2020.acl-main.148. Zhang, B., Xie, H., Gao, Z., and Wang, Y. Choose what you need: Disentangled representation learning for scene text recognition removal and editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 28358–28368, 2024. Zhang, H., Chen, K., Bai, X., Xiang, Y., and Zhang, M. Eval- uating and improving cultural awareness of reward mod- els for llm alignment. arXiv preprint arXiv:2509.21798, 2025. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020b. Zheng, C., Zhou, H., Meng, F., Zhou, J., and Huang, M. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2024. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. Zhu, K., Feng, X., Du, X., Gu, Y., Yu, W., Wang, H., Chen, Q., Chu, Z., Chen, J., and Qin, B. An information bottle- neck perspective for effective noise filtering on retrieval- augmented generation. In Ku, L.-W., Martins, A., and 13 -- 13 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1044–1069, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.59. 14 -- 14 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Appendix Contents Section Contents Appendix A Translationese Bias Evaluation Suite. Language
Chunk 39 · 1,988 chars
ok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.59. 14 -- 14 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Appendix Contents Section Contents Appendix A Translationese Bias Evaluation Suite. Language taxonomy, dataset construction, and qualitative examples. Appendix B Theoretical Analysis. Information bottleneck bounds, derivation of compression loss function, identifiability discussion, and disentanglement proofs. Appendix C Experimental Details. Benchmarks, evaluation metrics, model configurations, and training protocols. Appendix D Comprehensive Multilingual Results. Full results on RewardBench, M-RewardBench, and MM- Eval. Appendix E Translationese Bias Results. Detailed quantitative translationese bias evaluation results across language pairs. Appendix F Additional Experiments and Ablations. Spurious proxy task analysis, sensitivity analysis, method robustness checks, and information leakage validation of disentangle representation. A. Bias Evaluation Suite A.1. Language Selection and Taxonomy We adopt the language taxonomy proposed by Joshi et al. (2020) to categorize languages based on their resource availability. Specifically, we partition the selected languages into three distinct tiers based on their assigned resource classes: High- resource (Classes 4 and 5), Mid-resource (Class 3), and Low-resource (Classes 0, 1, and 2). Our evaluation spans three primary datasets: Aya (Singh et al., 2024), Belebele (Bandarkar et al., 2024), and XL-Sum (Hasan et al., 2021). The complete distribution of evaluated languages across these resource tiers is summarized in Table 5. Table 5. Classification of evaluated languages across datasets based on Joshi et al. (2020) taxonomy. Dataset High-Resource Mid-Resource Low-Resource Aya Basque, English, Finnish, Hindi, Japanese, Portuguese, Simp. Chi- nese, Spanish, Arabic, Vietnamese Bengali, Cebuano,
Chunk 40 · 1,992 chars
these resource tiers is summarized in Table 5. Table 5. Classification of evaluated languages across datasets based on Joshi et al. (2020) taxonomy. Dataset High-Resource Mid-Resource Low-Resource Aya Basque, English, Finnish, Hindi, Japanese, Portuguese, Simp. Chi- nese, Spanish, Arabic, Vietnamese Bengali, Cebuano, Filipino, Indonesian, Lithuanian, Malay, Tamil, Thai, Ukrainian, Urdu Amharic, Irish, Kyrgyz, Nepali, Malagasy, Sinhala, S. Pashto, Tel- ugu, Yoruba, Zulu Belebele Arabic, English, Finnish, Hindi, Japanese, Korean, Russian, Turk- ish, Vietnamese, Simp. Chinese Bengali, Greek, Hebrew, Georgian, Kazakh, Tamil, Thai, Ukrainian, Urdu, Malay Amharic, Tibetan, Guarani, Kan- nada, Khmer, Kyrgyz, Burmese, Punjabi, Pashto, Zulu XL-Sum Arabic, Simp. Chinese, English, French, Hindi, Japanese, Korean, Russian, Turkish, Vietnamese Azerbaijani, Bengali, Indonesian, Tamil, Thai, Ukrainian, Urdu, Uzbek Amharic, Burmese, Hausa, Kyr- gyz, Marathi, Nepali, Pashto, Sin- hala, Telugu, Welsh A.2. Test Set Construction We formulate a pairwise preference task where an LLM evaluator compares two candidate responses for a given query: (i) Chosen (xH ): The original human-authored or high-quality translated reference; and (ii) Rejected (xM ): A machine- generated counterpart produced via back-translation using NLLB-200-3.3B (Costa-Juss `a et al., 2022)2 to inject subtle translationese artifacts. To isolate translationese as the primary variable and mitigate length-based confounding, we enforce a length constraint where the token count differential between xH and xM is within ±5%. We further ensure evaluation robustness through a position-swapping protocol, retaining only consistent judgments where the model’s preference remains invariant to the presentation order. Beyond the baseline comparison, we introduce two distinct experimental configurations (summarized with examples in the following subsection A.3): 2https://huggingface.co/facebook/nllb-200-3.3B 15 -- 15 of 35 --
Chunk 41 · 1,997 chars
aining only consistent judgments where the model’s preference remains invariant to the presentation order. Beyond the baseline comparison, we introduce two distinct experimental configurations (summarized with examples in the following subsection A.3): 2https://huggingface.co/facebook/nllb-200-3.3B 15 -- 15 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck • Parallel: Both candidates are semantically equivalent and factually correct. This isolates the model’s stylistic preference for human vs. machine-translated syntax. • Perturbed: We introduce minor, controlled edits to the machine-generated candidate (xM ) to alter its factual correctness or logical coherence, testing whether the judge prioritizes correctness over stylistic fluency. A.3. Detailed Dataset Examples In this section, we provide the full text for the construction of our test sets across the three primary datasets: Aya, Belebele, and XL-Sum. For each dataset, we present both the Parallel setting (testing stylistic preference/bias) and the Perturbed setting (testing sensitivity to factual correctness). Example: Aya Dataset User Question: 问题:油的营养价值如何? Setting 1: Parallel (Stylistic Variation) Both answers are factually correct; B uses back-translated ”translationese.” [Assistant A (Chosen xH )] 回答:油是脂肪的重要来源,提供能量和必需脂肪酸,但摄入应适量控制,避免摄入过多的脂肪。 [Assistant B (Rejected xM )] 回答:油脂乃脂肪主来源,供应热能与必备脂肪酸,然食量宜适当限制,防止摄取过多的脂质。 Setting 2: Perturbed (Factual Error) Assistant B is modified to suggest an incorrect health outcome. [Assistant A (Chosen xH )] 回答:油是脂肪的重要来源,提供能量和必需脂肪酸,但摄入应适量控制,避免摄入过多的脂肪。 [Assistant B (Rejected xM )] 回答:油是脂肪的重要来源,提供能量和必需脂肪酸,但摄入应适量控制,以促进脂肪摄入。 Example: Belebele Dataset User Question (System Prompt): Create an example of multiple-choice reading test in Chinese (Simplified). You may choose any topic you want... Output only the test (passage, questions, answer choices, and correct answer). Setting 1: Parallel [Assistant A (Chosen xH )] ###
Chunk 42 · 1,934 chars
肪酸,但摄入应适量控制,以促进脂肪摄入。 Example: Belebele Dataset User Question (System Prompt): Create an example of multiple-choice reading test in Chinese (Simplified). You may choose any topic you want... Output only the test (passage, questions, answer choices, and correct answer). Setting 1: Parallel [Assistant A (Chosen xH )] ### Passage: 乘坐头等舱或商务舱最明显的方式是... 航空公司非常清楚,有一些核心乘客愿花高价... ### Question: 根据这段文字,预订某些票价舱位时,搜索什么是浪费时间? ### Answer Choices: (1) 直达航班(2) 商务舱(3) 折扣(4) 头等舱 ### Correct Answer: (3) 折扣 16 -- 16 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck [Assistant B (Rejected xM )] ### Passage: 想体验头等或商务舱最直接的途径... 航司深知,部分核心客户乐意支付高昂费用... ### Question: 依照本文内容,在订特定票价舱位时,查找何事乃浪费光阴? ### Answer Choices: (1) 不停航程(2) 公务舱(3) 优惠(4) 首等舱 ### Correct Answer: (3) 优惠 Setting 2: Perturbed Assistant B’s ”Correct Answer” index is modified to be factually wrong. [Assistant B (Rejected xM )] ... ### Answer Choices: (1) 不停航程(2) 公务舱(3) 优惠(4) 首等舱 ### Correct Answer: (4) 首等舱 Example: XL-Sum Dataset User Question (Summarization Task): Generate a concise, coherent abstractive summary in Chinese Simplified... Do not include information not present in the source text. Source Text: [Long article regarding the Brazilian ”Operation Weak Flesh” meat scandal involving President Michel Temer, JBS, and BRF...] Setting 1: Parallel (Stylistic Variation) Both summaries accurately reflect that countries suspended imports for several years. [Assistant A (Chosen xH )] 近期巴西”问题肉”丑闻揭露巴西的一些公司已经数年销售不安全肉类产品。包括中国在内的一些国家和组 织已经叫停巴西的进口肉。 [Assistant B (Rejected xM )] 最近巴西“劣质肉”风波披露,部分巴西公司长达数年贩卖问题肉品。中国等许多国家与机构,均已暂停来 自巴西的肉类进口。 Setting 2: Perturbed (Factual Hallucination) Assistant B is modified to falsely state that restrictions were ”relaxed” (放宽限制) and that the scandal ”just started” (刚开始). [Assistant A (Chosen xH )] 近期巴西”问题肉”丑闻揭露巴西的一些公司已经数年销售不安全肉类产品。包括中国在内的一些国家和组 织已经叫停巴西的进口肉。 [Assistant B (Rejected xM
Chunk 43 · 1,993 chars
达数年贩卖问题肉品。中国等许多国家与机构,均已暂停来 自巴西的肉类进口。 Setting 2: Perturbed (Factual Hallucination) Assistant B is modified to falsely state that restrictions were ”relaxed” (放宽限制) and that the scandal ”just started” (刚开始). [Assistant A (Chosen xH )] 近期巴西”问题肉”丑闻揭露巴西的一些公司已经数年销售不安全肉类产品。包括中国在内的一些国家和组 织已经叫停巴西的进口肉。 [Assistant B (Rejected xM )] 近期巴西”问题肉”丑闻揭露巴西的一些公司刚开始销售不安全肉类产品。包括中国在内的一些国家和组织 已经放宽限制巴西的进口肉。 17 -- 17 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck B. Theory Supplementary B.1. Proof of The Upper Boundary of I(X; Zr ) Proof. By definition, the mutual information I(X; Zr ) is the KL divergence between the joint distribution qϕ(X, Zr ) = p(x)qϕ(zr |x) and the product of the marginals p(x)qϕ(zr ), where qϕ(zr ) = R qϕ(zr |x)p(x) dx is the aggregate posterior (marginal) distribution. This is expressed as: I(X; Zr ) = Ex∼p(x) Z qϕ(zr |x) log qϕ(zr |x) qϕ(zr ) dzr . (9) To derive the upper bound, we introduce an arbitrary fixed prior p(zr ). We multiply and divide the argument of the logarithm by this prior p(zr ): I(X; Zr ) = Ex∼p(x) Z qϕ(zr |x) log qϕ(zr |x) qϕ(zr ) · p(zr ) p(zr ) dzr = Ex∼p(x) Z qϕ(zr |x) log qϕ(zr |x) p(zr ) − log qϕ(zr ) p(zr ) dzr . (10) Using the linearity of the expectation, we separate the integral into two terms: I(X; Zr ) = Ex∼p(x) Z qϕ(zr |x) log qϕ(zr |x) p(zr ) dzr − Ex∼p(x) Z qϕ(zr |x) log qϕ(zr ) p(zr ) dzr . (11) The first term is exactly the expected KL divergence between the posterior and the prior. For the second term, we observe that log qϕ(zr ) p(zr ) does not depend on x directly, other than through the integration of the joint density. We can simplify the expectation over x: Ex∼p(x) Z qϕ(zr |x) log qϕ(zr ) p(zr ) dzr = Z Z p(x)qϕ(zr |x) dx log qϕ(zr ) p(zr ) dzr = Z qϕ(zr ) log qϕ(zr ) p(zr ) dzr = DKL(qϕ(Zr )∥p(Zr )). (12) Substituting this back into the expression for mutual information yields the following decomposition: I(X; Zr ) = Ex∼p(x)
Chunk 44 · 1,989 chars
We can simplify the
expectation over x:
Ex∼p(x)
Z
qϕ(zr |x) log qϕ(zr )
p(zr ) dzr
=
Z Z
p(x)qϕ(zr |x) dx
log qϕ(zr )
p(zr ) dzr
=
Z
qϕ(zr ) log qϕ(zr )
p(zr ) dzr
= DKL(qϕ(Zr )∥p(Zr )).
(12)
Substituting this back into the expression for mutual information yields the following decomposition:
I(X; Zr ) = Ex∼p(x) [DKL(qϕ(Zr |x)∥p(Zr ))] − DKL(qϕ(Zr )∥p(Zr )). (13)
Since the KL divergence is non-negative (Gibbs’ inequality), i.e., DKL(qϕ(Zr )∥p(Zr )) ≥ 0, it follows that:
I(X; Zr ) ≤ Ex∼p(x) [DKL(qϕ(Zr |x)∥p(Zr ))] . (14)
This completes the proof.
B.2. Detailed Derivation of Lcompress
We derive the analytic form of the compression regularizer used in Eq. (4) . At each time step t ∈ {1, . . . , T }, we assume a
diagonal-covariance Gaussian variational posterior
qϕ(zr,t | x) = N (μt, diag(σ2
t )), (15)
and a standard Gaussian prior
p(zr,t) = N (0, I), (16)
where μt ∈ Rd and σ2
t ∈ Rd
>0.
We regularize the information capacity by minimizing the average KL divergence across the sequence:
Lcompress = 1
T
T X
t=1
DKL(qϕ(zr,t | x) ∥ p(zr,t)) . (17)
18
-- 18 of 35 --
Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
Thus, it suffices to derive a closed form for DKL
N (μt, diag(σ2
t )) ∥ N (0, I).
For q = N (μq , Σq ) and p = N (μp, Σp) in Rd, the KL divergence admits the well-known closed form:
DKL(q∥p) = 1
2
log |Σp|
|Σq | − d + trΣ−1
p Σq
+ (μp − μq )⊤Σ−1
p (μp − μq )
. (18)
In our case, μq = μt, Σq = diag(σ2
t ), μp = 0, and Σp = I.
Since Σp = I, we have |Σp| = 1 and Σ−1
p = I. Plugging into Eq. (18) yields
DKL(q∥p) = 1
2
log 1
|Σq | − d + tr(Σq ) + μ⊤
t μt
. (19)
Because Σq = diag(σ2
t ) is diagonal,
|Σq | =
d Y
j=1
σ2
t,j , (20)
log |Σq | =
d X
j=1
log σ2
t,j , (21)
tr(Σq ) =
d X
j=1
σ2
t,j , (22)
μ⊤
t μt =
d X
j=1
μ2
t,j . (23)
Substituting these into Eq. (19), we obtain
DKL
N (μt, diag(σ2
t )) N (0, I) = 1
2
d X
j=1
σ2
t,j + μ2
t,j − 1 − log σ2
t,j
. (24)
Rearranging Eq. (24) gives theChunk 45 · 1,998 chars
ag(σ2 t ) is diagonal, |Σq | = d Y j=1 σ2 t,j , (20) log |Σq | = d X j=1 log σ2 t,j , (21) tr(Σq ) = d X j=1 σ2 t,j , (22) μ⊤ t μt = d X j=1 μ2 t,j . (23) Substituting these into Eq. (19), we obtain DKL N (μt, diag(σ2 t )) N (0, I) = 1 2 d X j=1 σ2 t,j + μ2 t,j − 1 − log σ2 t,j . (24) Rearranging Eq. (24) gives the equivalent expression DKL = − 1 2 d X j=1 1 + log σ2 t,j − μ2 t,j − σ2 t,j . (25) Finally, averaging the KL divergence across the sequence as defined in Eq. (17), we obtain Lcompress = 1 T T X t=1 DKL N (μt, diag(σ2 t )) N (0, I) (26) = 1 2T T X t=1 d X j=1 σ2 t,j + μ2 t,j − 1 − log σ2 t,j (27) = − 1 2T T X t=1 d X j=1 1 + log σ2 t,j − μ2 t,j − σ2 t,j , (28) which matches Eq. (4). 19 -- 19 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck B.3. Proof of the Variational Lower Bound on I(U ; V ) Proof. By definition, the mutual information I(U ; V ) can be expressed as the difference between the marginal entropy of U and the conditional entropy of U given V : I(U ; V ) = H(U ) − H(U |V ). (29) The conditional entropy is defined as the expectation of the negative log-probability of the true conditional distribution p(u|v): H(U |V ) = Eu,v∼p(u,v)[− log p(u|v)]. (30) Substituting this back into the expression for mutual information yields: I(U ; V ) = H(U ) + Eu,v∼p(u,v)[log p(u|v)]. (31) Since the true conditional distribution p(u|v) is often unknown or intractable, we introduce a variational approximation qθ (u|v). We consider the expected Kullback-Leibler (KL) divergence between the true conditional distribution and the variational approximation: Ev∼p(v) [DKL(p(U |v)∥qθ (U |v))] = Eu,v∼p(u,v) log p(u|v) qθ (u|v) . (32) By the non-negativity of the KL divergence, we have: Eu,v∼p(u,v)[log p(u|v) − log qθ (u|v)] ≥ 0, (33) which implies: Eu,v∼p(u,v)[log p(u|v)] ≥ Eu,v∼p(u,v)[log qθ (u|v)]. (34) Finally, substituting the inequality (34) into (31), we obtain the lower bound: I(U ; V ) =
Chunk 46 · 1,996 chars
U |v))] = Eu,v∼p(u,v) log p(u|v) qθ (u|v) . (32) By the non-negativity of the KL divergence, we have: Eu,v∼p(u,v)[log p(u|v) − log qθ (u|v)] ≥ 0, (33) which implies: Eu,v∼p(u,v)[log p(u|v)] ≥ Eu,v∼p(u,v)[log qθ (u|v)]. (34) Finally, substituting the inequality (34) into (31), we obtain the lower bound: I(U ; V ) = H(U ) + Eu,v∼p(u,v)[log p(u|v)] ≥ H(U ) + Eu,v∼p(u,v)[log qθ (u|v)]. (35) Consequently, maximizing the expected log-likelihood of the variational distribution qθ (u|v) maximizes the lower bound of the mutual information I(U ; V ). B.4. Relationship Between Mutual Information and Cross-Covariance Proof. Let (Zr , Zb) be jointly Gaussian with mean 0 (without loss of generality, since mutual information is invariant under translations) and block covariance Σ = Σr Σrb Σbr Σb , Σbr = Σ⊤ rb, where Σr ≻ 0 and Σb ≻ 0 so that Σ−1/2 r and Σ−1/2 b are well-defined. Define C := Σ−1/2 r ΣrbΣ−1/2 b . Step 1: Mutual information for Gaussians from the definition. By definition, I(Zr ; Zb) = E log pZr ,Zb (Zr , Zb) pZr (Zr ) pZb (Zb) . For a centered d-dimensional Gaussian X ∼ N (0, ΣX ) with ΣX ≻ 0, its density is pX (x) = (2π)−d/2(det ΣX )−1/2 exp− 1 2 x⊤Σ−1 X x . 20 -- 20 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Applying this to (Zr , Zb) and to the marginals Zr and Zb, we obtain log pZr ,Zb (zr , zb) pZr (zr )pZb (zb) = − 1 2 log det Σ + 1 2 log det Σr + 1 2 log det Σb − 1 2 zr zb ⊤ Σ−1 zr zb + 1 2 z⊤ r Σ−1 r zr + 1 2 z⊤ b Σ−1 b zb. Taking expectation under the joint law of (Zr , Zb) yields I(Zr ; Zb) = − 1 2 log det Σ + 1 2 log det Σr + 1 2 log det Σb − 1 2 E "Zr Zb ⊤ Σ−1 Zr Zb # + 1 2 E[Z⊤ r Σ−1 r Zr ] + 1 2 E[Z⊤ b Σ−1 b Zb]. Using the identity E[X⊤AX] = tr(A Cov(X)) for any centered random vector X with finite second moment and any matrix A of compatible size, we get E "Zr Zb ⊤ Σ−1 Zr Zb # = tr(Σ−1Σ) = tr(I) = dr + db, E[Z⊤ r Σ−1 r Zr ] = tr(Σ−1 r Σr ) = dr , E[Z⊤ b
Chunk 47 · 1,990 chars
r
Zb
⊤
Σ−1
Zr
Zb
#
+ 1
2 E[Z⊤
r Σ−1
r Zr ] + 1
2 E[Z⊤
b Σ−1
b Zb].
Using the identity E[X⊤AX] = tr(A Cov(X)) for any centered random vector X with finite second moment and any
matrix A of compatible size, we get
E
"Zr
Zb
⊤
Σ−1
Zr
Zb
#
= tr(Σ−1Σ) = tr(I) = dr + db,
E[Z⊤
r Σ−1
r Zr ] = tr(Σ−1
r Σr ) = dr , E[Z⊤
b Σ−1
b Zb] = tr(Σ−1
b Σb) = db,
so the quadratic terms cancel. Hence
I(Zr ; Zb) = 1
2 log det Σr det Σb
det Σ . (36)
Step 2: Expressing det Σ in terms of C. By the block determinant (Schur complement) formula with Σb ≻ 0,
det Σ = det(Σb) det(Σr − ΣrbΣ−1
b Σbr ).
Next,
Σr − ΣrbΣ−1
b Σbr = Σ1/2
r
I − Σ−1/2
r ΣrbΣ−1
b Σbr Σ−1/2
r
Σ1/2
r
= Σ1/2
r
I − Σ−1/2
r ΣrbΣ−1/2
b Σ−1/2
b Σbr Σ−1/2
r
Σ1/2
r
= Σ1/2
r
I − CC⊤Σ1/2
r .
Therefore,
det(Σr − ΣrbΣ−1
b Σbr ) = det(Σr ) det(I − CC⊤),
and thus
det Σ = det(Σr ) det(Σb) det(I − CC⊤).
Substituting into (36) gives
I(Zr ; Zb) = − 1
2 log det(I − CC⊤). (37)
Step 3: Second-order expansion for small ∥C∥2. Let A := CC⊤. Then A ⪰ 0 and ∥A∥2 = ∥CC⊤∥2 = ∥C∥2
2. Assume
∥C∥2 is sufficiently small so that ∥A∥2 < 1. In this regime, the matrix power series for the principal logarithm holds:
log(I − A) = −
∞ X
k=1
Ak
k , (convergent in operator norm since ∥A∥2 < 1).
Taking traces and using (37) together with log det(I − A) = tr(log(I − A)) yields
I(Zr ; Zb) = 1
2
∞ X
k=1
tr(Ak)
k . (38)
21
-- 21 of 35 --
Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
The leading term is 1
2 tr(A) = 1
2 tr(CC⊤) = 1
2 ∥C∥2
F .
It remains to show that the remainder is o(∥C∥2
F ) as ∥C∥2 → 0. Since A ⪰ 0 has eigenvalues {λi}m
i=1 (with m = rank(A))
in [0, ∥A∥2], we have for every k ≥ 2,
tr(Ak) =
m X
i=1
λk
i ≤
max
i λi
k−1 m X
i=1
λi = ∥A∥ k−1
2 tr(A).
Therefore, the tail of (38) satisfies
0 ≤
∞ X
k=2
tr(Ak)
k ≤ tr(A)
∞ X
k=2
∥A∥ k−1
2
k ≤ tr(A)
∞ X
k=2
∥A∥ k−1
2 = tr(A) ∥A∥2
1 − ∥A∥2
.
Since ∥A∥2 = ∥C∥2
2 → 0, we have ∥A∥2
1−∥A∥2 → 0, and hence
∞ X
k=2
tr(Ak)
k = otr(A) =Chunk 48 · 1,999 chars
e have for every k ≥ 2, tr(Ak) = m X i=1 λk i ≤ max i λi k−1 m X i=1 λi = ∥A∥ k−1 2 tr(A). Therefore, the tail of (38) satisfies 0 ≤ ∞ X k=2 tr(Ak) k ≤ tr(A) ∞ X k=2 ∥A∥ k−1 2 k ≤ tr(A) ∞ X k=2 ∥A∥ k−1 2 = tr(A) ∥A∥2 1 − ∥A∥2 . Since ∥A∥2 = ∥C∥2 2 → 0, we have ∥A∥2 1−∥A∥2 → 0, and hence ∞ X k=2 tr(Ak) k = otr(A) = o∥C∥2 F as ∥C∥2 → 0. Combining this with (38) yields I(Zr ; Zb) = 1 2 ∥C∥2 F + o∥C∥2 F , as ∥C∥2 → 0, which is the desired second-order expansion. C. Detailed Experimental Settings C.1. Evaluation Benchmarks To evaluate the efficacy of LLM-as-a-judge frameworks and monitor the preservation of core English language capabilities, we utilize RewardBench (Lambert et al., 2024). This benchmark comprises approximately 3,000 pairwise comparisons across four primary dimensions: Chat, Chat Hard, Reasoning, and Safety. For the assessment of multilingual performance, we incorporate the following benchmarks: • M-RewardBench (Gureja et al., 2025): A multilingual adaptation of RewardBench covering 23 languages through expert-verified translations. • MM-Eval (Son et al., 2024): A diverse suite encompassing 18 languages. Unlike translated benchmarks, MM-Eval prioritizes native-speaker data and includes specialized subsets such as Linguistics (e.g., homophone disambiguation) and Language Hallucination (e.g., evaluating unintended code-switching). Metrics. Accuracy serves as our primary evaluation metric. For RewardBench, we report the arithmetic mean across the four category scores. For multilingual benchmarks, we compute the micro-average accuracy per language and subsequently report the macro-average across all supported languages. C.2. Training Settings Implementation Details. All experiments were conducted on a single node equipped with 8× NVIDIA H20 (96GB) GPUs. To ensure training stability and memory efficiency, we utilized DeepSpeed (Rasley et al., 2020) ZeRO Stage 3 with CPU offloading and leveraged FlashAttention-2 (Dao, 2024) for accelerated computation.
Chunk 49 · 1,992 chars
tings Implementation Details. All experiments were conducted on a single node equipped with 8× NVIDIA H20 (96GB) GPUs. To ensure training stability and memory efficiency, we utilized DeepSpeed (Rasley et al., 2020) ZeRO Stage 3 with CPU offloading and leveraged FlashAttention-2 (Dao, 2024) for accelerated computation. Optimization was performed using the Adam optimizer (Kingma, 2014). Model Architecture. We employed Supervised Fine-Tuning (SFT) combined with Low-Rank Adaptation (LoRA) (Hu et al., 2022), specifically targeting the attention linear projections. To generate robust and bias-aware representations, we utilized the Qwen3-0.6B-Embedding model (Yang et al., 2025) as an encoder. This encoder shares the same architecture as the LLM judge and processes features via separate one-layer MLP heads. The proxy task decoder is implemented using a linear projection layer. 22 -- 22 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Training Procedure. We adopted a two-stage training strategy to accommodate the variational framework. In the first stage, we froze the LLM judge parameters and trained only the projection layer to align the feature space. In the second stage, we jointly fine-tuned the projection layer and the LoRA modules to enhance compression. Additionally, we implemented dynamic loss scheduling to facilitate effective multi-task learning. Hyperparameters. We set the maximum sequence length to 16,384. All models were trained for 3 epochs using a cosine learning rate scheduler with a warmup ratio of 0.1. The global learning rate was set to 1 × 10−4 for the LoRA modules, bias heads, and the proxy task decoder. We used a per-device training batch size of 1 with 8 gradient accumulation steps. These settings are summarized in Table 6. Table 6. Hyperparameter settings for the training experiments. Hyperparameter Value Base Model Qwen3 Family (Yang et al., 2025) Optimizer Adam Learning Rate 1 ×
Chunk 50 · 1,996 chars
heads, and the proxy task decoder. We used a per-device training batch size of 1 with 8 gradient accumulation steps. These settings are summarized in Table 6. Table 6. Hyperparameter settings for the training experiments. Hyperparameter Value Base Model Qwen3 Family (Yang et al., 2025) Optimizer Adam Learning Rate 1 × 10−4 LR Scheduler Cosine Warmup Ratio 0.1 Max Sequence Length 16,384 Batch Size (per GPU) 1 Gradient Accumulation Steps 8 Epochs 3 Hardware 8× NVIDIA H20 (96GB) D. Comprehensive Results of Reward Modeling Benchmarks We present the fine-grained performance analysis across all evaluated benchmarks in the following sections. Detailed results for the five core subsets of MM-Eval (Son et al., 2024) are summarized in Table 7, while the language-specific performance breakdowns are distributed across Tables 8 and 9. For M-RewardBench (Gureja et al., 2025), comprehensive category-wise metrics are provided in Table 10, with language- level results detailed in Tables 11 and 12. Finally, the per-category accuracy for the original RewardBench (Lambert et al., 2024) is reported in Table 13. E. Comprehensive Results of Translationese Evaluation Suites Detailed translationese bias evaluation performance metrics under perturbed settings—adapted from the Aya (Singh et al., 2024), Belebele (Bandarkar et al., 2024), and XL-Sum (Hasan et al., 2021) datasets—are provided in Tables 14, 15, and 16, respectively. F. Additional Experiments F.1. Ablation Studies on Spurious Proxy Tasks In § 2, we identified spurious factors contributing to the systematic bias towards translationese. To mitigate this, we proposed two proxy tasks in § 3: (i) Cross-Lingual Alignment (CLA), utilizing InfoNCE to align learned representations with a back-translation manifold; and (ii) Log-Probability Bin Classification (LPBC), which encodes predictive confidence by classifying representations into discrete log-probability bins. In this section, we conduct a comprehensive ablation study to validate
Chunk 51 · 1,997 chars
t (CLA), utilizing InfoNCE to align learned representations with a back-translation manifold; and (ii) Log-Probability Bin Classification (LPBC), which encodes predictive confidence by classifying representations into discrete log-probability bins. In this section, we conduct a comprehensive ablation study to validate the effectiveness of these components. We examine the contribution of each proxy task to bias mitigation, analyze the impact of the back-translation system on the CLA task, and evaluate the robustness of different heuristic signals for the LPBC task. 23 -- 23 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 7. Full detailed results by category of MM-Eval. Bold indicates the best performance, and underlined indicates the second-best. Chat Lang. Hallu. Linguistics Reasoning Safety Avg. Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (Accuracy) (Avg. 18 lang) Proprietary Models GPT-4o 84.20 65.40 79.15 55.30 75.20 71.85 ± 0.81 Gemini-2.5-Flash 88.50 70.10 82.45 63.80 82.50 77.47 ± 0.76 General Open Models Qwen2.5-3B-Instruct 66.50 52.30 58.15 40.80 72.20 57.99 ± 1.20 Qwen2.5-7B-Instruct 76.20 60.50 68.40 48.90 74.20 65.64 ± 0.95 Qwen3-4B 90.46 67.34 84.00 84.35 76.56 80.85 ± 0.68 Qwen3-8B 91.17 67.79 83.78 80.31 85.87 82.20 ± 0.60 Multilingual Open Reward Models Nemotron-Multi-49B 91.47 68.92 87.56 38.29 95.59 76.31 ± 0.55 M-PROMETHEUS 3B 68.20 58.40 62.10 50.85 81.30 64.17 ± 1.10 M-PROMETHEUS 7B 62.61 61.55 61.33 63.50 91.37 69.38 ± 0.88 mR3-Qwen3-4B 90.05 69.14 83.56 81.62 90.69 82.55 ± 0.52 mR3-Qwen3-8B 92.28 67.34 84.89 87.20 92.52 85.29 ± 0.45 Think-as-Locals 7B 88.98 65.54 80.67 58.53 70.49 72.95 ± 0.70 Ours DIBJudge-Qwen3-4B 91.05 72.50 88.10 89.45 84.70 85.16 ± 0.33 DIBJudge-Qwen3-8B 92.80 74.20 90.50 91.20 93.50 87.53 ± 0.28‡ Effectiveness of Proxy Task Combination We first evaluate the individual and combined contributions of the CLA and LPBC tasks. Table 4 summarizes the
Chunk 52 · 1,996 chars
8.98 65.54 80.67 58.53 70.49 72.95 ± 0.70 Ours DIBJudge-Qwen3-4B 91.05 72.50 88.10 89.45 84.70 85.16 ± 0.33 DIBJudge-Qwen3-8B 92.80 74.20 90.50 91.20 93.50 87.53 ± 0.28‡ Effectiveness of Proxy Task Combination We first evaluate the individual and combined contributions of the CLA and LPBC tasks. Table 4 summarizes the bias severity scores across different configurations. We observe that while both tasks individually reduce bias compared to the baseline, the combination of both yields the most significant reduction. This suggests that the two tasks capture complementary aspects of the spurious features—latent manifold isomorphism and predictive confidence—thereby providing a more robust signal for bias mitigation. Impact of Back-Translation Systems The CLA task relies on back-translated data to approximate the ”translationese” manifold. A critical question is whether the choice of translation system influences the bias mitigation capabilities. We compared our default system (NLLB-200-3.3B) against a suite of varying architectures, including Gemma-3-4B, Llama-3.1-8B-Instruct, Qwen3-4B, GPT-4o, Gemini 2.5 Flash, and Google Translate. As shown in Table 17, while stronger systems (e.g., GPT-4o, Google Translate) achieve higher BLEU scores (> 50), higher translation quality does not necessarily correlate with lower bias severity in the downstream task. This indicates that the CLA task is robust to the generator’s quality, provided the generator produces sufficient translationese artifacts to serve as a negative contrastive pivot. Heuristic Signals for Bin Classification Finally, we ablate the heuristic metric used to partition samples for the LPBC task. While our method uses Negative Log-Likelihood (NLL), we compare this against Type-Token Ratio (TTR) and Perplexity (PPL). Table 18 demonstrates that while NLL yields the best performance marginally, the differences are negligible. All three metrics effectively capture the confidence disparity required for the auxiliary
Chunk 53 · 1,989 chars
hile our method uses Negative Log-Likelihood (NLL), we compare this against Type-Token Ratio (TTR) and Perplexity (PPL). Table 18 demonstrates that while NLL yields the best performance marginally, the differences are negligible. All three metrics effectively capture the confidence disparity required for the auxiliary task, demonstrating that our method is agnostic to the specific heuristic used to approximate predictive confidence. F.2. Sensitivity to Cross-Lingual Alignment Discrepancy (CAD) To test the hypothesis that conventional automatic judges exhibit an English-anchoring bias—i.e., a preference for translations that closely mirror the syntactic structure of the English source—we analyze model performance as a function of Cross- Lingual Alignment Discrepancy (CAD). As introduced in §2, CAD measures the degree of structural divergence between a candidate translation and its source sentence. 24 -- 24 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 8. Detailed results for MM-Eval for each language (Part 1). Bold indicates the best performance, and underlined indicates the second-best. Model Ar Bn Ca De En Es Eu Fr Gl Proprietary Models GPT-4o 70.50 62.10 73.50 75.20 78.50 76.80 65.40 75.90 71.20 Gemini-2.5-Flash 76.20 70.50 79.10 81.50 82.80 81.20 72.50 80.40 76.80 General Open Models Qwen2.5-3B-Instruct 55.40 42.10 60.50 63.80 68.50 64.20 48.50 62.10 56.80 Qwen2.5-7B-Instruct 62.80 54.50 68.20 71.50 74.80 72.10 58.40 70.50 65.20 Qwen3-4B 78.50 74.20 82.10 83.50 85.20 84.10 76.50 83.80 80.50 Qwen3-8B 80.20 75.80 83.50 84.80 86.50 85.40 78.10 84.90 81.80 Multilingual Open Reward Models Nemotron-Multi-49B 74.50 68.20 78.50 80.20 82.50 80.80 70.50 79.40 75.10 M-P ROMETHEUS 3B 61.20 52.50 66.80 69.50 72.40 70.10 56.80 68.50
Chunk 54 · 1,990 chars
84.10 76.50 83.80 80.50 Qwen3-8B 80.20 75.80 83.50 84.80 86.50 85.40 78.10 84.90 81.80 Multilingual Open Reward Models Nemotron-Multi-49B 74.50 68.20 78.50 80.20 82.50 80.80 70.50 79.40 75.10 M-P ROMETHEUS 3B 61.20 52.50 66.80 69.50 72.40 70.10 56.80 68.50 63.20 M-P ROMETHEUS 7B 66.50 59.80 71.50 74.20 76.80 74.50 62.50 73.10 68.40 mR3-Qwen3-4B 81.50 76.50 83.80 85.50 87.20 86.10 78.50 85.80 82.20 mR3-Qwen3-8B 84.20 79.50 86.50 88.10 89.50 88.80 81.20 87.50 85.10 Think-as-Locals-7B 71.50 64.80 75.20 77.50 80.10 78.40 66.50 76.80 72.50 Ours DIBJudge-Qwen3-4B 83.80 79.20 86.10 87.80 89.80 88.50 81.50 87.20 84.80 DIBJudge-Qwen3-8B 86.50 82.10 88.50 90.20 91.50 90.80 84.20 89.50 87.10 25 -- 25 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 9. Detailed results for MM-Eval for each language (Part 2). Bold indicates the best performance, and underlined indicates the second-best. Model It Ja Ko Ru Sw Te Th Vn Zh Proprietary Models GPT-4o 75.80 74.50 73.20 72.50 60.50 58.20 68.50 70.80 77.50 Gemini-2.5-Flash 80.50 79.80 78.50 78.10 68.50 66.20 75.40 76.80 82.50 General Open Models Qwen2.5-3B-Instruct 63.50 58.20 56.50 55.80 40.50 38.20 52.10 54.50 67.80 Qwen2.5-7B-Instruct 71.20 66.50 64.80 63.50 52.80 49.50 60.20 62.80 74.50 Qwen3-4B 84.50 81.20 80.50 79.80 72.50 70.20 77.80 79.50 84.80 Qwen3-8B 85.80 82.50 81.80 81.20 74.20 71.80 79.10 80.80 86.20 Multilingual Open Reward Models Nemotron-Multi-49B 79.50 76.20 75.50 77.80 66.20 64.50 73.80 75.20 81.50 M-P ROMETHEUS 3B 69.20 64.50 62.80 62.10 50.50 48.20 58.50 61.20 71.50 M-P ROMETHEUS 7B 74.50 70.20 68.50
Chunk 55 · 1,993 chars
82.50 81.80 81.20 74.20 71.80 79.10 80.80 86.20 Multilingual Open Reward Models Nemotron-Multi-49B 79.50 76.20 75.50 77.80 66.20 64.50 73.80 75.20 81.50 M-P ROMETHEUS 3B 69.20 64.50 62.80 62.10 50.50 48.20 58.50 61.20 71.50 M-P ROMETHEUS 7B 74.50 70.20 68.50 67.80 56.20 54.50 64.80 66.50 76.20 mR3-Qwen3-4B 86.20 83.50 82.10 81.50 74.80 72.50 79.50 81.20 86.50 mR3-Qwen3-8B 87.80 86.10 84.50 84.20 77.50 75.80 82.10 83.50 88.80 Think-as-Locals-7B 77.20 72.50 71.80 73.50 62.80 60.50 69.20 71.50 79.80 Ours DIBJudge-Qwen3-4B 88.50 85.80 84.20 83.50 78.20 75.50 81.50 83.20 88.50 DIBJudge-Qwen3-8B 90.20 88.50 87.20 86.80 81.50 79.20 84.50 86.20 90.50 26 -- 26 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 10. Full detailed results by category of m-RewardBench. Bold indicates the best performance, and underlined indicates the second-best. Chat Chat Hard Safety Reasoning Average Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (Avg. 23 lang) Proprietary Models GPT-4o 90.10 75.50 88.20 89.20 85.75 ± 0.42 Gemini-2.5-Flash 93.40 80.25 87.80 90.80 88.06 ± 0.49 General Open Models Qwen2.5-3B-Instruct 76.50 48.20 70.10 73.10 66.97 ± 1.12 Qwen2.5-7B-Instruct 86.10 61.50 78.80 85.15 77.89 ± 0.89 Qwen3-4B 89.10 72.64 85.20 93.30 85.06 ± 0.65 Qwen3-8B 91.00 73.50 86.00 93.98 86.12 ± 0.52 Multilingual Open Reward Models Nemotron-Multi-49B 92.80 79.50 87.20 95.80 88.83 ± 0.35 M-PROMETHEUS 3B 73.40 51.20 76.80 72.40 68.45 ± 0.98 M-PROMETHEUS 7B 90.50 60.50 83.00 78.12 78.03 ± 0.85 mR3-Qwen3-4B 86.55 78.00 88.50 95.80 87.21 ± 0.45 mR3-Qwen3-8B 87.95 80.19 89.50 96.68 88.58 ± 0.41 Think-as-Locals 7B 91.80 69.50 83.85 92.90 84.51 ± 0.60 Ours DIBJudge-Qwen3-4B 93.50 82.50 88.20 95.15 89.84 ± 0.28† DIBJudge-Qwen3-8B 94.60 84.80 90.10 96.00 91.37 ± 0.22‡ We
Chunk 56 · 1,993 chars
EUS 7B 90.50 60.50 83.00 78.12 78.03 ± 0.85 mR3-Qwen3-4B 86.55 78.00 88.50 95.80 87.21 ± 0.45 mR3-Qwen3-8B 87.95 80.19 89.50 96.68 88.58 ± 0.41 Think-as-Locals 7B 91.80 69.50 83.85 92.90 84.51 ± 0.60 Ours DIBJudge-Qwen3-4B 93.50 82.50 88.20 95.15 89.84 ± 0.28† DIBJudge-Qwen3-8B 94.60 84.80 90.10 96.00 91.37 ± 0.22‡ We partition the held-out evaluation set into disjoint CAD bins (e.g., [0.0, 0.1), [0.1, 0.2), . . .) and compute the win rate of Translationese outputs within each interval. Figure 7 summarizes the resulting trends. Baseline judges exhibit a pronounced positive correlation between CAD and win rate, indicating that their preferences are increasingly influenced by surface-level alignment artifacts rather than semantic adequacy. In contrast, DIBJUDGE demonstrates substantially reduced sensitivity to CAD, as evidenced by a near-flat win-rate profile across bins. This invariance suggests that DIBJUDGE successfully decouples evaluation quality from structural isomorphism, mitigating spurious biases that confound existing evaluation approaches. F.3. Analysis of Distributional Shortcuts To assess whether our approach mitigates reliance on superficial statistical artifacts, we analyze the bias spectrum shift induced by different judges using the Sequence Surprisal Ratio (SSR). SSR measures the relative predictability of machine- generated responses compared to human-written ones, thereby capturing spurious correlations arising from differences in predictive confidence (e.g., perplexity). Values of SSR close to 1 indicate distributional parity, where model preferences are not driven by confidence-related artifacts, whereas lower values reflect an over-reliance on highly predictable, low-entropy text. We focus on evaluation instances for which the judge selects the machine-generated translation (denoted as Machine Wins) and examine the empirical distribution of their SSR scores. Figure 8 presents a kernel density estimate (KDE) of these scores, revealing how
Chunk 57 · 1,996 chars
an over-reliance on highly predictable, low-entropy text. We focus on evaluation instances for which the judge selects the machine-generated translation (denoted as Machine Wins) and examine the empirical distribution of their SSR scores. Figure 8 presents a kernel density estimate (KDE) of these scores, revealing how different judges respond to distributional discrepancies associated with predictive confidence. As shown in Figure 8, the baseline judge exhibits a pronounced leftward shift in SSR, with probability mass concentrated at low values. This behavior indicates a distributional shortcut, wherein outputs with artificially high predictive confidence are systematically favored, independent of semantic quality. In contrast, DIBJUDGE induces a clear re-centering of the SSR distribution toward 1. This shift indicates that DIBJUDGE substantially reduces spurious correlations between model preference and predictive confidence, promoting judgments that are invariant to low-perplexity artifacts and more reflective of semantic utility. 27 -- 27 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 11. Detailed results for m-RewardBench for each language (Part 1). Bold indicates the best performance, and underlined indicates the second-best. Model Ar Cs De El Es Fa Fr He Hi Id It Ja Proprietary Models GPT-4o 84.50 85.20 87.10 83.50 88.00 82.00 87.50 83.00 84.00 86.50 87.80 86.50 Gemini-2.5-Flash 87.20 87.90 89.50 86.10 89.80 85.50 89.20 86.40 87.10 88.50 89.40 88.80 General Open Models Qwen2.5-3B-Instruct 64.20 66.50 69.80 60.50 70.20 61.10 69.50 60.80 65.40 68.10 69.20 67.50 Qwen2.5-7B-Instruct 76.80 78.10 80.50 74.20 81.20 73.50 80.80 74.90 77.40 79.50 80.10 79.20 Qwen3-4B 83.84 84.67 86.75 83.07 86.49 81.51 85.21 82.07 82.42 84.64 86.48 84.37 Qwen3-8B 85.33
Chunk 58 · 1,991 chars
69.80 60.50 70.20 61.10 69.50 60.80 65.40 68.10 69.20 67.50 Qwen2.5-7B-Instruct 76.80 78.10 80.50 74.20 81.20 73.50 80.80 74.90 77.40 79.50 80.10 79.20 Qwen3-4B 83.84 84.67 86.75 83.07 86.49 81.51 85.21 82.07 82.42 84.64 86.48 84.37 Qwen3-8B 85.33 87.43 88.01 84.85 87.39 85.06 87.57 84.40 85.64 86.95 87.25 85.60 Multilingual Open Reward Models Nemotron-Multi-49B 88.72 89.30 89.68 89.35 89.97 88.26 90.09 88.06 88.25 89.23 89.19 89.41 M-P ROMETHEUS 3B 66.50 68.20 71.40 64.10 72.50 63.80 71.10 65.20 67.50 70.20 71.80 69.50 M-P ROMETHEUS 7B 74.85 74.22 76.53 72.64 77.60 74.22 71.78 75.25 77.01 76.44 73.30 75.68 mR3-Qwen3-4B 87.61 87.37 87.79 86.15 88.58 85.25 88.54 86.42 86.43 87.43 87.90 86.78 mR3-Qwen3-8B 88.31 88.78 89.46 88.00 88.88 86.59 88.84 88.17 87.60 87.94 89.99 88.81 Think-as-Locals-7B 86.15 83.29 86.31 82.26 87.37 81.31 86.91 84.17 81.33 86.60 86.63 85.03 Ours DIBJudge-Qwen3-4B 88.50 90.15 91.20 88.05 91.50 88.45 91.80 88.50 89.10 90.50 91.50 90.20 DIBJudge-Qwen3-8B 90.50 91.80 92.80 89.80 93.10 89.90 93.50 90.10 90.80 92.20 93.00 92.00 28 -- 28 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 12. Detailed results for m-RewardBench for each language (Part 2). Bold indicates the best performance, and underlined indicates the second-best. Model Ko Nl Pl Pt Ro Ru Tr Uk Vi Zh Zh-TW Proprietary Models GPT-4o 85.50 87.50 85.00 87.80 84.50 85.50 84.20 84.00 85.00 86.50 86.00 Gemini-2.5-Flash 88.20 89.80 87.40 89.60 87.10 87.80 86.90 87.20 88.50 89.50 88.90 General Open Models Qwen2.5-3B-Instruct 66.80 70.50 66.20 69.80 65.50 67.20
Chunk 59 · 1,996 chars
h Zh-TW Proprietary Models GPT-4o 85.50 87.50 85.00 87.80 84.50 85.50 84.20 84.00 85.00 86.50 86.00 Gemini-2.5-Flash 88.20 89.80 87.40 89.60 87.10 87.80 86.90 87.20 88.50 89.50 88.90 General Open Models Qwen2.5-3B-Instruct 66.80 70.50 66.20 69.80 65.50 67.20 64.80 65.10 68.50 74.20 73.50 Qwen2.5-7B-Instruct 78.50 81.20 78.40 80.80 77.50 78.90 76.20 77.10 79.50 82.50 81.80 Qwen3-4B 82.77 85.89 84.58 87.39 85.29 86.06 83.83 83.80 84.76 84.82 84.88 Qwen3-8B 83.77 87.54 86.78 87.10 87.47 87.77 85.42 86.20 86.90 87.20 86.76 Multilingual Open Reward Models Nemotron-Multi-49B 88.05 90.83 89.99 89.33 89.89 90.19 88.09 88.91 89.32 88.86 86.29 M-P ROMETHEUS 3B 67.80 71.50 68.20 72.10 66.80 68.50 65.90 66.50 69.50 72.40 71.80 M-P ROMETHEUS 7B 71.96 75.48 77.59 74.00 77.21 70.17 71.57 74.91 76.45 71.16 75.99 mR3-Qwen3-4B 85.66 88.42 86.77 88.05 87.62 88.22 87.17 88.01 88.08 87.38 86.28 mR3-Qwen3-8B 88.47 88.99 87.33 90.56 89.30 88.84 88.77 88.16 88.89 88.36 87.95 Think-as-Locals-7B 83.49 86.04 85.67 86.21 84.61 85.31 83.31 83.50 86.67 85.90 85.42 Ours DIBJudge-Qwen3-4B 89.80 91.50 89.90 90.20 89.80 90.10 89.20 88.80 90.00 90.80 90.50 DIBJudge-Qwen3-8B 91.50 93.20 91.80 93.50 91.50 92.00 90.80 91.20 91.80 92.50 92.00 29 -- 29 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 13. Full detailed results by category of RewardBench (English). Bold indicates the best performance, and underlined indicates the second-best. Chat Chat Hard Safety Reasoning Average Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (English) Proprietary Models GPT-4o 90.50 75.10 88.50 89.74 85.96 ± 0.35 Gemini-2.5-Flash 93.80
Chunk 60 · 1,998 chars
detailed results by category of RewardBench (English). Bold indicates the best performance, and underlined indicates the second-best. Chat Chat Hard Safety Reasoning Average Model (Accuracy) (Accuracy) (Accuracy) (Accuracy) (English) Proprietary Models GPT-4o 90.50 75.10 88.50 89.74 85.96 ± 0.35 Gemini-2.5-Flash 93.80 81.20 89.10 91.22 88.83 ± 0.47 General Open Models Qwen2.5-3B-Instruct 82.50 41.50 74.50 77.46 68.99 ± 1.05 Qwen2.5-7B-Instruct 89.10 58.20 82.40 84.66 78.59 ± 0.91 Qwen3-4B 92.50 76.50 86.50 94.66 87.54 ± 0.55 Qwen3-8B 92.00 82.70 87.05 93.49 88.81 ± 0.48 Multilingual Open Reward Models Nemotron-Multi-49B 93.50 85.80 90.00 89.54 89.71 ± 0.31 M-PROMETHEUS 3B 80.50 42.10 80.50 76.06 69.79 ± 0.92 M-PROMETHEUS 7B 90.00 53.00 84.00 79.76 76.69 ± 0.78 mR3-Qwen3-4B 88.90 84.10 89.50 96.50 89.75 ± 0.38 mR3-Qwen3-8B 88.00 84.47 90.41 97.52 90.10 ± 0.40 Think-as-Locals 7B 91.20 79.00 89.50 95.46 88.79 ± 0.52 Ours DIBJudge-Qwen3-4B 94.20 86.50 89.80 90.78 90.32 ± 0.25 DIBJudge-Qwen3-8B 95.50 88.10 90.80 89.64 91.01 ± 0.20† F.4. Ablation Study: Efficacy of Information Bottlenecks A central hypothesis of our work is that a variational information constraint provides a superior trade-off between semantic preservation and artifact suppression compared to discrete or deterministic alternatives. To evaluate this, we compare our framework against three representative bottleneck mechanisms, assessing their impact on the robustness-utility Pareto frontier: • Vector Quantization (VQ): We replace the continuous variational information constraint with a discrete codebook constraint (Van Den Oord et al., 2017), mapping latent representations to the nearest centroid. This imposes a rigid structural bottleneck. • Low-Rank Projection (Low-Rank): We utilize a deterministic linear projection to a lower-dimensional subspace. This variant relies solely on capacity reduction without stochastic regularization. • Stochastic Noise (Noise): We apply isotropic Gaussian noise injection,
Chunk 61 · 1,995 chars
troid. This imposes a rigid structural bottleneck. • Low-Rank Projection (Low-Rank): We utilize a deterministic linear projection to a lower-dimensional subspace. This variant relies solely on capacity reduction without stochastic regularization. • Stochastic Noise (Noise): We apply isotropic Gaussian noise injection, z = h+ϵ, as a baseline stochastic regularization technique, notably lacking a prior distribution constraint. Analysis. Table 19 demonstrates that while the Low-Rank variant preserves accuracy, it fails to filter translationese artifacts, suggesting that dimensionality reduction alone cannot achieve effective disentanglement. Conversely, VQ achieves the lowest bias severity but suffers from utility collapse, where the rigid discrete constraint discards the nuanced semantic features necessary for reward modeling. Our variational information constraint method bridges this gap by explicitly optimizing the I(Z; X) vs. I(Z; Y ) trade-off. By penalizing I(Z; X) via the KL divergence to a prior, VIB selectively purges nuisance factors—such as translationese artifacts—while retaining task-relevant style features. Consequently, VIB dominates the noise and low-rank baselines in robustness while maintaining competitive utility. 30 -- 30 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 14. Bias severity by language on Aya dataset (Singh et al., 2024) under perturbed setting Language Base Vanilla SFT Vanilla IB DIBJudge High-Resource Basque 0.081 ± 0.008 0.088 ± 0.043 0.056 ± 0.011 0.042 ± 0.009 English 0.045 ± 0.008 0.058 ± 0.010 0.041 ± 0.008 0.031 ± 0.005 Finnish 0.125 ± 0.010 0.165 ± 0.028 0.079 ± 0.015 0.058 ± 0.005 Hindi 0.089 ± 0.008 0.076 ± 0.010 0.048 ± 0.007 0.036 ± 0.012 Japanese 0.049 ± 0.011 0.048 ± 0.010 0.034 ± 0.014 0.026 ± 0.015 Portuguese 0.089 ± 0.011 0.084 ± 0.013 0.056 ± 0.011 0.042 ± 0.006 Simplified Chinese 0.072 ± 0.015 0.086 ± 0.013 0.048 ± 0.007 0.035 ± 0.008 Spanish
Chunk 62 · 1,996 chars
0.165 ± 0.028 0.079 ± 0.015 0.058 ± 0.005 Hindi 0.089 ± 0.008 0.076 ± 0.010 0.048 ± 0.007 0.036 ± 0.012 Japanese 0.049 ± 0.011 0.048 ± 0.010 0.034 ± 0.014 0.026 ± 0.015 Portuguese 0.089 ± 0.011 0.084 ± 0.013 0.056 ± 0.011 0.042 ± 0.006 Simplified Chinese 0.072 ± 0.015 0.086 ± 0.013 0.048 ± 0.007 0.035 ± 0.008 Spanish 0.099 ± 0.012 0.128 ± 0.012 0.064 ± 0.013 0.046 ± 0.007 Vietnamese 0.173 ± 0.022 0.192 ± 0.038 0.099 ± 0.007 0.071 ± 0.006 Avg (High) 0.091 ± 0.015 0.103 ± 0.022 0.058 ± 0.020 0.043 ± 0.014 Mid-Resource Bengali 0.084 ± 0.014 0.140 ± 0.028 0.096 ± 0.026 0.043 ± 0.016 Cebuano 0.108 ± 0.008 0.113 ± 0.027 0.120 ± 0.013 0.052 ± 0.008 Filipino 0.118 ± 0.018 0.122 ± 0.020 0.132 ± 0.015 0.057 ± 0.010 Indonesian 0.064 ± 0.004 0.059 ± 0.015 0.071 ± 0.024 0.032 ± 0.016 Lithuanian 0.166 ± 0.004 0.202 ± 0.019 0.182 ± 0.029 0.082 ± 0.015 Malay 0.086 ± 0.007 0.113 ± 0.027 0.095 ± 0.012 0.044 ± 0.011 Tamil 0.157 ± 0.034 0.192 ± 0.051 0.170 ± 0.021 0.077 ± 0.015 Thai 0.082 ± 0.018 0.112 ± 0.024 0.090 ± 0.023 0.041 ± 0.017 Ukrainian 0.106 ± 0.018 0.133 ± 0.004 0.118 ± 0.013 0.052 ± 0.012 Urdu 0.139 ± 0.023 0.189 ± 0.025 0.155 ± 0.020 0.069 ± 0.018 Avg (Mid) 0.111 ± 0.016 0.138 ± 0.024 0.123 ± 0.037 0.055 ± 0.016 Low-Resource Amharic 0.376 ± 0.056 0.201 ± 0.021 0.300 ± 0.036 0.225 ± 0.042 Irish 0.350 ± 0.034 0.348 ± 0.013 0.276 ± 0.049 0.208 ± 0.031 Kyrgyz 0.174 ± 0.024 0.227 ± 0.024 0.138 ± 0.043 0.107 ± 0.036 Nepali 0.111 ± 0.004 0.101 ± 0.015 0.089 ± 0.069 0.070 ± 0.034 Malagasy 0.262 ± 0.024 0.178 ± 0.131 0.210 ± 0.049 0.160 ± 0.030 Sinhala 0.254 ± 0.032 0.310 ± 0.101 0.202 ± 0.050 0.154 ± 0.018 Pashto 0.244 ± 0.011 0.235 ± 0.079 0.195 ± 0.029 0.149 ± 0.053 Telugu 0.138 ± 0.022 0.148 ± 0.012 0.111 ± 0.058 0.085 ± 0.037 Yoruba 0.472 ± 0.026 0.453 ± 0.066 0.372 ± 0.029 0.283 ± 0.048 Zulu 0.338 ± 0.016 0.464 ± 0.021 0.266 ± 0.056 0.201 ± 0.044 Avg (Low) 0.272 ± 0.028 0.266 ± 0.048 0.216 ± 0.089 0.164 ± 0.067 F.5. Ablation Study: Disentanglement Mechanisms We evaluate
Chunk 63 · 1,995 chars
0.053 Telugu 0.138 ± 0.022 0.148 ± 0.012 0.111 ± 0.058 0.085 ± 0.037 Yoruba 0.472 ± 0.026 0.453 ± 0.066 0.372 ± 0.029 0.283 ± 0.048 Zulu 0.338 ± 0.016 0.464 ± 0.021 0.266 ± 0.056 0.201 ± 0.044 Avg (Low) 0.272 ± 0.028 0.266 ± 0.048 0.216 ± 0.089 0.164 ± 0.067 F.5. Ablation Study: Disentanglement Mechanisms We evaluate the efficacy of our proposed cross-covariance penalty against other methods. Our primary hypothesis is that a cross-covariance-based constraint provides a computationally efficient proxy for independence, effectively minimizing mutual information without the overhead of complex density estimators. Alternative Disentanglement Objectives. We compare our cross-covariance approach, Lcov, against three established baseline objectives: • Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005): A kernel-based measure of dependence. While theoretically robust, its O(n2) complexity per batch is prohibitive for long-context LLM fine-tuning. • Mutual Information Estimators (CLUB (Cheng et al., 2020)/MINE (Belghazi et al., 2018)): Variational upper bounds on Mutual Information (MI). These require auxiliary neural networks, increasing the parameter search space and training time. • Orthogonality Constraint (Lorth): A first-order geometric constraint minimizing absolute cosine similarity: Lorth = E |z⊤ r zb| ∥zr ∥2∥zb∥2 . (39) 31 -- 31 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 15. Bias severity by language on Belebele (Bandarkar et al., 2024) under perturbed etting Language Base Vanilla SFT Vanilla IB DIBJudge High-Resource Arabic 0.100 ± 0.029 0.095 ± 0.016 0.058 ± 0.010 0.014 ± 0.006 English 0.072 ± 0.005 0.063 ± 0.004 0.043 ± 0.014 0.011 ± 0.005 Finnish 0.086 ± 0.015 0.067 ± 0.017 0.049 ± 0.023 0.012 ± 0.010 Hindi 0.091 ± 0.011 0.089 ± 0.011 0.052 ± 0.012 0.013 ± 0.011 Japanese 0.074 ± 0.009 0.069 ± 0.012 0.047 ± 0.008 0.012 ± 0.006 Korean 0.084 ± 0.013 0.078 ± 0.018
Chunk 64 · 1,990 chars
6 0.058 ± 0.010 0.014 ± 0.006 English 0.072 ± 0.005 0.063 ± 0.004 0.043 ± 0.014 0.011 ± 0.005 Finnish 0.086 ± 0.015 0.067 ± 0.017 0.049 ± 0.023 0.012 ± 0.010 Hindi 0.091 ± 0.011 0.089 ± 0.011 0.052 ± 0.012 0.013 ± 0.011 Japanese 0.074 ± 0.009 0.069 ± 0.012 0.047 ± 0.008 0.012 ± 0.006 Korean 0.084 ± 0.013 0.078 ± 0.018 0.050 ± 0.011 0.013 ± 0.004 Russian 0.065 ± 0.010 0.058 ± 0.010 0.041 ± 0.021 0.010 ± 0.009 Turkish 0.072 ± 0.012 0.070 ± 0.010 0.045 ± 0.017 0.011 ± 0.009 Vietnamese 0.081 ± 0.006 0.080 ± 0.008 0.048 ± 0.013 0.012 ± 0.009 Avg (High) 0.081 ± 0.012 0.074 ± 0.011 0.048 ± 0.005 0.012 ± 0.001 Mid-Resource Bengali 0.094 ± 0.014 0.091 ± 0.015 0.066 ± 0.021 0.014 ± 0.005 Greek 0.083 ± 0.016 0.075 ± 0.019 0.060 ± 0.013 0.012 ± 0.009 Hebrew 0.088 ± 0.021 0.079 ± 0.019 0.062 ± 0.012 0.013 ± 0.006 Georgian 0.091 ± 0.016 0.085 ± 0.018 0.063 ± 0.019 0.013 ± 0.006 Kazakh 0.088 ± 0.012 0.081 ± 0.014 0.062 ± 0.014 0.013 ± 0.010 Tamil 0.102 ± 0.018 0.095 ± 0.022 0.072 ± 0.030 0.015 ± 0.014 Thai 0.080 ± 0.012 0.074 ± 0.014 0.057 ± 0.028 0.012 ± 0.006 Ukrainian 0.093 ± 0.018 0.089 ± 0.019 0.068 ± 0.027 0.014 ± 0.009 Urdu 0.099 ± 0.020 0.093 ± 0.019 0.073 ± 0.019 0.015 ± 0.009 Malay 0.090 ± 0.014 0.084 ± 0.016 0.062 ± 0.019 0.013 ± 0.013 Avg (Mid) 0.091 ± 0.016 0.085 ± 0.017 0.065 ± 0.005 0.013 ± 0.001 Low-Resource Amharic 0.122 ± 0.025 0.111 ± 0.023 0.185 ± 0.049 0.093 ± 0.041 Burmese 0.136 ± 0.029 0.119 ± 0.028 0.205 ± 0.055 0.103 ± 0.031 Guarani 0.098 ± 0.021 0.091 ± 0.019 0.147 ± 0.029 0.074 ± 0.024 Kannada 0.142 ± 0.029 0.132 ± 0.027 0.218 ± 0.051 0.109 ± 0.019 Khmer 0.129 ± 0.026 0.117 ± 0.024 0.197 ± 0.053 0.099 ± 0.034 Kyrgyz 0.111 ± 0.022 0.104 ± 0.020 0.170 ± 0.054 0.085 ± 0.035 Punjabi 0.115 ± 0.023 0.109 ± 0.021 0.179 ± 0.040 0.089 ± 0.023 Pashto 0.168 ± 0.031 0.156 ± 0.029 0.255 ± 0.058 0.129 ± 0.025 Zulu 0.123 ± 0.024 0.114 ± 0.022 0.188 ± 0.044 0.094 ± 0.034 Avg (Low) 0.126 ± 0.026 0.117 ± 0.025 0.194 ± 0.031 0.097 ± 0.016 Quantitative Analysis of
Chunk 65 · 1,985 chars
.104 ± 0.020 0.170 ± 0.054 0.085 ± 0.035 Punjabi 0.115 ± 0.023 0.109 ± 0.021 0.179 ± 0.040 0.089 ± 0.023 Pashto 0.168 ± 0.031 0.156 ± 0.029 0.255 ± 0.058 0.129 ± 0.025 Zulu 0.123 ± 0.024 0.114 ± 0.022 0.188 ± 0.044 0.094 ± 0.034 Avg (Low) 0.126 ± 0.026 0.117 ± 0.025 0.194 ± 0.031 0.097 ± 0.016 Quantitative Analysis of Efficiency vs. Robustness. The trade-off between disentanglement strength and computational overhead is summarized in Table 20. Our analysis reveals that while variational estimators (CLUB, MINE) and kernel-based methods (HSIC) theoretically offer tighter bounds on independence, their integration into LLM architectures is bottlenecked by the high-dimensional nature of the hidden states. Specifically, the auxiliary network in CLUB introduces a 112% increase in training latency due to the additional forward-backward passes required for the critic update. In contrast, our cross-covariance approach, Lcov, achieves a Bias Severity score of 14.2, which is competitive with the 13.8 achieved by HSIC, but at a fraction of the computational cost (1.2× vs 2.8× latency). We observe that simple orthogonality (Lorth) suffers from significant ”information leakage,” as evidenced by its high Bias Severity (21.5); this confirms that first-order geometric constraints are insufficient to capture the complex, non-linear correlations inherent in LLM representations. By minimizing the cross-covariance, we achieve a second-order alignment that serves as a ”practical optimum”—sufficiently decorrelating the robust and biased subspaces without the prohibitive O(n2) complexity or training instability of higher-order estimators. F.6. Quantitative Analysis: Linear Probing for Information Leakage To quantify the degree of disentanglement achieved by DIBJUDGE, we employ linear probing (Alain & Bengio, 2016). Our hypothesis is that a truly disentangled robust representation, zr , should be invariant to the text origin (Human vs. Machine-translated). Conversely, the bias
Chunk 66 · 1,996 chars
is: Linear Probing for Information Leakage To quantify the degree of disentanglement achieved by DIBJUDGE, we employ linear probing (Alain & Bengio, 2016). Our hypothesis is that a truly disentangled robust representation, zr , should be invariant to the text origin (Human vs. Machine-translated). Conversely, the bias representation, zb, should explicitly encode these “translationese” artifacts. 32 -- 32 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 16. Bias severity by language on XL-Sum (Hasan et al., 2021) under perturbed setting Language Base Vanilla SFT Vanilla IB DIBJudge High-Resource Arabic 0.079 ± 0.031 0.100 ± 0.030 0.084 ± 0.029 0.025 ± 0.008 English 0.060 ± 0.006 0.076 ± 0.008 0.063 ± 0.022 0.019 ± 0.005 French 0.054 ± 0.010 0.095 ± 0.019 0.069 ± 0.009 0.021 ± 0.005 Hindi 0.101 ± 0.010 0.126 ± 0.006 0.083 ± 0.029 0.024 ± 0.006 Japanese 0.067 ± 0.009 0.066 ± 0.026 0.064 ± 0.025 0.020 ± 0.008 Korean 0.081 ± 0.009 0.096 ± 0.037 0.075 ± 0.022 0.022 ± 0.012 Russian 0.077 ± 0.013 0.072 ± 0.012 0.067 ± 0.017 0.021 ± 0.004 Turkish 0.060 ± 0.016 0.065 ± 0.010 0.062 ± 0.020 0.019 ± 0.012 Vietnamese 0.069 ± 0.005 0.081 ± 0.029 0.081 ± 0.016 0.026 ± 0.006 Avg (High) 0.071 ± 0.014 0.086 ± 0.020 0.072 ± 0.009 0.022 ± 0.003 Mid-Resource Azerbaijani 0.116 ± 0.024 0.145 ± 0.037 0.046 ± 0.012 0.012 ± 0.010 Bengali 0.080 ± 0.011 0.075 ± 0.015 0.040 ± 0.012 0.011 ± 0.008 Indonesian 0.058 ± 0.012 0.078 ± 0.009 0.036 ± 0.023 0.010 ± 0.005 Tamil 0.109 ± 0.006 0.123 ± 0.030 0.045 ± 0.022 0.012 ± 0.012 Thai 0.074 ± 0.007 0.083 ± 0.025 0.041 ± 0.029 0.011 ± 0.010 Ukrainian 0.092 ± 0.015 0.100 ± 0.022 0.043 ± 0.020 0.011 ± 0.013 Urdu 0.091 ± 0.020 0.117 ± 0.014 0.047 ± 0.029 0.012 ± 0.012 Uzbek 0.141 ± 0.012 0.159 ± 0.009 0.052 ± 0.012 0.013 ± 0.005 Avg (Mid) 0.095 ± 0.016 0.123 ± 0.022 0.044 ± 0.005 0.012 ± 0.001 Low-Resource Amharic 0.303 ± 0.022 0.271 ± 0.036 0.110 ± 0.042 0.088 ± 0.020 Burmese 0.157 ±
Chunk 67 · 1,999 chars
0 ± 0.022 0.043 ± 0.020 0.011 ± 0.013 Urdu 0.091 ± 0.020 0.117 ± 0.014 0.047 ± 0.029 0.012 ± 0.012 Uzbek 0.141 ± 0.012 0.159 ± 0.009 0.052 ± 0.012 0.013 ± 0.005 Avg (Mid) 0.095 ± 0.016 0.123 ± 0.022 0.044 ± 0.005 0.012 ± 0.001 Low-Resource Amharic 0.303 ± 0.022 0.271 ± 0.036 0.110 ± 0.042 0.088 ± 0.020 Burmese 0.157 ± 0.028 0.081 ± 0.028 0.087 ± 0.025 0.067 ± 0.019 Hausa 0.215 ± 0.022 0.207 ± 0.015 0.098 ± 0.052 0.076 ± 0.040 Kyrgyz 0.160 ± 0.021 0.165 ± 0.035 0.081 ± 0.060 0.063 ± 0.016 Marathi 0.115 ± 0.016 0.124 ± 0.018 0.079 ± 0.044 0.061 ± 0.031 Nepali 0.084 ± 0.021 0.068 ± 0.018 0.070 ± 0.027 0.054 ± 0.040 Pashto 0.176 ± 0.034 0.170 ± 0.060 0.091 ± 0.063 0.071 ± 0.052 Sinhala 0.246 ± 0.042 0.248 ± 0.022 0.105 ± 0.021 0.082 ± 0.055 Telugu 0.095 ± 0.013 0.121 ± 0.001 0.076 ± 0.027 0.059 ± 0.056 Welsh 0.163 ± 0.026 0.225 ± 0.028 0.084 ± 0.025 0.066 ± 0.038 Avg (Low) 0.172 ± 0.024 0.168 ± 0.026 0.088 ± 0.013 0.069 ± 0.011 Experimental Setup. We freeze the DIBJUDGE encoder and train a linear classifier (the probe) on the extracted representations. The probe is optimized to distinguish the two domains via binary classification. We report probing accuracy on a held-out test set; an accuracy of 50% (random chance) signifies perfect invariance, whereas higher accuracy indicates significant information leakage. Results. As shown in Table 21, the probe trained on the bias representation zb achieves near-perfect accuracy (96.1%), confirming that DIBJUDGE successfully isolates translation artifacts. More importantly, the probe trained on the robust representation zr yields an accuracy of approximately 50%, demonstrating that predictive features are effectively sanitized of domain-specific signals. In contrast, standard embeddings from the baseline SFT model exhibit substantial leakage (82.4%), highlighting that standard fine-tuning fails to decouple semantic content from stylistic artifacts. 33 -- 33 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge
Chunk 68 · 1,989 chars
y sanitized of domain-specific signals. In contrast, standard embeddings from the baseline SFT model exhibit substantial leakage (82.4%), highlighting that standard fine-tuning fails to decouple semantic content from stylistic artifacts. 33 -- 33 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck Table 17. Impact of Back-Translation Systems. We report BLEU scores of the generated back-translations and the resulting Bias Severity. Translation System BLEU ↑ Bias Severity ↓ NLLB-200-3.3B (Costa-Juss`a et al., 2022) (Default) 52.4 0.15 Qwen3-4B (Yang et al., 2025) 54.1 0.16 Gemma-3-4B (Team et al., 2025) 55.8 0.14 Llama-3.1-8B-Instruct (Grattafiori et al., 2024) 56.2 0.18 Gemini-2.5-Flash (Comanici et al., 2025) 57.9 0.15 GPT-4o (Hurst et al., 2024) 58.5 0.17 Google Translate 59.1 0.14 Table 18. Comparison of Heuristic Metrics. Ablation of the metric used for Bin Classification. NLL performs slightly better, but all metrics provide similar bias mitigation. Metric Bias Severity ↓ Type-Token Ratio (TTR) 0.17 Perplexity (PPL) 0.16 Negative Log-Likelihood (NLL) 0.15 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 CAD (binned) 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Win rate High-CAD regime Smallest slope: DIBJudge DIBJudge (slope = +0.07) Base (slope = +0.16) Vanilla SFT (slope = +0.28) Vanilla IB (slope = +0.15) Figure 7. Sensitivity of Win Rate to Cross-Lingual Alignment Discrepancy (CAD). Win rates of Translationese outputs are plotted against binned CAD values. Baseline judges (dashed lines) display a strong positive dependence on CAD, reflecting an English-anchoring bias toward structurally aligned translations. In contrast, DIBJUDGE (solid line) maintains a near-constant win rate across CAD regimes, including high-CAD regions, indicating robustness to cross-lingual structural divergence. Table 19. Comparison of Bottleneck Mechanisms. Effect of different latent
Chunk 69 · 1,992 chars
English-anchoring bias toward structurally aligned translations. In contrast, DIBJUDGE (solid line) maintains a near-constant win rate across CAD regimes, including high-CAD regions, indicating robustness to cross-lingual structural divergence. Table 19. Comparison of Bottleneck Mechanisms. Effect of different latent constraints on utility (m-RewardBench accuracy) and bias mitigation (Bias Severity). Lower Bias Severity is better. Method Constraint Type m-RB Acc. ↑ Bias Sev. ↓ Low-Rank Deterministic (Capacity) 88.2 0.185 Noise Stochastic (Additive) 85.5 0.182 VQ Discrete (Structural) 84.9 0.082 Ours Variational (Information) 89.3 0.091 34 -- 34 of 35 -- Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Sequence Surprisal Ratio (SSR) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Density Vanilla SFT peak DIBJudge peak Vanilla SFT DIBJudge Figure 8. Bias Spectrum Shift via Sequence Surprisal Ratio (SSR). Kernel density estimates of SSR for instances where the judge selects the machine-generated output (Machine Wins). Lower SSR values correspond to spurious correlations with high predictive confidence (i.e., low perplexity), indicating a preference for distributionally simple text. Values near 1 reflect invariance to confidence-related artifacts. While the baseline judge concentrates mass at low SSR values, DIBJUDGE shifts the spectrum toward 1, demonstrating reduced reliance on predictive-confidence shortcuts. Table 20. Efficiency–bias trade-off of disentanglement mechanisms on Llama-3-8B. Training latency is normalized to standard fine-tuning (1.0×). Bias Severity is defined in § 2, where lower is better. Method Computational Cost Latency ↑ Bias Severity ↓ Baseline (No disentanglement) – 1.00× 0.284 Lorth (Orthogonality) O(d) 1.05× 0.215 MINE (Belghazi et al., 2018) O(aux. net) 2.42× 0.145 CLUB (Cheng et al., 2020) O(aux. net) 2.12× 0.141 HSIC (Gretton et al., 2005) O(n2) 2.78×
Chunk 70 · 754 chars
is defined in § 2, where lower is better. Method Computational Cost Latency ↑ Bias Severity ↓ Baseline (No disentanglement) – 1.00× 0.284 Lorth (Orthogonality) O(d) 1.05× 0.215 MINE (Belghazi et al., 2018) O(aux. net) 2.42× 0.145 CLUB (Cheng et al., 2020) O(aux. net) 2.12× 0.141 HSIC (Gretton et al., 2005) O(n2) 2.78× 0.138 Lcov (Ours) O(d2) 1.18× 0.142 Table 21. Linear probing for domain classification. Higher accuracy indicates stronger domain information. Effective disentanglement yields high accuracy for bias representations and low accuracy for robust representations. Model Probed Representation Accuracy (%) Baseline (SFT) Standard embedding h 82.4 DIBJUDGE (Ours) Bias representation zb 96.1 Robust representation zr 50.3 35 -- 35 of 35 --