XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

Summary

This paper introduces XQ-MEval, a multilingual dataset designed to benchmark automatic translation evaluation metrics by providing parallel-quality instances across nine language directions. The dataset is semi-automatically constructed by injecting MQM-defined errors into high-quality translations, filtering them with native speakers, and merging errors to create pseudo-translations with controlled quality levels. These are paired with source and reference texts to form triplets for metric evaluation. Experiments on nine metrics reveal inconsistencies between averaging scores across languages and human judgment, providing the first empirical evidence of cross-lingual scoring bias. The authors propose a normalization strategy, Language-specific Global Normalization (LGN), which aligns score distributions across languages, improving fairness and reliability in multilingual evaluation. The dataset covers English-to-Chinese, Japanese, Lao, Vietnamese, Indonesian, French, Spanish, Sinhala, and German translations, enabling systematic analysis of metric biases. The work highlights the need for fairer evaluation practices in multilingual machine translation.

PDF viewer

Chunks(46)

Chunk 0 · 1,996 chars

XQ-MEval: A Dataset with Cross-lingual Parallel Quality for
Benchmarking Translation Metrics
Jingxuan Liu†1 Zhi Qu†1
Jin Tei1 Hidetaka Kamigaito1 Lemao Liu2 Taro Watanabe1
† These authors contributed equally to this work.
1Nara Institute of Science and Technology, Japan.
2Fudan University, China.
jingxuan.liu.jm2@naist.ac.jp
Abstract
Automatic evaluation metrics are essential for
building multilingual translation systems. The
common practice of evaluating these systems is
averaging metric scores across languages, yet
this is suspicious since metrics may suffer from
cross-lingual scoring bias, where translations
of equal quality receive different scores across
languages. This problem has not been system-
atically studied because no benchmark exists
that provides parallel-quality instances across
languages, and expert annotation is not real-
istic. In this work, we propose XQ-MEval, a
semi-automatically built dataset covering nine
translation directions, to benchmark transla-
tion metrics. Specifically, we inject MQM-
defined errors into gold translations automati-
cally, filter them by native speakers for reliabil-
ity, and merge errors to generate pseudo trans-
lations with controllable quality. These pseudo
translations are then paired with corresponding
sources and references to form triplets used in
assessing the qualities of translation metrics.
Using XQ-MEval, our experiments on nine
representative metrics reveal the inconsistency
between averaging and human judgment and
provide the first empirical evidence of cross-
lingual scoring bias. Finally, we propose a nor-
malization strategy derived from XQ-MEval
that aligns score distributions across languages,
improving the fairness and reliability of multi-
lingual metric evaluation.1
1 Introduction
With the growing demand for multilingual transla-
tion systems, comprehensive and reliable evalua-
tion has become critical (Kocmi et al., 2024). In
human evaluation, Multidimensional Quality Met-
rics (MQM) largely

Chunk 1 · 1,996 chars

languages,
improving the fairness and reliability of multi-
lingual metric evaluation.1
1 Introduction
With the growing demand for multilingual transla-
tion systems, comprehensive and reliable evalua-
tion has become critical (Kocmi et al., 2024). In
human evaluation, Multidimensional Quality Met-
rics (MQM) largely achieves cross-lingually com-
parable evaluation through standardized error cate-
gories and hierarchical deduction (Lommel et al.,
1The code and dataset are available at: https://github.
com/zhiqu22/XQ-MEval.
He is skeptical
about whether
diabetes can
be cured.
他对感冒是否治愈持
怀疑态度。
Er 	ist 	skeptisch, 	ob
eine 	Erkältung 	geh-
eilt 	werden 	kann.
Er 	ist 	skeptisch, 	ob
Diabetes 	wieder 	au-
ftreten 	kann. 	different
scores
close
scores
Translation
Systems
85.62
Input 	Outputs 	Scoring
By COMET
86.17
94.42
Scoring by an
MQM-expert All of them 	have a
major error,
so, scoring 	-5.
Inconsistent!
Figure 1: A clue of this study, showing the inconsistency
between human evaluation, i.e., MQM, and automatic
metrics, e.g., COMET. Three translations each contain
one major error, thus sharing the same MQM score, yet
COMET assigns notably different scores, with larger
gaps across languages.
2013; Freitag et al., 2021). However, as evaluation
scales up, automatic evaluation metrics are essen-
tial due to their efficiency and scalability (Popovi´c,
2015, 2017; Post, 2018; Goyal et al., 2022). There-
fore, MQM driven automatic metrics have recently
become the primary tools, e.g., COMET (Rei et al.,
2020) and MetricX (Juraska et al., 2023).
In multilingual translation evaluation, the com-
mon practice is to evaluate each language direction
with a metric and then average the metric scores to
compute a system-level score2 (Chen et al., 2023;
Cao et al., 2024; Qu et al., 2025). However, this
average strategy may be problematic because it im-
plicitly assumes that different languages are scored
on the same scale for a similar error. In fact, cross-
lingual scoring bias is

Chunk 2 · 1,993 chars

nd then average the metric scores to
compute a system-level score2 (Chen et al., 2023;
Cao et al., 2024; Qu et al., 2025). However, this
average strategy may be problematic because it im-
plicitly assumes that different languages are scored
on the same scale for a similar error. In fact, cross-
lingual scoring bias is indeed observed as illus-
trated in Figure 1. To quantify and verify this po-
tential problem, a benchmark is needed that pro-
vides parallel quality across languages, ensuring
that cross-lingual comparisons are made on the
2The computational procedure of the average strategy is
described in Appendix A with pseudocode.
arXiv:2604.14934v1 [cs.CL] 16 Apr 2026

-- 1 of 19 --

same grounds, i.e., similar errors are quantified
equally across different languages. Due to the un-
affordable cost of expert-level annotations, no such
benchmark currently exists.
In this work, we propose a novel semi-automatic
pipeline that injects MQM-defined errors into gold
translations and filters them with native speakers,
ensuring reliability and cross-lingual consistency.
By merging individual errors, we generate pseudo
translations with controllable quality, which are
then paired with gold sources and references to
form triplets. Based on this, we construct a dataset
for evaluating metrics with cross-lingual parallel
quality, namely XQ-MEval. This dataset covers
nine languages3, i.e., Chinese, Japanese, Lao, Viet-
namese, Indonesian, French, Spanish, Sinhala, and
German, for translation directions from English,
and provides parallel-quality triplets for the fair
metric comparisons across languages.
Based on XQ-MEval, we conduct experiments
on nine representative automatic metrics. The re-
sults reveal a clear inconsistency between averag-
ing and human evaluation, and provide the first em-
pirical evidence of cross-lingual scoring bias. This
bias has two manifestations: (1) systems of equal
quality receive different scores across languages;
(2) the decline of metric

Chunk 3 · 1,988 chars

sentative automatic metrics. The re-
sults reveal a clear inconsistency between averag-
ing and human evaluation, and provide the first em-
pirical evidence of cross-lingual scoring bias. This
bias has two manifestations: (1) systems of equal
quality receive different scores across languages;
(2) the decline of metric scores with decreasing
quality is inconsistent across languages. Build-
ing on this finding, we propose a simple strategy
based on normalization (García et al., 2015), i.e.,
Language-specific Global Normalization (LGN),
to calibrate multilingual evaluation metrics. Our
experiments show that, compared to the average
strategy, LGN effectively reduces score range dis-
parities and improves the fairness and reliability
of multilingual metric evaluation. We make the
following threefold contributions in this study:
• We present XQ-MEval, the first multilingual
dataset with parallel-quality triplets across
nine translation directions, enabling bench-
marking of automatic evaluation metrics.
• We evaluate representative metrics to reveal
the inconsistency between the average strategy
and human judgment, and provide the first
analysis of cross-lingual scoring bias.
• We introduce and verify LGN, a normalized
average strategy that calibrates metrics in eval-
uating multilingual translation systems.
3Appendix B shows the details in language selection.
2 Related Work
The evaluation of bilingual translation systems re-
lies on discrete scoring schemes (Koehn and Monz,
2006; Vilar et al., 2007; Callison-Burch et al., 2007;
Denkowski and Lavie, 2010), but these suffer from
low inter-annotator agreement. Although Graham
et al. (2013); Bojar et al. (2016, 2017) introduced
the continuous rating scale to mitigate this vari-
ability, subjectivity-related biases persisted across
annotators. Building upon the Multidimensional
Quality Metrics (MQM) proposed by Lommel et al.
(2013), Freitag et al. (2021) developed a frame-
work that reduces annotator inconsistency

Chunk 4 · 1,995 chars

al. (2016, 2017) introduced
the continuous rating scale to mitigate this vari-
ability, subjectivity-related biases persisted across
annotators. Building upon the Multidimensional
Quality Metrics (MQM) proposed by Lommel et al.
(2013), Freitag et al. (2021) developed a frame-
work that reduces annotator inconsistency through
standardized error categories and hierarchical de-
duction. Specifically, each sentence is assumed
to have perfect quality initially, and points are de-
ducted according to error type, e.g., accuracy and
fluency, and severity, e.g., 1 for minor and 5 for ma-
jor. This makes translation metrics cross-lingually
comparable because sentences with the same er-
rors are expected to receive the same score across
languages.
To complement costly and inconsistent human-
based evaluation, automatic evaluation metrics
are proposed to approximate human judgments of
translation quality efficiently. They can be broadly
categorized into three types: (1) Regression-based
metrics frame evaluation as a supervised task that
directly predicts scalar quality scores, including
both models trained explicitly for evaluation, e.g.,
COMET (Rei et al., 2020, 2022a; Guerreiro et al.,
2024) and MetricX (Juraska et al., 2023, 2024),
and converting LLMs into evaluators, e.g., ReMedy
(Tan and Monz, 2025). (2) Sequence-based met-
rics evaluate translations by comparing candidate
translations with gold references, primarily relying
on surface-level similarity4, e.g., BLEU (Papineni
et al., 2002; Post, 2018) and chrF (Popovi´c, 2015,
2017). (3) Reference-free metrics, also known as
quality estimation (QE), extend regression-based
methods to evaluate translations directly against the
source without requiring references, e.g., COMET-
kiwi (Rei et al., 2021, 2023). In parallel, recent
work has explored using LLMs as human evalua-
tors by prompting them to follow explicit assess-
ment agreements such as MQM, thereby approxi-
mating human judgment behavior at inference time
(Kocmi

Chunk 5 · 1,998 chars

directly against the
source without requiring references, e.g., COMET-
kiwi (Rei et al., 2021, 2023). In parallel, recent
work has explored using LLMs as human evalua-
tors by prompting them to follow explicit assess-
ment agreements such as MQM, thereby approxi-
mating human judgment behavior at inference time
(Kocmi and Federmann, 2023).
4Although metrics like BLEURT (Sellam et al., 2020) are
regression-based, the metric depended on embeddings from
sequence information should be classified as sequence-based.

-- 2 of 19 --

High Quality
Translation
Dataset
Triplet
source: 我昨天在玩游戏。
translation: ?
reference: I played games yesterday.
Temporary
Error Pool
Candidates with 1 error segment:
1. I played games with my friends yesterday.
2. I just played games yesterday.
3. ……
(a) Phrase Level
Error
Pool Human
Filtering
1.
2. severity is minor
3. type is incorrect
4. ……
Merging
Pseudo Translation
Pool
1. I played games yesterday. (0 error)
2. I watched movies and played games
yesterday. (1 error)
3. I watched movies and played games
with my friends yesterday. (2 errors)
4. ……
(b) Sentence Level
Pseudo System A:
Triplets in en-de,
Triplets in en-ja,
……
Pseudo System B:
Triplets in en-de,
Triplets in en-ja,
……
(c) System Level
Triplet
Pool
Introduce
Error by GPT-4o
(d)
Final Evaluation
Incorrect!
They are Comparable !!!
A is better !
Pseudo System A
Pseudo System B Answer
Decided in Sampling
Sampling with
Predetermined
Scores
combining
(source, pseudo translation, reference)
Figure 2: The illustration of our pipeline. Specifically, stages from (a) to (c) show the data construction and reveal
that the product is to create pseudo translation systems with predetermined scores. Finally, stage (d) demonstrates
the use of pseudo systems to assess the automatic metrics based on the answer, i.e., the predetermined score.
These metrics are widely applied in multilingual
translation evaluation, but the practice of averaging
scores across languages

Chunk 6 · 1,994 chars

udo translation systems with predetermined scores. Finally, stage (d) demonstrates
the use of pseudo systems to assess the automatic metrics based on the answer, i.e., the predetermined score.
These metrics are widely applied in multilingual
translation evaluation, but the practice of averaging
scores across languages (Zhang et al., 2021; Qu
and Watanabe, 2022; Chen et al., 2023; Cao et al.,
2024; Qu et al., 2025) may hinder the system-level
evaluation since it is unclear whether a similar er-
ror is consistently measured across languages. Lyu
et al. (2025) showed that, in error span detection,
alignment with human judgments can vary with dif-
ferent decoding strategy. Relatedly, Von Däniken
et al. (2025) showed that metrics fail to align with
human evaluation even in a single translation direc-
tion. Thus, benchmarks are needed to expose cross-
lingual scoring bias and guide metric improvement.
However, constructing them incurs costs similar to
MQM, where each instance requires expert-level
annotation. Fortunately, using LLMs with human
filtering can simplify this process (Li et al., 2023;
Kwan et al., 2024; Bai et al., 2024; Wang et al.,
2025), providing a practical avenue for benchmark
construction.
3 Pipeline of Dataset Construction
We present a multilingual dataset, XQ-MEval, for
benchmarking automatic evaluation metrics cov-
ering nine translation directions, i.e., en-zh, en-
ja, en-lo, en-vi, en-id, en-fr, en-es, en-si, and
en-de, comprising both high-resource and low-
resource languages5. Constructing such a dataset
following MQM is challenging due to the high
cost of expert annotation, which greatly limits the
language coverage. To address this, we employ a
semi-automatic approach, formatting each sample
as a triplet and rigorously controlling quality to en-
sure cross-lingual parallelism. This design enables
5Languages are represented by ISO 639-1 codes, and de-
tails about language selection are shown in the Appendix B.
flexible sampling to simulate

Chunk 7 · 1,979 chars

s this, we employ a
semi-automatic approach, formatting each sample
as a triplet and rigorously controlling quality to en-
sure cross-lingual parallelism. This design enables
5Languages are represented by ISO 639-1 codes, and de-
tails about language selection are shown in the Appendix B.
flexible sampling to simulate systems with prede-
termined quality levels for metric benchmarking.
Specifically, we introduce a novel pipeline for
benchmark construction that enables systematic
and cost-effective analysis of metric biases in Fig-
ure 2, comprising phrase-level, sentence-level, and
system-level stages of different granularity. Auto-
matic evaluation metrics operate on a triplet com-
prising a source, translation, and reference. We
begin with a high-quality translation corpus, where
each translation pair forms the source and refer-
ence for a triplet. At the phrase-level stage, a
major-severity error is introduced into each ref-
erence. Then, at the sentence-level stage, we merge
0 to 5 errors from such candidates6 to generate
pseudo translations7 with six distinct quality levels.
Finally, at the system-level stage, pseudo systems
are constructed by assembling triplets across dif-
ferent quality levels, thereby emulating translation
systems with predetermined performance.
Nevertheless, we acknowledge that XQ-MEval
instances are synthesized rather than produced
by real translation systems, and may thus differ
from real-world scenarios. We have conducted pre-
liminary experiments on usable real-world MQM
datasets and validated our approach in Appendix C.
3.1 Phrase-level Construction
XQ-MEval is built on Flores8, a high-quality mul-
tilingual translation dataset, denoted as F, with
6The choice of 5 follows Google’s MQM guideline, where
each sentence can lose at most 25 points and each major error
accounts for 5 points (Freitag et al., 2021).
7Annotators’ feedback indicates that although combining
errors may appear unnatural, they remain objectively

Chunk 8 · 1,968 chars

lingual translation dataset, denoted as F, with
6The choice of 5 follows Google’s MQM guideline, where
each sentence can lose at most 25 points and each major error
accounts for 5 points (Freitag et al., 2021).
7Annotators’ feedback indicates that although combining
errors may appear unnatural, they remain objectively valid.
8https://huggingface.co/datasets/
openlanguagedata/flores_plus

-- 3 of 19 --

Part 	Product 	Operation 	Language 	Example 	Note
1 	Error Pool
Introducing single error to the
reference by GPT-4o, then filter
by native speakers.
de
Der Klage zufolge wurde der Abfall aus dem UN-Lager nicht
ordnungsgemäß gesäubert, was dazu führte, dass Bakterien in
den Zufluss des Artibonit-Flusses, einem der größten Flüsse
Haitis, <v>sowie in andere Gewässer</v> gelangten.
Error type: Addition;
Human judgment: ✔
ja
訴訟によれば、国連キャンプの廃棄物が適切に消毒さ
れていなかったため、ハイチ最大級の<v></v>に細菌が
侵入したとのことです。
Error type: Omission;
Human judgment: ✔
zh
诉讼材料显示，联合国营地未能对废弃物进行<v>彻底的
焚烧</v>，因而导致细菌进入阿蒂博尼特河的支流，这是
海地最大河流之一。
Error type:
Mistranslation;
Human judgment: ✔
zh
诉讼材料显示，联合国营地未能对废弃物进行适当的消
毒处理，因而导致细菌进入<v>Artibonite River</v>的支
流，这是海地最大河流之一。
Error type: Untranslated;
Human judgment: ✘
2
Pseudo
Translation
Pool
Merging several candidates,
where the error span is not
conflict, to get a pseudo
translation.
ja
訴訟によれば、国連キャンプの<v>食料供給</v>が適
切に消毒されていなかったため、ハイチ最大級のア
ルティボナイト川の支流に細菌が侵入したとのことで
す。
Error span: 14-18
ja
訴訟によれば、国連キャンプの廃棄物が適切に消毒
されていなかったため、<v>さらに多くの問題が発生
し、</v>ハイチ最大級のアルティボナイト川の支流に細
菌が侵入したとのことです。
Error span: 34-35
ja
訴訟によれば、国連キャンプの<v>食料供給</v>が適切
に消毒されていなかったため、 <v>さらに多くの問題が
発生し、</v>ハイチ最大級的阿蒂博尼特川的支流に細菌
が侵入したとのことです。
Pseudo Translation
with 2 Errors
3 	Triplet Pool
Each triplet fed into metrics is
combined by a source, a
translation, and a reference.
en
According to the lawsuit, waste from the UN camp was not
properly sanitized, causing bacteria to enter the tributary of the
Artibonite River, one of Haiti’s largest.
Source
ja
訴訟によれば、国連キャンプの<v>食料供給</v>が適切
に消毒されていなかったため、

Chunk 9 · 1,995 chars

plet Pool
Each triplet fed into metrics is
combined by a source, a
translation, and a reference.
en
According to the lawsuit, waste from the UN camp was not
properly sanitized, causing bacteria to enter the tributary of the
Artibonite River, one of Haiti’s largest.
Source
ja
訴訟によれば、国連キャンプの<v>食料供給</v>が適切
に消毒されていなかったため、 <v>さらに多くの問題が
発生し、</v>ハイチ最大級的阿蒂博尼特川的支流に細菌
が侵入したとのことです。
Pseudo Translation
ja
訴訟によれば、国連キャンプの廃棄物が適切に消毒さ
れていなかったため、ハイチ最大級的阿蒂博尼特川的
支流に細菌が侵入したとのことです。
Reference
Table 1: Examples used to assist in explaining Figure 2. The column of part is used for conveniently referring.
102 instances used in our experiments9. Flores is
particularly suitable because its translations are se-
mantically parallel and are carefully validated by
multiple native speakers (NLLB Team, 2022).
As shown in Figure 2, we define each transla-
tion instance in F as (s, r) where s represents the
source in en and r represents its reference. We em-
ploy GPT-4o10 (OpenAI, 2024) to inject an MQM-
defined error of major severity into r, producing
a temporary error candidate ˆr comprising a single
error segment with an identification tag.
We introduce the following four error types,
which dominate existing MQM datasets11 and are
conducive to cross-lingual comparability as they
are purely semantic (Haspelmath, 2010; Cristofaro,
2009): (1) Addition, where extraneous information
is inserted in translations; (2) Omission, where a
part of the source is left out; (3) Mistranslation,
where the meaning is distorted or incorrect; (4)
Untranslated, where the source remains untrans-
lated text. Because each pseudo translation ˜r may
contain up to five errors in our settings, we allow
9We have manually selected to exclude very short sen-
tences that cannot accommodate multiple injected errors.
10Version: gpt-4o-2024-11-20.
11These four types account for 46.3% of all MQM errors.
multiple instances of the same error type injected
separately into the first and second halves of the
sentence, which are first

Chunk 10 · 1,993 chars

ave manually selected to exclude very short sen-
tences that cannot accommodate multiple injected errors.
10Version: gpt-4o-2024-11-20.
11These four types account for 46.3% of all MQM errors.
multiple instances of the same error type injected
separately into the first and second halves of the
sentence, which are first divided and explicitly
tagged to guide GPT-4o to introduce error segments
into the corresponding parts. Thus, a single (s, r)
can yield up to eight temporary error candidates
ˆr = {ˆr1, ˆr2, . . . , ˆr8}. Applying this process to
the entire dataset produces a temporary error pool
ˆR = Sn
i=1 ˆri.12
Then, native speakers of the nine target lan-
guages review and filter ˆR. In practice, two in-
dependent reviewers are engaged, but, for si, lo,
and vi, only one reviewer is available due to re-
source constraints. Finally, only ˆr unanimously
approved by both annotators are retained to con-
struct the final error pool ˆRfiltered. The part 1 of
Table 1 demonstrates this process.
To ensure consistency, we provide detailed an-
notation guidelines in Appendix D that explain the
four MQM errors and specify filtering conditions
regarding completeness, locality, and severity. Ta-
ble 2 summarizes the number of sentences gen-
erated by GPT-4o and retained by annotators for
each error type. Also, to assess annotation relia-
12Prompts are carefully designed and listed in Appendix E.

-- 4 of 19 --

en-zh en-lo en-ja en-vi en-id en-fr en-es en-si en-de
Add. 	194 	196 	194 	201 	196 	200 	196 	200 	203
Omit. 	200 	191 	196 	199 	197 	196 	197 	188 	194
Mist. 	201 	200 	202 	200 	201 	196 	197 	197 	194
Untr. 	181 	172 	183 	171 	188 	183 	181 	181 	183
Table 2: The number of candidates generated by GPT-
4o and filtered by annotators for each error type. The
abbreviations of error type are as follows: Addition,
Omission, Mistranslation, and Untranslated.
en-zh en-ja en-fr en-es en-de en-id
Agreement (%) 98.16 96.45 97.79 97.30 97.67 96.45
Table 3: The annotation

Chunk 11 · 1,997 chars

183
Table 2: The number of candidates generated by GPT-
4o and filtered by annotators for each error type. The
abbreviations of error type are as follows: Addition,
Omission, Mistranslation, and Untranslated.
en-zh en-ja en-fr en-es en-de en-id
Agreement (%) 98.16 96.45 97.79 97.30 97.67 96.45
Table 3: The annotation agreement between the two
native speakers during the manual screening process.
bility, we compute inter-annotator agreement be-
tween the two native speakers. As shown in Table
3, agreement is consistently high, reflecting the ef-
fectiveness of our guidelines. We further validate
robustness through a second round of independent
screening on 200 randomly sampled en-zh and en-
ja instances. The alignment rates between the two
rounds are 99% for en-zh and 98% for en-ja, con-
firming the stability of annotation process. These
results demonstrate that the constructed dataset is
both reliable and reproducible, establishing a solid
foundation for subsequent stages.
3.2 Sentence-level Construction
Based on ˆRfiltered, we generate each pseudo trans-
lation ˜r by merging k single-error candidates ˆr,
where k ∈ {0, 1, 2, 3, 4, 5}, all of which are from
the same ˆrfiltered, i.e., the candidates filtered for
each pair (s, r), as illustrated in Figure 2. ˜r is a
variant of r containing between 0 and 5 errors, thus
covering six distinct quality levels in the MQM
framework. Part 2 of Table 1 provides an example,
where two non-overlapping ˆr are merged to form a
˜r with two errors. In addition, a special case is that
of 0 error, corresponding to the reference itself.13
By merging candidates, we can flexibly pro-
duce pseudo translations with the desired scores.
However, candidates may contain overlapping er-
ror spans, which compromise the locality of each
error. Such overlapping combinations are simply
discarded so that the actual number of pseudo trans-
lations is smaller than the theoretical maximum. As
a result, each triplet yields a set of pseudo trans-
lations

Chunk 12 · 1,993 chars

scores.
However, candidates may contain overlapping er-
ror spans, which compromise the locality of each
error. Such overlapping combinations are simply
discarded so that the actual number of pseudo trans-
lations is smaller than the theoretical maximum. As
a result, each triplet yields a set of pseudo trans-
lations that cover different quality levels. Table 4
reports the minimum and maximum number of
13In this case, a metric should assign a full score to the
triplet, when the translation matches the gold reference.
en-zh en-lo en-ja en-vi en-id en-fr en-es en-si en-de
max 	176 	176 	150 	218 	176 	139 	139 	139 	176
min 	19 	8 	11 	21 	7 	8 	11 	11 	15
Table 4: Summarizes the maximum and minimum num-
ber of pseudo translations generated for each triplet in
different translation directions.
pseudo translations generated per triplet for each
language direction, reflecting the constraints im-
posed by overlap and sentence structure.
3.3 System-level Construction and Final
Evaluation
As shown in part 3 of Table 1, an instance is formed
as a triplet (s, ˜r, r). By iterating over the entire
dataset, we obtain the triplet pool D, which consti-
tutes the final dataset of XQ-MEval.
Figure 2 further illustrates how D enables sys-
tematic benchmarking of automatic metrics. We
assume the existence of a translation system with
a given MQM score derived from the number of
error spans and then construct a pseudo system
by sampling triplets that reflect this target perfor-
mance. This procedure is both flexible and power-
ful because it allows us to generate arbitrary pseudo
systems tailored to different evaluation scenarios.
Based on pseudo systems with predefined perfor-
mance, we evaluate them using automatic metrics
and measure the alignment between metric scores
and predefined scores as a proxy for consistency
with human judgments.14
4 Experimental Setup
Based on XQ-MEval in Section 3, we perform a
large-scale and multilingual analysis of existing
automatic evaluation

Chunk 13 · 1,997 chars

fined perfor-
mance, we evaluate them using automatic metrics
and measure the alignment between metric scores
and predefined scores as a proxy for consistency
with human judgments.14
4 Experimental Setup
Based on XQ-MEval in Section 3, we perform a
large-scale and multilingual analysis of existing
automatic evaluation metrics15 as follows.
Sequence-based (1) spBLEU (Goyal et al.,
2022), a variant of BLEU that unifies tokeniza-
tion across languages through a SentencePiece to-
kenizer (Kudo and Richardson, 2018); (2) chrF++
(Popovi´c, 2017), which assesses character-level
overlap and balances precision with recall; (3)
BLEURT-20 (Sellam et al., 2020), a BERT-based
metric trained on human-annotated data to better
align with human judgments.
14Appendix A exhibits the process of computing system-
level metric scores, and shows comparing them to predefined
scores, i.e., human evaluations.
15We primarily focus on metrics within the categories de-
fined in Section 2. However, we also analyze LLM-based
approaches, including LLM-adapted regression metrics and
MQM-style LLM-as-judge evaluation, in Appendix J.

-- 5 of 19 --

Num. of
Lang.
System-level Kendall-τ
BLR. COM. xCOM. MX-r. KW22. KW23. MX-q.
3 	0.89 0.88 	0.90 	0.80 	0.88 	0.86 	0.82
6 	0.88 0.88 	0.89 	0.84 	0.90 	0.89 	0.83
9 	0.87 0.89 	0.90 	0.83 	0.90 	0.89 	0.83
(a) System-level
Num. of
Lang.
Triplet-level Kendall-τ
BLR. COM. xCOM. MX-r. KW22. KW23. MX-q.
3 	0.50 0.46 	0.38 	0.35 	0.44 	0.39 	0.32
6 	0.46 0.44 	0.42 	0.38 	0.45 	0.39 	0.33
9 	0.48 0.42 	0.44 	0.38 	0.44 	0.38 	0.32
(b) Triplet-level
Table 5: Results showing the system-level and triplet-level Kendall-τ correlation between averaged metric scores
and human judgments on pseudo systems. Num. of Lang. denotes the number of involved languages. In this setting,
Num. of 3 means that the system is sampled from zh, lo, and de; Num. of 6 means that the system is sampled from
zh, lo, de, id, ja, and si; Num. of 9 means that the system is sampled from all

Chunk 14 · 1,999 chars

d metric scores
and human judgments on pseudo systems. Num. of Lang. denotes the number of involved languages. In this setting,
Num. of 3 means that the system is sampled from zh, lo, and de; Num. of 6 means that the system is sampled from
zh, lo, de, id, ja, and si; Num. of 9 means that the system is sampled from all languages. The abbreviations of
metric are as follows: BLEURT, COMET, xCOMET, MX-reg, KIWI22, KIWI23, and MX-qe.
Regression-based (1) COMET-22 (Rei et al.,
2022a), which integrates source, hypothesis, and
reference embeddings to predict quality scores; (2)
xCOMET-XL (Guerreiro et al., 2024), which im-
proves interpretability by detecting errors explic-
itly; (3) MetricX-23 (Juraska et al., 2023), abbre-
viated as MX-reg, initialized with mT5 (Xue et al.,
2021) and fine-tuned on MQM data.
Reference-free (1) COMET-KIWI-22 (Rei et al.,
2022b), abbreviated as KIWI22, a reference-free
variant of COMET-22; (2) COMET-KIWI-23 (Rei
et al., 2023), abbreviated as KIWI23, an extended
version of KIWI22; (3) MetricX-23-QE (Juraska
et al., 2023), abbreviated as MX-qe, the reference-
free variant of MetricX-23.
5 Analysis on Average Strategy
5.1 Verification
To verify the consistency between the average strat-
egy and human evaluations in multilingual MT eval-
uation, we assemble 10 pseudo systems to approxi-
mate real-world translation systems.
Following the procedure of Section 3.3, each
pseudo system is built by aggregating 102 triplets
sampled per language pair from multiple languages
to meet predetermined scores. After scoring each
triplet, system-level metric scores are computed
by averaging their respective scores across direc-
tions, followed by calculating their correlation with
human evaluation to assess agreement. This pro-
cedure is repeated 100 times for stability, and
the average correlation across these repetitions is
reported. We rely on the Kendall-τ coefficient
(Kendall, 1938), a statistical measure of rank cor-
relation, to quantify the consistency

Chunk 15 · 1,999 chars

calculating their correlation with
human evaluation to assess agreement. This pro-
cedure is repeated 100 times for stability, and
the average correlation across these repetitions is
reported. We rely on the Kendall-τ coefficient
(Kendall, 1938), a statistical measure of rank cor-
relation, to quantify the consistency between the
rankings induced by metrics and by predetermined
scores, where higher values indicate stronger con-
sistency and vice versa.
Table 5(a) reports the system-level correlation
results under three settings with 3, 6, and all 9 lan-
guages, where the subsets of 3 and 6 were selected
to maximize linguistic diversity. Although correla-
tions appear high across settings, this is expected
in our simplified evaluation setup, where instance
quality is divided into five coarse-grained levels
with large gaps, making quality differences easier
for metrics to distinguish. As a result, such high
correlations may be inflated by the evaluation setup
and should be interpreted with caution.
To further examine whether this apparent consis-
tency holds at a finer granularity, we analyze metric
behavior at the triplet level. Since pseudo systems
are constructed from triplets, we group all possi-
ble triplets across languages to form test systems.
Table 5(b) presents the resulting triplet-level corre-
lations, which are substantially lower and indicate
pronounced inconsistency. These results shed light
on the concerns raised by the system-level analysis
and point to potential cross-lingual inconsistencies
in metric scoring behavior.
5.2 Analysis
To analyze inconsistencies between metrics and
human evaluations, we construct pseudo monolin-
gual systems, each restricted to a single translation
direction and quality level. Unlike multilingual sys-
tems, this setting isolates metric behavior within
one language and enables direct cross-language
comparison at the same quality level. Moreover, to
address imbalances in triplet counts across quality
levels16, we randomly

Chunk 16 · 1,997 chars

s, each restricted to a single translation
direction and quality level. Unlike multilingual sys-
tems, this setting isolates metric behavior within
one language and enables direct cross-language
comparison at the same quality level. Moreover, to
address imbalances in triplet counts across quality
levels16, we randomly sample 102 triplets per sys-
tem and repeat this procedure 10 times to ensure
robustness.17
At the same quality level Table 6 reports cross-
lingual coefficients of variation (CV) for nine met-
16Appendix F counts and lists the triplets distribution of
different languages.
17We further report tests with 5, 10, and 25 repetitions in
Appendix G to support our design choices.

-- 6 of 19 --

Quality Level spBLEU chrF BLEURT COMET xCOMET MX-reg KIWI22 KIWI23 MX-qe
1 	1.53 7.56 	1.62 	2.51 	9.95 	22.80 	2.38 	6.05 23.61
2 	3.16 9.84 	4.46 	3.84 	14.77 	23.74 	2.39 	8.55 22.38
3 	5.04 12.00 	7.26 	5.09 	20.51 	23.75 	2.66 	11.23 20.57
4 	7.25 14.22 	10.07 	6.27 	26.01 	24.23 	3.29 	14.42 18.94
5 	10.01 16.23 	13.79 	7.61 	28.67 	24.03 	3.76 	18.86 15.40
Table 6: Illustration of the cross-lingual CV (%) of scores for nine automatic metrics measured at five quality levels.
1 	2 	3 	4 	5
40
50
60
70
80
Score
spBLEU
1 	2 	3 	4 	5
40
50
60
70
80
90
chrF
1 	2 	3 	4 	5
40
50
60
70
80
BLEURT
1 	2 	3 	4 	5
65
70
75
80
85
90
Score
COMET
1 	2 	3 	4 	5
20
30
40
50
60
70
80
90
xCOMET
1 	2 	3 	4 	5
10
8
6
4
2
MX-reg
1 	2 	3 	4 	5
Numbers of Errors
55
60
65
70
75
80
Score
KIWI22
1 	2 	3 	4 	5
Numbers of Errors
30
40
50
60
70
KIWI23
1 	2 	3 	4 	5
Numbers of Errors
6
5
4
3
2
1 	MX-qe
en-zh
en-lo
en-ja
en-vi
en-id
en-fr
en-es
en-si
en-de
en-all
Figure 3: Visualization of nine metric scores across nine directions at varying translation quality levels. en-all
denoting the average metric scores among all directions.
rics across five quality levels, corresponding to
translations with the number of errors ranging from
1 to 5. For each quality level, CV is computed from
the mean

Chunk 17 · 1,994 chars

lization of nine metric scores across nine directions at varying translation quality levels. en-all
denoting the average metric scores among all directions.
rics across five quality levels, corresponding to
translations with the number of errors ranging from
1 to 5. For each quality level, CV is computed from
the mean and standard deviation of metric scores
across nine monolingual systems. CV measures
score inconsistency across languages at the same
quality level, indicating whether metrics provide
consistent judgments as translation direction varies,
with ideal values close to zero. Results show in-
consistencies for most metrics, with CV increasing
as translation quality decreases. This indicates that
metrics assign divergent scores to translations of
comparable quality, deviating from human evalua-
tion and reflecting cross-lingual bias in the scoring
behavior of metrics.
Across different quality level Figure 3 plots met-
ric scores across translation directions at varying
quality levels to examine whether score trends re-
main consistent as quality varies.18 Curves across
directions should overlap, with similar scores and
trends across quality levels. In contrast, two phe-
18Specific values are provided in Appendix H.

-- 7 of 19 --

Num. of
Lang.
System-level Kendall-τ
BLR. COM. xCOM. MX-r. KW22. KW23. MX-q.
3 0.90 0.89 0.91 0.88 0.90 0.88 0.85
6 0.92 0.91 0.90 0.91 0.92 0.90 0.86
9 0.91 0.93 0.92 0.88 0.91 0.91 0.86
(a) System-level
Num. of
Lang.
Triplet-level Kendall-τ
BLR. COM. xCOM. MX-r. KW22. KW23. MX-q.
3 0.51 0.48 0.47 0.41 0.45 0.41 0.34
6 0.49 0.48 0.48 0.42 0.46 0.41 0.35
9 0.50 0.48 0.49 0.41 0.45 0.40 0.34
(b) Triplet-level
Table 7: Kendall-τ correlations at system-level and triplet-level, corresponding to Table 5. All settings and
abbreviations follow Table 5. Bold values indicate improvements of LGN over the average strategy. Improvements
are modest in magnitude but statistically significant; significance

Chunk 18 · 1,988 chars

0.40 0.34
(b) Triplet-level
Table 7: Kendall-τ correlations at system-level and triplet-level, corresponding to Table 5. All settings and
abbreviations follow Table 5. Bold values indicate improvements of LGN over the average strategy. Improvements
are modest in magnitude but statistically significant; significance tests are reported in Appendix K.
nomena are observed.19 First, metric scores differ
across directions even at the same quality level.
Second, as quality decreases, score reduction rates
vary across directions, leading to widening gaps
between curves. Consistent with the analysis in
Table 6, these variations confirm the existence of
cross-lingual scoring bias in automatic translation
metrics, posing a challenge for metrics to align with
human evaluations in multilingual settings, where
uniformity across directions is expected.
6 Normalization-based Scoring
6.1 Methodology
The analysis in Section 5.2 reveals substantial vari-
ation in metric score ranges across translation di-
rections. Figure 4 further illustrates this issue using
COMET, where the distribution of scores for differ-
ent target languages diverges even when the human
score is fixed at 15, comprising 3 errors in each
translation. It is evident that different languages
occupy distinct numerical scales, making metric
scores inconsistent even when human quality is
comparable.
To address this problem, we propose Language-
specific Global Normalization (LGN), which
adopts z-score normalization to unify score scales
across languages via mean and standard deviation.
LGN computes the mean and standard deviation of
triplet scores for each translation direction across
all quality levels. For a given direction, 102 triplets
are randomly sampled per quality level (includ-
ing error-free translations) and pooled to calculate
the global mean and standard deviation.20 This
process is repeated 10 times, and the final values
are obtained by averaging across repetitions. By
normalizing scores, LGN

Chunk 19 · 1,993 chars

ity levels. For a given direction, 102 triplets
are randomly sampled per quality level (includ-
ing error-free translations) and pooled to calculate
the global mean and standard deviation.20 This
process is repeated 10 times, and the final values
are obtained by averaging across repetitions. By
normalizing scores, LGN effectively reduces dis-
crepancies between score ranges by narrowing the
19Appendix I describes the difference across directions and
across metrics in detail.
20Appendix A describes the computational procedure with
pseudo codes.
en-zh 	en-lo 	en-ja 	en-vi 	en-id 	en-es 	en-fr 	en-si 	en-de
40
50
60
70
80
90
COMET
 
Score
Mean
Min/Max
Mean ± Std Dev
Figure 4: The illustration of COMET score distribution
across different translation directions under fixed human
evaluation scores. The bar sections represent the mean
± standard deviation, while the whiskers indicate the
maximum and minimum values.
gaps in score distributions. The general formula
for normalization is as follows, with μ and σ being
the direction-wise mean and standard deviation:
z = score − μ
σ . (1)
6.2 Experiments and Results
We evaluate LGN by applying it before cross-
lingual score averaging, following the same ex-
perimental setup as in Table 5. Results in Table 7
show that LGN consistently improves the correla-
tion between automatic metrics and human evalua-
tions in multilingual settings. Although the abso-
lute gains are moderate, partly because correlations
are already high under the original setup, paired-
sample t-tests reported in Appendix K confirm the
statistically-significant improvement. Also, this
reflects the concern raised in the system-level veri-
fication of Section 5.1, where the value shown in
Table 5(a) is high but still suboptimal due to the
cross-lingual scoring bias. By reducing dispari-
ties in score ranges, LGN improves cross-lingual
consistency both the system and triplet levels.21
This directly addresses the concern raised in the
system-level analysis:

Chunk 20 · 1,990 chars

cation of Section 5.1, where the value shown in
Table 5(a) is high but still suboptimal due to the
cross-lingual scoring bias. By reducing dispari-
ties in score ranges, LGN improves cross-lingual
consistency both the system and triplet levels.21
This directly addresses the concern raised in the
system-level analysis: without normalization, av-
eraging scores across directions is unreliable, as
21We also reproduce the analysis in Figure 3 after applying
LGN in Appendix L.

-- 8 of 19 --

some languages may be systematically over- or
under-estimated. Our results suggest that applying
LGN before aggregation provides a more reliable
basis for multilingual system evaluation. While the
generalizability of LGN warrants further investi-
gation, these findings offer initial evidence that
normalization-based scoring can mitigate cross-
lingual bias in automatic evaluation metrics.
7 Conclusion
In this work, we introduce XQ-MEval, the first
multilingual dataset designed to achieve parallel
quality across languages for benchmarking auto-
matic evaluation metrics. Based on the benchmark,
we identify limitations in the commonly used prac-
tice of averaging metric scores across translation
directions to represent system-level performance.
Specifically, we reveal that cross-lingual scoring
bias, caused by metrics exhibiting different scoring
ranges across languages, is a key factor contribut-
ing to the misalignment between metrics and hu-
man evaluation in multilingual settings. Building
on this observation, we propose a normalization-
based strategy to mitigate cross-lingual scoring bias
by narrowing the distances between score ranges.
Experimental results show that the LGN strategy
significantly improves the consistency with human
evaluations and highlight the importance of align-
ing score ranges across languages to a unified scale
before averaging for reliability.
Limitations
Human evaluation remains a major bottleneck in
machine translation research, as large-scale

Chunk 21 · 1,994 chars

show that the LGN strategy
significantly improves the consistency with human
evaluations and highlight the importance of align-
ing score ranges across languages to a unified scale
before averaging for reliability.
Limitations
Human evaluation remains a major bottleneck in
machine translation research, as large-scale multi-
lingual annotation, especially for expert-level anno-
tation, is costly and resource-intensive. Although
our semi-automatic pipeline alleviates this reliance
and makes benchmark construction more efficient,
the current version covers only nine translation di-
rections. Nevertheless, the pipeline is highly flexi-
ble and can be extended to more languages in future
work.
While the MQM framework provides a compre-
hensive set of error categories, we focus on only
four purely semantic error types in our work. How-
ever, as discussed in Section 3.1, these error types
are better suited for achieving cross-lingual com-
parability and represent the most prominent cate-
gories in existing MQM datasets, accounting for
approximately 46.3% of all errors. Although our
pipeline can incorporate additional error types, do-
ing so first requires careful linguistic justification
to ensure that the added types remain comparable
across languages.
Given that this is the first work to discuss the
fairness in evaluating multilingual translation sys-
tems, our work raises further questions for future
research. For instance, are metrics equally sen-
sitive to different error types, or do they respond
unevenly? More intriguingly, does this sensitivity
vary across languages? We leave these fine-grained
investigations for future work.
Ethics Statement
In this work, we construct the XQ-MEval dataset
based on Flores, a public dataset, combining man-
ual filtering to enhance its quality. We recruit el-
igible students from our institution to assist with
human annotation tasks, and the compensation pro-
vided is in compliance with local standards. All
human-involved steps

Chunk 22 · 1,997 chars

this work, we construct the XQ-MEval dataset
based on Flores, a public dataset, combining man-
ual filtering to enhance its quality. We recruit el-
igible students from our institution to assist with
human annotation tasks, and the compensation pro-
vided is in compliance with local standards. All
human-involved steps during the construction are
carefully designed to ensure that no personal infor-
mation is involved. The manual annotation process
adheres strictly to the ethical guidelines of our in-
stitution and the ACL ethics policy. Thus, this
recruitment and annotation are approved by the
ethics reviewing committee of our affiliation. Gen-
erally, this benchmark can be applied in real-world
scenarios, supporting the evaluation of automatic
evaluation metrics in multilingual settings.
Flores is released under the CC BY-SA 4.0 li-
cense22, which explicitly permits adaptation and
sharing. To fully comply with these terms, our li-
cense in releasing XQ-MEval would be CC BY-SA
4.0. Moreover, XQ-MEval is created using GPT-
4o and is therefore subject to OpenAI’s license
terms23. OpenAI assigns to us all rights, titles, and
interests in and to the output.
Use of AI Assistance
During the preparation of this paper, we used Chat-
GPT to assist with proofreading and polishing. The
model was employed solely to improve clarity,
grammar, and readability of the manuscript; all
ideas, experimental designs, analyses, and conclu-
sions come from the authors. The authors carefully
reviewed and verified all AI-assisted edits to en-
sure correctness and faithfulness to the intended
meaning.
22https://huggingface.co/datasets/
openlanguagedata/flores_plus
23https://openai.com/policies/terms-of-use

-- 9 of 19 --

References
Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jia-
heng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su,
Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024.
MT-bench-101: A fine-grained benchmark for evalu-
ating large language models in multi-turn dialogues.
In Proceedings of

Chunk 23 · 1,997 chars

nai.com/policies/terms-of-use

-- 9 of 19 --

References
Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jia-
heng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su,
Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024.
MT-bench-101: A fine-grained benchmark for evalu-
ating large language models in multi-turn dialogues.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 7421–7454, Bangkok, Thailand.
Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,
Yvette Graham, Barry Haddow, Shujian Huang,
Matthias Huck, Philipp Koehn, Qun Liu, Varvara
Logacheva, Christof Monz, Matteo Negri, Matt Post,
Raphael Rubino, Lucia Specia, and Marco Turchi.
2017. Findings of the 2017 conference on machine
translation (WMT17). In Proceedings of the Second
Conference on Machine Translation, pages 169–214,
Copenhagen, Denmark. Association for Computa-
tional Linguistics.
Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,
Yvette Graham, Barry Haddow, Matthias Huck, An-
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
gacheva, Christof Monz, Matteo Negri, Aurélie
Névéol, Mariana Neves, Martin Popel, Matt Post,
Raphael Rubino, Carolina Scarton, Lucia Specia,
Marco Turchi, and 2 others. 2016. Findings of the
2016 conference on machine translation. In Proceed-
ings of the First Conference on Machine Transla-
tion: Volume 2, Shared Task Papers, pages 131–198,
Berlin, Germany. Association for Computational Lin-
guistics.
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn,
Christof Monz, and Josh Schroeder. 2007. (meta-)
evaluation of machine translation. In Proceedings of
the Second Workshop on Statistical Machine Transla-
tion, pages 136–158, Prague, Czech Republic. Asso-
ciation for Computational Linguistics.
Zhe Cao, Zhi Qu, Hidetaka Kamigaito, and Taro Watan-
abe. 2024. Exploring intrinsic language-specific sub-
spaces in fine-tuning multilingual neural machine
translation. In Proceedings of the 2024 Conference
on Empirical Methods in

Chunk 24 · 1,997 chars

on, pages 136–158, Prague, Czech Republic. Asso-
ciation for Computational Linguistics.
Zhe Cao, Zhi Qu, Hidetaka Kamigaito, and Taro Watan-
abe. 2024. Exploring intrinsic language-specific sub-
spaces in fine-tuning multilingual neural machine
translation. In Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Process-
ing, pages 21142–21157, Miami, Florida, USA.
Liang Chen, Shuming Ma, Dongdong Zhang, Furu Wei,
and Baobao Chang. 2023. On the off-target problem
of zero-shot multilingual neural machine translation.
In Findings of the Association for Computational
Linguistics: ACL 2023, pages 9542–9558, Toronto,
Canada.
Sonia Cristofaro. 2009. Grammatical categories and
relations: universality vs. language-specificity and
construction-specificity. Language and Linguistics
Compass, 3(1):441–479.
Michael Denkowski and Alon Lavie. 2010. Choosing
the right evaluation for machine translation: an ex-
amination of annotator and automatic metric perfor-
mance on human judgment tasks. In Proceedings of
the 9th Conference of the Association for Machine
Translation in the Americas: Research Papers, Den-
ver, Colorado, USA. Association for Machine Trans-
lation in the Americas.
Markus Freitag, George Foster, David Grangier, Viresh
Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021.
Experts, errors, and context: A large-scale study of
human evaluation for machine translation. Transac-
tions of the Association for Computational Linguis-
tics, 9:1460–1474.
Salvador García, Julián Luengo, and Francisco Her-
rera. 2015. Data Preprocessing in Data Mining, vol-
ume 72. Springer Cham.
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán,
and Angela Fan. 2022. The Flores-101 evaluation
benchmark for low-resource and multilingual ma-
chine translation. Transactions of the Association for
Computational Linguistics, 10:522–538.
Yvette Graham, Timothy Baldwin, Alistair Moffat,

Chunk 25 · 1,996 chars

laume Wenzek, Da Ju, Sanjana Kr-
ishnan, Marc’Aurelio Ranzato, Francisco Guzmán,
and Angela Fan. 2022. The Flores-101 evaluation
benchmark for low-resource and multilingual ma-
chine translation. Transactions of the Association for
Computational Linguistics, 10:522–538.
Yvette Graham, Timothy Baldwin, Alistair Moffat, and
Justin Zobel. 2013. Continuous measurement scales
in human evaluation of machine translation. In Pro-
ceedings of the 7th Linguistic Annotation Workshop
and Interoperability with Discourse, pages 33–41,
Sofia, Bulgaria. Association for Computational Lin-
guistics.
Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa
Coheur, Pierre Colombo, and André F. T. Mar-
tins. 2024. xCOMET: Transparent machine trans-
lation evaluation through fine-grained error detection.
Transactions of the Association for Computational
Linguistics, 12:979–995.
Martin Haspelmath. 2010. Comparative concepts and
descriptive categories in crosslinguistic studies. Lan-
guage, 86(3):663–687.
Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and
Markus Freitag. 2024. MetricX-24: The Google
submission to the WMT 2024 metrics shared task.
In Proceedings of the Ninth Conference on Machine
Translation, pages 492–504, Miami, Florida, USA.
Juraj Juraska, Mara Finkelstein, Daniel Deutsch, Aditya
Siddhant, Mehdi Mirzazadeh, and Markus Freitag.
2023. MetricX-23: The Google submission to the
WMT 2023 metrics shared task. In Proceedings
of the Eighth Conference on Machine Translation,
pages 756–767, Singapore.
Maurice G Kendall. 1938. A new measure of rank
correlation. Biometrika, 30(1-2):81–93.
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden,
Ondˇrej Bojar, Anton Dvorkovich, Christian Feder-
mann, Mark Fishel, Markus Freitag, Thamme Gowda,
Roman Grundkiewicz, Barry Haddow, Marzena
Karpinska, Philipp Koehn, Benjamin Marie, Christof
Monz, Kenton Murray, Masaaki Nagata, Martin
Popel, Maja Popovi´c, and 3 others. 2024. Findings
of the WMT24 general machine translation shared

-- 10 of 19 --

Chunk 26 · 1,996 chars

Christian Feder-
mann, Mark Fishel, Markus Freitag, Thamme Gowda,
Roman Grundkiewicz, Barry Haddow, Marzena
Karpinska, Philipp Koehn, Benjamin Marie, Christof
Monz, Kenton Murray, Masaaki Nagata, Martin
Popel, Maja Popovi´c, and 3 others. 2024. Findings
of the WMT24 general machine translation shared

-- 10 of 19 --

task: The LLM era is here but MT is not solved yet.
In Proceedings of the Ninth Conference on Machine
Translation, pages 1–46, Miami, Florida, USA.
Tom Kocmi and Christian Federmann. 2023. GEMBA-
MQM: Detecting translation quality error spans with
GPT-4. In Proceedings of the Eighth Conference on
Machine Translation, pages 768–775, Singapore.
Philipp Koehn and Christof Monz. 2006. Manual and
automatic evaluation of machine translation between
European languages. In Proceedings on the Work-
shop on Statistical Machine Translation, pages 102–
121, New York City. Association for Computational
Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece:
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. In
Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium.
Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei
Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun
Liu, and Kam-Fai Wong. 2024. MT-eval: A multi-
turn capabilities evaluation benchmark for large lan-
guage models. In Proceedings of the 2024 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, pages 20153–20177, Miami, Florida, USA.
Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and
Ji-Rong Wen. 2023. HaluEval: A large-scale hal-
lucination evaluation benchmark for large language
models. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing,
pages 6449–6464, Singapore.
Arle Richard Lommel, Aljoscha Burchardt, and Hans
Uszkoreit. 2013. Multidimensional quality metrics:
a flexible system for assessing translation

Chunk 27 · 1,990 chars

ination evaluation benchmark for large language
models. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing,
pages 6449–6464, Singapore.
Arle Richard Lommel, Aljoscha Burchardt, and Hans
Uszkoreit. 2013. Multidimensional quality metrics:
a flexible system for assessing translation quality.
In Proceedings of Translating and the Computer 35,
London, UK. Aslib.
Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito,
Chenchen Ding, Hideki Tanaka, Masao Utiyama, Ko-
taro Funakoshi, and Manabu Okumura. 2025. Mini-
mum bayes risk decoding for error span detection in
reference-free automatic machine translation evalua-
tion. Preprint, arXiv:2512.07540.
NLLB Team. 2022. No language left behind: Scal-
ing human-centered machine translation. Preprint,
arXiv:2207.04672.
OpenAI. 2024. Gpt-4 technical report. Preprint,
arXiv:2303.08774.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA.
Maja Popovi´c. 2015. chrF: character n-gram F-score
for automatic MT evaluation. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
pages 392–395, Lisbon, Portugal.
Maja Popovi´c. 2017. chrF++: words helping charac-
ter n-grams. In Proceedings of the Second Confer-
ence on Machine Translation, pages 612–618, Copen-
hagen, Denmark.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium.
Zhi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding,
Hideki Tanaka, Masao Utiyama, and Taro Watanabe.
2025. Registering source tokens to target language
spaces in multilingual neural machine translation.
In Proceedings of the 63rd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages

Chunk 28 · 1,991 chars

hi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding,
Hideki Tanaka, Masao Utiyama, and Taro Watanabe.
2025. Registering source tokens to target language
spaces in multilingual neural machine translation.
In Proceedings of the 63rd Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 21687–21706, Vienna, Austria.
Zhi Qu and Taro Watanabe. 2022. Adapting to non-
centered languages for zero-shot multilingual transla-
tion. In Proceedings of the 29th International Con-
ference on Computational Linguistics, pages 5251–
5265, Gyeongju, Republic of Korea. International
Committee on Computational Linguistics.
Ricardo Rei, José G. C. de Souza, Duarte Alves,
Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova,
Alon Lavie, Luisa Coheur, and André F. T. Martins.
2022a. COMET-22: Unbabel-IST 2022 submission
for the metrics shared task. In Proceedings of the
Seventh Conference on Machine Translation (WMT),
pages 578–585, Abu Dhabi, United Arab Emirates
(Hybrid). Association for Computational Linguistics.
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan
van Stigt, Craig Stewart, Pedro Ramos, Taisiya
Glushkova, André F. T. Martins, and Alon Lavie.
2021. Are references really needed? unbabel-IST
2021 submission for the metrics shared task. In Pro-
ceedings of the Sixth Conference on Machine Trans-
lation, pages 1030–1040, Online. Association for
Computational Linguistics.
Ricardo Rei, Nuno M. Guerreiro, José Pombal, Daan
van Stigt, Marcos Treviso, Luisa Coheur, José G.
C. de Souza, and André F. T. Martins. 2023. Scaling
up CometKiwi: Unbabel-IST 2023 submission for
the quality estimation shared task. In Proceedings
of the Eighth Conference on Machine Translation,
pages 841–848, Singapore.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2685–2702, Online.
Ricardo Rei, Marcos

Chunk 29 · 1,999 chars

th Conference on Machine Translation,
pages 841–848, Singapore.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2685–2702, Online.
Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro,
Chrysoula Zerva, Ana C Farinha, Christine Maroti,
José G. C. de Souza, Taisiya Glushkova, Duarte
Alves, Luisa Coheur, Alon Lavie, and André F. T.

-- 11 of 19 --

Martins. 2022b. CometKiwi: IST-unbabel 2022 sub-
mission for the quality estimation shared task. In
Proceedings of the Seventh Conference on Machine
Translation (WMT), pages 634–645, Abu Dhabi,
United Arab Emirates (Hybrid). Association for Com-
putational Linguistics.
Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020.
BLEURT: Learning robust metrics for text genera-
tion. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
7881–7892, Online.
Shaomu Tan and Christof Monz. 2025. ReMedy: Learn-
ing machine translation evaluation from human pref-
erences with reward modeling. In Proceedings of
the 2025 Conference on Empirical Methods in Natu-
ral Language Processing, pages 4370–4387, Suzhou,
China.
David Vilar, Gregor Leusch, Hermann Ney, and
Rafael E. Banchs. 2007. Human evaluation of ma-
chine translation through binary system comparisons.
In Proceedings of the Second Workshop on Statistical
Machine Translation, pages 96–103, Prague, Czech
Republic. Association for Computational Linguistics.
Pius Von Däniken, Jan Milan Deriu, and Mark Cieliebak.
2025. A measure of the system dependence of au-
tomated metrics. In Proceedings of the 63rd An-
nual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 87–99,
Vienna, Austria.
Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag,
Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar,
Yang Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin
Bai, Yifan Gao, Haiyang

Chunk 30 · 1,992 chars

trics. In Proceedings of the 63rd An-
nual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 87–99,
Vienna, Austria.
Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag,
Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar,
Yang Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin
Bai, Yifan Gao, Haiyang Zhang, Qi He, Shuiwang
Ji, and Yangqiu Song. 2025. EcomScriptBench: A
multi-task benchmark for E-commerce script plan-
ning via step-wise intention-driven product associa-
tion. In Proceedings of the 63rd Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 1–22, Vienna, Austria.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mT5: A massively multilingual
pre-trained text-to-text transformer. In Proceedings
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 483–498, On-
line.
Biao Zhang, Ankur Bapna, Rico Sennrich, and Orhan
Firat. 2021. Share or not? learning to schedule
language-specific capacity for multilingual transla-
tion. In International Conference on Learning Rep-
resentations.
A Computational Procedure
Algorithm 1 formalizes the average strategy de-
scribed in Section 1, which evaluates multilingual
MT systems by first computing metric scores for
each triplet (Step 5) and then averaging scores
across all translation directions to obtain a system-
level score (Step 14). Two highlighted components
further clarify key aspects of our evaluation setup.
Step 15 computes the corresponding human score
to serve as the predefined performance used to
benchmark metrics against human judgments, as
discussed in Section 3.3. In addition, Step 7 present
the normalization based LGN strategy proposed in
Section 6.1, where triplet level metric scores are
normalized after computation.
Algorithm 1 Evaluation with Average Strategy
1: Input: number of

Chunk 31 · 1,993 chars

ined performance used to
benchmark metrics against human judgments, as
discussed in Section 3.3. In addition, Step 7 present
the normalization based LGN strategy proposed in
Section 6.1, where triplet level metric scores are
normalized after computation.
Algorithm 1 Evaluation with Average Strategy
1: Input: number of language pairs N ; number
of triplets per language pair I; metric scoring
function METRIC(˜r); human scoring function
HUMAN(˜r); normalization flag USE_LGN;
normalization function LGN(sm)
2: Output: overall metric score SM ; overall hu-
man score SH
3: for i ← 1 to N do ▷ language pairs
4: for j ← 1 to I do ▷ triplets
5: s(j)
m ← METRIC(˜ri,j )
6: if USE_LGN then
7: s(j)
m ← LGN(s(j)
m )
8: end if
9: s(j)
h ← HUMAN(˜ri,j )
10: end for
11: ¯s(i)
m ← 1
I
PI
j=1 s(j)
m
12: ¯s(i)
h ← 1
I
PI
j=1 s(j)
h
13: end for
14: SM ← 1
N
PN
i=1 ¯s(i)
m
15: SH ← 1
N
PN
i=1 ¯s(i)
h
16: return SM , SH
B Language Selection in Benchmark
Construction
In constructing the benchmark, we select nine tar-
get languages paired with English, resulting in nine
translation directions: en-zh, en-lo, en-ja, en-vi,
en-id, en-es, en-fr, en-si, and en-de. This se-
lection aims to ensure a comprehensive evaluation
across high-resource and low-resource languages.
As discussed in Section 2, most widely-used met-
rics are driven by MQM-style training, i.e., fine-
tuned on MQM-annotated data. However, MQM
annotations are only available for high-resource lan-
guages, resulting in an imbalanced data distribution.
Intuitively, this imbalance may lead MQM-driven

-- 12 of 19 --

metrics to exhibit stronger biases when evaluating
translations in low-resource languages. In addi-
tion, practical constraints such as the availability
of native-speaking volunteers for filtering pseudo
translations also influence our language choices.
Taking these factors into account, we determine
that the selected translation directions strike a rea-
sonable balance between linguistic diversity and
feasibility,

Chunk 32 · 1,995 chars

tion, practical constraints such as the availability
of native-speaking volunteers for filtering pseudo
translations also influence our language choices.
Taking these factors into account, we determine
that the selected translation directions strike a rea-
sonable balance between linguistic diversity and
feasibility, making the benchmark both represen-
tative and manageable. In addition, among the
selected languages, those supported by MQM train-
ing data include: zh, de, es, ja, and fr; Languages
without MQM support include: lo, vi, id, and si.
C Verification on MQM
As mentioned in Section 1, investigating cross-
lingual scoring bias requires instances with strictly
parallel semantics and quality. However, MQM
datasets cover only a limited number of language
pairs, among which only en-de and en-ru satisfy
this requirement. For these two directions, we par-
tition instances into five MQM score ranges: 0,
(0, 5], (5, 10], (10, 15], and (15, 25], merging the
highest range due to data sparsity. We evaluate
these instances using BLEURT, XCOMET, and
COMETKIWI-23 (spanning all metric types).
The results in Figure 5 show that translations
of comparable quality in different language pairs
are assigned different scores by the metrics, par-
ticularly XCOMET, even when only two language
pairs are involved. Moreover, the results demon-
strate that cross-lingual scoring bias exists in MQM
data and follows a trend similar to that observed in
Figure 3, thereby validating our synthetic instances.
In addition, we conduct additional experi-
ments on real MQM datasets using the fac-
tors learned from our benchmark. Specifi-
cally, we select mqm_generalMT2024_ende and
mqm_generalMT2024_enes dataset, as these two
language pairs overlap with those covered in our
benchmark. We observe a severe imbalance in the
en-es data, where triplets with multiple errors are
rare. To improve comparability, we restrict en-de
to triplets with few errors. The datasets contain out-
puts from different

Chunk 33 · 1,995 chars

mqm_generalMT2024_enes dataset, as these two
language pairs overlap with those covered in our
benchmark. We observe a severe imbalance in the
en-es data, where triplets with multiple errors are
rare. To improve comparability, we restrict en-de
to triplets with few errors. The datasets contain out-
puts from different systems, which we rank using
averaged MQM scores across language pairs. As
no references are available, we use the reference-
free metric COMETKIWI-23 for evaluation. We
compute system-level rankings using both original
and LGN-calibrated scores. Calibration improves
the correlation with MQM rankings from 45.05 to
46.81, indicating better alignment with human eval-
uation. Despite the limited setting, these results
provide further evidence for the effectiveness of
LGN, corroborating the validity of our synthetic
instances.
Figure 5: Visualization of three metrics scores across
two directions at varying translation quality levels on
MQM dataset.
D Annotation Guidelines
To ensure that native speakers acquire a clear under-
standing of the purpose of our experiment and the
definition of MQM, thereby enabling them to more
accurately identify and filter error candidates that
meet the required criteria, we comply an instruction
document that provides the necessary background
information and operational guidelines. It is in-
cluded in the following:
Background
In the evaluation of translation quality, a human-
centric framework known as Multidimensional
Quality Metrics (MQM) (https://themqm.
org/) is widely used. Specifically, MQM classifies
translation quality based on a standardized error
taxonomy, resulting in a scoring system that is both
low in subjectivity and high in comparability. This
framework significantly facilitates both production
and research efforts.
However, MQM annotation is inherently ineffi-
cient and costly, as it heavily depends on the man-
ual work of expert annotators. While, in theory,
advanced artificial intelligence could act as

Chunk 34 · 1,979 chars

low in subjectivity and high in comparability. This
framework significantly facilitates both production
and research efforts.
However, MQM annotation is inherently ineffi-
cient and costly, as it heavily depends on the man-
ual work of expert annotators. While, in theory,
advanced artificial intelligence could act as expert-
level annotators, such a substitution is not entirely
trustworthy because we cannot verify whether the
AI has truly reached expert proficiency.
Fortunately, and interestingly, our task is NOT to
evaluate a machine translation system in the MQM

-- 13 of 19 --

style. Instead, we aim to obtain MQM-style scores.
Specifically, this means we can use advanced AI
systems to disrupt a set of perfect translations by
introducing errors defined under MQM. Then, we
simply ask native speakers to verify whether the
disruption was successful. This approach allows us
to obtain reliable MQM scores on a given dataset.
Task
Each volunteer will be provided with four files,
named en-{lang}-{error}.tsv, where {lang}
points to each volunteer’s native language, and
{error} refers to four common and easily quan-
tified types of errors in machine translation: Addi-
tion, Omission, Mistranslation, and Untranslated.
In each file, there are three parts that should be
noticed:
• src: The source sentence in English.
• ref: The correct (perfect) translation of the
source sentence in the volunteer’s native lan-
guage.
• mt: The sentence that has been disrupted by
using GPT-4o. Specifically, GPT-4o intro-
duced an error into each ref.
Please note, the error in mt is marked by <v>
</v>. Now, you should check the quality of mt,
and judge whether the error marked by <v> </v>
indeed disrupts ref without any change in the rest
part of ref. If the answer is YES, you don’t need
to take any action; otherwise, you should write T in
the reject column to indicate that the disruption is
not acceptable.
Criteria
The following are the evaluation criteria for each
type of

Chunk 35 · 1,996 chars

the error marked by <v> </v>
indeed disrupts ref without any change in the rest
part of ref. If the answer is YES, you don’t need
to take any action; otherwise, you should write T in
the reject column to indicate that the disruption is
not acceptable.
Criteria
The following are the evaluation criteria for each
type of error:
Addition
The error in mt marked by <v></v> introduced
additional semantics into ref.
• If the error indeed presents additional seman-
tics in the ref without any change in the rest
part of ref, then this mt is acceptable, i.e., you
don’t need to take any action.
• Otherwise, please write T in the reject column.
• Note that the key is whether the semantics are
changed, i.e., a change in the adverb of degree
is considered as reject.
Omission
The mt has a missing part compared to the ref, and
the missing part is marked by <v></v>.
• If the missing part in the mt causes a change in
meaning, then this mt is acceptable, i.e., you
don’t need to take any action.
• Note, omission could make the sentence un-
readable. However, the unique criterion is that
the part outside of labeling (<v></v>) is not
changed.
• In languages using spaces as intervals, some
words could be labeled. However, the follow-
ing case caused by changes of punctuation is
also acceptable:
ref: ..., en particulier les affaires de
voitures volées, avec l’intention...
mt: ..., en <v>particulier,</v> avec
l’intention...
Here, les affaires de voitures volées
is omissive, and the label is caused by the
change of particulier → particulier,.
• Otherwise, e.g., the missing part changes the
part out of <v></v> or the marked part is not
missed, please write T in the reject column.
Mistranslation
The error in mt marked by <v></v> is a mistransla-
tion from src.
• Given that ref is a ground-truth translation
from src, you can simply compare ref and
mt. If the error of mt conveys different words
or semantics compared to ref, this mt is ac-
ceptable, i.e., you don’t need to take any ac-
tion.
•

Chunk 36 · 1,999 chars

mn.
Mistranslation
The error in mt marked by <v></v> is a mistransla-
tion from src.
• Given that ref is a ground-truth translation
from src, you can simply compare ref and
mt. If the error of mt conveys different words
or semantics compared to ref, this mt is ac-
ceptable, i.e., you don’t need to take any ac-
tion.
• Otherwise, please write T in the reject column.
Untranslated
The error in mt marked by <v></v> has not been
translated and remains in the original English.
• Simply copying from src or changing words
but remaining in English is recognized as ac-
ceptable, i.e., you don’t need to take any ac-
tion.
• If the untranslated words are person’s names
or place names, please write T in the reject
column.

-- 14 of 19 --

Quality Level en-zh en-lo en-ja en-vi en-id en-fr en-es en-si en-de
1 	776 	753 	775 	771 	782 	775 	771 	765 	774
2 	2,109 2,053 2,078 2,056 2,095 1,992 2,016 2,064 2,049
3 	2,548 2,627 2,441 2,420 2,421 2,068 2,233 2,489 2,337
4 	1,466 1,704 1,324 1,387 1,311 	9,57 1,069 1,432 1,234
5 	406 	558 	340 	428 	312 	198 	203 	361 	313
Table 8: The triplets count distribution across the five
quality levels for each language pair.
Num. System-level Triplet-level
3 0.03 0.03
6 0.009 0.003
9 0.001 0.005
Table 9: Paired samples t-test results for system-level
and triplet-level improvements obtained with the LGN
strategy.
Overall
Changes in the content of the mt may result in gram-
matical errors in the overall sentence, and this is
acceptable as long as the part marked with <v></v>
in the mt indeed causes a change in meaning with-
out changes in the part outside of <v></v>. This
indicates that the mt is acceptable.
E Prompt Design
To instruct GPT-4o to introduce addition, omission,
mistranslation and untranslated errors to references
to obtain temporary error candidates containing
one error segment, we design the specific prompt
for different error types. Figure 6 shows the details
of the prompt.
F Triplets Count Distribution
Table 8 shows the triplets

Chunk 37 · 1,993 chars

uct GPT-4o to introduce addition, omission,
mistranslation and untranslated errors to references
to obtain temporary error candidates containing
one error segment, we design the specific prompt
for different error types. Figure 6 shows the details
of the prompt.
F Triplets Count Distribution
Table 8 shows the triplets count distribution across
the five quality levels for each language pair. As
shown in the table, the triplets with quality level 2
and 3 are more frequent, while triplets at level 5
are fewer. This is because quality level reflects the
number of errors in pseudo translations; as the error
count increases, overlapping error spans reduce the
number of generated triplets.
G Discussion on Repeated Sampling
To examine the effect of repeated sampling on eval-
uation stability, we test three metrics, i.e., BLEURT,
xCOMET, and KIWI23, on monolingual systems
for en-zh, en-ja, and en-de at five quality levels.
For each system, 102 triplets are sampled, and the
procedure is repeated 5, 10, and 25 times. Ta-
ble 10 reports the means and variances across these
settings. As the sampling iterations increase, the
mean scores shown in the table exhibit stability.
Although the variance fluctuates to some extent, it
is caused by the value is scaled to the square of the
scoring scale because the scores are amplified by a
factor of 100. Consequently, the variance remains
within a small range, and we consider these fluctu-
ations to be acceptable. Ultimately, we adopt the
approach of repeating the process 10 times in our
main experiments.
H Detailed Scores
Table 11 presents the detailed scores of Figure 3.
I Score Reduction across Directions and
Metrics
Figure 3 reveals that as translation quality declines,
the rate of score reduction differs across translation
directions, highlighting the varying sensitivities of
metrics to quality changes across languages. This
variation exacerbates score inconsistencies across
directions at the same quality level, particularly
for

Chunk 38 · 1,995 chars

Figure 3 reveals that as translation quality declines,
the rate of score reduction differs across translation
directions, highlighting the varying sensitivities of
metrics to quality changes across languages. This
variation exacerbates score inconsistencies across
directions at the same quality level, particularly
for lower-quality translations. The widening gaps
in the decline patterns further illustrate this trend.
Similarly, score reduction patterns differ across
metrics. For spBLEU, scores are approaching at
high quality but diverge as quality decreases due
to different decline rates across directions. chrF
shows more consistent decline trends, though its
score ranges vary substantially across directions,
with zh and ja exhibiting systematically lower
ranges. BLEURT exhibits behavior similar to sp-
BLEU, but with larger cross-lingual discrepancies
in score reduction as translation quality deteriorates.
For COMET and xCOMET, score reduction trends
exhibit similar patterns across directions. However,
COMET assigns direction-specific score ranges
with limited overlap, whereas xCOMET produces
more aligned score ranges for most directions, ex-
cept lo, si, and de. In contrast, KIWI22 and
KIWI23 more closely align with the desired proper-
ties of an ideal metric, as they exhibit more closely
aligned score ranges and score reduction trends,
whereas KIWI23 still shows noticeable score range
discrepancies for certain directions. By compari-
son, the MetricX variants display substantial cross-
lingual inconsistency in both score ranges and re-
duction patterns, with regression-based MetricX
exhibiting pronounced inconsistencies.

-- 15 of 19 --

(System) Your task is to introduce errors to disrupt the quality.
"Addition": \n
Given a sentence, your task is to add an addition error to disrupt the quality.
Please do the following instructions step by step:
1. You should select a sub-part of the sentence in the part enclosed by <{position}> and </{position}>, then output

Chunk 39 · 1,993 chars

Your task is to introduce errors to disrupt the quality.
"Addition": \n
Given a sentence, your task is to add an addition error to disrupt the quality.
Please do the following instructions step by step:
1. You should select a sub-part of the sentence in the part enclosed by <{position}> and </{position}>, then output the
sub-part you selected.
2. You should disrupt the sub-part by adding some words which includes information not present in the selected sub-
part. Please keep the rest the same. Then output the disrupted sub-part.
3. Replace the selected sub-part by the disrupted sub-part to get the updated sentence.
4. Finally, output the updated sentence.
"Omission": \n
Given a sentence, your task is to add an omission error to disrupt the quality.
Please do the following instructions step by step:
1. You should select a sub-part of the sentence in the part enclosed by <{position}> and </{position}>, then output the
sub-part you selected.
2. You should select a segment containing some important information in the sub-part. Note that a segment means
some words or a phrase rather than a clause. Then output the segment you selected.
3. You should delete the segment from the sub-part to get the disrupted sub-part, make sure that you just delete one
segment.
4. Replace the sub-part in the sentence by the disrupted sub-part to get the updated sentence.
5. Finally, output the updated sentence.
"Mistranslation": \n
Given a sentence, your task is to add an mistranslation error to disrupt the quality.
Please do the following instructions step by step:
1. You should select a sub-part of the sentence in the part enclosed by <{position}> and </{position}>, then output the
sub-part you selected.
2. You should select a segment containing some important information in the sub-part. Ensure the segment is a natural
and coherent phrase rather than fragments of different sentences or clauses. And the segment is typically a short
phrase that conveys a key idea without unnecessary

Chunk 40 · 1,996 chars

en output the
sub-part you selected.
2. You should select a segment containing some important information in the sub-part. Ensure the segment is a natural
and coherent phrase rather than fragments of different sentences or clauses. And the segment is typically a short
phrase that conveys a key idea without unnecessary details. Then output the segment you selected.
3. You should replace the segment you selected in the sub-part, with alternatives that change the meaning of that part
to get the disrupted segment. Do NOT perform simple substitutions, such as replacing "1" with "2" or "good" with
"bad". Use descriptive phrases or reframe the meaning to introduce different information. Then output the disrupted
segment.
4. Replace the selected segment in the selected sub-part by the disrupted segment to get the disrupted sub-part, then
output the disrupted sub-part.
5. Replace the selected sub-part in the sentence by the disrupted sub-part to get the updated sentence.
6. Finally, output the updated sentence.
"Untranslated": \n
Given a source sentence and target sentence, your task is to add an untranslation error to disrupt the translation quality.
Please do the following instructions step by step:
1. You should select a sub-part of the target sentence in the part enclosed by <{position}> and </{position}>.
Note that, a sub-part means a word or a phrase instead of a clause. Then output the sub-part you selected.
2. Given that our objective is to create an untranslation error, the selected sub-part should be in {language} instead of
English and does not present in the source sentence. Please validate it. If it cannot meet our requirement, please select
another sub-part in {language}.
3. You should find the corresponding part from the source sentence. Then output the corresponding source part.
4. Replace the selected sub-part by the corresponding source part to get the updated target sentence, finally output the
updated sentence.
Figure 6: The prompt for different error

Chunk 41 · 1,995 chars

ther sub-part in {language}.
3. You should find the corresponding part from the source sentence. Then output the corresponding source part.
4. Replace the selected sub-part by the corresponding source part to get the updated target sentence, finally output the
updated sentence.
Figure 6: The prompt for different error types to guide GPT-4o to introduce errors to references.
J Experiment on LLM-based Metrics
We investigate two LLM-based evaluation ap-
proaches: ReMedy (Tan and Monz, 2025), a train-
able evaluation metric fine-tuned from LLMs; and
GEMBA-MQM (Kocmi and Federmann, 2023),
which prompts LLMs to simulate human annota-
tors by following MQM guidelines. Using these
evaluators, we assess translation triplets at varying
quality levels across three language pairs, en-zh,
en-es, and en-ja, reflecting the language coverage
of ReMedy in our study. The results in Figure 7
reveal substantial variation across language pairs,
indicating that LLM-based evaluators remain sus-
ceptible to cross-lingual bias.

-- 16 of 19 --

BLEURT 	xCOMET 	KIWI23
mean 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 81.98 70.37 	62.36 	56.45 51.85 	82.90 68.44 	55.50 	45.20 36.66 	66.38 56.09 46.43 40.23 33.39
en-ja 81.65 69.15 	60.18 	53.47 48.18 	80.85 63.93 	49.20 	38.78 32.51 	67.15 57.28 49.18 43.50 39.67
en-de 80.05 67.82 	59.01 	52.02 47.85 	91.89 83.21 	72.83 	64.83 56.65 	61.92 49.16 38.19 29.52 25.43
var 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 0.36 0.36 	1.49 	0.40 0.62 	0.55 0.85 	1.27 	1.21 0.74 	0.33 0.69 	1.04 	0.07 1.68
en-ja 0.41 0.58 	0.10 	0.80 0.83 	0.35 2.22 	2.99 	3.39 1.60 	0.27 1.44 	0.89 	0.40 1.43
en-de 0.39 1.94 	1.22 	1.83 0.37 	0.19 0.26 	2.71 	2.57 4.01 	0.94 1.38 	3.70 	1.43 1.68
(a) Mean and variance for 5 iterations of sampling.
BLEURT 	xCOMET 	KIWI23
mean 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 81.90 70.60 	62.42 	56.36 52.09 	82.93 68.49 	56.06 	45.32 37.41 	66.28 56.12 46.95 39.87 33.83
en-ja 81.97 69.27 	60.36 	53.74 48.21

Chunk 42 · 1,995 chars

0.19 0.26 	2.71 	2.57 4.01 	0.94 1.38 	3.70 	1.43 1.68
(a) Mean and variance for 5 iterations of sampling.
BLEURT 	xCOMET 	KIWI23
mean 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 81.90 70.60 	62.42 	56.36 52.09 	82.93 68.49 	56.06 	45.32 37.41 	66.28 56.12 46.95 39.87 33.83
en-ja 81.97 69.27 	60.36 	53.74 48.21 	81.12 64.38 	48.79 	38.62 32.48 	67.66 57.38 49.36 43.73 39.54
en-de 79.63 67.99 	59.62 	52.23 47.35 	91.45 83.20 	73.64 	64.12 55.40 	60.92 48.82 38.95 29.87 24.84
var 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 0.22 0.42 	1.00 	0.36 0.62 	0.67 0.60 	2.36 	0.85 1.02 	0.29 0.41 	2.05 	0.17 1.66
en-ja 0.45 0.43 	0.36 	0.89 0.58 	0.79 1.56 	2.26 	2.87 1.28 	0.72 0.76 	0.69 	0.59 0.94
en-de 0.71 1.33 	1.52 	1.18 0.64 	0.80 0.57 	4.47 	2.24 6.12 	1.84 1.54 	4.35 	0.99 1.95
(b) Mean and variance for 10 iterations of sampling.
BLEURT 	xCOMET 	KIWI23
mean 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 81.80 70.52 	62.39 	56.21 51.91 	82.56 68.64 	55.88 	45.51 37.23 	66.10 55.66 46.72 39.93 33.59
en-ja 81.98 69.71 	60.58 	53.65 47.92 	81.34 64.51 	49.80 	38.56 32.04 	67.72 57.55 49.73 43.68 39.23
en-de 79.63 68.04 	59.47 	52.91 47.29 	91.41 83.23 	73.69 	64.79 55.16 	60.89 48.54 38.51 30.84 24.78
var 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 0.34 0.53 	1.07 	0.64 0.78 	0.77 1.52 	2.67 	1.37 0.88 	0.43 1.49 	1.82 	1.01 1.10
en-ja 0.42 0.49 	0.81 	0.73 0.71 	1.87 1.86 	3.87 	2.37 1.77 	0.90 0.56 	1.89 	0.79 1.25
en-de 0.50 0.99 	1.25 	1.49 0.78 	0.59 1.01 	2.48 	3.04 3.75 	1.49 1.92 	3.15 	2.95 1.31
(c) Mean and variance for 25 iterations of sampling.
Table 10: Mean and variance for 5, 10, 25 iterations of sampling. Note that the scores are amplified by a factor of
100, and the scale of the variance corresponds to the square of the scoring scale.
1 	2 	3 	4 	5
Numbers of Errors
2
1
0
1
2
Score
ReMedy
1 	2 	3 	4 	5
Numbers of Errors
18
16
14
12
10
8
6 GEMBA
en-zh 	en-ja 	en-es
Figure 7: Visualization of LLM-based evaluation scores across

Chunk 43 · 1,995 chars

at the scores are amplified by a factor of
100, and the scale of the variance corresponds to the square of the scoring scale.
1 	2 	3 	4 	5
Numbers of Errors
2
1
0
1
2
Score
ReMedy
1 	2 	3 	4 	5
Numbers of Errors
18
16
14
12
10
8
6 GEMBA
en-zh 	en-ja 	en-es
Figure 7: Visualization of LLM-based evaluation scores across three directions at varying translation quality levels.
K Significance Test
We conduct paired samples t-tests on the improve-
ments obtained with the LGN strategy in Table 7.
As shown in the Table 9, all p-values are below
0.05, indicating that although the improvements
are small in magnitude, they are statistically signif-
icant.
L Results under the LGN Strategy
Figure 8 shows the normalized scores of nine met-
rics across translation directions at varying qual-
ity levels. As illustrated in the figure, the LGN
strategy effectively narrows score range disparities
across language pairs, as evidenced by the reduced
distances between curves. After applying LGN,
translations of comparable quality from different

-- 17 of 19 --

spBLEU 	chrF 	BLEURT
1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 83.85 70.50 60.36 51.84 45.16 	74.46 62.94 54.34 46.97 41.70 	81.90 70.81 62.60 56.89 52.61
en-lo 84.99 73.74 64.88 55.64 48.21 	87.33 77.31 69.93 63.03 57.35 	82.87 73.03 65.26 57.72 52.88
en-ja 81.13 66.94 55.20 45.00 35.78 	75.03 65.13 57.05 50.47 44.38 	81.97 69.74 61.05 54.29 47.59
en-vi 84.21 72.71 62.93 56.48 51.55 	90.44 82.54 75.98 71.39 67.88 	81.85 71.86 63.54 57.29 52.25
en-id 83.10 70.02 60.12 51.25 44.98 	90.83 82.89 76.89 71.49 67.46 	83.15 74.56 68.42 64.55 62.11
en-fr 84.03 71.60 62.46 55.21 48.60 	90.41 82.03 75.74 70.37 65.28 	80.96 67.37 56.23 47.83 40.07
en-es 84.30 71.72 62.35 53.94 46.73 	90.80 82.63 76.49 71.52 67.16 	83.37 71.68 62.18 54.59 48.96
en-si 86.18 75.42 67.07 59.80 53.27 	91.40 83.41 77.29 71.65 66.73 	84.38 78.44 72.83 68.57 64.73
en-de 84.09 72.25 62.61 54.31 48.01 	90.93 83.13 76.58 71.38 66.47 	79.63 68.03 59.56 53.64

Chunk 44 · 1,995 chars

80.96 67.37 56.23 47.83 40.07
en-es 84.30 71.72 62.35 53.94 46.73 	90.80 82.63 76.49 71.52 67.16 	83.37 71.68 62.18 54.59 48.96
en-si 86.18 75.42 67.07 59.80 53.27 	91.40 83.41 77.29 71.65 66.73 	84.38 78.44 72.83 68.57 64.73
en-de 84.09 72.25 62.61 54.31 48.01 	90.93 83.13 76.58 71.38 66.47 	79.63 68.03 59.56 53.64 46.88
(a) Sequenced-based metrics.
COMET 	xCOMET 	MX-reg
1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 90.42 85.97 82.17 78.46 75.10 	82.93 68.65 	55.82 	45.86 37.67 	2.40 3.58 4.75 5.63 6.86
en-lo 88.15 83.31 79.07 74.60 71.16 	63.72 48.64 	36.11 	27.04 21.01 	3.46 5.49 7.24 8.81 10.22
en-ja 92.63 88.90 85.18 81.85 77.12 	81.12 64.52 	49.83 	39.70 31.86 	1.85 2.91 3.87 4.91 6.03
en-vi 89.81 85.61 81.59 78.68 75.26 	81.83 68.44 	55.56 	46.71 36.54 	3.22 5.05 6.55 8.05 9.71
en-id 91.72 87.93 83.95 80.52 77.35 	84.43 70.58 	55.69 	45.30 36.19 	2.52 3.89 5.53 6.65 7.58
en-fr 86.49 80.36 74.45 69.70 64.79 	82.18 66.13 	51.27 	40.37 30.00 	2.64 4.04 5.51 6.81 8.40
en-es 87.15 81.37 75.73 70.81 66.54 	86.06 72.62 	59.45 	47.80 37.82 	2.44 3.90 5.40 6.37 7.28
en-si 92.51 89.30 85.92 82.84 79.64 	69.67 52.42 	37.07 	26.85 20.31 	2.30 4.08 6.08 7.67 9.83
en-de 87.39 81.18 75.75 70.89 64.36 	91.45 83.13 	73.95 	66.14 55.02 	1.53 2.23 2.87 3.40 4.17
(b) Regression-based metrics.
KIWI22 	KIWI23 	MX-qe
1 	2 	3 	4 	5 	1 	2 	3 	4 	5 	1 	2 	3 	4 	5
en-zh 76.40 69.10 63.34 58.68 54.74 	66.28 55.69 46.63 40.37 34.33 	1.70 2.32 3.00 3.54 4.26
en-lo 74.50 68.24 63.82 60.12 57.88 	61.41 52.93 46.59 40.60 36.18 	2.71 3.87 4.83 5.81 6.62
en-ja 79.37 72.80 67.69 63.90 60.15 	67.66 57.27 49.42 44.60 38.96 	1.38 2.04 2.65 3.25 4.19
en-vi 76.38 70.37 65.30 61.81 58.60 	64.94 55.71 48.44 43.27 38.43 	2.04 2.89 3.62 4.43 5.11
en-id 76.09 69.15 63.20 58.60 55.54 	66.37 56.34 47.66 41.96 37.25 	2.23 3.12 4.16 4.94 5.54
en-fr 78.60 70.98 64.87 60.42 55.42 	60.82 47.49 38.08 30.50 24.20 	1.97 2.83 3.85 4.70 5.94
en-es 76.16 68.48 61.98 57.04 53.43 	64.96 53.30 43.91 36.34 30.66

Chunk 45 · 1,746 chars

.81 58.60 	64.94 55.71 48.44 43.27 38.43 	2.04 2.89 3.62 4.43 5.11
en-id 76.09 69.15 63.20 58.60 55.54 	66.37 56.34 47.66 41.96 37.25 	2.23 3.12 4.16 4.94 5.54
en-fr 78.60 70.98 64.87 60.42 55.42 	60.82 47.49 38.08 30.50 24.20 	1.97 2.83 3.85 4.70 5.94
en-es 76.16 68.48 61.98 57.04 53.43 	64.96 53.30 43.91 36.34 30.66 	1.97 2.84 3.87 4.63 5.26
en-si 80.42 72.71 65.64 60.72 56.20 	74.13 64.18 55.82 49.46 44.87 	1.25 1.89 2.74 3.72 4.93
en-de 75.62 68.56 62.28 58.17 53.91 	60.92 48.52 38.47 31.73 24.32 	1.44 2.11 2.71 3.35 4.21
(c) Regression-free metrics.
Table 11: The detailed scores of nine metrics when evaluating different languages at various quality levels.
language pairs receive consistent metric scores, and
the score degradation trends as translation quality
decreases become more consistent across direc-
tions.

-- 18 of 19 --

1 	2 	3 	4 	5
1.25
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
Score
spBLEU
1 	2 	3 	4 	5
1.25
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
chrF
1 	2 	3 	4 	5
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
BLEURT
1 	2 	3 	4 	5
1.25
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
Score
COMET
1 	2 	3 	4 	5	
1.25
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
xCOMET
1 	2 	3 	4 	5
1.00
0.75
0.50
0.25
0.00
0.25
0.50
MX-reg
1 	2 	3 	4 	5
Numbers of Errors
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
Score
KIWI22
1 	2 	3 	4 	5
Numbers of Errors
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75 KIWI23
1 	2 	3 	4 	5
Numbers of Errors
1.00
0.75
0.50
0.25
0.00
0.25
0.50
MX-qe
en-zh
en-lo
en-ja
en-vi
en-id
en-fr
en-es
en-si
en-de
en-all
Figure 8: Visualization of nine metrics scores under the LGN strategy across nine directions at varying translation
quality levels. en-all denoting the average metric scores among all directions.

-- 19 of 19 --