MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation
Summary
This paper introduces MERIT, a framework for improving Chinese-centric low-resource machine translation (LRL→Chinese) for five Southeast Asian languages: Vietnamese, Burmese, Lao, Tagalog, and Indonesian. The challenge stems from the scarcity of clean parallel corpora and pervasive noise in existing data, which limits model training and widens performance gaps compared to high-resource language pairs. MERIT addresses this through a unified framework combining language-specific token prefixing (LTP), supervised fine-tuning (SFT), and a novel group relative policy optimization (GRPO) guided by a semantic alignment reward (SAR). The framework also introduces CALT, a Chinese-centric benchmark derived from the ALT corpus, eliminating English-pivot bias for direct LRL→Chinese evaluation. Experiments show that MERIT-3B, using only 22.8% of the original data, significantly outperforms larger baselines like NLLB-200 3.3B, demonstrating that high-quality data and reward-guided optimization are more effective than model scaling in low-resource settings. The study highlights the importance of data curation and targeted fine-tuning for improving translation quality in under-resourced languages.
PDF viewer
Chunks(42)
Chunk 0 · 1,997 chars
MERIT: Multilingual Expert-Reward Informed Tuning for Chinese-Centric Low-Resource Machine Translation Zhixiang Lu, Chong Zhang, Chenyu Xue, Angelos Stefanidis, Chong Li, Jionglong Su, Zhengyong Jiang Xi’an Jiaotong-Liverpool University zhengyong.jiang02@xjtlu.edu.cn Abstract Neural machine translation (NMT) from Chi- nese to low-resource Southeast Asian lan- guages remains severely constrained by the ex- treme scarcity of clean parallel corpora and the pervasive noise in existing mined data. This chronic shortage not only impedes effective model training but also sustains a large per- formance gap with high-resource directions, leaving millions of speakers of languages such as Lao, Burmese, and Tagalog with persis- tently low-quality translation systems despite recent advances in large multilingual mod- els. We introduce Multilingual Expert-Reward Informed Tuning (MERIT), a unified transla- tion framework that transforms the traditional English-centric ALT benchmark into a Chinese- centric evaluation suite for five Southeast Asian low-resource languages (LRLs). Our frame- work combines language-specific token prefix- ing (LTP) with supervised fine-tuning (SFT) and a novel group relative policy optimiza- tion (GRPO) guided by the semantic alignment reward (SAR). These results confirm that, in LRL→Chinese translation, targeted data cura- tion and reward-guided optimization dramati- cally outperform mere model scaling. 1 Introduction The vision of Neural Machine Translation (NMT) is to provide equitable access to information for speakers of over 7,000 languages worldwide. How- ever, the benefits brought by recent advances in large-scale models are highly uneven. High- resource language pairs such as English–French have achieved near-human BLEU scores (Papineni et al., 2002), while many low-resource languages remain almost entirely untranslatable due to severe data scarcity (NLLB Team et al., 2024). This disparity is marked between Chinese and low-resource languages
Chunk 1 · 1,997 chars
ly uneven. High- resource language pairs such as English–French have achieved near-human BLEU scores (Papineni et al., 2002), while many low-resource languages remain almost entirely untranslatable due to severe data scarcity (NLLB Team et al., 2024). This disparity is marked between Chinese and low-resource languages (LRLs), spanning domes- tic (e.g., Burmese, Lao). These languages suffer from scarce parallel and monolingual data, lim- ited annotators, and sometimes non-standardized orthographies, deepening the digital divide. While multilingual models like mBART-50 (Liu et al., 2020) and mT5 (Xue et al., 2021) show zero-shot gains, they still underperform on these LRLs. Even with NLLB-200’s expanded coverage (NLLB Team et al., 2024), LRL-Chinese performance continues to trail English-pivoted directions. Moreover, the lack of publicly available and high-quality evaluation benchmarks hinders objec- tive progress measurement. An ideal benchmark should: (i) cover multiple LRL→Chinese direc- tions directly to avoid pivot bias, (ii) maintain suf- ficient balance in scale and domain coverage, (iii) avoid dependence on English as a pivot. Without these features, improvements in Chinese-centric translation are difficult to reproduce or attribute reliably. To address these challenges, this paper makes the following three contributions: • Benchmark Construction: We introduce CALT, the first Chinese-centric benchmark for Southeast Asian languages. By recon- structing the ALT corpus to eliminate English- pivot bias, we establish a rigorous, pivot-free standard for evaluating direct LRL-Chinese translation. • Methodological Innovation: We propose MERIT, a framework that combines LTP, SFT, and a novel GRPO method with SAR for efficient data filtering and model refine- ment. • Empirical Validation: Experiments show that MERIT-3B, using only 22.8% of the original data, significantly outperforms much larger baselines, showing that high-quality data and reward-guided optimization
Chunk 2 · 1,991 chars
ork that combines LTP, SFT, and a novel GRPO method with SAR for efficient data filtering and model refine- ment. • Empirical Validation: Experiments show that MERIT-3B, using only 22.8% of the original data, significantly outperforms much larger baselines, showing that high-quality data and reward-guided optimization trump model scale in low-resource translation. arXiv:2604.04839v1 [cs.CL] 6 Apr 2026 -- 1 of 17 -- 2 Related Work LRL→Chinese Corpora. The CCMT shared tasks released fewer than 200k sentence pairs for ZH–UG and ZH–MN (Xu et al., 2021), while Wiki- based mining typically yields only a few thousand pairs for ZH–LO and ZH–FIL (Artetxe and Schwenk, 2018). The ALT corpus (Thu et al., 2016) extends coverage to 13 LRLs but remains English-centric and lacks direct LRL→Chinese alignment. LLM-based Machine Translation. In recent years, the integration of LLMs into machine trans- lation has witnessed rapid and substantial progress. Related research mainly focuses on low-resource scenarios (Zebaze et al., 2025; Lu et al., 2025), con- text and prompt optimization (Feng et al., 2024; Zaranis et al., 2024), self-refining mechanism (Feng et al., 2025; Wang et al., 2024), term con- trol (Kim et al., 2024), dialect and historical lan- guage translation (Abdelaziz et al., 2024; Volk et al., 2024), document-level translation (Wang et al., 2023), simultaneous translation (Koshkin et al., 2024), and fusion of traditional MT meth- ods (Hoang et al., 2024). These works demon- strate that LLM has significantly outperformed tra- ditional methods in various challenging scenarios through a combination of translation, structured reasoning, self-refinement, and targeted fine-tuning techniques, while providing efficient and feasible solutions for low-resource languages, specialized domains, and real-time translation. Multilingual Pretraining Models. Multilingual pretrained models aim to learn shared representa- tions across tens to hundreds of languages with a single LLM,
Chunk 3 · 1,998 chars
and targeted fine-tuning techniques, while providing efficient and feasible solutions for low-resource languages, specialized domains, and real-time translation. Multilingual Pretraining Models. Multilingual pretrained models aim to learn shared representa- tions across tens to hundreds of languages with a single LLM, enabling cross-lingual transfer with zero-shot/few-shot tasks(Lauscher et al., 2020). Early representative works include mBART-50 (Liu et al., 2020), mT5 (Xue et al., 2021) and DeltaLM (Ma et al., 2021), which cover 50 to 101 languages, respectively. NLLB-200 (NLLB Team et al., 2024), released by Meta in 2022, achieves high-quality mutual translation of 200 languages for the first time. Through technologies such as massively parallel data mining, back-translation, and self- supervised denoising, the average BLEU score on the FLORES-200 benchmark has been improved by 44% compared to the best system at the time (NLLB Team et al., 2024). Additionally, the XSTS manual evaluation index is introduced, which is more closely aligned with human judgment (NLLB Team et al., 2024). However, our fine-grained anal- ysis reveals that a significant quality gap still exists in the LRL-to-Chinese direction, underscoring the importance of parallel data quality, domain distri- bution, and the cross-lingual alignment strategy, as much as model size itself. 3 Methodology 3.1 Supervised Fine-Tuning We fine-tune open-source models (Qwen-2.5-0.5B and 3B) on the CALT benchmark described in sub- section 3.5 using SFT (Fan et al., 2020). The objective of SFT is to maximize the conditional probability of producing the target sequence Y = (y1, . . . , yM ) given the source sequence X = (x1, . . . , xN ), following the standard sequence-to- sequence formulation (Sutskever et al., 2014). Dur- ing training, we employ a teacher-forcing strategy (Williams and Zipser, 1989; Lu et al., 2026b), con- ditioning the model on the ground-truth previous tokens y<t to predict the next token yt.
Chunk 4 · 1,991 chars
source sequence X = (x1, . . . , xN ), following the standard sequence-to- sequence formulation (Sutskever et al., 2014). Dur- ing training, we employ a teacher-forcing strategy (Williams and Zipser, 1989; Lu et al., 2026b), con- ditioning the model on the ground-truth previous tokens y<t to predict the next token yt. The model is trained by minimizing the cross-entropy loss. To prevent over-confidence and improve general- ization, we incorporate label smoothing (Szegedy et al., 2015; Lu et al., 2026a). The token-level loss is defined as: LSFT(yt, ˆPt) = − |V | X k=1 q′(k|y∗ t ) log P (k|X, y<t; θ) q′(k|y∗ t ) = (1 − εls)I(k = y∗ t ) + εls |V | (1) Fine-tuning is performed separately for each LRL- Chinese pair to respect language-specific morphol- ogy. This combination of filtering and task-specific adaptation yields substantial gains over zero-shot baselines, echoing findings on targeted adaptation for massively multilingual models (Arivazhagan et al., 2019). 3.2 Language-specific Token Prefixing We introduce a Language-specific Token Prefix- ing (LTP) strategy to improve language discrimina- tion in many-to-one multilingual generation. This method involves injecting a language-identifier to- ken into both the tokenizer vocabulary and the SFT prompt, enabling consistent language conditioning. Given a sample consisting of a source sequence X (from LRLs) and a target sequence Y (Chinese). We prepend the specific language token correspond- ing to the source, denoted as [lang], to the source input X to create a modified source X′ as shown in Equation 2: X′ = [lang] ⊕ X = [lang, x1, x2, ..., xn] (2) -- 2 of 17 -- Figure 1: Overview of the MERIT framework. The pipeline consists of (a) heuristic data selection utilizing the Elite Parallel Data Sampler (EPDS, algorithm 1) and Data Integrity Validation (DIV, algorithm 2), (b) translation scoring via a QE agent trained on expert cross-reviewed quality data using GRPO-SAR, (c) reward-based data distillation, which
Chunk 5 · 1,991 chars
framework. The pipeline consists of (a) heuristic data selection utilizing the Elite Parallel Data Sampler (EPDS, algorithm 1) and Data Integrity Validation (DIV, algorithm 2), (b) translation scoring via a QE agent trained on expert cross-reviewed quality data using GRPO-SAR, (c) reward-based data distillation, which utilizes the QE agent to rescore the entire training set and filter for high-quality translation data, and (d) final model optimization using SFT-LTP based on this distilled dataset. A prompt instruction P (l) is defined for the target language l. An example structure for such a prompt is: P (l) = ”Translate [lang] language into Chinese: ...” (3) The final input to the model is then constructed by combining the prompt and the original source input. Based on the structure presented, the final model input is: Input = [p1, ..., pr , lang, x1, ..., xn] (4) where [p1, ..., pr] are the tokens derived from the prompt P (l), and [l] is the language identifier token explicitly prepended before the source tokens X. The training objective is to minimize the nega- tive log-likelihood of the target sequence Y , con- ditioned on the prompt P (l) and the source input X: LMLE = − m X t=1 log P (yt|y<t, P (l), X; θ) (5) where y<t represents the preceding ground-truth target tokens, and θ denotes the model parameters. This LTP method extends the target language pre- fixing idea introduced by (Johnson et al., 2017) and adapts the prompt-based control framework from Qu et al. (2025). By combining tokenizer-level symbolic conditioning with prompt-level natural language alignment, tailored for unified training in a one-to-many multilingual setup. 3.3 Group Relative Policy Optimization Group Relative Policy Optimization (GRPO) is a reinforcement learning strategy that refines model outputs using reward feedback. Inspired by Rein- forcement Learning with Human Feedback tech- niques (Ouyang et al., 2022), GRPO operates on mini-batches of candidate translations in this
Chunk 6 · 1,987 chars
roup Relative Policy Optimization
Group Relative Policy Optimization (GRPO) is a
reinforcement learning strategy that refines model
outputs using reward feedback. Inspired by Rein-
forcement Learning with Human Feedback tech-
niques (Ouyang et al., 2022), GRPO operates on
mini-batches of candidate translations in this task,
assigning scalar rewards based on SAR scores. Sub-
sequently, the model learns to maximize the ex-
pected reward via policy gradient updates. Unlike
conventional pointwise objective functions, GRPO
introduces an intra-batch comparison mechanism
and normalizes rewards using a moving baseline.
This approach helps to reduce the variance of gra-
dient estimates, thereby enhancing the stability of
the training process (Lu et al., 2026b). Our experi-
ments indicate that GRPO is effective for transla-
tion evaluation, enabling the selection and improve-
ment of translation quality in datasets.
3.4 Semantic Alignment Reward Function
We introduce the Semantic Alignment Reward
(SAR) to align the Quality Estimation (QE) agent’s
predictions with human expert judgments. SAR
-- 3 of 17 --
incentivizes the agent to generate scalar quality
scores that precisely reproduce ground-truth expert
values, ensuring the reliability of the reward signal
used in downstream policy optimization.
Given the QE agent’s generated evaluation log
ci, we extract potential quality scores using a prede-
fined regular expression R (e.g., capturing patterns
like “Score: [0-100]”). Let M (ci) denote the set of
all integer matches found within ci:
M (ci) = {m ∈ Z | PR(m, ci)} (6)
where PR(m, ci) is true if integer m matches R in
ci. To resolve ambiguity in generated logs, we em-
ploy a conservative extraction function E(ci) that
selects the minimum valid integer found, thereby
penalizing outputs containing conflicting lower
scores. If no match exists, a penalty indicator of -1
is assigned:
si = E(ci) =
(
min M (ci), if M (ci)̸ = ∅
−1, if M (ci) = ∅ (7)
Let ai represent theChunk 7 · 1,997 chars
erated logs, we em- ploy a conservative extraction function E(ci) that selects the minimum valid integer found, thereby penalizing outputs containing conflicting lower scores. If no match exists, a penalty indicator of -1 is assigned: si = E(ci) = ( min M (ci), if M (ci)̸ = ∅ −1, if M (ci) = ∅ (7) Let ai represent the ground-truth expert score. We define the reward based on the absolute devia- tion d = |si − ai|. To allow for minor subjective variations in human annotation while enforcing strict accuracy, we employ a stepwise mapping ϕ(d): ϕ(d) = 2.0, if d = 0 1.0, if 1 ≤ d ≤ 10 0.0, otherwise (8) This mapping grants maximum reward for exact matches and partial reward for acceptable estima- tions (within a 10-point tolerance), while zeroing out significant misjudgments. The final alignment reward ri is computed by ap- plying this mapping to valid extractions. Instances where the agent fails to generate a parseable score (si < 0) receive zero reward: ri = ( ϕ(|si − ai|), if si ≥ 0 0.0, if si < 0 (9) This mechanism provides a dense supervision sig- nal, effectively steering the QE agent to approxi- mate the distribution of human expert preferences (Xu et al., 2024). 3.5 Dataset We construct a new test suite based on the ASEAN Languages Treebank (ALT) corpus (Thu et al., 2016), called CALT. ALT is an English- centric multilingual corpus that already provides sentence-level alignment for several Southeast- Asian languages-Vietnamese (vi), Burmese (my), Lao (lo), Tagalog (fil), and Indonesian (id), the details shown in Table 1. Although Chinese is in- cluded as a target aligned with English, no direct LRL→Chinese alignment exists. We therefore re- index sentences that share the same alt_id and semantic source to form direct LRL-Chinese sen- tence pairs. In the resulting test set, Chinese can serve as either the source or the reference language. Languages Speaker Population1 (Eberhard et al., 2023) Filtered Subset2 LRL Chinese (zh) 1180M ✗ ✗ Hindi (hi) 345M ✗
Chunk 8 · 1,999 chars
e- index sentences that share the same alt_id and semantic source to form direct LRL-Chinese sen- tence pairs. In the resulting test set, Chinese can serve as either the source or the reference language. Languages Speaker Population1 (Eberhard et al., 2023) Filtered Subset2 LRL Chinese (zh) 1180M ✗ ✗ Hindi (hi) 345M ✗ ✗ Bengali (bn) 234M ✗ ✗ Japanese (ja) 121M ✗ ✗ Vietnamese (vi) 86M 10K ✓ Indonesian (id) 43M 10K ✓ Burmese (my) 33M 10K ✓ Tagalog (fil) 24M 10K ✓ Thai (th) 20M ✗ ✗ Malay (ms) 18M ✗ ✗ Khmer (km)3 16M – ✓ Lao (lo) 4.3M 10K ✓ Table 1: The ALT corpus statistics sorted by L1 speaker population. All counts refer to L1 speakers and are rounded to the nearest million (M). 3.6 Language Selection We deliberately focus on five Southeast Asian low- resource languages (Vietnamese, Burmese, Lao, Tagalog, and Indonesian) for four data-driven rea- sons. All five appear in ALT with reliable English alignments, making high-quality LRL-Chinese re- alignment feasible. We retain Indonesian instead of Malay because the two belong to the same Malayic subgroup with 90% lexical overlap and are of- ten treated as a single “Malay macrolanguage” in international surveys (Lim and Poedjosoedarmo, 2016), including both would introduce redundancy. Malay already has over 1 million clean En–Ms sentence pairs (e.g., MT-Wiki and multiple OPUS sub-corpora), placing it in the mid-resource tier (Stefanescu and Ion, 2013), whereas Indonesian still lacks sizable Chinese parallel data (<50k pairs in total, with ALT contributing only 20k) and re- mains low-resource according to FLORES-200 and NLLB benchmarks (Goyal et al., 2022). Thai has aggregated corpora exceeding one million pairs and dedicated WMT/IWSLT tracks (Lowphansirikul et al., 2021), so it no longer meets a strict LRL defi- nition. Khmer parallel resources are both small and highly noisy, with the WMT20 corpus-filtering task highlighting the need for extensive cleaning (Koehn et al., 2020). Moreover, Ethnologue reports virtu- --
Chunk 9 · 1,998 chars
pairs and dedicated WMT/IWSLT tracks (Lowphansirikul et al., 2021), so it no longer meets a strict LRL defi- nition. Khmer parallel resources are both small and highly noisy, with the WMT20 corpus-filtering task highlighting the need for extensive cleaning (Koehn et al., 2020). Moreover, Ethnologue reports virtu- -- 4 of 17 -- ally no L2 speaker community for Khmer (Eber- hard et al., 2023), which would require a language- specific filtering pipeline and undermine compara- bility. This reconstructed benchmark complements existing resources such as FLORES-200 (NLLB Team et al., 2024), particularly for LRL-Chinese di- rections in mainland and maritime Southeast Asia. Unlike pivot-based benchmarks, our test set avoids semantic distortion introduced by intermediate En- glish, thereby enabling more realistic, stable, and reproducible evaluation of Chinese-centric multi- lingual translation systems. 3.7 Model Overview We evaluate a series of representative LLMs, span- ning both proprietary and open-source systems: • Qwen-2.5 (Ghosal et al., 2025; Cui et al., 2025): Chinese-English bilingual models fine- tuned for multilingual transfer, evaluated on several LRLs. • GPT-4o (Huang et al., 2025): OpenAI’s flag- ship model tested on 16 languages, including several low-resource directions such as En–Te and En–Sw. • Claude-3.5 (Enis and Hopkins, 2024): A mul- tilingual LLM from Anthropic, evaluated via MQM metrics on pairs like En–Yoruba and En–Amharic. • Gemini-2.5 (DeepMind Gemini Team, 2025): While lacking peer-reviewed benchmarks on LRL→Chinese tasks, its predecessor cov- ers ultra-low-resource translation (e.g., En– Kalamang). • DeepSeek (Huang et al., 2025; Larionov et al., 2025): A competitive open-source model eval- uated in the BenchMAX suite alongside GPT- 4o. Zero-shot prompting was applied exclusively to representative closed-source LLMs. For the Qwen- 2.5 models, both SFT and GRPO-enhanced SFT were applied. Additional tests, as illustrated in the Data Distilling
Chunk 10 · 1,996 chars
; Larionov et al., 2025): A competitive open-source model eval- uated in the BenchMAX suite alongside GPT- 4o. Zero-shot prompting was applied exclusively to representative closed-source LLMs. For the Qwen- 2.5 models, both SFT and GRPO-enhanced SFT were applied. Additional tests, as illustrated in the Data Distilling module of Figure 1, were con- ducted with SFT and GRPO-enhanced SFT using 1Speaker numbers derive from the most recent national censuses or Ethnologue reports (2023–2025) and are expressed in millions (M). 2Each ALT language contains approximately 20k aligned sentence pairs from a shared English source. See https:// www2.nict.go.jp/astrec-att/member/mutiyama/ALT/. 3No L2 speaker community (Eberhard et al., 2023). enhanced data. This data was derived from closed- source model outputs, filtered by the QE agent, which trained on expert-rated development sets and optimized with SAR to select only high-quality translations. This comparative design allows us to isolate the specific contributions of reward-guided data distillation, verifying whether high-quality data can effectively compensate for limited model scale. Model performance is assessed using stan- dard metrics: BLEU-4 (Papineni et al., 2002), sacreBLEU (Post, 2018), chrF (Popovi´c, 2015), ROUGE-L (Lin, 2004), METEOR (Banerjee and Lavie, 2005), and BERTScore (Zhang et al., 2020). 3.8 Scoring and Selection To construct high-quality parallel corpora for low- resource translation, we design a three-stage scor- ing and filtering pipeline that integrates inter- pretable statistical features with semantic evalua- tion, followed by reference-free quality estimation and threshold-based selection. Stage I: Statistical and Semantic Feature- Based Scoring. In the first stage, we perform multi-dimensional scoring to capture both surface- level consistency and deeper semantic alignment between source and target sentences. Specifically, we extract five statistical features: sentence length ratio, token count
Chunk 11 · 1,986 chars
ection. Stage I: Statistical and Semantic Feature- Based Scoring. In the first stage, we perform multi-dimensional scoring to capture both surface- level consistency and deeper semantic alignment between source and target sentences. Specifically, we extract five statistical features: sentence length ratio, token count ratio, punctuation frequency dif- ference, digit proportion difference, and lexical diversity difference. These features have proven effective in prior work on noisy parallel data de- tection and alignment assessment (Munteanu and Marcu, 2005; Sánchez-Cartagena et al., 2018), par- ticularly for identifying mismatches caused by lan- guage divergence or formatting inconsistencies. To further enhance robustness, we incorporate two semantic signals. The first is conditional per- plexity, which measures the fluency and natural- ness of each sentence and is widely used for data selection in machine translation (Moore and Lewis, 2010; Junczys-Dowmunt, 2018). The second is instruction-following discrepancy, which evaluates whether the target sentence faithfully follows the communicative intent of the source, inspired by re- cent advances in instruction tuning (Li et al., 2024). All features are normalized and integrated through a weighted combination, enabling the identifica- tion of superficially aligned but semantically mis- matched sentence pairs (Esplà-Gomis et al., 2020). Implementation details and ablation analysis are provided in the Appendix A.5. Stage II: Reference-Free Quality Estimation. -- 5 of 17 -- In the second stage, we adopt a quality estimation (QE) agent trained on expert-annotated develop- ment sets. Inspired by COMET-QE, this model provides reference-free assessments of semantic adequacy, fluency, and alignment quality. The QE agent refines the results of the initial feature-based scoring by evaluating translation candidates from a semantic perspective without requiring ground- truth references (Rei et al., 2020; Freitag et
Chunk 12 · 1,985 chars
COMET-QE, this model provides reference-free assessments of semantic adequacy, fluency, and alignment quality. The QE agent refines the results of the initial feature-based scoring by evaluating translation candidates from a semantic perspective without requiring ground- truth references (Rei et al., 2020; Freitag et al., 2021). Stage III: Threshold-Based Distilling. In the final stage, we apply a filtering threshold to the QE scores. The threshold is empirically calibrated to balance high recall and quality retention, based on agreement with human evaluation on a validation set. Only sentence pairs exceeding the threshold are retained. This process ensures that the final training corpus is not only large in scale but also high in quality, making it suitable for fine-tuning compact models in low-resource scenarios. 4 Experiments and Analysis 4.1 Evaluation Method We evaluate translation performance using both overlap-based and semantic-aware metrics: Strict-Overlap: BLEU-4 (Papineni et al., 2002), sacreBLEU (Post, 2018), and ROUGE-L (Lin, 2004) assess lexical match and n-gram precision, which are crucial for evaluating surface-level accu- racy and fluency. Semantic-Friendly: chrF (Popovi´c, 2015), METEOR (Banerjee and Lavie, 2005), and BERTScore (Zhang et al., 2020) measure semantic similarity and fluency robustness, capturing aspects that n-gram overlap alone might miss. Each metric is computed on the reconstructed ALT test suite for five LRL→Chinese pairs. We report averages across directions, comparing zero-shot prompting, SFT, and GRPO-enhanced regimes. To provide a balanced evaluation that captures both lexical precision and semantic adequacy, we propose a composite metric, BLEU-chrF. This metric integrates insights from both the Strict- Overlap and Semantic-Friendly categories of eval- uation measures by taking the arithmetic mean of the BLEU-4 score and the chrF score: BLEU-chrF = BLEU-4 + chrF 2 (10) By averaging these two widely-used metrics,
Chunk 13 · 1,989 chars
adequacy, we propose a composite metric, BLEU-chrF. This metric integrates insights from both the Strict- Overlap and Semantic-Friendly categories of eval- uation measures by taking the arithmetic mean of the BLEU-4 score and the chrF score: BLEU-chrF = BLEU-4 + chrF 2 (10) By averaging these two widely-used metrics, one emphasizing n-gram precision and the other char- 1 10 100 1000 Model Parameters (B) - Log Scale (Params for Closed Source Models are Estimates) 0 10 20 30 40 50 Score (BLEU-chrF) Model Categories/Series MERIT-3B (Ours) Closed Source Models Qwen-2.5 Series DeepSeek Series Other Open Source Models Trendline 0.2 32.4 46.0 0.2 32.4 46.0 Gemini 2.5 DS-v3 GPT-4o Claude 3.5 Qwen-32B DS-R1-32B Qwen-7B NLLB-3.3B DS-R1-7B M2M-1.2B DS-R1-1.5B MERIT-3B Figure 2: Performance–Scale Trade-offs of MERIT-3B and Baseline Models on Chinese-Centric Multilingual Translation. Comparison of BLEU-chrF scores against model size (log-scale) across MERIT-3B, open-source, and estimated closed-source models. acter n-gram recall and F-score. We aim to achieve a more holistic assessment of translation quality, particularly for tasks where both lexical fidelity and semantic resemblance are important (see Table 5). 4.2 Main Result Table 2 presents our evaluation results across five LRL→Chinese directions, categorizing metrics into Strict-Overlap (Papineni et al., 2002; Post, 2018; Lin, 2004) and Semantic-Friendly (Popovi´c, 2015; Banerjee and Lavie, 2005; Zhang et al., 2020). Among leading closed-source models, Gemini-2.5-Flash consistently achieves top scores in BLEU-4 and chrF across multiple languages, such as Filipino (BLEU-4: 49.26, chrF: 42.68) and Indonesian (BLEU-4: 48.73, chrF: 41.93). Claude- 3.5 Sonnet excels in ROUGE-L for Lao (14.00) and Burmese (24.44). The proposed MERIT framework demonstrates notable strengths. Specifically, MERIT-3B sig- nificantly outperforms the similarly sized open- source NLLB-200-3.3B model across several met- rics for Filipino,
Chunk 14 · 1,991 chars
Indonesian (BLEU-4: 48.73, chrF: 41.93). Claude- 3.5 Sonnet excels in ROUGE-L for Lao (14.00) and Burmese (24.44). The proposed MERIT framework demonstrates notable strengths. Specifically, MERIT-3B sig- nificantly outperforms the similarly sized open- source NLLB-200-3.3B model across several met- rics for Filipino, Indonesian, and Vietnamese (see Figure 2 and 6). For instance, on Filipino, MERIT- 3B achieves 29.71 BLEU-4 and 49.88 METEOR compared to NLLB-200’s 25.05 and 46.10, re- spectively. Furthermore, MERIT-3B shows sub- stantial gains over smaller open-source baselines on particularly challenging low-resource language pairs. Notably, our MERIT-3B model demonstrates substantial advantages over the DeepSeek-R1 7B. MERIT-3B consistently outperforms DeepSeek-R1 7B on lexical similarity metrics such as BLEU-4 and chrF across all five evaluated languages. Fur- thermore, when benchmarked against Qwen-2.5 7B -- 6 of 17 -- Strict-Overlap BLEU-4 sacreBLEU ROUGE-L Model fil id lo my vi fil id lo my vi fil id lo my vi FT OS GPT-4o 45.26 47.77 28.10 30.48 45.03 43.60 45.81 27.38 29.53 44.50 33.62 31.40 12.33 23.57 28.25 ✗ ✗ Claude-3.5 Sonnet 42.97 47.35 32.20 30.92 45.14 43.38 46.54 32.65 30.14 44.55 35.03 31.17 14.00 24.44 28.45 ✗ ✗ Gemini-2.5 Flash 49.26 48.73 35.79 36.91 39.24 47.48 47.20 35.03 36.07 38.63 35.89 30.83 13.53 23.92 24.62 ✗ ✗ DeepSeek-V3 46.19 41.83 25.57 26.68 41.29 44.12 41.38 25.25 26.29 40.90 35.15 30.06 12.75 22.95 28.27 ✗ ✓ Qwen-2.5 32B 43.56 47.87 23.27 20.43 46.23 42.01 46.61 22.48 19.50 44.63 34.97 31.85 11.84 20.61 29.08 ✗ ✓ DeepSeek-R1 32B 37.98 43.29 12.97 8.87 42.57 37.14 42.27 12.71 8.43 41.45 32.40 29.92 10.04 16.31 28.11 ✗ ✓ Qwen 2.5-7B 30.21 36.28 6.17 6.43 35.22 29.75 36.41 5.67 5.42 35.07 31.99 28.56 9.86 16.41 27.81 ✗ ✓ DeepSeek-R1 7B 14.77 20.94 0.79 0.37 12.53 15.31 24.17 0.86 0.46 16.16 24.58 23.32 8.67 5.67 18.52 ✗ ✓ NLLB-200 3.3B 25.05 25.27 15.86 20.83 25.30 24.21 23.64 15.19 20.13 22.97 31.45 25.73 11.49 18.68 25.51 ✓
Chunk 15 · 1,992 chars
28.11 ✗ ✓ Qwen 2.5-7B 30.21 36.28 6.17 6.43 35.22 29.75 36.41 5.67 5.42 35.07 31.99 28.56 9.86 16.41 27.81 ✗ ✓ DeepSeek-R1 7B 14.77 20.94 0.79 0.37 12.53 15.31 24.17 0.86 0.46 16.16 24.58 23.32 8.67 5.67 18.52 ✗ ✓ NLLB-200 3.3B 25.05 25.27 15.86 20.83 25.30 24.21 23.64 15.19 20.13 22.97 31.45 25.73 11.49 18.68 25.51 ✓ ✓ DeepSeek-R1 1.5B 0.07 0.08 0.05 1.06 0.05 0.03 0.06 0.01 0.11 0.02 0.77 0.90 1.80 3.80 0.55 ✗ ✓ M2M-100 1.2B 2.13 9.53 0.05 0.00 3.33 1.32 9.47 0.01 0.00 3.19 9.88 21.65 4.69 0.00 13.01 ✓ ✓ MERIT-3B (Ours) 29.71 34.73 5.15 4.56 31.20 27.20 33.16 4.28 3.54 29.25 31.91 26.63 8.40 13.68 21.18 ✓ ✓ Semantic-Friendly chrF METEOR BERTScore Model fil id lo my vi fil id lo my vi fil id lo my vi FT OS GPT-4o 39.36 41.13 24.20 26.76 39.62 67.80 70.34 53.66 56.59 69.44 68.59 70.84 56.27 57.04 70.41 ✗ ✗ Claude-3.5 Sonnet 38.71 41.30 28.38 28.79 39.65 67.52 70.29 58.53 58.88 69.23 68.28 71.55 60.72 57.68 70.40 ✗ ✗ Gemini-2.5 Flash 42.68 41.93 30.61 32.09 34.66 70.14 70.87 60.21 61.85 60.31 71.67 72.77 63.93 62.39 60.88 ✗ ✗ DeepSeek-V3 39.80 36.95 22.37 24.43 36.86 68.04 67.10 51.41 53.66 67.43 70.02 68.81 54.82 54.21 68.87 ✗ ✓ Qwen-2.5 32B 37.77 41.35 20.36 18.13 39.80 66.61 70.57 46.72 43.29 69.76 67.73 71.73 50.50 45.09 71.13 ✗ ✓ DeepSeek-R1 32B 33.62 37.62 12.77 9.81 36.97 61.60 66.75 34.07 27.37 66.95 63.32 68.54 36.09 27.36 68.65 ✗ ✓ Qwen-2.5 7B 27.88 33.68 7.39 7.95 33.28 54.51 62.56 20.84 23.25 63.00 54.95 62.82 22.24 23.35 63.28 ✗ ✓ DeepSeek-R1 7B 15.43 21.60 2.20 1.35 14.87 36.02 49.35 8.45 6.46 37.61 34.22 49.07 1.46 4.10 36.54 ✗ ✓ NLLB-200 3.3B 22.48 23.57 14.92 18.69 22.43 46.10 45.28 36.27 42.28 45.73 48.61 54.34 44.80 48.93 52.59 ✓ ✓ DeepSeek-R1 1.5B 0.33 0.38 0.36 1.87 0.24 1.99 2.34 2.17 3.80 1.43 27.97 26.07 20.48 19.68 29.48 ✗ ✓ M2M-100 1.2B 5.16 18.21 0.26 0.01 6.49 4.65 14.00 0.57 0.11 6.10 16.71 1.68 10.04 16.34 9.51 ✓ ✓ MERIT 3B (Ours) 25.52 30.22 5.81 5.53 26.98 49.88 56.46 16.55 16.93 50.50 46.70 45.58 11.45 16.12 32.87 ✓ ✓ Table 2:
Chunk 16 · 1,997 chars
✓ ✓ DeepSeek-R1 1.5B 0.33 0.38 0.36 1.87 0.24 1.99 2.34 2.17 3.80 1.43 27.97 26.07 20.48 19.68 29.48 ✗ ✓ M2M-100 1.2B 5.16 18.21 0.26 0.01 6.49 4.65 14.00 0.57 0.11 6.10 16.71 1.68 10.04 16.34 9.51 ✓ ✓ MERIT 3B (Ours) 25.52 30.22 5.81 5.53 26.98 49.88 56.46 16.55 16.93 50.50 46.70 45.58 11.45 16.12 32.87 ✓ ✓ Table 2: Evaluation on five Southeast Asian languages. Strict-Overlap metrics include BLEU-4, sacreBLEU, and ROUGE-L. Semantic-Friendly metrics include chrF, METEOR, and BERTScore. For each metric column: Bold values indicate the highest score, and Underlined values indicate the second highest score across all models. FT: Fine-tuned; OS: Open Source. and MERIT-3B, with only approximately 42.9% of its parameters, MERIT-3B achieves highly com- petitive translation quality. For instance, on Fil- ipino, MERIT-3B reaches 99.7% of Qwen-2.5 7B’s ROUGE-L score (31.91 vs. 31.99) and over 98.3% of its BLEU-4 score (29.71 vs. 30.21). Similar competitiveness is observed for Indonesian, where MERIT-3B attains approximately 95.7% of Qwen- 2.5 7B’s BLEU-4 score (34.73 vs. 36.28) and 93.2% of its ROUGE-L score (26.63 vs. 28.56). For distinct scripts such as Lao and Burmese, models ≤1.5B fail catastrophically: instruction- following deficits lead to hallucinations or source copying, yielding negligible utility. These results underscore the efficacy of our reward-informed filtering and specialized fine- tuning approach, particularly in improving perfor- mance for low-resource languages and achieving competitive results within the open-source land- scape, given model scale. 4.3 Module Comparison To investigate the contribution of each component, we conduct an ablation study on Qwen-2.5-0.5B and Qwen-2.5-3B across four distinct setups: zero- shot (serving as our baseline), Supervised Fine- Tuning with Language-Token Prefixing (SFT-LTP), SFT-LTP followed by reward-enhanced tuning us- ing Group Relative Policy Optimization with Se- mantic Alignment Reward (GRPO-SAR), and fi- nally
Chunk 17 · 1,993 chars
ion study on Qwen-2.5-0.5B and Qwen-2.5-3B across four distinct setups: zero- shot (serving as our baseline), Supervised Fine- Tuning with Language-Token Prefixing (SFT-LTP), SFT-LTP followed by reward-enhanced tuning us- ing Group Relative Policy Optimization with Se- mantic Alignment Reward (GRPO-SAR), and fi- nally our full SFT-LTP + GRPO-SAR with an ad- ditional LLMs-Enhanced (LLME) stage. As detailed in Table 3, initial SFT-LTP yields substantial improvements in both BLEU-4 and chrF scores over the zero-shot baselines across all languages for both model sizes. For instance, Qwen-2.5 3B sees its overall BLEU-chrF score in- crease from 9.10 to 15.53 after SFT-LTP. Introduc- ing GRPO-SAR provides further consistent gains. Notably, for the Qwen-2.5 3B model, GRPO-SAR significantly boosts performance on low-resource pairs like Lao→Chinese, improving BLEU-4 from 1.17 (SFT-LTP) to 4.39 and chrF from 2.22 (SFT- LTP) to 5.17. Even with its limited capacity, the Qwen-2.5 0.5B model benefits remarkably from GRPO-SAR, achieving an overall BLEU-chrF score of 4.39, which is nearly a 40-fold increase (a 3890% relative improvement) over its zero-shot baseline score of 0.11. This underscores the effi- cacy of reward modeling, consistent with findings in instruction tuning (Ouyang et al., 2022; Xu et al., 2024). Our proposed LLMs-Enhanced (LLME) stage demonstrates further advancements, particularly for the larger Qwen-2.5 3B model. With LLME, the Qwen-2.5 3B model achieves the highest over- all BLEU-chrF score of 19.94, representing a 10.84 -- 7 of 17 -- Model BLEU-4 chrF Overall (BLEU-chrF) fil id lo my vi fil id lo my vi Qwen2.5-0.5b 0.03 0.03 0.02 0.01 0.01 0.16 0.12 0.40 0.25 0.06 0.11 + SFT-LTP 1.86↑1.83 4.02↑4.00 0.25↑0.22 0.15↑0.14 3.12↑3.11 4.85↑4.69 10.38↑10.26 1.41↑1.00 1.07↑0.82 8.55↑8.50 3.57↑3.46 + GRPO-SAR 2.31↑2.28 4.32↑4.29 0.26↑0.24 0.16↑0.16 6.29↑6.27 5.00↑4.84 10.64↑10.52 1.38↑0.98 1.22↑0.96 12.36↑12.30 4.39↑4.28 + LLME 0.32↑0.29 0.45↑0.42 0.08↑0.06
Chunk 18 · 1,992 chars
0.01 0.16 0.12 0.40 0.25 0.06 0.11 + SFT-LTP 1.86↑1.83 4.02↑4.00 0.25↑0.22 0.15↑0.14 3.12↑3.11 4.85↑4.69 10.38↑10.26 1.41↑1.00 1.07↑0.82 8.55↑8.50 3.57↑3.46 + GRPO-SAR 2.31↑2.28 4.32↑4.29 0.26↑0.24 0.16↑0.16 6.29↑6.27 5.00↑4.84 10.64↑10.52 1.38↑0.98 1.22↑0.96 12.36↑12.30 4.39↑4.28 + LLME 0.32↑0.29 0.45↑0.42 0.08↑0.06 0.07↑0.06 0.79↑0.78 1.20↑1.04 1.66↑1.54 0.45↑0.05 1.25↑1.00 2.82↑2.76 0.91↑0.80 Avg. 1.13 2.21 0.15 0.10 2.55 2.80 5.70 0.91 0.95 5.95 2.25 Qwen2.5-3b 5.80 14.25 1.11 1.83 17.06 8.83 17.71 2.05 3.03 19.35 9.10 + SFT-LTP 23.01↑17.21 26.00↑11.75 1.17↑0.05 2.53↑0.69 27.94↑10.87 20.14↑11.31 24.25↑6.54 2.22↑0.17 3.53↑0.50 24.55↑5.20 15.53↑6.43 + GRPO-SAR 25.58↑19.78 29.11↑14.86 4.39↑3.28 2.77↑0.93 32.54↑15.48 23.62↑14.78 27.83↑10.12 5.17↑3.12 3.45↑0.42 27.88↑8.53 18.23↑9.13 + LLME 29.71↑23.91 34.73↑20.48 5.15↑4.04 4.56↑2.73 31.20↑14.14 25.52↑16.69 30.22↑12.51 5.81↑3.76 5.53↑2.50 26.98↑7.63 19.94↑10.84 Avg. 21.03 26.02 2.96 2.92 27.19 19.53 25.00 3.81 3.89 24.69 15.70 Table 3: Evaluation results of Qwen-2.5-0.5B and 3B on five Southeast Asian languages. All values are rounded to two decimal places. Improvements over the baseline (underlined rows) are shown with arrows. absolute point improvement (a 119% relative in- crease) over its zero-shot baseline. This highlights the synergistic benefits of our full pipeline. While the LLME stage yields more modest gains for the Qwen-2.5 0.5B model in the current setup (overall BLEU-chrF of 0.91), the substantial cumulative im- provements from SFT-LTP and GRPO-SAR on this smaller model, and the peak performance achieved by the 3B model with LLME, collectively validate the effectiveness and scalability of our modular tun- ing strategy in significantly enhancing translation quality. 4.4 Effect of Data Distillation on Performance We assess the impact of our quality filtering ap- proach by comparing full-scale Supervised Fine- Tuning with Language-Token Prefixing (SFT-LTP) against subsequent reward-informed filtering
Chunk 19 · 1,992 chars
y of our modular tun- ing strategy in significantly enhancing translation quality. 4.4 Effect of Data Distillation on Performance We assess the impact of our quality filtering ap- proach by comparing full-scale Supervised Fine- Tuning with Language-Token Prefixing (SFT-LTP) against subsequent reward-informed filtering and tuning via GRPO-SAR, using our MERIT-3B model. Table 4 details the number of retained train- ing instances per language and the corresponding overall BLEU-chrF scores for these configurations. The SFT-LTP stage utilizes the full set of 40,000 training instances. In contrast, the GRPO-SAR stage strategically curates this data, drastically re- ducing the volume to only 9,126 instances. This results in an average data reduction of 77.2%, with the most significant reduction observed for Viet- namese, where the training data was reduced by 87.8% (from 8,000 to 976 instances). Remarkably, despite this substantial data pruning, the overall BLEU-chrF score not only signifies the efficient retention of highly informative samples but actu- ally improves from 15.53 (achieved with SFT-LTP on 40,000 instances) to 18.23 with GRPO-SAR on the reduced dataset. This represents a relative performance increase of approximately 17.4%. These findings underscore the efficacy of our reward-based filtering as a data-efficient strategy that simultaneously reduces training data require- ments and enhances model performance. This of- fers a compelling alternative to training on larger, potentially noisier, unfiltered datasets. The bene- fits of leveraging reward signals for targeted data curation align with effective strategies observed in other generative AI tasks, such as summarization and dialogue tuning (Ouyang et al., 2022). 5 Discussion Future work should incorporate robust script nor- malization or transliteration to mitigate encoding inconsistencies in Lao and Burmese. The current GRPO-SAR reward may overemphasize adequacy; multi-objective rewards that balance
Chunk 20 · 1,996 chars
e AI tasks, such as summarization and dialogue tuning (Ouyang et al., 2022). 5 Discussion Future work should incorporate robust script nor- malization or transliteration to mitigate encoding inconsistencies in Lao and Burmese. The current GRPO-SAR reward may overemphasize adequacy; multi-objective rewards that balance adequacy and fluency are needed. Zero-shot baselines of large closed-source LLMs may be underestimated; com- bining our data distillation with advanced prompt- ing or in-context learning on larger models remains a promising direction. The full discussion is visible in Section A.1. 6 Conclusion This work bridges the critical gap in Chinese- centric low-resource translation by introducing MERIT, a comprehensive framework comprising the CALT benchmark and a data-efficient train- ing paradigm. By synergizing Language-specific Token Prefixing (LTP), Supervised Fine-Tuning (SFT), and our novel Group Relative Policy Opti- mization (GRPO) guided by the Semantic Align- ment Reward (SAR), we demonstrate that model performance can be significantly decoupled from sheer parameter scale. Notably, our MERIT-3B model surpasses much larger baselines, such as NLLB-200 3.3B and M2M-100, while utilizing only 22.8% of the original training data. These findings underscore that in low-resource regimes, expert-aligned data distillation and reward-guided optimization offer a more sustainable and effective path than brute-force scaling, providing a repro- ducible blueprint for narrowing the multilingual digital divide. -- 8 of 17 -- Limitations Despite the encouraging results achieved by our proposed framework and the MERIT-3B model, this work has several limitations that warrant dis- cussion and offer avenues for future improvement. Limited Linguistic Coverage: Despite con- structing direct LRL-Chinese sentence pairs, the current benchmark is restricted to only five South- east Asian languages, leaving important low- resource languages such as Tibetan, Uyghur, and Kazakh
Chunk 21 · 1,994 chars
itations that warrant dis- cussion and offer avenues for future improvement. Limited Linguistic Coverage: Despite con- structing direct LRL-Chinese sentence pairs, the current benchmark is restricted to only five South- east Asian languages, leaving important low- resource languages such as Tibetan, Uyghur, and Kazakh unaddressed due to the lack of high-quality parallel corpora. Residual English-Centric Bias in the Test Suite: Although the ALT-based test suite has been realigned for LRL-Chinese evaluation using shared alt_id indexing, it remains inherently constrained by its original English-centric design, potentially retaining subtle domain-specific or stylistic arti- facts that affect translation assessment. Insufficient Scale of Human Validation: While the QE agent and statistical-semantic SAR function rely on automatic filtering, the volume of human validation, particularly the expert-annotated data employed in reward model training, remains sub- stantially limited. This constraint may compromise the robustness and alignment of the SAR model with diverse human preferences. Fixed Decoding and Prompting Strategies: Al- though multiple LLMs were evaluated under zero- shot, SFT, and GRPO settings, decoding hyperpa- rameters (e.g., beam size, temperature) and prompt formats were uniformly fixed for fair comparison, potentially masking model-specific performance differences that could be uncovered through more extensive hyperparameter and prompt engineering. Restricted Model Scale and Missing Effi- ciency Analysis: Due to computational constraints, all experiments including the MERIT-3B model and the proposed SFT-LTP and GRPO-SAR frame- works were limited to models of up to 3B parame- ters; the approach has not yet been validated on larger (over 7B) or state-of-the-art billion-scale models, and a comprehensive efficiency analysis of training time, inference latency, and overall com- putational cost remains absent, leaving scalability and practical deployability
Chunk 22 · 1,999 chars
limited to models of up to 3B parame- ters; the approach has not yet been validated on larger (over 7B) or state-of-the-art billion-scale models, and a comprehensive efficiency analysis of training time, inference latency, and overall com- putational cost remains absent, leaving scalability and practical deployability unassessed. Ethics Statements This work presents a Chinese-centric multilin- gual translation benchmark targeting five Southeast Asian low-resource languages (LRLs), constructed from publicly available corpora and evaluated un- der reproducible protocols. We aim to support re- sponsible research in multilingual NLP by releas- ing rigorous evaluation resources while proactively addressing ethical concerns related to data prove- nance, model fairness, environmental impact, and potential misuse. Data Privacy and Consent All data is derived from the publicly available ASEAN Languages Treebank (ALT), which includes multilingual trans- lations of government and news texts. While the dataset is openly licensed, the original collection did not explicitly document consent procedures or procedures for removing personally identifiable in- formation (PII). To mitigate this, we apply a multi- stage filtering process to exclude named entities, explicit language, and potentially sensitive content. Nonetheless, due to the limitations of automated and manual filtering, some residual risk may re- main. We follow the data statements framework (Bender and Friedman, 2018) and document licens- ing, provenance, and usage constraints in the ap- pendix. Bias and Fairness Despite the use of a three- stage filtering pipeline and expert-rated supervi- sion, the training data may still encode latent cul- tural, linguistic, or regional bias, particularly due to its English-pivoted design and limited coverage of dialectal variations or non-standard orthogra- phies. Annotators are bilingual graduate students, and while they are experienced, demographic di- versity is limited. Future
Chunk 23 · 1,997 chars
ning data may still encode latent cul- tural, linguistic, or regional bias, particularly due to its English-pivoted design and limited coverage of dialectal variations or non-standard orthogra- phies. Annotators are bilingual graduate students, and while they are experienced, demographic di- versity is limited. Future work will prioritize the inclusion of more diverse annotators and typolog- ically broader sources to mitigate such represen- tational imbalances. Our work aligns with global AI ethics principles of fairness, transparency, and non-maleficence (Gebru et al., 2021). Environmental Impact Model training and in- ference were conducted on two NVIDIA RTX 3090 GPUs (24 GB). We log training FLOPs and wall- clock runtime for both the SFT and GRPO stages. While the GRPO procedure improves data effi- ciency through reward-based filtering, it introduces additional computational cost. We estimate that the total training corresponds to a typical single- node compute workload and plan to explore more lightweight reward models or compute-efficient alternatives to reduce carbon impact in future iter- ations (EMNLP 2023 Program Committee, 2023; Xu et al., 2026; Li and Lu, 2026). -- 9 of 17 -- Intended Use and Misuse Risks The benchmark is designed to support objective evaluation and su- pervised training for LRL→Chinese translation tasks. It is intended for academic research and language technology development, particularly in regions underrepresented in NLP. However, misuse is possible, such as generating misinformation or content targeting marginalized communities. We explicitly discourage such applications and recom- mend that any downstream use include fairness au- diting, risk controls, and human oversight (Mitchell et al., 2019). References AhmedElmogtaba Abdelmoniem Ali Abdelaziz, Ashraf Hatim Elneima, and Kareem Darwish. 2024. LLM-based MT data creation: Dialectal to MSA translation shared task. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora
Chunk 24 · 1,998 chars
de fairness au- diting, risk controls, and human oversight (Mitchell et al., 2019). References AhmedElmogtaba Abdelmoniem Ali Abdelaziz, Ashraf Hatim Elneima, and Kareem Darwish. 2024. LLM-based MT data creation: Dialectal to MSA translation shared task. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 112–116, Torino, Italia. ELRA and ICCL. Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019. Massively multilingual neural machine translation in the wild: Findings and chal- lenges. Preprint, arXiv:1907.05019. Mikel Artetxe and Holger Schwenk. 2018. Mas- sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with im- proved correlation with human judgments. In Pro- ceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Transla- tion and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguis- tics. Emily M. Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604. Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, and Bin Wang. 2025. Multilingual machine translation with open large language models at practical scale: An empirical study. Preprint, arXiv:2502.02481. DeepMind Gemini Team. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Technical Report v2.5, Google DeepMind.
Chunk 25 · 1,998 chars
al machine translation with open large language models at practical scale: An empirical study. Preprint, arXiv:2502.02481. DeepMind Gemini Team. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Technical Report v2.5, Google DeepMind. Accessed: 2025-12. David M. Eberhard, Gary F. Simons, and Charles D. Fennig. 2023. Ethnologue: Languages of the World, 26 edition. SIL International, Dallas, TX. Provides L1 speaker estimates and language classification data. EMNLP 2023 Program Committee. 2023. Emnlp 2023 ethics faq. https://2023.emnlp.org/ethics/ faq. Conference ethics guidelines, accessed May 2024. Maxim Enis and Mark Hopkins. 2024. From llm to nmt: Advancing low-resource machine translation with claude. Preprint, arXiv:2404.13813. Miquel Esplà-Gomis, Víctor M. Sánchez-Cartagena, Jaume Zaragoza-Bernabeu, and Felipe Sánchez- Martínez. 2020. Bicleaner at WMT 2020: Universitat d’alacant-prompsit’s submission to the parallel cor- pus filtering shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 952–958, Online. Association for Computational Linguistics. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Man- deep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi- taliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2020. Be- yond english-centric multilingual machine transla- tion. Preprint, arXiv:2010.11125. Zhaopeng Feng, Ruizhe Chen, Yan Zhang, Zijie Meng, and Zuozhu Liu. 2024. Ladder: A model-agnostic framework boosting LLM-based machine translation to the next level. In Proceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 15377–15393, Miami, Florida, USA. Association for Computational Linguistics. Zhaopeng Feng, Yan Zhang, Hao Li, Bei Wu, Jiayu Liao, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, and Zuozhu Liu.
Chunk 26 · 1,998 chars
nslation to the next level. In Proceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 15377–15393, Miami, Florida, USA. Association for Computational Linguistics. Zhaopeng Feng, Yan Zhang, Hao Li, Bei Wu, Jiayu Liao, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, and Zuozhu Liu. 2025. TEaR: Improving LLM-based machine translation with systematic self-refinement. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3922–3938, Al- buquerque, New Mexico. Association for Computa- tional Linguistics. Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transac- tions of the Association for Computational Linguis- tics, 9:1460–1474. Timnit Gebru, Jamie Morgenstern, Briana Vec- chione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets. Preprint, arXiv:1803.09010. -- 10 of 17 -- Soumya Suvra Ghosal, Soumyabrata Pal, Koyel Mukherjee, and Dinesh Manocha. 2025. PromptRe- fine: Enhancing few-shot performance on low- resource Indic languages with example selection from related example banks. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), pages 351–365, Albuquerque, New Mexico. Association for Computational Linguistics. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. 2022. The flores-101 evaluation benchmark for low-resource and multilingual ma- chine translation. Transactions of the Association for Computational Linguistics, 10:522–538. Hieu Hoang, Huda Khayrallah, and Marcin Junczys- Dowmunt. 2024. On-the-fly fusion of large language models and machine translation. In
Chunk 27 · 1,996 chars
ngela Fan. 2022. The flores-101 evaluation benchmark for low-resource and multilingual ma- chine translation. Transactions of the Association for Computational Linguistics, 10:522–538. Hieu Hoang, Huda Khayrallah, and Marcin Junczys- Dowmunt. 2024. On-the-fly fusion of large language models and machine translation. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 520–532, Mexico City, Mexico. Associ- ation for Computational Linguistics. Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. 2025. Benchmax: A comprehensive multilingual evaluation suite for large language models. Preprint, arXiv:2502.07346. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- rat, Fernanda Viégas, Martin Wattenberg, Greg Cor- rado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Preprint, arXiv:1611.04558. Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics. Sejoon Kim, Mingi Sung, Jeonghwan Lee, Hyunkuk Lim, and Jorge Gimenez Perez. 2024. Efficient ter- minology integration for LLM-based translation in specialized domains. In Proceedings of the Ninth Conference on Machine Translation, pages 636–642, Miami, Florida, USA. Association for Computational Linguistics. Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky, Naman Goyal, Peng-Jen Chen, and Francisco Guzmán. 2020. Findings of the WMT 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation, pages 726–742, Online. Association for Computational Linguistics. Roman Koshkin, Katsuhito Sudoh, and Satoshi Naka- mura. 2024. TransLLaMa: LLM-based simultaneous translation system. In Findings of
Chunk 28 · 1,997 chars
20 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation, pages 726–742, Online. Association for Computational Linguistics. Roman Koshkin, Katsuhito Sudoh, and Satoshi Naka- mura. 2024. TransLLaMa: LLM-based simultaneous translation system. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 461–476, Miami, Florida, USA. Association for Com- putational Linguistics. Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yan- ran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, and Steffen Eger. 2025. Deepseek-r1 vs. o3-mini: How well can reasoning llms evaluate mt and summarization? Preprint, arXiv:2504.08120. Anne Lauscher, Vinit Ravishankar, Ivan Vuli´c, and Goran Glavaš. 2020. From zero to hero: On the limitations of zero-shot language transfer with mul- tilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 4483–4499, On- line. Association for Computational Linguistics. Linxiao Li and Zhixiang Lu. 2026. Ecothink: A green adaptive inference framework for sustainable and accessible agents. Preprint, arXiv:2603.25498. Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2024. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. Preprint, arXiv:2308.12032. Beng Soon Lim and Gloria R. Poedjosoedarmo. 2016. Bahasa indonesia and bahasa melayu: Convergence and divergence of the official languages in contem- porary southeast asia. In Gerhard Leitner, Azirah Hashim, and Hans-Georg Wolf, editors, Communi- cating with Asia: The Future of English as a Global Language, pages 170–187. Cambridge University Press, Cambridge. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational
Chunk 29 · 1,988 chars
rg Wolf, editors, Communi- cating with Asia: The Future of English as a Global Language, pages 170–187. Cambridge University Press, Cambridge. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre- training for neural machine translation. Preprint, arXiv:2001.08210. Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford, and Sarana Nutanong. 2021. A large en- glish–thai parallel corpus from the web and machine- generated text. Language Resources and Evaluation, 56(2):477–499. Zhixiang Lu, Peichen Ji, Yulong Li, Ding Sun, Chenyu Xue, Haochen Xue, Mian Zhou, Angelos Stefanidis, Jionglong Su, and Zhengyong Jiang. 2025. Advanc- ing low-resource machine translation: A unified data selection and scoring optimization framework. In International Conference on Intelligent Computing, pages 482–493. Springer. Zhixiang Lu, Yulong Li, Feilong Tang, Zhengyong Jiang, Chong Li, Mian Zhou, Tenglong Li, and Jion- glong Su. 2026a. Deepgb-tb: A risk-balanced cross- attention gradient-boosted convolutional network for -- 11 of 17 -- rapid, interpretable tuberculosis screening. Proceed- ings of the AAAI Conference on Artificial Intelligence, 40(46):38989–38997. Zhixiang Lu, Chong Zhang, Yulong Li, Angelos Ste- fanidis, Anh Nguyen, Imran Razzak, Jionglong Su, and Zhengyong Jiang. 2026b. Sage: Sustainable agent-guided expert-tuning for culturally attuned translation in low-resource southeast asia. Preprint, arXiv:2603.19931. Shuming Ma, Li Dong, Shaohan Huang, Dong- dong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, and Furu Wei. 2021. Deltalm: Encoder-decoder pre-training for language generation and translation by augment- ing pretrained multilingual encoders.
Chunk 30 · 1,997 chars
esource southeast asia. Preprint, arXiv:2603.19931. Shuming Ma, Li Dong, Shaohan Huang, Dong- dong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, and Furu Wei. 2021. Deltalm: Encoder-decoder pre-training for language generation and translation by augment- ing pretrained multilingual encoders. Preprint, arXiv:2106.13736. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Account- ability, and Transparency, FAT* ’19, page 220–229. ACM. Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Pro- ceedings of the ACL 2010 Conference Short Papers, pages 220–224, Uppsala, Sweden. Association for Computational Linguistics. Dragos Stefan Munteanu and Daniel Marcu. 2005. Im- proving machine translation performance by exploit- ing non-parallel corpora. Computational Linguistics, 31(4):477–504. NLLB Team, Marta R. Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, and 20 others. 2024. Scaling neural machine translation to 200 languages. Nature, 630(8018):841–846. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Preprint, arXiv:2203.02155. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. In Proceedings of the 40th
Chunk 31 · 1,999 chars
er Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Preprint, arXiv:2203.02155. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computa- tional Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics. Maja Popovi´c. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics. Matt Post. 2018. A call for clarity in reporting bleu scores. Preprint, arXiv:1804.08771. Zhi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding, Hideki Tanaka, Masao Utiyama, and Taro Watanabe. 2025. Registering source tokens to target language spaces in multilingual neural machine translation. Preprint, arXiv:2501.02979. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics. Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio Ortiz-Rojas, and Gema Ramírez. 2018. Prompsit’s submission to WMT 2018 parallel corpus filtering shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 955–962, Belgium, Brussels. Association for Com- putational Linguistics. Dan C. Stefanescu and Radu Ion. 2013. Parallel-wiki: A collection of parallel sentences extracted from wikipedia. Res. Comput. Sci., 70:145–156. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Preprint, arXiv:1409.3215. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Re- thinking the inception
Chunk 32 · 1,994 chars
entences extracted from wikipedia. Res. Comput. Sci., 70:145–156. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Preprint, arXiv:1409.3215. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Re- thinking the inception architecture for computer vi- sion. Preprint, arXiv:1512.00567. Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2016. Introducing the Asian language treebank (ALT). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1574– 1578, Portorož, Slovenia. European Language Re- sources Association (ELRA). Martin Volk, Dominic Philipp Fischer, Lukas Fischer, Patricia Scheurer, and Phillip Benjamin Ströbel. 2024. LLM-based machine translation and summarization for Latin. In Proceedings of the Third Workshop on Language Technologies for Historical and An- cient Languages (LT4HALA) @ LREC-COLING- 2024, pages 122–128, Torino, Italia. ELRA and ICCL. Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023. Document-level machine translation with large lan- guage models. In Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 16646–16661, Singapore. Association for Computational Linguistics. -- 12 of 17 -- Weichuan Wang, Zhaoyi Li, Defu Lian, Chen Ma, Linqi Song, and Ying Wei. 2024. Mitigating the language mismatch and repetition issues in LLM-based ma- chine translation via model editing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15681–15700, Miami, Florida, USA. Association for Computational Linguistics. Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neu- ral networks. Neural Computation, 1(2):270–280. Dehong Xu, Liang Qiu, Minseok Kim, Faisal Lad- hak, and Jaeyoung Do. 2024. Aligning
Chunk 33 · 1,996 chars
s 15681–15700, Miami, Florida, USA. Association for Computational Linguistics. Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neu- ral networks. Neural Computation, 1(2):270–280. Dehong Xu, Liang Qiu, Minseok Kim, Faisal Lad- hak, and Jaeyoung Do. 2024. Aligning large lan- guage models via fine-grained supervision. Preprint, arXiv:2406.02756. Minggang Xu, Xihai Tang, Jian Sun, Chong Li, Jong- long Su, and Zhixiang Lu. 2026. Attention-based hy- brid deep learning framework for modelling the com- pressive strength of ultra-high-performance geopoly- mer concrete. Results in Engineering, page 109288. Zhiwang Xu, Huibin Qin, and Yongzhu Hua. 2021. Research on uyghur-chinese neural machine trans- lation based on the transformer at multistrategy seg- mentation granularity. Mobile Information Systems, 2021:1–7. Open access article. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilin- gual pre-trained text-to-text transformer. Preprint, arXiv:2010.11934. Emmanouil Zaranis, Nuno M Guerreiro, and Andre Martins. 2024. Analyzing context contributions in LLM-based machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14899–14924, Miami, Florida, USA. Association for Computational Linguistics. Armel Randy Zebaze, Benoît Sagot, and Rachel Baw- den. 2025. Compositional translation: A novel LLM- based approach for low-resource machine translation. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 22328–22357, Suzhou, China. Association for Computational Linguistics. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher De- wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi- haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt:
Chunk 34 · 1,989 chars
r Computational Linguistics. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher De- wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi- haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models. Preprint, arXiv:2205.01068. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. Preprint, arXiv:1904.09675. -- 13 of 17 -- A Appendix A.1 Further Discussion Our work, culminating in the MERIT framework, demonstrates the significant potential of combin- ing data filtering techniques, such as the Semantic Alignment Reward (SAR) driven GRPO, with ef- ficient fine-tuning strategies like Language-Token Prefixing (LTP) for multilingual translation, espe- cially into low-resource languages (LRLs). The proposed BLEU-chrF composite metric has also provided a balanced view of lexical and semantic performance. While MERIT-3B exhibits strong performance relative to its scale and against com- parable open-source models, several limitations persist and pave the way for future exploration. First, script-related challenges, particularly for Lao and Burmese, can introduce encoding incon- sistencies. These not only affect the performance of QE agents used in SAR but also potentially skew standard evaluation metrics. Future iterations could incorporate more robust character normalization or transliteration techniques at the data preprocessing stage, or develop QE models less sensitive to such variations. Second, the current reward model underlying GRPO-SAR, while effective, may inadvertently pri- oritize adequacy (accuracy of content, as captured by our specific SAR function focusing on numer- ical or key information matching) sometimes at the expense of optimal fluency. This can occasion- ally lead to subtle grammatical artifacts in
Chunk 35 · 1,991 chars
ent reward model underlying GRPO-SAR, while effective, may inadvertently pri- oritize adequacy (accuracy of content, as captured by our specific SAR function focusing on numer- ical or key information matching) sometimes at the expense of optimal fluency. This can occasion- ally lead to subtle grammatical artifacts in some translations. Future work could investigate multi- objective reward functions that explicitly balance adequacy, fluency, and even other aspects like style or register, potentially drawing on more diverse human feedback signals beyond simple ratings. As noted by Zhang et al. (2022), many LLMs, including some baselines we compared against, are often evaluated in zero-shot or few-shot settings for translation. This might not fully reveal their capabilities, which could be significantly enhanced with more sophisticated prompting strategies or in- context learning techniques specifically tailored for translation. Exploring how our data filtering and fine-tuning methods can synergize with advanced prompting for even larger LLMs is a promising direction. A.2 Supplementary Experiments All experiments were conducted on a local worksta- tion equipped with two NVIDIA RTX 3090 GPUs (24 GB). Under a 2×2 parallel configuration, the per-GPU batch size was set to 8 with a gradient accumulation step of 2, resulting in an effective to- tal batch size of 32. The maximum input sequence length was set to 1024 tokens, and the initial learn- ing rate was configured as 2e-4. The system en- vironment included Ubuntu 20.04, CUDA 12.1, and Python 3.10, with PyTorch 2.1 and Transform- ers v4.49 as the core libraries. All training was performed using standard mixed-precision (fp16) computation via custom training scripts. Due to hardware limitations, the batch size was carefully adjusted to fit within the available GPU memory, and no experiments were conducted using larger- parameter models. To ensure reproducibility, all random seeds were fixed, and detailed runtime
Chunk 36 · 1,988 chars
ndard mixed-precision (fp16) computation via custom training scripts. Due to hardware limitations, the batch size was carefully adjusted to fit within the available GPU memory, and no experiments were conducted using larger- parameter models. To ensure reproducibility, all random seeds were fixed, and detailed runtime logs were maintained for each experiment. A.3 Reward Function In this study, we introduce the Semantic Align- ment Reward (SAR) function as a key component of the reward mechanism for aligning the Quality Estimation (QE) agent’s predictions with human expert judgments. Specifically, based on the abso- lute deviation d = |si − ai| between the predicted score si and the ground-truth expert score ai, we design a step-wise reward strategy which operates as follows: • A reward of 2.0 is assigned if the model’s predicted score exactly matches the human expert score (d = 0). • A reward of 1.0 is assigned if the predicted score deviates from the expert score by an acceptable margin (specifically, 1 ≤ d ≤ 10). • A reward of 0.0 is assigned if the deviation exceeds the tolerance threshold (d > 10) or if the score is invalid. This step-wise formulation distinguishes SAR from binary rewards typical in exact matching tasks, driven by two key rationales: (1) Accommodating Evaluation Subjectivity. Unlike mathematical reasoning with unique solu- tions, QE involves inherent subjectivity where hu- man annotators may assign varying scalar scores to identical quality levels. Consequently, enforc- ing exact matches is overly rigid. Our tolerance mechanism acknowledges this by treating proximal scores (within a 10-point window) as valid, thereby preventing penalization for minor variances that do not reflect genuine quality disparities. -- 14 of 17 -- Method # Training Size Overall fil id lo my vi (Size / BLEU-chrF) MERIT-3B + SFT-LTP 8,000 8,000 8,000 8,000 8,000 40,000 / 15.53 + GRPO-SAR 1,851↓76.9% 1,779↓77.7% 2,058↓74.2% 2,462↓69.2% 976↓87.8% 9,126↓77.2% /
Chunk 37 · 1,994 chars
venting penalization for minor variances that do not reflect genuine quality disparities. -- 14 of 17 -- Method # Training Size Overall fil id lo my vi (Size / BLEU-chrF) MERIT-3B + SFT-LTP 8,000 8,000 8,000 8,000 8,000 40,000 / 15.53 + GRPO-SAR 1,851↓76.9% 1,779↓77.7% 2,058↓74.2% 2,462↓69.2% 976↓87.8% 9,126↓77.2% / 18.23↑17.4% + LLME 2,891↓63.9% 3,104↓61.2% 3,300↓58.8% 3,764↓53.0% 2,193↓72.6% 15,252↓61.9% / 19.94↑28.4% Table 4: Training size comparison across five low-resource languages for MERIT-3B. Overall column shows total training data (with percentage reduction relative to initial 40,000) and BLEU-chrF score (with percentage improvement relative to the SFT-LTP stage). 25 50 75 100 125 150 175 200 Steps 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 Training Loss Qwen2.5-0.5B SFT-LTP Qwen2.5-0.5B SFT-LTP+GRPO-SAR Qwen2.5-0.5B SFT-LTP+GRPO-SAR+LLMs Qwen2.5-3B SFT-LTP Qwen2.5-3B SFT-LTP+GRPO-SAR Qwen2.5-3B SFT-LTP+GRPO-SAR+LLMs 0 200 400 600 800 1000 1200 1400 Steps 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Training Loss Training Loss Reward 2.25 2.30 2.35 2.40 2.45 2.50 Reward Figure 3: Training loss and reward evolution across SFT and GRPO strategies. Combination Spearman ρ ↑ σ2 ↓ (BLEU + chrF)/2 (equal) 0.986 1.02 √BLEU × chrF (geometric) 0.985 1.14 0.4 · BLEU + 0.6 · chrF 0.985 1.19 0.6 · BLEU + 0.4 · chrF 0.984 1.23 Note: Metrics were computed with sacreBLEU 2.4.2. Table 5: Empirical comparison of different combina- tion methods for BLEU and chrF, evaluated on WMT22 test sets (5 languages × 16 systems). The results show that, while all variants exhibit very similar rank corre- lations with human judgments, the simple arithmetic mean yields the lowest system-level variance. (2) Providing Dense Supervision Signals. Sparse binary rewards often hinder optimization stability. By granting partial rewards for approx- imate matches, SAR provides finer-grained feed- back even when exact alignment is not achieved. This dense signal facilitates smoother
Chunk 38 · 1,998 chars
ean yields the lowest system-level variance. (2) Providing Dense Supervision Signals. Sparse binary rewards often hinder optimization stability. By granting partial rewards for approx- imate matches, SAR provides finer-grained feed- back even when exact alignment is not achieved. This dense signal facilitates smoother convergence during GRPO, guiding the agent to approximate the distribution of expert preferences rather than overfitting to rigid scalar values. A.4 Recruitment And Payment To ensure the accuracy and objectivity of human evaluation, we recruited ten annotators with aca- demic backgrounds in the target Southeast Asian languages. All annotators were either language instructors or graduate students from relevant uni- Metric Mean ∆ (v.s NLLB-200) p-value BLEU-4 +9.46 0.003 chrF2 +8.36 0.006 ROUGE-L +1.18 0.041 METEOR +2.65 0.038 COMET-22 +2.80 <0.001 Note: Mean ∆ represents the score difference aver- aged across all test sets. Some metrics are presented as percentages. p-values in bold indicate a statistically significant improvement (p < 0.05). Table 6: Statistical Significance of Improvements over Baseline. We conduct pairwise bootstrap resampling (1,000 iterations) to compare our final model (MERIT- 3B) against the NLLB-200 baseline. All improvements are statistically significant. versities. For each target language, three annota- tors were assigned, and a cross-review protocol was adopted to enhance annotation quality and con- sistency. All participants had formal training in translation or linguistics and possessed strong lan- guage comprehension and evaluative capabilities. Annotators were compensated at a rate of 1 RMB per evaluated sample. Before the evaluation began, all participants received detailed instructions and training on annotation guidelines. Participation was voluntary, and compensation was provided propor- tionally based on the amount of completed work. Since the dataset contains no personally identifi- able information (PII) and the
Chunk 39 · 1,999 chars
ample. Before the evaluation began,
all participants received detailed instructions and
training on annotation guidelines. Participation was
voluntary, and compensation was provided propor-
tionally based on the amount of completed work.
Since the dataset contains no personally identifi-
able information (PII) and the task involves only
linguistic quality assessment, the annotation pro-
-- 15 of 17 --
cess entails no ethical risks and does not require
institutional ethics approval.
A.5 Data Filtering Methodology
To ensure the quality of the Chinese-centric low-
resource corpora, we implement a hybrid filter-
ing pipeline combining statistical heuristics and
Large Language Model (LLM) based metrics. Let
Draw = {(xi, yi)}N
i=1 denote the initial noisy par-
allel corpus.
A.5.1 Statistical Feature Extraction
For each pair (xi, yi), we extract a feature vector
fi ∈ R5 measuring surface alignment:
1. Length Ratio (Rlen): defined as
min(|yi|/|xi|, |xi|/|yi|) to penalize ex-
treme length mismatches.
2. Token Ratio (Rtok): Calculated on
whitespace-separated tokens.
3. Punctuation Divergence (Dpunct): Absolute
difference in punctuation ratios, adjusted by
language-specific regex patterns.
4. Digit Divergence (Ddigit): Absolute differ-
ence in digit proportions.
5. Lexical Diversity Diff (Duniq): Difference in
Type-Token Ratios (TTR).
The base statistical score is a weighted sum with
Algorithm 1: Elite Parallel Data Sampler
Input: Draw, Target Size K, LLM M
Output: Dclean
1 Dvalid ← ∅
2 foreach (x, y) ∈ Draw do
3 if If ilter(x, y) = 1 then
4 fstat ← ExtractFeatures(x, y)
5 Sbase ← wT · Φ(fstat)
6 Dvalid ← Dvalid ∪ {(x, y, Sbase)}
7 foreach batch B ⊂ Dvalid do
8 SP P L, SIF D ← M.Forward(B)
9 foreach (x, y) ∈ B do
10 S(x,y)
f inal ← Sbase + S(x,y)
P P L + S(x,y)
IF D
11 Dsorted ← Sort(Dvalid, key = Sf inal)
12 Dclean ← {d ∈ Dsorted | rank(d) ≤ K}
13 return Dclean
Algorithm 2: Data Integrity Validation
Input: Dclean, Targets
{Ntrain, Ndev, Ntest}
Output: Splits {Strain, Sdev,Chunk 40 · 1,988 chars
SP P L, SIF D ← M.Forward(B)
9 foreach (x, y) ∈ B do
10 S(x,y)
f inal ← Sbase + S(x,y)
P P L + S(x,y)
IF D
11 Dsorted ← Sort(Dvalid, key = Sf inal)
12 Dclean ← {d ∈ Dsorted | rank(d) ≤ K}
13 return Dclean
Algorithm 2: Data Integrity Validation
Input: Dclean, Targets
{Ntrain, Ndev, Ntest}
Output: Splits {Strain, Sdev, Stest}
1 G ← {(k, {s | d(s) = k})}M
k=1
2 P[k] ← |G[k]|/|Dclean|, ∀k
3 foreach split T ∈ {train, dev, test} do
4 ST ← ∅
5 foreach domain k ∈ Keys(G) do
6 qk ← ⌊NT · P[k]⌋
7 Bk ← Sample(G[k], qk)
8 ST ← ST ∪ Bk;
G[k] ← G[k] \ Bk
9 ∆ ← NT − |ST |
10 if ∆ > 0 then
11 U ← S
k G[k]
Ssupp ← Resample(U, ∆)
12 ST ← ST ∪ Ssupp
13 else if |ST | > NT then
14 ST ← ST [1 : NT ]
15 if ∃T, |ST |̸ = NT then return Error
16 return {Strain, Sdev, Stest}
normalizing functions ϕk:
Sbase(xi, yi) =
5 X
k=1
wk · ϕk(f (k)
i ) (11)
A.5.2 LLM-based Semantic Scoring
We employ Qwen-2.5-0.5B to compute seman-
tic metrics. Perplexity Score (SP P L) verifies flu-
ency via conditional likelihood, and Instruction-
Following Discrepancy (SIF D) measures context
reliance:
SP P L = 1
1 + σ(PPL(y|x, I)) (12)
SIF D = min
PPL(y|∅)
PPL(y|x, I) , τ
· τ −1 (13)
where I is the instruction prompt, σ is a scaling
factor, and τ is a normalization threshold.
A.5.3 Composite Filtering Algorithm
The final score combines signals with weights
(α, β, γ) = (0.3, 0.3, 0.4):
Sf inal = αSbase + βSP P L + γSIF D (14)
-- 16 of 17 --
A.6 Data Integrity Validation
To ensure the reliability of our benchmark, we pro-
pose the Data Integrity Validation (DIV) algorithm.
This mechanism addresses two critical challenges
in low-resource corpus construction: (1) Distribu-
tional Integrity, ensuring that the domain distribu-
tion (e.g., news, government docs) in the training
and test sets strictly mirrors the original corpus; and
(2) Size Integrity, enforcing exact sample counts
(e.g., 8,000/1,000/1,000) despite rounding errors
inherent in stratified sampling.
Let Dclean be the filtered corpus. For eachChunk 41 · 1,280 chars
ensuring that the domain distribu-
tion (e.g., news, government docs) in the training
and test sets strictly mirrors the original corpus; and
(2) Size Integrity, enforcing exact sample counts
(e.g., 8,000/1,000/1,000) despite rounding errors
inherent in stratified sampling.
Let Dclean be the filtered corpus. For each sen-
tence pair s, we extract its domain identifier d(s)
from the metadata.
A.6.1 Distribution-Preserving Allocation
We first compute the global domain distribution
vector P = {p1, . . . , pM }, where pk = Nk/Ntotal
denotes the proportion of domain k. For a target
split T (e.g., Train) with a required size Ntarget,
the allocated quota for domain k is calculated as:
q(k)
T = ⌊Ntarget · pk⌋ (15)
We then sample q(k)
T distinct sentences from do-
main k to form the initial split.
A.6.2 Integrity Compensation
Since the floor operation ⌊·⌋ may cause the to-
tal count to fall short of Ntarget (i.e., P q(k)
T <
Ntarget), or data scarcity in specific domains may
prevent fulfilling the quota, DIV performs a In-
tegrity Compensation step. We calculate the deficit
∆ = Ntarget − P |ST | and perform weighted re-
sampling from the available pool to fill ∆, ensuring
the final dataset size strictly equals Ntarget without
violating domain diversity.
-- 17 of 17 --