Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Summary

This paper introduces a novel approach to expanding large language models (LLMs) to low-resource languages using semantic rewards in reinforcement learning (RL), addressing the "alignment tax" problem where improvements in a target language lead to degradation in general capabilities. Traditional supervised fine-tuning (SFT) enforces token-level imitation on limited data, causing catastrophic forgetting. The proposed method, Group Relative Policy Optimization (GRPO), aligns models in semantic space using embedding-level rewards, prioritizing meaning preservation over rigid surface-form imitation. This approach allows flexible realizations while minimizing interference with pretrained knowledge. Experiments on Tibetan–Chinese machine translation and headline generation show that semantic RL significantly reduces alignment tax, preserves general competence, and produces higher-quality, more transferable outputs compared to SFT. The method uses a two-stage training process: a lightweight SFT for initial competence, followed by RL with semantic and language consistency rewards. Results demonstrate superior semantic quality, fewer alignment tax effects, and better few-shot transferability, indicating that semantic-space alignment offers a safer and more effective pathway for low-resource language expansion.

PDF viewer

Chunks(32)

Chunk 0 · 1,995 chars

Reinforcement Learning with Semantic Rewards Enables Low-Resource
Language Expansion without Alignment Tax
Zeli Su1,2∗ Ziyin Zhang3∗ Zhou Liu5∗ Xuexian Song6 Zhankai Xu2
Longfei Zheng2 Xiaolu Zhang2 Rong Fu4 Guixian Xu1,7† Wentao Zhang5†
1 Minzu University of China 2 Ant Group 3 Shanghai Jiao Tong University 4 University of Macau 5 Peking University
6 Institute of Automation, Chinese Academy of Sciences 7 Hainan International College, Minzu University of China
{rickamorty, guixian_xu}@muc.edu.cn daenerystargaryen@sjtu.edu.cn
zhouliu25@stu.pku.edu.cn songxuexian5@gmail.com {xuzhankai.xzk, zlf206411}@antgroup.com
yueyin.zxl@antfin.com mc46603@um.edu.mo wentao.zhang@pku.edu.cn
Abstract
Extending large language models (LLMs) to
low-resource languages often incurs an “align-
ment tax”: improvements in the target lan-
guage come at the cost of catastrophic forget-
ting in general capabilities. We argue that this
trade-off arises from the rigidity of supervised
fine-tuning (SFT), which enforces token-level
surface imitation on narrow and biased data
distributions. To address this limitation, we
propose a semantic-space alignment paradigm
powered by Group Relative Policy Optimiza-
tion (GRPO), where the model is optimized us-
ing embedding-level semantic rewards rather
than likelihood maximization. This objective
encourages meaning preservation through flex-
ible realizations, enabling controlled updates
that reduce destructive interference with pre-
trained knowledge. We evaluate our approach
on Tibetan–Chinese machine translation and Ti-
betan headline generation. Experiments show
that our method acquires low-resource capa-
bilities while markedly mitigating alignment
tax, preserving general competence more effec-
tively than SFT. Despite producing less rigid
surface overlap, semantic RL yields higher se-
mantic quality and preference in open-ended
generation, and few-shot transfer results indi-
cate that it learns more transferable and ro-
bust representations under

Chunk 1 · 1,997 chars

ting alignment
tax, preserving general competence more effec-
tively than SFT. Despite producing less rigid
surface overlap, semantic RL yields higher se-
mantic quality and preference in open-ended
generation, and few-shot transfer results indi-
cate that it learns more transferable and ro-
bust representations under limited supervision.
Overall, our study demonstrates that reinforce-
ment learning with semantic rewards provides
a safer and more reliable pathway for inclusive
low-resource language expansion.
1 Introduction
Large language models (LLMs) have achieved re-
markable performance across a wide range of tasks
and languages through large-scale pretraining and
post-training alignment (DeepSeek-AI et al., 2025;
Yang et al., 2025a; Team et al., 2025; OpenAI,
2025; Comanici et al., 2025). However, their ca-
pabilities remain highly uneven across languages:
Figure 1: Token-level alignment versus semantic-space
alignment in low-resource language expansion. Token-
level supervised fine-tuning enforces surface-form imi-
tation under teacher forcing, often causing catastrophic
forgetting and high alignment tax. In contrast, semantic-
space alignment optimizes meaning preservation with
constrained reinforcement learning policy updates, al-
lowing flexible realizations while preserving pretrained
knowledge.
many languages are weakly supported, particularly
for generation and reasoning-intensive tasks that
require semantic abstraction rather than surface
pattern matching. Improving model performance
in such low-resource language settings therefore
remains an important and challenging problem.
A common approach to improving low-resource
language performance is further training on
language-specific data, including continual pre-
training and supervised instruction fine-tuning. De-
spite differences in data sources and training pro-
tocols, these methods share a common optimiza-
tion paradigm: they rely on teacher-forced learning
with token-level likelihood objectives,

Chunk 2 · 1,997 chars

nce is further training on
language-specific data, including continual pre-
training and supervised instruction fine-tuning. De-
spite differences in data sources and training pro-
tocols, these methods share a common optimiza-
tion paradigm: they rely on teacher-forced learning
with token-level likelihood objectives, aligning the
model to a target data distribution through surface-
form imitation. Figure 1 illustrates the contrast
between token-level alignment via surface-form
imitation and semantic-space alignment based on
meaning preservation, highlighting how different
objectives lead to fundamentally different update
behaviors under data scarcity. Much recent work
on language expansion and adaptation follows this
arXiv:2605.14366v1 [cs.CL] 14 May 2026

-- 1 of 15 --

paradigm, using supervised or weakly supervised
finetuning to inject new linguistic capabilities into
pretrained models (Csaki et al., 2024; Zhao et al.,
2024). Related efforts based on continued pre-
training or language-specific model construction
similarly perform strong distribution-level updates
on limited and domain-constrained data (Almeida
et al., 2025; Bari et al., 2025).
While effective under abundant and diverse su-
pervision, token-level distribution matching be-
comes problematic in low-resource language set-
tings. Available training data are often limited
in size, narrow in domain, and distributionally
biased. Optimizing likelihood on such data en-
courages overly confident and rigid parameter up-
dates, amplifying overfitting and interfering with
representations learned during pretraining. Empir-
ically, this interference often manifests as catas-
trophic forgetting: improvements in the target lan-
guage are accompanied by degradation in existing
high-resource language capabilities, a phenomenon
that has been systematically observed in multilin-
gual fine-tuning and low-resource representation
learning settings (Liu and Niehues, 2025; Schmidt,
2025).
We argue that this phenomenon

Chunk 3 · 1,998 chars

improvements in the target lan-
guage are accompanied by degradation in existing
high-resource language capabilities, a phenomenon
that has been systematically observed in multilin-
gual fine-tuning and low-resource representation
learning settings (Liu and Niehues, 2025; Schmidt,
2025).
We argue that this phenomenon is not merely an
optimization artifact but a structural outcome of
the alignment objective itself. When alignment is
equated with surface-form imitation on narrow dis-
tributions, representational capacity is aggressively
reallocated, leading to an alignment tax: gains in
low-resource language performance achieved at the
expense of general competence. This issue persists
regardless of the adaptation method (e.g., contin-
ual pretraining or instruction tuning) as long as the
objective enforces token-level matching on sparse
data (Yamaguchi et al., 2025).
Consequently, we propose shifting perspective to
view language expansion not merely as adaptation,
but as an alignment problem under sparse super-
vision. To address this, we introduce a semantic-
space alignment paradigm that prioritizes mean-
ing preservation over rigid surface-form imitation.
We operationalize this framework using Group
Relative Policy Optimization (GRPO) (Shao
et al., 2024), employing embedding-level seman-
tic similarity as the primary reward signal. Unlike
teacher-forced training, this approach encourages
the model to explore diverse linguistic realizations
that maintain semantic equivalence. Crucially, by
optimizing based on relative rewards within a sam-
pled group, our method inherently incorporates
the stability constraints of trust-region optimiza-
tion (Schulman et al., 2017; Rafailov et al., 2023),
enabling the acquisition of low-resource capabili-
ties while strictly limiting the destructive interfer-
ence typical of unconstrained likelihood maximiza-
tion.
We evaluate our approach on Tibetan–Chinese
machine translation (MT) and Tibetan headline
generation (HG). Empirical

Chunk 4 · 1,993 chars

n et al., 2017; Rafailov et al., 2023),
enabling the acquisition of low-resource capabili-
ties while strictly limiting the destructive interfer-
ence typical of unconstrained likelihood maximiza-
tion.
We evaluate our approach on Tibetan–Chinese
machine translation (MT) and Tibetan headline
generation (HG). Empirical results demonstrate
that semantic-reward-driven GRPO achieves a su-
perior trade-off between adaptation and preserva-
tion. In the MT task, our method substantially re-
duces alignment tax, outperforming the strong SFT
baseline by +5.15 points on the dominant-language
CMRC benchmark. Similarly, in the HG task, de-
spite lower n-gram overlap, our model is preferred
by LLM-based judges with a +16.1% higher win
rate compared to SFT. These findings suggest that
semantic-space alignment offers a safer and more
robust paradigm for improving low-resource lan-
guage performance under data scarcity.
In summary, the main contributions of this paper
are:
• We propose a semantic-space alignment
paradigm that utilizes Group Relative Policy
Optimization (GRPO) with embedding-level re-
wards to decouple meaning preservation from
surface-form imitation.
• We demonstrate that this approach virtually elimi-
nates the alignment tax, enabling significant low-
resource gains while maintaining the model’s
general capabilities and pretrained knowledge.
• We show that our method produces semantically
superior outputs that are preferred by LLM
judges over SFT baselines, despite having lower
n-gram overlap with rigid references.
• We validate that semantic RL yields more
transferable representations, as evidenced by
stronger few-shot generalization to downstream
tasks compared to supervised methods.
2 Related Work
2.1 Low-Resource Language Adaptation and
Expansion
Post-training language expansion andadaptation
typically start from a pretrained foundation model
and further train on language-specific data. Prior
work studies supervised fine-tuning or instruction
tuning for

Chunk 5 · 1,974 chars

ream
tasks compared to supervised methods.
2 Related Work
2.1 Low-Resource Language Adaptation and
Expansion
Post-training language expansion andadaptation
typically start from a pretrained foundation model
and further train on language-specific data. Prior
work studies supervised fine-tuning or instruction
tuning for transferring language capabilities and
scaling with data and model size (Csaki et al., 2024;
Zhao et al., 2024), as well as continued pretraining

-- 2 of 15 --

and language-specific model construction that em-
phasize corpus selection and composition (Almeida
et al., 2025; Bari et al., 2025). Across these set-
tings, selective adaptation on narrow distributions
can induce catastrophic forgetting, degrading per-
formance on non-target languages or tasks (Liu and
Niehues, 2025; Schmidt, 2025).
Several mitigation strategies focus on constrain-
ing update magnitude or protecting previously
learned capabilities, such as parameter-efficient
finetuning (e.g., LoRA-style low-rank updates) and
source-shielded adaptation (Yang et al., 2025b; Ya-
maguchi et al., 2025). While these methods can
reduce interference, they generally retain teacher-
forced token-level likelihood objectives. Our work
is complementary: we reconsider the alignment
objective itself by optimizing semantic consistency
via embedding-level rewards, aiming to improve
weak-language capability with lower alignment
tax.
2.2 Reinforcement Learning for LLM
Alignment
Reinforcement learning (RL) is commonly used
in LLM alignment when optimization objectives
are sequence-level or non-differentiable, enabling
learning beyond token-level supervised imitation
(Christiano et al., 2017; Ouyang et al., 2022). A
core advantage of RL-based alignment methods
is the use of constrained policy updates, such as
trust-region or KL-regularized optimization, which
limits drift from pretrained representations and im-
proves stability (Schulman et al., 2015, 2017).
Recent variants retain this

Chunk 6 · 1,999 chars

stiano et al., 2017; Ouyang et al., 2022). A
core advantage of RL-based alignment methods
is the use of constrained policy updates, such as
trust-region or KL-regularized optimization, which
limits drift from pretrained representations and im-
proves stability (Schulman et al., 2015, 2017).
Recent variants retain this constrained-update
principle while improving efficiency or flexibility,
including direct preference optimization (Rafailov
et al., 2023) and group-based policy optimization
methods (Shao et al., 2024), which we adopt in
this work to enable controlled alignment towards
semantic-level objectives.
3 Method
3.1 Problem Formulation: Semantic-Space
Alignment
We study low-resource language expansion as an
alignment problem. Given a pretrained instruction-
following language model πbase, our goal is to ac-
quire new capabilities in a low-resource language
while preserving existing competencies in domi-
nant, high-resource languages.
Conventional supervised fine-tuning aligns mod-
els by maximizing token-level likelihood under a
target data distribution. In low-resource settings,
where the distribution is narrow and biased, this
objective enforces surface-form imitation and often
leads to overconfident updates and catastrophic for-
getting. We instead frame alignment as semantic-
space alignment: model outputs are considered cor-
rect if they preserve meaning, regardless of their
specific surface realization. This formulation ex-
plicitly decouples semantic adequacy from token-
level matching and allows multiple valid expres-
sions of the same intent.
Under this perspective, alignment is defined
by semantic consistency rather than distribution
matching. Our objective is therefore to optimize
the model to produce outputs that are semantically
equivalent to reference texts, while limiting inter-
ference with representations learned during pre-
training.
3.2 Two-Stage Training Paradigm
To operationalize semantic-space alignment in low-
resource settings, we adopt

Chunk 7 · 1,999 chars

ching. Our objective is therefore to optimize
the model to produce outputs that are semantically
equivalent to reference texts, while limiting inter-
ference with representations learned during pre-
training.
3.2 Two-Stage Training Paradigm
To operationalize semantic-space alignment in low-
resource settings, we adopt a two-stage training
paradigm.
Stage 1: Cold-start supervised fine-tuning. We
first perform a lightweight supervised fine-tuning
step on a small subset of low-resource data to ob-
tain an initial policy πinit. Specifically, we fine-tune
the base model on 5k training instances for two
epochs. The goal of this stage is not to achieve
strong task performance, but to bootstrap minimal
output competence in the target language, such
as producing text in the correct script and main-
taining basic language consistency. This cold-start
initialization allows the model to reliably generate
non-degenerate outputs in the low-resource lan-
guage, ensuring that subsequent semantic rewards
are meaningful and that reinforcement learning
does not collapse into uninformative exploration.
Stage 2: Reinforcement learning with semantic
rewards. Starting from πinit, we perform rein-
forcement learning to align the model in seman-
tic space. In this stage, we utilize the remaining
training data to drive learning through semantic
rewards rather than token-level supervision. Rein-
forcement learning is conducted for a single epoch,
during which the model is encouraged to explore
diverse surface realizations while preserving se-
mantic equivalence to reference texts. Constrained
policy optimization is applied throughout training

-- 3 of 15 --

to control update magnitude, enabling the model to
acquire low-resource language capabilities while
minimizing destructive interference with pretrained
representations.
3.3 Reinforcement Learning for Semantic
Alignment
Optimizing semantic alignment is inherently a
sequence-level problem over discrete outputs. The
embedding-based semantic

Chunk 8 · 1,997 chars

gnitude, enabling the model to
acquire low-resource language capabilities while
minimizing destructive interference with pretrained
representations.
3.3 Reinforcement Learning for Semantic
Alignment
Optimizing semantic alignment is inherently a
sequence-level problem over discrete outputs. The
embedding-based semantic rewards used in our
framework are not directly differentiable with re-
spect to model parameters, making standard super-
vised learning objectives unsuitable. Reinforce-
ment learning therefore provides a natural and
principled framework for optimizing such non-
differentiable, sequence-level objectives, allowing
direct optimization of semantic consistency rather
than token-level likelihood.
Beyond enabling optimization of semantic re-
wards, reinforcement learning also plays a criti-
cal role in preserving pretrained knowledge during
low-resource adaptation. In contrast to supervised
fine-tuning, which performs unconstrained likeli-
hood maximization on narrow data distributions,
constrained reinforcement learning methods explic-
itly limit policy updates. This controlled optimiza-
tion is crucial for reducing destructive interference
and mitigating catastrophic forgetting, making re-
inforcement learning particularly well-suited for
semantic-space alignment under data scarcity.
We instantiate reinforcement learning using
Group Relative Policy Optimization (GRPO, Shao
et al., 2024), a value-free variant of PPO-style con-
strained optimization. For each input prompt x,
we sample a group of candidate outputs {y(k)}K
k=1
from the current policy πθ(· | x), compute their cor-
responding rewards, and update the policy based
on relative comparisons within the group. GRPO
inherits the key stabilization mechanisms of PPO,
including trust-region-style constraints that limit
policy drift between updates, while avoiding the
need for an explicit value function. These prop-
erties make GRPO a practical and stable choice
for semantic alignment, enabling effective

Chunk 9 · 1,998 chars

risons within the group. GRPO
inherits the key stabilization mechanisms of PPO,
including trust-region-style constraints that limit
policy drift between updates, while avoiding the
need for an explicit value function. These prop-
erties make GRPO a practical and stable choice
for semantic alignment, enabling effective learning
from semantic rewards while maintaining existing
language capabilities.
3.4 Semantic Reward Design
A central component of our framework is a se-
mantic reward that explicitly defines the alignment
objective for low-resource language expansion. Un-
like supervised fine-tuning, which implicitly aligns
models through token-level likelihood, our goal is
to directly guide learning toward semantic consis-
tency. The reward therefore serves not merely as an
optimization signal, but as the primary mechanism
that determines what the model is encouraged to
learn.
Semantic embedding model and suitability for
reinforcement learning. To instantiate the se-
mantic reward, we employ a multilingual sentence-
level embedding model trained under a contrastive
bilingual alignment objective. The model is
adapted on parallel sentence pairs, where trans-
lations are treated as positive examples and in-
batch samples provide implicit negatives. Rather
than training an encoder from scratch or enforc-
ing surface-form similarity, this adaptation strategy
emphasizes meaning preservation and fine-grained
semantic discriminability across different linguistic
realizations.
We empirically observe that bilingual contrastive
training yields stronger semantic structure than
monolingual adaptation, improving both cross-
lingual alignment and intra-language separability.
This property is particularly important for reinforce-
ment learning, where the reward signal must reflect
graded semantic differences rather than coarse top-
ical similarity. As a sanity check, we construct a
small diagnostic set of sentence pairs spanning dif-
ferent degrees of semantic equivalence and find

Chunk 10 · 1,998 chars

ability.
This property is particularly important for reinforce-
ment learning, where the reward signal must reflect
graded semantic differences rather than coarse top-
ical similarity. As a sanity check, we construct a
small diagnostic set of sentence pairs spanning dif-
ferent degrees of semantic equivalence and find that
the embedding model assigns similarity scores that
consistently correlate with these graded relation-
ships. This indicates that the resulting embedding
space is suitable for use as a semantic reward for
guiding alignment in reinforcement learning.
3.4.1 Embedding-Level Semantic Similarity
Reward
The primary learning signal in our framework is an
embedding-level semantic similarity reward. Let
f (·) denote the sentence embedding model de-
scribed above, which maps text to normalized vec-
tor representations. Given a generated output y
and a reference text y∗, we compute their semantic
similarity using cosine similarity:
s(y, y∗) = cos (f (y), f (y∗)) . (1)
This reward directly reflects our desired learning
direction: outputs are encouraged to preserve mean-
ing, regardless of surface realization. In contrast
to token-level likelihood objectives, this formula-
tion treats semantically equivalent paraphrases as

-- 4 of 15 --

equally valid and explicitly avoids overfitting to
reference form. To stabilize optimization and focus
learning on meaningful improvements beyond min-
imal adequacy, we apply a threshold-and-rescale
shaping function:
Rsim(y, y∗) =
(
0, s(y, y∗) ≤ τ,
s(y,y∗)−τ
1−τ , s(y, y∗) > τ, (2)
where τ corresponds to a minimal semantic ade-
quacy level achieved after cold-start fine-tuning.
This shaping ensures that reinforcement learning
primarily refines semantic quality rather than am-
plifying noise from low-quality generations.
3.4.2 Language Consistency Reward
Because the embedding model is multilingual, op-
timizing semantic similarity alone may reward
mixed-language or partially off-target outputs. To
prevent this reward hacking

Chunk 11 · 1,998 chars

cement learning
primarily refines semantic quality rather than am-
plifying noise from low-quality generations.
3.4.2 Language Consistency Reward
Because the embedding model is multilingual, op-
timizing semantic similarity alone may reward
mixed-language or partially off-target outputs. To
prevent this reward hacking behavior, we intro-
duce a language consistency reward based on a
rule-based script check using Unicode ranges and
regular expressions:
Rlang(y) =
(
0, language mixed,
1, language consistent. (3)
This acts as a hard constraint, ensuring that seman-
tic optimization is carried out strictly within the
target low-resource language space.
Final reward. The final reward combines seman-
tic similarity and language consistency:
R(y, y∗) = λsimRsim(y, y∗) + λlangRlang(y). (4)
Together, these components define our semantic
alignment objective: the model is encouraged to
improve semantic adequacy while being strictly
constrained to produce linguistically consistent
outputs in the low-resource language. In prac-
tice, we assign a larger weight to semantic sim-
ilarity (λsim = 1.5) than to language consistency
(λlang = 1.0), reflecting our design choice that se-
mantic preservation constitutes the primary learn-
ing objective, while language consistency serves
as a necessary constraint to prevent degenerate or
off-language generations.
4 Experiments
We conduct a series of experiments to evalu-
ate whether semantic-reward-driven reinforcement
learning (RL) provides a better trade-off between
low-resource language adaptation and preservation
of existing capabilities compared to supervised fine-
tuning (SFT). Our experiments are designed to an-
swer three research questions: (1) whether RL ef-
fectively acquires low-resource language capabil-
ities, (2) how RL and SFT differ in the trade-off
between task performance and alignment tax, and
(3) whether RL learns more transferable represen-
tations under data scarcity.
4.1 Experimental Setup
Base model and adaptation.

Chunk 12 · 1,994 chars

ree research questions: (1) whether RL ef-
fectively acquires low-resource language capabil-
ities, (2) how RL and SFT differ in the trade-off
between task performance and alignment tax, and
(3) whether RL learns more transferable represen-
tations under data scarcity.
4.1 Experimental Setup
Base model and adaptation. All experiments
are conducted on Qwen3-4B with parameter-
efficient fine-tuning via LoRA (Hu et al., 2022).
Unless otherwise specified, we apply LoRA to all
linear projection layers in self-attention and MLP
blocks. We use a LoRA rank of r = 64, scaling
factor α = 128, and dropout rate of 0.05.
Supervised fine-tuning (SFT). SFT is trained
for three epochs in BF16 with a global batch size
of 32, using AdamW (Loshchilov and Hutter, 2019)
with learning rate 2 × 10−5 and a cosine schedule
(warmup ratio 0.1).
Semantic reward model. The semantic reward
described in Section 3.4 is instantiated using a bilin-
gual sentence-embedding model built on top of
CINO (Yang et al., 2022), a Tibetan-enhanced ex-
tension of XLM-R (Conneau et al., 2020). We
adapt CINO into a sentence-level encoder using
SENTENCETRANSFORMER, and further special-
ize it on Chinese–Tibetan parallel data to produce
embedding-based semantic similarity scores. The
resulting encoder is used as a frozen reward model
during RL and is not jointly optimized with the
policy model.
Reinforcement learning (GRPO). Reinforce-
ment learning is performed with GRPO (Shao et al.,
2024) starting from the SFT checkpoint, trained for
one epoch in BF16 with AdamW and learning rate
5 × 10−7 (effective global batch size 32). For each
prompt, we sample 8 candidates with temperature
0.8 and top-p 0.9, using max prompt/completion
lengths of 256 tokens.
Controlled comparison. This unified setup en-
sures that observed differences primarily reflect
the alignment strategy (semantic-reward-driven RL
vs. SFT), rather than mismatched optimization or
adaptation configurations. Training and hyperpa-
rameter details are

Chunk 13 · 1,999 chars

0.9, using max prompt/completion
lengths of 256 tokens.
Controlled comparison. This unified setup en-
sures that observed differences primarily reflect
the alignment strategy (semantic-reward-driven RL
vs. SFT), rather than mismatched optimization or
adaptation configurations. Training and hyperpa-
rameter details are provided in Appendix A.

-- 5 of 15 --

4.1.1 Tasks and Datasets
We evaluate our approach on two representative Ti-
betan low-resource generation tasks: cross-lingual
machine translation (MT) and monolingual head-
line generation (HG).
Machine Translation (MT). For Tibetan–
Chinese machine translation, we use an internal
parallel corpus collected for training a vision–
language model (Wu et al., 2019). Specifically,
the corpus consists of Tibetan–Chinese sentence
pairs translated as part of the pretraining data
construction pipeline for the VLM, rather than
annotations produced by the model itself. We
repurpose this parallel data as supervised training
material for machine translation in our experiments.
As the corpus was originally curated to support
VLM pretraining, a large portion of the data
is grounded in visual descriptions, resulting
in a relatively narrow and domain-constrained
distribution. While this dataset does not aim to be
a comprehensive translation benchmark, it reflects
a realistic low-resource scenario with limited
domain diversity and is therefore suitable for
studying alignment behavior under data scarcity.
Headline Generation (HG). For Tibetan head-
line generation, we use the Tibetan subset of the
CMHG dataset (Xu et al., 2025). Due to the fine-
grained tokenization of Tibetan in the Qwen to-
kenizer, raw samples often result in excessively
long sequences and high memory consumption.
We therefore filter out samples exceeding 1024
tokens and retain shorter instances for both training
and evaluation. After filtering, the dataset con-
tains 16,449 training samples and 621 test samples,
which are used consistently across all headline

Chunk 14 · 1,993 chars

en result in excessively
long sequences and high memory consumption.
We therefore filter out samples exceeding 1024
tokens and retain shorter instances for both training
and evaluation. After filtering, the dataset con-
tains 16,449 training samples and 621 test samples,
which are used consistently across all headline gen-
eration experiments.
4.1.2 Evaluation Protocols
We adopt a multi-dimensional evaluation proto-
col to capture both surface-level accuracy and se-
mantic quality. For task performance, we report
standard reference-based metrics (BLEU for MT
and ROUGE for HG) as well as embedding-based
semantic similarity. To assess semantic quality
beyond reference matching, we conduct blind pair-
wise evaluations using an LLM-as-a-Judge, where
judgments are produced by GPT-5.2 under a fixed
evaluation prompt; the full judging prompt and
evaluation policy are provided in Appendix B.
To quantify alignment tax, we evaluate all mod-
els on a dominant-language benchmark (Chinese
Model BLEU-4 Similarity
Cold-start SFT 0.3953 0.5593
RL (Ours) 0.4519 0.7164
Table 1: Experiment 1 results on Tibetan–Chinese ma-
chine translation.
Model ROUGE-L Similarity
Cold-start SFT 0.2204 0.5774
RL (Ours) 0.2530 0.6404
Table 2: Experiment 1 results on Tibetan headline gen-
eration.
CMRC, Cui et al., 2019) before and after adapta-
tion and report performance changes relative to the
base model.
4.2 Experiment 1: Effectiveness of
Semantic-Reward RL
We first evaluate whether semantic-reward-driven
reinforcement learning (RL) effectively acquires
low-resource language capabilities beyond mini-
mal supervised initialization. Specifically, we com-
pare RL against the cold-start SFT model on both
Tibetan–Chinese machine translation (MT) and Ti-
betan headline generation (HG).
Machine Translation. Table 1 reports the results
on Tibetan–Chinese MT. Starting from the same
cold-start SFT checkpoint trained on 5k parallel
sentence pairs, RL is further trained on approxi-
mately 90k additional

Chunk 15 · 1,999 chars

tart SFT model on both
Tibetan–Chinese machine translation (MT) and Ti-
betan headline generation (HG).
Machine Translation. Table 1 reports the results
on Tibetan–Chinese MT. Starting from the same
cold-start SFT checkpoint trained on 5k parallel
sentence pairs, RL is further trained on approxi-
mately 90k additional samples using semantic re-
wards. Compared to the cold-start baseline, RL
yields consistent improvements in both reference-
based accuracy and semantic similarity. BLEU-4
increases from 0.3953 to 0.4519, while semantic
similarity improves substantially from 0.5593 to
0.7164.
Headline Generation. We observe a similar
trend on Tibetan headline generation. As shown
in Table 2, the RL model trained on approximately
15k samples consistently outperforms the cold-start
SFT baseline in both ROUGE-L and semantic sim-
ilarity. In particular, ROUGE-L improves from
0.2204 to 0.2530, while semantic similarity in-
creases from 0.5774 to 0.6404.
Analysis. Across both translation and generation
tasks, semantic-reward-driven RL consistently im-
proves performance over the cold-start SFT base-
line. Notably, the improvements are particularly
pronounced in semantic similarity, suggesting that

-- 6 of 15 --

RL primarily refines meaning preservation rather
than merely increasing surface-level overlap with
references. These results confirm that embedding-
level semantic rewards constitute a sufficiently in-
formative alignment signal, enabling effective low-
resource language learning beyond minimal super-
vised initialization.
4.3 Experiment 2: Trade-off Between Task
Performance and Alignment Tax
In this experiment, we compare semantic-reward-
driven RL with a Strong SFT baseline to charac-
terize the trade-off between task performance and
preservation of existing general capabilities (i.e.,
alignment tax). Unlike the cold-start SFT model
used in Experiment 1 — which is trained only on
a small 5k subset of low-resource data and serves
solely as the initialization policy

Chunk 16 · 1,988 chars

h a Strong SFT baseline to charac-
terize the trade-off between task performance and
preservation of existing general capabilities (i.e.,
alignment tax). Unlike the cold-start SFT model
used in Experiment 1 — which is trained only on
a small 5k subset of low-resource data and serves
solely as the initialization policy for RL — the
Strong SFT model is trained on the full available
training data (i.e., the same combined dataset
used by cold-start SFT + RL) under the same
optimization and LoRA configuration, represent-
ing the best-effort supervised adaptation outcome.
Table 3 reports results on Tibetan–Chinese machine
translation (MT) and Tibetan headline generation
(HG), including task metrics, semantic similarity,
and dominant-language performance on CMRC as
a proxy for alignment tax. We additionally report
LLM-based preference as a reference-free measure
of semantic quality.
Task 1: Tibetan–Chinese Machine Transla-
tion (MT). On MT, Strong SFT achieves higher
reference-based scores, improving BLEU from
0.4519 (RL) to 0.6006 and semantic similarity from
0.7164 to 0.8282, reflecting stronger surface align-
ment to references. However, this advantage is less
pronounced under LLM-based judgment: Strong
SFT is preferred in 59.2% of cases, while the RL-
aligned model still wins 33.5%, indicating compet-
itive semantic quality despite lower BLEU.
These metric gains come with a substantial
alignment tax. After adaptation, SFT suffers
marked degradation on CMRC (41.82 Avg / 62.99
F1), whereas RL preserves general capability sig-
nificantly better (46.97 Avg / 65.79 F1). Overall,
token-level imitation inflates reference-based MT
metrics at the cost of forgetting, while constrained
semantic alignment via RL yields safer updates
with lower alignment tax.
Task 2: Tibetan Headline Generation (HG).
On HG, Strong SFT again achieves higher
reference-based scores (ROUGE-L 0.3095 vs.
0.2530 for RL), while the semantic similarity gap
remains small (0.6499 vs. 0.6404). Both

Chunk 17 · 1,997 chars

f forgetting, while constrained
semantic alignment via RL yields safer updates
with lower alignment tax.
Task 2: Tibetan Headline Generation (HG).
On HG, Strong SFT again achieves higher
reference-based scores (ROUGE-L 0.3095 vs.
0.2530 for RL), while the semantic similarity gap
remains small (0.6499 vs. 0.6404). Both methods
largely preserve dominant-language performance,
with only minor differences in CMRC. However,
under LLM-based judgment, RL is strongly pre-
ferred: it wins 51.2% of pairwise comparisons ver-
sus 35.1% for SFT (+16.1 points). This suggests
that in open-ended generation, semantic-reward-
driven RL learns generation behaviors that go be-
yond reference imitation, capturing alternative yet
semantically appropriate realizations that are more
human-preferred despite lower n-gram overlap.
Overall analysis. Across both tasks, Table 3 re-
veals a consistent pattern: supervised fine-tuning
excels at maximizing reference-based metrics,
while semantic-reward-driven RL better preserves
general capabilities and improves semantic qual-
ity under preference-based evaluation. In MT,
this trade-off manifests primarily as alignment tax,
where SFT’s metric gains coincide with substantial
forgetting. In HG, where multiple valid realizations
exist, RL is consistently preferred by LLM judges
despite lower ROUGE, indicating that it learns gen-
eration patterns not anchored to a single reference
form.
Together, these results suggest a fundamental
mismatch between reference-based metrics and
true semantic quality in low-resource settings. By
aligning models in semantic space rather than en-
forcing surface imitation, constrained RL enables
the acquisition of alternative, semantically valid
generation paradigms that are poorly reflected by n-
gram metrics but better capture human preferences.
4.4 Experiment 3: Few-Shot Transferability
Finally, we examine whether the stronger MT met-
rics observed for SFT in Experiment 2 translate
into better cross-task generalization.

Chunk 18 · 1,987 chars

n of alternative, semantically valid
generation paradigms that are poorly reflected by n-
gram metrics but better capture human preferences.
4.4 Experiment 3: Few-Shot Transferability
Finally, we examine whether the stronger MT met-
rics observed for SFT in Experiment 2 translate
into better cross-task generalization. While SFT
achieves higher reference-based scores on MT, Ex-
periment 2 also reveals a clear mismatch between
such metrics and semantic quality, as reflected by
alignment tax and LLM-based judgment. This
raises a natural question: do higher MT scores in-
dicate genuinely stronger Tibetan representations,
or do they primarily capture task- and reference-
specific surface patterns that are unlikely to trans-
fer?
To answer this, we design a few-shot transfer
test from MT to HG.
Concretely, we take the best MT checkpoints

-- 7 of 15 --

Model Task Performance General Capability (Alignment Tax) Semantic Quality
Metric Similarity CMRC Avg CMRC F1 LLM-Judge Win (%)
Task 1: Tibetan–Chinese Machine Translation (MT)
Strong SFT 0.6006 0.8282 41.82 62.99 59.2
RL (Ours) 0.4519 0.7164 46.97 65.79 33.5
Gap (RL vs. SFT) -0.1487 -0.1118 +5.15 +2.80 -25.7
Task 2: Tibetan Headline Generation (HG)
Strong SFT 0.3095 0.6499 44.20 65.30 35.1
RL (Ours) 0.2530 0.6404 45.10 65.20 51.2
Gap (RL vs. SFT) -0.0565 -0.0095 +0.90 -0.10 +16.1
Table 3: Trade-off Analysis: Task Performance vs. Alignment Tax. For machine translation, Strong SFT achieves
higher task metrics but incurs a heavy alignment tax, reflected by a significant drop in CMRC performance. In
contrast, RL preserves general language capabilities with substantially higher CMRC scores while sacrificing
surface-level metrics. For headline generation, both methods exhibit comparable general capability preservation,
but RL significantly outperforms SFT in semantic quality as measured by LLM-based judgment, despite lower
n-gram-based scores.
Initialization ROUGE-L Similarity
Base Model 0.1585 0.4695
MT-SFT 0.1935

Chunk 19 · 1,999 chars

ficing
surface-level metrics. For headline generation, both methods exhibit comparable general capability preservation,
but RL significantly outperforms SFT in semantic quality as measured by LLM-based judgment, despite lower
n-gram-based scores.
Initialization ROUGE-L Similarity
Base Model 0.1585 0.4695
MT-SFT 0.1935 0.5456
MT-RL (Ours) 0.1918 0.5690
Table 4: Few-shot transfer from MT to HG with 1,000
HG training samples.
produced by Strong SFT and by RL, and fine-tune
each of them on the HG task using only 1,000 train-
ing samples under identical training settings. This
setting stresses representation reuse: with limited
HG supervision, a model that learns more general
and semantically grounded Tibetan features during
MT should adapt more effectively than a model
whose gains are dominated by task-specific imita-
tion.
Table 4 reports the results. Both MT-adapted
models improve substantially over the base model,
confirming that MT training provides useful Ti-
betan signal for downstream generation. However,
despite MT-SFT’s strong MT performance, it does
not retain a corresponding advantage in transfer.
The RL-initialized model achieves a higher seman-
tic similarity score (0.5690 vs. 0.5456), while main-
taining comparable ROUGE-L (0.1918 vs. 0.1935).
This indicates that the MT-SFT model’s improve-
ments are, at least in part, tied to MT-specific sur-
face alignment and do not generalize as strongly
to a different open-ended generation task, whereas
semantic-reward-driven RL yields representations
that transfer better under limited supervision.We
further provide a mechanistic analysis of forgetting
in Appendix C, where we examine OOD token-
level negative log-likelihood and KL divergence
to the base model on a fixed CMRC evaluation
set. The results are consistent with our main find-
ings and suggest that semantic RL yields more
controlled distributional adaptation than SFT.
Overall, this experiment supports our central
claim that semantic-space alignment provides

Chunk 20 · 1,998 chars

ative log-likelihood and KL divergence
to the base model on a fixed CMRC evaluation
set. The results are consistent with our main find-
ings and suggest that semantic RL yields more
controlled distributional adaptation than SFT.
Overall, this experiment supports our central
claim that semantic-space alignment provides a
safer and more effective adaptation paradigm in
low-resource settings: while SFT can produce
larger in-task metric gains, RL achieves more ro-
bust generalization across tasks, consistent with
the practical needs of low-resource language ex-
pansion. To complement these downstream results,
we further provide a mechanistic analysis of for-
getting in Appendix C, where we examine OOD
token-level negative log-likelihood and KL diver-
gence to the base model on a fixed CMRC evalua-
tion set. The results are consistent with our main
findings and suggest that semantic RL yields more
controlled distributional adaptation than SFT.
4.5 Reward Ablation
To further understand the role of reward design un-
der the same reinforcement learning framework,
we conduct a reward ablation study on the unified
Tibetan–Chinese machine translation task. Specif-
ically, under identical model, data, and training
settings, we compare several reward combinations
to isolate how different reward components affect
semantic alignment performance.
Table 5 shows that the proposed reward com-
bination, consisting of embedding similarity and
language consistency, achieves the best semantic
similarity among all configurations. This result

-- 8 of 15 --

Reward Similarity
Embedding + LC (Ours) 0.7164
BLEU + LC 0.6375
BLEU + Embedding + LC 0.6175
BLEU + Embedding 0.2312
Table 5: Reward ablation results on Tibetan–Chinese
machine translation under matched settings. LC denotes
the language consistency reward.
suggests that the performance gain does not come
from reinforcement learning alone, but depends
critically on how the reward is defined.
First, language consistency is necessary for

Chunk 21 · 1,990 chars

ble 5: Reward ablation results on Tibetan–Chinese
machine translation under matched settings. LC denotes
the language consistency reward.
suggests that the performance gain does not come
from reinforcement learning alone, but depends
critically on how the reward is defined.
First, language consistency is necessary for sta-
ble semantic optimization in the multilingual set-
ting. Without language consistency, the model fre-
quently produces mixed Tibetan–Chinese outputs
during exploration, indicating that semantic simi-
larity alone is insufficient to constrain generation
into the target low-resource language space.
Second, BLEU-based rewards consistently
weaken performance. As a surface-form overlap
objective, BLEU introduces token-level pressure
that restricts semantic exploration and partially re-
stores the rigidity of supervised imitation. Even
when combined with embedding reward and lan-
guage consistency, it still degrades performance rel-
ative to the simpler Embedding + LC formulation.
This suggests that token-overlap rewards are not
well aligned with the objective of semantic-space
alignment, where multiple surface realizations may
preserve the same meaning.
We also tested an additional length-constraint
reward for translation. However, it did not im-
prove generation quality and reduced CMRC per-
formance by approximately 2 points, indicating
that excessive output constraints may further harm
both capability retention and semantic alignment.
Overall, these results show that a lightweight se-
mantic reward, together with a necessary target-
language constraint, is more effective than stacking
additional surface-form objectives.
5 Conclusion
This paper argues that low-resource language ex-
pansion should be treated as an alignment prob-
lem, where the core objective is semantic con-
sistency rather than token-level imitation. We
propose a semantic-space alignment paradigm in-
stantiated with reinforcement learning driven by
embedding-level semantic

Chunk 22 · 1,995 chars

lusion
This paper argues that low-resource language ex-
pansion should be treated as an alignment prob-
lem, where the core objective is semantic con-
sistency rather than token-level imitation. We
propose a semantic-space alignment paradigm in-
stantiated with reinforcement learning driven by
embedding-level semantic similarity and a strict
language-consistency constraint. Experiments on
Tibetan–Chinese machine translation and Tibetan
headline generation show that semantic-reward-
driven RL acquires low-resource language capabili-
ties while substantially reducing alignment tax, pre-
serving dominant-language competence with near-
zero forgetting. We further observe a consistent
mismatch between reference-based metrics and se-
mantic quality: despite weaker n-gram overlap, RL
is often preferred by LLM-based judges in open-
ended generation and yields more transferable rep-
resentations under few-shot transfer.Taken together,
these findings suggest that semantic-space align-
ment offers a scalable path for extending LLMs
to weakly supported languages under data scarcity,
shifting low-resource adaptation from distribution
matching toward meaning-centered alignment.
Limitations
While modern LLMs nominally support many lan-
guages, identifying a language with weak model
performance often implies severe data scarcity, as
in the case of Tibetan. The limited and domain-
narrow nature of available data (e.g., translation cor-
pora) may cause supervised fine-tuning to achieve
artificially high in-domain metrics that do not fully
reflect real-world generalization, making some de-
gree of overfitting unavoidable.
Ethical Considerations
This work promotes inclusive language modeling
by extending LLMs to low-resource languages such
as Tibetan. All data are publicly available or inter-
nally licensed and contain no personal or sensitive
information. No human participants were involved,
and all evaluations were performed automatically
using LLM-based judges. While our method

Chunk 23 · 1,998 chars

lusive language modeling
by extending LLMs to low-resource languages such
as Tibetan. All data are publicly available or inter-
nally licensed and contain no personal or sensitive
information. No human participants were involved,
and all evaluations were performed automatically
using LLM-based judges. While our method re-
duces overfitting and potential bias from narrow
supervision, residual pretrained biases may persist.
Future research should further assess fairness and
bias in low-resource settings.
Acknowledgments
This work was supported by the Hainan Provincial
Joint Project of the Li’an International Education
Innovation Pilot Zone (Grant No. 624LALH006).
References
Thales Sales Almeida, Rodrigo Nogueira, and Hélio
Pedrini. 2025. Curió-edu 7b: Examining data selec-

-- 9 of 15 --

tion impacts in llm continued pretraining. Preprint,
arXiv:2512.12770.
M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani,
Nouf M. Alotaibi, Hisham Abdullah Alyahya, Sultan
AlRashed, Faisal Abdulrahman Mirza, Shaykhah Z.
Alsubaie, Hassan A. Alahmed, Ghadah Alabduljab-
bar, Raghad Alkhathran, Yousef Almushayqih, Ra-
neem Alnajim, Salman Alsubaihi, Maryam Al Man-
sour, Saad Amin Hassan, Majed Alrubaian, Ali Alam-
mari, Zaki Alawami, and 8 others. 2025. Allam:
Large language models for arabic and english. In In-
ternational Conference on Learning Representations
(ICLR) 2025. Poster.
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan
Martic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. In
Advances in Neural Information Processing Systems.
Gheorghe Comanici, Eric Bieber, Mike Schaekermann,
Ice Pasupat, and 1 others. 2025. Gemini 2.5: Pushing
the frontier with advanced reasoning, multimodality,
long context, and next generation agentic capabilities.
Preprint, arXiv:2507.06261.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov.

Chunk 24 · 1,984 chars

5: Pushing
the frontier with advanced reasoning, multimodality,
long context, and next generation agentic capabilities.
Preprint, arXiv:2507.06261.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440–
8451, Online. Association for Computational Lin-
guistics.
Zoltan Csaki, Bo Li, Jonathan Lingjie Li, Qiantong Xu,
Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao,
Changran Hu, and Urmish Thakker. 2024. Sam-
baLingo: Teaching large language models new lan-
guages. In Proceedings of the Fourth Workshop on
Multilingual Representation Learning (MRL 2024),
pages 1–21, Miami, Florida, USA. Association for
Computational Linguistics.
Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng
Chen, Wentao Ma, Shijin Wang, and Guoping Hu.
2019. A span-extraction dataset for Chinese ma-
chine reading comprehension. In Proceedings of
the 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 5886–5891, Hong Kong,
China. Association for Computational Linguistics.
DeepSeek-AI and 1 others. 2025. Deepseek-v3.2:
Pushing the frontier of open large language models.
Preprint, arXiv:2512.02556.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Weizhu Chen. 2022. Lora: Low-rank adaptation of
large language models. In The Tenth International
Conference on Learning Representations, ICLR 2022,
Virtual Event, April 25-29, 2022. OpenReview.net.
Danni Liu and Jan Niehues. 2025. Conditions for catas-
trophic forgetting in multilingual translation. In Pro-
ceedings of the 5th Workshop on Multilingual Rep-
resentation Learning (MRL 2025), pages

Chunk 25 · 1,990 chars

enth International
Conference on Learning Representations, ICLR 2022,
Virtual Event, April 25-29, 2022. OpenReview.net.
Danni Liu and Jan Niehues. 2025. Conditions for catas-
trophic forgetting in multilingual translation. In Pro-
ceedings of the 5th Workshop on Multilingual Rep-
resentation Learning (MRL 2025), pages 347–359.
Association for Computational Linguistics.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In 7th International
Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019. OpenRe-
view.net.
OpenAI. 2025. Introducing gpt-4.1 in the api. OpenAI
Blog. Accessed: 2025-12-25.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, and
1 others. 2022. Training language models to follow
instructions with human feedback. In Advances in
Neural Information Processing Systems.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Ermon, Christopher D. Manning, and Chelsea Finn.
2023. Direct preference optimization: Your language
model is secretly a reward model. arXiv preprint
arXiv:2305.18290.
Fabian David Schmidt. 2025. Robust and Scalable
Cross-Lingual Transfer. Ph.D. thesis, Bayerische
Julius-Maximilians-Universitaet Wuerzburg (Ger-
many).
John Schulman, Sergey Levine, Philipp Moritz,
Michael I. Jordan, and Pieter Abbeel. 2015.
Trust region policy optimization. arXiv preprint
arXiv:1502.05477.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint
arXiv:1707.06347.
Zhihong Shao and 1 others. 2024. Deepseekmath: Push-
ing the limits of mathematical reasoning in open lan-
guage models. arXiv preprint arXiv:2402.03300.
Kimi Team, Yifan Bai, and 1 others. 2025.
Kimi k2: Open agentic intelligence. Preprint,
arXiv:2507.20534.
Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming
Yan, Rui Liang, Wenjia Wang, Shipei Zhou,

Chunk 26 · 1,994 chars

. Deepseekmath: Push-
ing the limits of mathematical reasoning in open lan-
guage models. arXiv preprint arXiv:2402.03300.
Kimi Team, Yifan Bai, and 1 others. 2025.
Kimi k2: Open agentic intelligence. Preprint,
arXiv:2507.20534.
Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming
Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen
Lin, Yanwei Fu, and 1 others. 2019. Large-scale
datasets for going deeper in image understanding. In
2019 IEEE International Conference on Multimedia
and Expo (ICME), pages 1480–1485. IEEE.
Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, Xu Han,
Ting Zhang, and Yushuang Dong. 2025. CMHG: A
dataset and benchmark for headline generation of
minority languages in China. In Proceedings of the
2025 Conference on Empirical Methods in Natural
Language Processing, pages 12350–12357, Suzhou,
China. Association for Computational Linguistics.

-- 10 of 15 --

Atsuki Yamaguchi, Terufumi Morishita, Aline Villav-
icencio, and Nikolaos Aletras. 2025. Mitigating
catastrophic forgetting in target language adapta-
tion of llms via source-shielded updates. Preprint,
arXiv:2512.04844.
An Yang and 1 others. 2025a. Qwen3 technical report.
Preprint, arXiv:2505.09388.
Menglin Yang, Jialin Chen, Yifei Zhang, Jiahong Liu,
Jiasheng Zhang, Qiyao Ma, Harshit Verma, Qianru
Zhang, Min Zhou, Irwin King, and Rex Ying. 2025b.
Low-rank adaptation for foundation models: A com-
prehensive review. CoRR, abs/2501.00365.
Ziqing Yang, Zihang Xu, Yiming Cui, Baoxin Wang,
Min Lin, Dayong Wu, and Zhigang Chen. 2022.
CINO: A Chinese minority pre-trained language
model. In Proceedings of the 29th International Con-
ference on Computational Linguistics, pages 3937–
3949, Gyeongju, Republic of Korea. International
Committee on Computational Linguistics.
Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui,
and Xuanjing Huang. 2024. Llama beyond english:
An empirical study on language capability transfer.
CoRR, abs/2401.01055.

-- 11 of 15 --

A Training and Hyperparameter Details
A.1 Base

Chunk 27 · 1,998 chars

ongju, Republic of Korea. International
Committee on Computational Linguistics.
Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui,
and Xuanjing Huang. 2024. Llama beyond english:
An empirical study on language capability transfer.
CoRR, abs/2401.01055.

-- 11 of 15 --

A Training and Hyperparameter Details
A.1 Base model and LoRA configuration
We adopt Qwen3-4B-Instruct as the base model
and apply LoRA for all SFT and RL experiments.
Unless otherwise specified, LoRA adapters are in-
serted into:
• Self-attention projections: q_proj, k_proj,
v_proj, o_proj
• MLP projections: gate_proj, up_proj,
down_proj
We use LoRA rank r = 64, scaling factor α = 128,
and dropout 0.05 throughout all experiments.
A.2 Supervised fine-tuning (SFT) details
• Initialization: Qwen3-4B-Instruct
• Precision: BF16
• Training length: 3 epochs
• Optimizer: AdamW
• Learning rate: 2 × 10−5
• Scheduler: cosine decay with warmup ratio
0.1
• Global batch size: 32 (2 GPUs × per-device
batch size 8 × gradient accumulation 2)
• Sequence length: determined by the data and
model defaults, typically 1024–2048 tokens
A.3 Reinforcement learning (GRPO) details
We perform reinforcement learning using
GRPO (Shao et al., 2024) starting from the
cold-start SFT checkpoint.
• Initialization: SFT checkpoint (cold-start)
• Precision: BF16
• Training length: 1 epoch
• Optimizer: AdamW
• Learning rate: 5 × 10−7
• Effective global batch size: 32 (per-device
batch size 16, gradient accumulation 2)
• Group size: 8 sampled candidate generations
per input prompt
• Max prompt length: 256 tokens
• Max completion length: 256 tokens
• Sampling temperature: 0.8
• Nucleus sampling: top-p = 0.9
A.4 Rationale for unified configuration
Across SFT and RL, we keep the base model,
LoRA configuration, precision, and batch scale
as aligned as possible. This reduces confounding
factors and supports attributing performance differ-
ences to the alignment paradigm itself.
B LLM-judge Prompt
B.1 LLM-judge Prompt for Headline
Generation

Chunk 28 · 1,995 chars

r unified configuration
Across SFT and RL, we keep the base model,
LoRA configuration, precision, and batch scale
as aligned as possible. This reduces confounding
factors and supports attributing performance differ-
ences to the alignment paradigm itself.
B LLM-judge Prompt
B.1 LLM-judge Prompt for Headline
Generation Task
For the Tibetan headline generation task (HG), we
used prompt in Figure 2 to evaluate two candidate
headlines generated for a Tibetan news article. The
evaluation was conducted using GPT-5.2 as the
LLM-judge.
B.2 LLM-judge Prompt for Machine
Translation Task
For the Tibetan-Chinese machine translation task
(MT), we used the prompt in Figure 3 to evaluate
two candidate translations. This evaluation was
also conducted using GPT-5.2 as the LLM-judge.
In both tasks, the evaluation process was con-
ducted blind, with no direct reference to the model
outputs, ensuring an unbiased comparison of the
candidate results.
C OOD Log-Likelihood and KL
Divergence Analysis
To complement the downstream evaluation in the
main text, we further analyze forgetting from
a more mechanistic perspective using an out-of-
distribution (OOD) evaluation set. Specifically,
we use passages from the Chinese Machine Read-
ing Comprehension benchmark (CMRC) as a fixed
OOD corpus without supervision signals, repre-
senting general language understanding ability out-
side the low-resource language-expansion training
distribution. On this set, we evaluate token-level
negative log-likelihood (NLL) and KL divergence
relative to the base model. These measurements
help characterize how different training strategies
affect distributional drift beyond downstream task
metrics.

-- 12 of 15 --

LLM Judge Prompt for headline generation
You are an expert linguist specializing in Tibetan-Chinese journalism.
Your task is to evaluate two candidate headlines (Candidate 1 and Candidate 2) generated for a
Tibetan news article.
### Article:
[Source Text]:
{src}
[Candidate 1]:
{cand_1}
[Candidate

Chunk 29 · 1,999 chars

ics.

-- 12 of 15 --

LLM Judge Prompt for headline generation
You are an expert linguist specializing in Tibetan-Chinese journalism.
Your task is to evaluate two candidate headlines (Candidate 1 and Candidate 2) generated for a
Tibetan news article.
### Article:
[Source Text]:
{src}
[Candidate 1]:
{cand_1}
[Candidate 2]:
{cand_2}
### Task:
Compare the two candidates.
- If Candidate 1 is significantly better, output: [[1]]
- If Candidate 2 is significantly better, output: [[2]]
- If both are equally good or bad, output: [[0]]
Provide a brief reason (in Chinese) before your decision.
### Output Format:
Reason: <brief explanation>
Decision: [[1]] or [[2]] or [[0]]
Figure 2: Prompt for headline generation evaluation.
OOD token-level negative log-likelihood and KL
divergence. We first compare token-level NLL
on the CMRC OOD set. Lower NLL indicates
that the adapted model remains closer to the base
model’s general language modeling behavior on
unseen out-of-domain data. We then examine KL
divergence between the adapted models and the
base model on the same OOD set, which provides
a complementary view of distributional drift.
As shown in Table 6(a), RL leads to substantially
smaller degradation on OOD data than SFT. Com-
pared to the base model, RL increases mean NLL
by only +0.24, whereas SFT increases it by +0.64.
The difference is even more pronounced in the tail:
the 90th-percentile NLL rises by +0.62 under RL
but by +1.43 under SFT. This suggests that forget-
ting under SFT disproportionately affects harder
OOD examples, while semantic RL preserves more
stable behavior across the distribution.
Table 6(b) shows that RL and SFT have com-
parable mean KL divergence to the base model,
indicating that the overall magnitude of adaptation
is similar. However, RL yields consistently lower
median and tail KL values than SFT. In particular,
the 90th-percentile KL is lower for RL (0.0839)
than for both cold-start SFT (0.0912) and final SFT
(0.0932). This suggests that semantic

Chunk 30 · 1,994 chars

ivergence to the base model,
indicating that the overall magnitude of adaptation
is similar. However, RL yields consistently lower
median and tail KL values than SFT. In particular,
the 90th-percentile KL is lower for RL (0.0839)
than for both cold-start SFT (0.0912) and final SFT
(0.0932). This suggests that semantic RL does not
merely reduce the total amount of learning, but
instead leads to more uniform distributional adapta-
tion and avoids localized large shifts that are more
characteristic of catastrophic forgetting.
Summary. Taken together, the OOD NLL and
KL analyses provide complementary mechanistic
evidence for the main findings in the paper. Com-
pared with supervised fine-tuning, semantic-reward
RL induces smaller degradation in token-level like-
lihood on unseen OOD data, while also producing
more controlled and less heavy-tailed divergence
from the base model. These observations support
our interpretation that semantic RL mitigates align-
ment tax not by suppressing learning altogether, but

-- 13 of 15 --

LLM Judge Prompt for machine translation
You are an expert linguist specializing in Tibetan-Chinese translation.
Your task is to evaluate two candidate translations (Candidate 1 and Candidate 2) for a Tibetan text.
### Article:
[Source Text]:
{src}
[Candidate 1]:
{cand_1}
[Candidate 2]:
{cand_2}
### Task:
Compare the two translations.
- If Candidate 1 is significantly better, output: [[1]]
- If Candidate 2 is significantly better, output: [[2]]
- If both are equally good or bad, output: [[0]]
Provide a brief reason (in English) before your decision.
### Output Format:
Reason: <brief explanation>
Decision: [[1]] or [[2]] or [[0]]
Figure 3: Prompt for machine translation evaluation.
by encouraging more uniform and less destructive
adaptation.

-- 14 of 15 --

(a) OOD token-level negative log-likelihood on the fixed CMRC evaluation set
Model Mean NLL Median NLL P10 NLL P90 NLL
Base Model 2.6097 1.4285 0.3636 5.8844
RL (final) 2.8533 1.5543 0.3848

Chunk 31 · 969 chars

igure 3: Prompt for machine translation evaluation.
by encouraging more uniform and less destructive
adaptation.

-- 14 of 15 --

(a) OOD token-level negative log-likelihood on the fixed CMRC evaluation set
Model Mean NLL Median NLL P10 NLL P90 NLL
Base Model 2.6097 1.4285 0.3636 5.8844
RL (final) 2.8533 1.5543 0.3848 6.5000
SFT (final) 3.2504 1.7390 0.4636 7.3188
(b) KL divergence to the base model on the fixed CMRC OOD set
Model Mean KL Median KL P10 KL P90 KL
SFT (cold-start) ∥ Base 0.0471 0.0099 0.0000 0.0912
SFT (final) ∥ Base 0.0404 0.0113 0.0000 0.0932
RL (final) ∥ Base 0.0410 0.0073 0.0000 0.0839
Table 6: Mechanistic OOD analysis on a fixed CMRC evaluation set. Panel (a) reports token-level negative
log-likelihood (NLL), where lower values indicate better preservation of the base model’s out-of-domain language
modeling behavior. Panel (b) reports KL divergence to the base model, characterizing distributional drift after
adaptation.

-- 15 of 15 --