Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Summary
This paper introduces a novel approach to expanding large language models (LLMs) to low-resource languages using semantic rewards in reinforcement learning (RL), addressing the "alignment tax" problem where improvements in a target language lead to degradation in general capabilities. Traditional supervised fine-tuning (SFT) enforces token-level imitation on limited data, causing catastrophic forgetting. The proposed method, Group Relative Policy Optimization (GRPO), aligns models in semantic space using embedding-level rewards, prioritizing meaning preservation over rigid surface-form imitation. This approach allows flexible realizations while minimizing interference with pretrained knowledge. Experiments on TibetanāChinese machine translation and headline generation show that semantic RL significantly reduces alignment tax, preserves general competence, and produces higher-quality, more transferable outputs compared to SFT. The method uses a two-stage training process: a lightweight SFT for initial competence, followed by RL with semantic and language consistency rewards. Results demonstrate superior semantic quality, fewer alignment tax effects, and better few-shot transferability, indicating that semantic-space alignment offers a safer and more effective pathway for low-resource language expansion.
PDF viewer
Chunks(32)
Chunk 0 Ā· 1,995 chars
Reinforcement Learning with Semantic Rewards Enables Low-Resource
Language Expansion without Alignment Tax
Zeli Su1,2ā Ziyin Zhang3ā Zhou Liu5ā Xuexian Song6 Zhankai Xu2
Longfei Zheng2 Xiaolu Zhang2 Rong Fu4 Guixian Xu1,7ā Wentao Zhang5ā
1 Minzu University of China 2 Ant Group 3 Shanghai Jiao Tong University 4 University of Macau 5 Peking University
6 Institute of Automation, Chinese Academy of Sciences 7 Hainan International College, Minzu University of China
{rickamorty, guixian_xu}@muc.edu.cn daenerystargaryen@sjtu.edu.cn
zhouliu25@stu.pku.edu.cn songxuexian5@gmail.com {xuzhankai.xzk, zlf206411}@antgroup.com
yueyin.zxl@antfin.com mc46603@um.edu.mo wentao.zhang@pku.edu.cn
Abstract
Extending large language models (LLMs) to
low-resource languages often incurs an āalign-
ment taxā: improvements in the target lan-
guage come at the cost of catastrophic forget-
ting in general capabilities. We argue that this
trade-off arises from the rigidity of supervised
fine-tuning (SFT), which enforces token-level
surface imitation on narrow and biased data
distributions. To address this limitation, we
propose a semantic-space alignment paradigm
powered by Group Relative Policy Optimiza-
tion (GRPO), where the model is optimized us-
ing embedding-level semantic rewards rather
than likelihood maximization. This objective
encourages meaning preservation through flex-
ible realizations, enabling controlled updates
that reduce destructive interference with pre-
trained knowledge. We evaluate our approach
on TibetanāChinese machine translation and Ti-
betan headline generation. Experiments show
that our method acquires low-resource capa-
bilities while markedly mitigating alignment
tax, preserving general competence more effec-
tively than SFT. Despite producing less rigid
surface overlap, semantic RL yields higher se-
mantic quality and preference in open-ended
generation, and few-shot transfer results indi-
cate that it learns more transferable and ro-
bust representations underChunk 1 Ā· 1,997 chars
ting alignment tax, preserving general competence more effec- tively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher se- mantic quality and preference in open-ended generation, and few-shot transfer results indi- cate that it learns more transferable and ro- bust representations under limited supervision. Overall, our study demonstrates that reinforce- ment learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion. 1 Introduction Large language models (LLMs) have achieved re- markable performance across a wide range of tasks and languages through large-scale pretraining and post-training alignment (DeepSeek-AI et al., 2025; Yang et al., 2025a; Team et al., 2025; OpenAI, 2025; Comanici et al., 2025). However, their ca- pabilities remain highly uneven across languages: Figure 1: Token-level alignment versus semantic-space alignment in low-resource language expansion. Token- level supervised fine-tuning enforces surface-form imi- tation under teacher forcing, often causing catastrophic forgetting and high alignment tax. In contrast, semantic- space alignment optimizes meaning preservation with constrained reinforcement learning policy updates, al- lowing flexible realizations while preserving pretrained knowledge. many languages are weakly supported, particularly for generation and reasoning-intensive tasks that require semantic abstraction rather than surface pattern matching. Improving model performance in such low-resource language settings therefore remains an important and challenging problem. A common approach to improving low-resource language performance is further training on language-specific data, including continual pre- training and supervised instruction fine-tuning. De- spite differences in data sources and training pro- tocols, these methods share a common optimiza- tion paradigm: they rely on teacher-forced learning with token-level likelihood objectives,
Chunk 2 Ā· 1,997 chars
nce is further training on language-specific data, including continual pre- training and supervised instruction fine-tuning. De- spite differences in data sources and training pro- tocols, these methods share a common optimiza- tion paradigm: they rely on teacher-forced learning with token-level likelihood objectives, aligning the model to a target data distribution through surface- form imitation. Figure 1 illustrates the contrast between token-level alignment via surface-form imitation and semantic-space alignment based on meaning preservation, highlighting how different objectives lead to fundamentally different update behaviors under data scarcity. Much recent work on language expansion and adaptation follows this arXiv:2605.14366v1 [cs.CL] 14 May 2026 -- 1 of 15 -- paradigm, using supervised or weakly supervised finetuning to inject new linguistic capabilities into pretrained models (Csaki et al., 2024; Zhao et al., 2024). Related efforts based on continued pre- training or language-specific model construction similarly perform strong distribution-level updates on limited and domain-constrained data (Almeida et al., 2025; Bari et al., 2025). While effective under abundant and diverse su- pervision, token-level distribution matching be- comes problematic in low-resource language set- tings. Available training data are often limited in size, narrow in domain, and distributionally biased. Optimizing likelihood on such data en- courages overly confident and rigid parameter up- dates, amplifying overfitting and interfering with representations learned during pretraining. Empir- ically, this interference often manifests as catas- trophic forgetting: improvements in the target lan- guage are accompanied by degradation in existing high-resource language capabilities, a phenomenon that has been systematically observed in multilin- gual fine-tuning and low-resource representation learning settings (Liu and Niehues, 2025; Schmidt, 2025). We argue that this phenomenon
Chunk 3 Ā· 1,998 chars
improvements in the target lan- guage are accompanied by degradation in existing high-resource language capabilities, a phenomenon that has been systematically observed in multilin- gual fine-tuning and low-resource representation learning settings (Liu and Niehues, 2025; Schmidt, 2025). We argue that this phenomenon is not merely an optimization artifact but a structural outcome of the alignment objective itself. When alignment is equated with surface-form imitation on narrow dis- tributions, representational capacity is aggressively reallocated, leading to an alignment tax: gains in low-resource language performance achieved at the expense of general competence. This issue persists regardless of the adaptation method (e.g., contin- ual pretraining or instruction tuning) as long as the objective enforces token-level matching on sparse data (Yamaguchi et al., 2025). Consequently, we propose shifting perspective to view language expansion not merely as adaptation, but as an alignment problem under sparse super- vision. To address this, we introduce a semantic- space alignment paradigm that prioritizes mean- ing preservation over rigid surface-form imitation. We operationalize this framework using Group Relative Policy Optimization (GRPO) (Shao et al., 2024), employing embedding-level seman- tic similarity as the primary reward signal. Unlike teacher-forced training, this approach encourages the model to explore diverse linguistic realizations that maintain semantic equivalence. Crucially, by optimizing based on relative rewards within a sam- pled group, our method inherently incorporates the stability constraints of trust-region optimiza- tion (Schulman et al., 2017; Rafailov et al., 2023), enabling the acquisition of low-resource capabili- ties while strictly limiting the destructive interfer- ence typical of unconstrained likelihood maximiza- tion. We evaluate our approach on TibetanāChinese machine translation (MT) and Tibetan headline generation (HG). Empirical
Chunk 4 Ā· 1,993 chars
n et al., 2017; Rafailov et al., 2023), enabling the acquisition of low-resource capabili- ties while strictly limiting the destructive interfer- ence typical of unconstrained likelihood maximiza- tion. We evaluate our approach on TibetanāChinese machine translation (MT) and Tibetan headline generation (HG). Empirical results demonstrate that semantic-reward-driven GRPO achieves a su- perior trade-off between adaptation and preserva- tion. In the MT task, our method substantially re- duces alignment tax, outperforming the strong SFT baseline by +5.15 points on the dominant-language CMRC benchmark. Similarly, in the HG task, de- spite lower n-gram overlap, our model is preferred by LLM-based judges with a +16.1% higher win rate compared to SFT. These findings suggest that semantic-space alignment offers a safer and more robust paradigm for improving low-resource lan- guage performance under data scarcity. In summary, the main contributions of this paper are: ⢠We propose a semantic-space alignment paradigm that utilizes Group Relative Policy Optimization (GRPO) with embedding-level re- wards to decouple meaning preservation from surface-form imitation. ⢠We demonstrate that this approach virtually elimi- nates the alignment tax, enabling significant low- resource gains while maintaining the modelās general capabilities and pretrained knowledge. ⢠We show that our method produces semantically superior outputs that are preferred by LLM judges over SFT baselines, despite having lower n-gram overlap with rigid references. ⢠We validate that semantic RL yields more transferable representations, as evidenced by stronger few-shot generalization to downstream tasks compared to supervised methods. 2 Related Work 2.1 Low-Resource Language Adaptation and Expansion Post-training language expansion andadaptation typically start from a pretrained foundation model and further train on language-specific data. Prior work studies supervised fine-tuning or instruction tuning for
Chunk 5 Ā· 1,974 chars
ream tasks compared to supervised methods. 2 Related Work 2.1 Low-Resource Language Adaptation and Expansion Post-training language expansion andadaptation typically start from a pretrained foundation model and further train on language-specific data. Prior work studies supervised fine-tuning or instruction tuning for transferring language capabilities and scaling with data and model size (Csaki et al., 2024; Zhao et al., 2024), as well as continued pretraining -- 2 of 15 -- and language-specific model construction that em- phasize corpus selection and composition (Almeida et al., 2025; Bari et al., 2025). Across these set- tings, selective adaptation on narrow distributions can induce catastrophic forgetting, degrading per- formance on non-target languages or tasks (Liu and Niehues, 2025; Schmidt, 2025). Several mitigation strategies focus on constrain- ing update magnitude or protecting previously learned capabilities, such as parameter-efficient finetuning (e.g., LoRA-style low-rank updates) and source-shielded adaptation (Yang et al., 2025b; Ya- maguchi et al., 2025). While these methods can reduce interference, they generally retain teacher- forced token-level likelihood objectives. Our work is complementary: we reconsider the alignment objective itself by optimizing semantic consistency via embedding-level rewards, aiming to improve weak-language capability with lower alignment tax. 2.2 Reinforcement Learning for LLM Alignment Reinforcement learning (RL) is commonly used in LLM alignment when optimization objectives are sequence-level or non-differentiable, enabling learning beyond token-level supervised imitation (Christiano et al., 2017; Ouyang et al., 2022). A core advantage of RL-based alignment methods is the use of constrained policy updates, such as trust-region or KL-regularized optimization, which limits drift from pretrained representations and im- proves stability (Schulman et al., 2015, 2017). Recent variants retain this
Chunk 6 Ā· 1,999 chars
stiano et al., 2017; Ouyang et al., 2022). A core advantage of RL-based alignment methods is the use of constrained policy updates, such as trust-region or KL-regularized optimization, which limits drift from pretrained representations and im- proves stability (Schulman et al., 2015, 2017). Recent variants retain this constrained-update principle while improving efficiency or flexibility, including direct preference optimization (Rafailov et al., 2023) and group-based policy optimization methods (Shao et al., 2024), which we adopt in this work to enable controlled alignment towards semantic-level objectives. 3 Method 3.1 Problem Formulation: Semantic-Space Alignment We study low-resource language expansion as an alignment problem. Given a pretrained instruction- following language model Ļbase, our goal is to ac- quire new capabilities in a low-resource language while preserving existing competencies in domi- nant, high-resource languages. Conventional supervised fine-tuning aligns mod- els by maximizing token-level likelihood under a target data distribution. In low-resource settings, where the distribution is narrow and biased, this objective enforces surface-form imitation and often leads to overconfident updates and catastrophic for- getting. We instead frame alignment as semantic- space alignment: model outputs are considered cor- rect if they preserve meaning, regardless of their specific surface realization. This formulation ex- plicitly decouples semantic adequacy from token- level matching and allows multiple valid expres- sions of the same intent. Under this perspective, alignment is defined by semantic consistency rather than distribution matching. Our objective is therefore to optimize the model to produce outputs that are semantically equivalent to reference texts, while limiting inter- ference with representations learned during pre- training. 3.2 Two-Stage Training Paradigm To operationalize semantic-space alignment in low- resource settings, we adopt
Chunk 7 Ā· 1,999 chars
ching. Our objective is therefore to optimize the model to produce outputs that are semantically equivalent to reference texts, while limiting inter- ference with representations learned during pre- training. 3.2 Two-Stage Training Paradigm To operationalize semantic-space alignment in low- resource settings, we adopt a two-stage training paradigm. Stage 1: Cold-start supervised fine-tuning. We first perform a lightweight supervised fine-tuning step on a small subset of low-resource data to ob- tain an initial policy Ļinit. Specifically, we fine-tune the base model on 5k training instances for two epochs. The goal of this stage is not to achieve strong task performance, but to bootstrap minimal output competence in the target language, such as producing text in the correct script and main- taining basic language consistency. This cold-start initialization allows the model to reliably generate non-degenerate outputs in the low-resource lan- guage, ensuring that subsequent semantic rewards are meaningful and that reinforcement learning does not collapse into uninformative exploration. Stage 2: Reinforcement learning with semantic rewards. Starting from Ļinit, we perform rein- forcement learning to align the model in seman- tic space. In this stage, we utilize the remaining training data to drive learning through semantic rewards rather than token-level supervision. Rein- forcement learning is conducted for a single epoch, during which the model is encouraged to explore diverse surface realizations while preserving se- mantic equivalence to reference texts. Constrained policy optimization is applied throughout training -- 3 of 15 -- to control update magnitude, enabling the model to acquire low-resource language capabilities while minimizing destructive interference with pretrained representations. 3.3 Reinforcement Learning for Semantic Alignment Optimizing semantic alignment is inherently a sequence-level problem over discrete outputs. The embedding-based semantic
Chunk 8 Ā· 1,997 chars
gnitude, enabling the model to
acquire low-resource language capabilities while
minimizing destructive interference with pretrained
representations.
3.3 Reinforcement Learning for Semantic
Alignment
Optimizing semantic alignment is inherently a
sequence-level problem over discrete outputs. The
embedding-based semantic rewards used in our
framework are not directly differentiable with re-
spect to model parameters, making standard super-
vised learning objectives unsuitable. Reinforce-
ment learning therefore provides a natural and
principled framework for optimizing such non-
differentiable, sequence-level objectives, allowing
direct optimization of semantic consistency rather
than token-level likelihood.
Beyond enabling optimization of semantic re-
wards, reinforcement learning also plays a criti-
cal role in preserving pretrained knowledge during
low-resource adaptation. In contrast to supervised
fine-tuning, which performs unconstrained likeli-
hood maximization on narrow data distributions,
constrained reinforcement learning methods explic-
itly limit policy updates. This controlled optimiza-
tion is crucial for reducing destructive interference
and mitigating catastrophic forgetting, making re-
inforcement learning particularly well-suited for
semantic-space alignment under data scarcity.
We instantiate reinforcement learning using
Group Relative Policy Optimization (GRPO, Shao
et al., 2024), a value-free variant of PPO-style con-
strained optimization. For each input prompt x,
we sample a group of candidate outputs {y(k)}K
k=1
from the current policy ĻĪø(Ā· | x), compute their cor-
responding rewards, and update the policy based
on relative comparisons within the group. GRPO
inherits the key stabilization mechanisms of PPO,
including trust-region-style constraints that limit
policy drift between updates, while avoiding the
need for an explicit value function. These prop-
erties make GRPO a practical and stable choice
for semantic alignment, enabling effectiveChunk 9 Ā· 1,998 chars
risons within the group. GRPO inherits the key stabilization mechanisms of PPO, including trust-region-style constraints that limit policy drift between updates, while avoiding the need for an explicit value function. These prop- erties make GRPO a practical and stable choice for semantic alignment, enabling effective learning from semantic rewards while maintaining existing language capabilities. 3.4 Semantic Reward Design A central component of our framework is a se- mantic reward that explicitly defines the alignment objective for low-resource language expansion. Un- like supervised fine-tuning, which implicitly aligns models through token-level likelihood, our goal is to directly guide learning toward semantic consis- tency. The reward therefore serves not merely as an optimization signal, but as the primary mechanism that determines what the model is encouraged to learn. Semantic embedding model and suitability for reinforcement learning. To instantiate the se- mantic reward, we employ a multilingual sentence- level embedding model trained under a contrastive bilingual alignment objective. The model is adapted on parallel sentence pairs, where trans- lations are treated as positive examples and in- batch samples provide implicit negatives. Rather than training an encoder from scratch or enforc- ing surface-form similarity, this adaptation strategy emphasizes meaning preservation and fine-grained semantic discriminability across different linguistic realizations. We empirically observe that bilingual contrastive training yields stronger semantic structure than monolingual adaptation, improving both cross- lingual alignment and intra-language separability. This property is particularly important for reinforce- ment learning, where the reward signal must reflect graded semantic differences rather than coarse top- ical similarity. As a sanity check, we construct a small diagnostic set of sentence pairs spanning dif- ferent degrees of semantic equivalence and find
Chunk 10 Ā· 1,998 chars
ability. This property is particularly important for reinforce- ment learning, where the reward signal must reflect graded semantic differences rather than coarse top- ical similarity. As a sanity check, we construct a small diagnostic set of sentence pairs spanning dif- ferent degrees of semantic equivalence and find that the embedding model assigns similarity scores that consistently correlate with these graded relation- ships. This indicates that the resulting embedding space is suitable for use as a semantic reward for guiding alignment in reinforcement learning. 3.4.1 Embedding-Level Semantic Similarity Reward The primary learning signal in our framework is an embedding-level semantic similarity reward. Let f (Ā·) denote the sentence embedding model de- scribed above, which maps text to normalized vec- tor representations. Given a generated output y and a reference text yā, we compute their semantic similarity using cosine similarity: s(y, yā) = cos (f (y), f (yā)) . (1) This reward directly reflects our desired learning direction: outputs are encouraged to preserve mean- ing, regardless of surface realization. In contrast to token-level likelihood objectives, this formula- tion treats semantically equivalent paraphrases as -- 4 of 15 -- equally valid and explicitly avoids overfitting to reference form. To stabilize optimization and focus learning on meaningful improvements beyond min- imal adequacy, we apply a threshold-and-rescale shaping function: Rsim(y, yā) = ( 0, s(y, yā) ⤠Ļ, s(y,yā)āĻ 1āĻ , s(y, yā) > Ļ, (2) where Ļ corresponds to a minimal semantic ade- quacy level achieved after cold-start fine-tuning. This shaping ensures that reinforcement learning primarily refines semantic quality rather than am- plifying noise from low-quality generations. 3.4.2 Language Consistency Reward Because the embedding model is multilingual, op- timizing semantic similarity alone may reward mixed-language or partially off-target outputs. To prevent this reward hacking
Chunk 11 Ā· 1,998 chars
cement learning primarily refines semantic quality rather than am- plifying noise from low-quality generations. 3.4.2 Language Consistency Reward Because the embedding model is multilingual, op- timizing semantic similarity alone may reward mixed-language or partially off-target outputs. To prevent this reward hacking behavior, we intro- duce a language consistency reward based on a rule-based script check using Unicode ranges and regular expressions: Rlang(y) = ( 0, language mixed, 1, language consistent. (3) This acts as a hard constraint, ensuring that seman- tic optimization is carried out strictly within the target low-resource language space. Final reward. The final reward combines seman- tic similarity and language consistency: R(y, yā) = Ī»simRsim(y, yā) + Ī»langRlang(y). (4) Together, these components define our semantic alignment objective: the model is encouraged to improve semantic adequacy while being strictly constrained to produce linguistically consistent outputs in the low-resource language. In prac- tice, we assign a larger weight to semantic sim- ilarity (Ī»sim = 1.5) than to language consistency (Ī»lang = 1.0), reflecting our design choice that se- mantic preservation constitutes the primary learn- ing objective, while language consistency serves as a necessary constraint to prevent degenerate or off-language generations. 4 Experiments We conduct a series of experiments to evalu- ate whether semantic-reward-driven reinforcement learning (RL) provides a better trade-off between low-resource language adaptation and preservation of existing capabilities compared to supervised fine- tuning (SFT). Our experiments are designed to an- swer three research questions: (1) whether RL ef- fectively acquires low-resource language capabil- ities, (2) how RL and SFT differ in the trade-off between task performance and alignment tax, and (3) whether RL learns more transferable represen- tations under data scarcity. 4.1 Experimental Setup Base model and adaptation.
Chunk 12 Ā· 1,994 chars
ree research questions: (1) whether RL ef- fectively acquires low-resource language capabil- ities, (2) how RL and SFT differ in the trade-off between task performance and alignment tax, and (3) whether RL learns more transferable represen- tations under data scarcity. 4.1 Experimental Setup Base model and adaptation. All experiments are conducted on Qwen3-4B with parameter- efficient fine-tuning via LoRA (Hu et al., 2022). Unless otherwise specified, we apply LoRA to all linear projection layers in self-attention and MLP blocks. We use a LoRA rank of r = 64, scaling factor α = 128, and dropout rate of 0.05. Supervised fine-tuning (SFT). SFT is trained for three epochs in BF16 with a global batch size of 32, using AdamW (Loshchilov and Hutter, 2019) with learning rate 2 Ć 10ā5 and a cosine schedule (warmup ratio 0.1). Semantic reward model. The semantic reward described in Section 3.4 is instantiated using a bilin- gual sentence-embedding model built on top of CINO (Yang et al., 2022), a Tibetan-enhanced ex- tension of XLM-R (Conneau et al., 2020). We adapt CINO into a sentence-level encoder using SENTENCETRANSFORMER, and further special- ize it on ChineseāTibetan parallel data to produce embedding-based semantic similarity scores. The resulting encoder is used as a frozen reward model during RL and is not jointly optimized with the policy model. Reinforcement learning (GRPO). Reinforce- ment learning is performed with GRPO (Shao et al., 2024) starting from the SFT checkpoint, trained for one epoch in BF16 with AdamW and learning rate 5 Ć 10ā7 (effective global batch size 32). For each prompt, we sample 8 candidates with temperature 0.8 and top-p 0.9, using max prompt/completion lengths of 256 tokens. Controlled comparison. This unified setup en- sures that observed differences primarily reflect the alignment strategy (semantic-reward-driven RL vs. SFT), rather than mismatched optimization or adaptation configurations. Training and hyperpa- rameter details are
Chunk 13 Ā· 1,999 chars
0.9, using max prompt/completion lengths of 256 tokens. Controlled comparison. This unified setup en- sures that observed differences primarily reflect the alignment strategy (semantic-reward-driven RL vs. SFT), rather than mismatched optimization or adaptation configurations. Training and hyperpa- rameter details are provided in Appendix A. -- 5 of 15 -- 4.1.1 Tasks and Datasets We evaluate our approach on two representative Ti- betan low-resource generation tasks: cross-lingual machine translation (MT) and monolingual head- line generation (HG). Machine Translation (MT). For Tibetanā Chinese machine translation, we use an internal parallel corpus collected for training a visionā language model (Wu et al., 2019). Specifically, the corpus consists of TibetanāChinese sentence pairs translated as part of the pretraining data construction pipeline for the VLM, rather than annotations produced by the model itself. We repurpose this parallel data as supervised training material for machine translation in our experiments. As the corpus was originally curated to support VLM pretraining, a large portion of the data is grounded in visual descriptions, resulting in a relatively narrow and domain-constrained distribution. While this dataset does not aim to be a comprehensive translation benchmark, it reflects a realistic low-resource scenario with limited domain diversity and is therefore suitable for studying alignment behavior under data scarcity. Headline Generation (HG). For Tibetan head- line generation, we use the Tibetan subset of the CMHG dataset (Xu et al., 2025). Due to the fine- grained tokenization of Tibetan in the Qwen to- kenizer, raw samples often result in excessively long sequences and high memory consumption. We therefore filter out samples exceeding 1024 tokens and retain shorter instances for both training and evaluation. After filtering, the dataset con- tains 16,449 training samples and 621 test samples, which are used consistently across all headline
Chunk 14 Ā· 1,993 chars
en result in excessively long sequences and high memory consumption. We therefore filter out samples exceeding 1024 tokens and retain shorter instances for both training and evaluation. After filtering, the dataset con- tains 16,449 training samples and 621 test samples, which are used consistently across all headline gen- eration experiments. 4.1.2 Evaluation Protocols We adopt a multi-dimensional evaluation proto- col to capture both surface-level accuracy and se- mantic quality. For task performance, we report standard reference-based metrics (BLEU for MT and ROUGE for HG) as well as embedding-based semantic similarity. To assess semantic quality beyond reference matching, we conduct blind pair- wise evaluations using an LLM-as-a-Judge, where judgments are produced by GPT-5.2 under a fixed evaluation prompt; the full judging prompt and evaluation policy are provided in Appendix B. To quantify alignment tax, we evaluate all mod- els on a dominant-language benchmark (Chinese Model BLEU-4 Similarity Cold-start SFT 0.3953 0.5593 RL (Ours) 0.4519 0.7164 Table 1: Experiment 1 results on TibetanāChinese ma- chine translation. Model ROUGE-L Similarity Cold-start SFT 0.2204 0.5774 RL (Ours) 0.2530 0.6404 Table 2: Experiment 1 results on Tibetan headline gen- eration. CMRC, Cui et al., 2019) before and after adapta- tion and report performance changes relative to the base model. 4.2 Experiment 1: Effectiveness of Semantic-Reward RL We first evaluate whether semantic-reward-driven reinforcement learning (RL) effectively acquires low-resource language capabilities beyond mini- mal supervised initialization. Specifically, we com- pare RL against the cold-start SFT model on both TibetanāChinese machine translation (MT) and Ti- betan headline generation (HG). Machine Translation. Table 1 reports the results on TibetanāChinese MT. Starting from the same cold-start SFT checkpoint trained on 5k parallel sentence pairs, RL is further trained on approxi- mately 90k additional
Chunk 15 Ā· 1,999 chars
tart SFT model on both TibetanāChinese machine translation (MT) and Ti- betan headline generation (HG). Machine Translation. Table 1 reports the results on TibetanāChinese MT. Starting from the same cold-start SFT checkpoint trained on 5k parallel sentence pairs, RL is further trained on approxi- mately 90k additional samples using semantic re- wards. Compared to the cold-start baseline, RL yields consistent improvements in both reference- based accuracy and semantic similarity. BLEU-4 increases from 0.3953 to 0.4519, while semantic similarity improves substantially from 0.5593 to 0.7164. Headline Generation. We observe a similar trend on Tibetan headline generation. As shown in Table 2, the RL model trained on approximately 15k samples consistently outperforms the cold-start SFT baseline in both ROUGE-L and semantic sim- ilarity. In particular, ROUGE-L improves from 0.2204 to 0.2530, while semantic similarity in- creases from 0.5774 to 0.6404. Analysis. Across both translation and generation tasks, semantic-reward-driven RL consistently im- proves performance over the cold-start SFT base- line. Notably, the improvements are particularly pronounced in semantic similarity, suggesting that -- 6 of 15 -- RL primarily refines meaning preservation rather than merely increasing surface-level overlap with references. These results confirm that embedding- level semantic rewards constitute a sufficiently in- formative alignment signal, enabling effective low- resource language learning beyond minimal super- vised initialization. 4.3 Experiment 2: Trade-off Between Task Performance and Alignment Tax In this experiment, we compare semantic-reward- driven RL with a Strong SFT baseline to charac- terize the trade-off between task performance and preservation of existing general capabilities (i.e., alignment tax). Unlike the cold-start SFT model used in Experiment 1 ā which is trained only on a small 5k subset of low-resource data and serves solely as the initialization policy
Chunk 16 Ā· 1,988 chars
h a Strong SFT baseline to charac- terize the trade-off between task performance and preservation of existing general capabilities (i.e., alignment tax). Unlike the cold-start SFT model used in Experiment 1 ā which is trained only on a small 5k subset of low-resource data and serves solely as the initialization policy for RL ā the Strong SFT model is trained on the full available training data (i.e., the same combined dataset used by cold-start SFT + RL) under the same optimization and LoRA configuration, represent- ing the best-effort supervised adaptation outcome. Table 3 reports results on TibetanāChinese machine translation (MT) and Tibetan headline generation (HG), including task metrics, semantic similarity, and dominant-language performance on CMRC as a proxy for alignment tax. We additionally report LLM-based preference as a reference-free measure of semantic quality. Task 1: TibetanāChinese Machine Transla- tion (MT). On MT, Strong SFT achieves higher reference-based scores, improving BLEU from 0.4519 (RL) to 0.6006 and semantic similarity from 0.7164 to 0.8282, reflecting stronger surface align- ment to references. However, this advantage is less pronounced under LLM-based judgment: Strong SFT is preferred in 59.2% of cases, while the RL- aligned model still wins 33.5%, indicating compet- itive semantic quality despite lower BLEU. These metric gains come with a substantial alignment tax. After adaptation, SFT suffers marked degradation on CMRC (41.82 Avg / 62.99 F1), whereas RL preserves general capability sig- nificantly better (46.97 Avg / 65.79 F1). Overall, token-level imitation inflates reference-based MT metrics at the cost of forgetting, while constrained semantic alignment via RL yields safer updates with lower alignment tax. Task 2: Tibetan Headline Generation (HG). On HG, Strong SFT again achieves higher reference-based scores (ROUGE-L 0.3095 vs. 0.2530 for RL), while the semantic similarity gap remains small (0.6499 vs. 0.6404). Both
Chunk 17 Ā· 1,997 chars
f forgetting, while constrained semantic alignment via RL yields safer updates with lower alignment tax. Task 2: Tibetan Headline Generation (HG). On HG, Strong SFT again achieves higher reference-based scores (ROUGE-L 0.3095 vs. 0.2530 for RL), while the semantic similarity gap remains small (0.6499 vs. 0.6404). Both methods largely preserve dominant-language performance, with only minor differences in CMRC. However, under LLM-based judgment, RL is strongly pre- ferred: it wins 51.2% of pairwise comparisons ver- sus 35.1% for SFT (+16.1 points). This suggests that in open-ended generation, semantic-reward- driven RL learns generation behaviors that go be- yond reference imitation, capturing alternative yet semantically appropriate realizations that are more human-preferred despite lower n-gram overlap. Overall analysis. Across both tasks, Table 3 re- veals a consistent pattern: supervised fine-tuning excels at maximizing reference-based metrics, while semantic-reward-driven RL better preserves general capabilities and improves semantic qual- ity under preference-based evaluation. In MT, this trade-off manifests primarily as alignment tax, where SFTās metric gains coincide with substantial forgetting. In HG, where multiple valid realizations exist, RL is consistently preferred by LLM judges despite lower ROUGE, indicating that it learns gen- eration patterns not anchored to a single reference form. Together, these results suggest a fundamental mismatch between reference-based metrics and true semantic quality in low-resource settings. By aligning models in semantic space rather than en- forcing surface imitation, constrained RL enables the acquisition of alternative, semantically valid generation paradigms that are poorly reflected by n- gram metrics but better capture human preferences. 4.4 Experiment 3: Few-Shot Transferability Finally, we examine whether the stronger MT met- rics observed for SFT in Experiment 2 translate into better cross-task generalization.
Chunk 18 Ā· 1,987 chars
n of alternative, semantically valid generation paradigms that are poorly reflected by n- gram metrics but better capture human preferences. 4.4 Experiment 3: Few-Shot Transferability Finally, we examine whether the stronger MT met- rics observed for SFT in Experiment 2 translate into better cross-task generalization. While SFT achieves higher reference-based scores on MT, Ex- periment 2 also reveals a clear mismatch between such metrics and semantic quality, as reflected by alignment tax and LLM-based judgment. This raises a natural question: do higher MT scores in- dicate genuinely stronger Tibetan representations, or do they primarily capture task- and reference- specific surface patterns that are unlikely to trans- fer? To answer this, we design a few-shot transfer test from MT to HG. Concretely, we take the best MT checkpoints -- 7 of 15 -- Model Task Performance General Capability (Alignment Tax) Semantic Quality Metric Similarity CMRC Avg CMRC F1 LLM-Judge Win (%) Task 1: TibetanāChinese Machine Translation (MT) Strong SFT 0.6006 0.8282 41.82 62.99 59.2 RL (Ours) 0.4519 0.7164 46.97 65.79 33.5 Gap (RL vs. SFT) -0.1487 -0.1118 +5.15 +2.80 -25.7 Task 2: Tibetan Headline Generation (HG) Strong SFT 0.3095 0.6499 44.20 65.30 35.1 RL (Ours) 0.2530 0.6404 45.10 65.20 51.2 Gap (RL vs. SFT) -0.0565 -0.0095 +0.90 -0.10 +16.1 Table 3: Trade-off Analysis: Task Performance vs. Alignment Tax. For machine translation, Strong SFT achieves higher task metrics but incurs a heavy alignment tax, reflected by a significant drop in CMRC performance. In contrast, RL preserves general language capabilities with substantially higher CMRC scores while sacrificing surface-level metrics. For headline generation, both methods exhibit comparable general capability preservation, but RL significantly outperforms SFT in semantic quality as measured by LLM-based judgment, despite lower n-gram-based scores. Initialization ROUGE-L Similarity Base Model 0.1585 0.4695 MT-SFT 0.1935
Chunk 19 Ā· 1,999 chars
ficing surface-level metrics. For headline generation, both methods exhibit comparable general capability preservation, but RL significantly outperforms SFT in semantic quality as measured by LLM-based judgment, despite lower n-gram-based scores. Initialization ROUGE-L Similarity Base Model 0.1585 0.4695 MT-SFT 0.1935 0.5456 MT-RL (Ours) 0.1918 0.5690 Table 4: Few-shot transfer from MT to HG with 1,000 HG training samples. produced by Strong SFT and by RL, and fine-tune each of them on the HG task using only 1,000 train- ing samples under identical training settings. This setting stresses representation reuse: with limited HG supervision, a model that learns more general and semantically grounded Tibetan features during MT should adapt more effectively than a model whose gains are dominated by task-specific imita- tion. Table 4 reports the results. Both MT-adapted models improve substantially over the base model, confirming that MT training provides useful Ti- betan signal for downstream generation. However, despite MT-SFTās strong MT performance, it does not retain a corresponding advantage in transfer. The RL-initialized model achieves a higher seman- tic similarity score (0.5690 vs. 0.5456), while main- taining comparable ROUGE-L (0.1918 vs. 0.1935). This indicates that the MT-SFT modelās improve- ments are, at least in part, tied to MT-specific sur- face alignment and do not generalize as strongly to a different open-ended generation task, whereas semantic-reward-driven RL yields representations that transfer better under limited supervision.We further provide a mechanistic analysis of forgetting in Appendix C, where we examine OOD token- level negative log-likelihood and KL divergence to the base model on a fixed CMRC evaluation set. The results are consistent with our main find- ings and suggest that semantic RL yields more controlled distributional adaptation than SFT. Overall, this experiment supports our central claim that semantic-space alignment provides
Chunk 20 Ā· 1,998 chars
ative log-likelihood and KL divergence to the base model on a fixed CMRC evaluation set. The results are consistent with our main find- ings and suggest that semantic RL yields more controlled distributional adaptation than SFT. Overall, this experiment supports our central claim that semantic-space alignment provides a safer and more effective adaptation paradigm in low-resource settings: while SFT can produce larger in-task metric gains, RL achieves more ro- bust generalization across tasks, consistent with the practical needs of low-resource language ex- pansion. To complement these downstream results, we further provide a mechanistic analysis of for- getting in Appendix C, where we examine OOD token-level negative log-likelihood and KL diver- gence to the base model on a fixed CMRC evalua- tion set. The results are consistent with our main findings and suggest that semantic RL yields more controlled distributional adaptation than SFT. 4.5 Reward Ablation To further understand the role of reward design un- der the same reinforcement learning framework, we conduct a reward ablation study on the unified TibetanāChinese machine translation task. Specif- ically, under identical model, data, and training settings, we compare several reward combinations to isolate how different reward components affect semantic alignment performance. Table 5 shows that the proposed reward com- bination, consisting of embedding similarity and language consistency, achieves the best semantic similarity among all configurations. This result -- 8 of 15 -- Reward Similarity Embedding + LC (Ours) 0.7164 BLEU + LC 0.6375 BLEU + Embedding + LC 0.6175 BLEU + Embedding 0.2312 Table 5: Reward ablation results on TibetanāChinese machine translation under matched settings. LC denotes the language consistency reward. suggests that the performance gain does not come from reinforcement learning alone, but depends critically on how the reward is defined. First, language consistency is necessary for
Chunk 21 Ā· 1,990 chars
ble 5: Reward ablation results on TibetanāChinese machine translation under matched settings. LC denotes the language consistency reward. suggests that the performance gain does not come from reinforcement learning alone, but depends critically on how the reward is defined. First, language consistency is necessary for sta- ble semantic optimization in the multilingual set- ting. Without language consistency, the model fre- quently produces mixed TibetanāChinese outputs during exploration, indicating that semantic simi- larity alone is insufficient to constrain generation into the target low-resource language space. Second, BLEU-based rewards consistently weaken performance. As a surface-form overlap objective, BLEU introduces token-level pressure that restricts semantic exploration and partially re- stores the rigidity of supervised imitation. Even when combined with embedding reward and lan- guage consistency, it still degrades performance rel- ative to the simpler Embedding + LC formulation. This suggests that token-overlap rewards are not well aligned with the objective of semantic-space alignment, where multiple surface realizations may preserve the same meaning. We also tested an additional length-constraint reward for translation. However, it did not im- prove generation quality and reduced CMRC per- formance by approximately 2 points, indicating that excessive output constraints may further harm both capability retention and semantic alignment. Overall, these results show that a lightweight se- mantic reward, together with a necessary target- language constraint, is more effective than stacking additional surface-form objectives. 5 Conclusion This paper argues that low-resource language ex- pansion should be treated as an alignment prob- lem, where the core objective is semantic con- sistency rather than token-level imitation. We propose a semantic-space alignment paradigm in- stantiated with reinforcement learning driven by embedding-level semantic
Chunk 22 Ā· 1,995 chars
lusion This paper argues that low-resource language ex- pansion should be treated as an alignment prob- lem, where the core objective is semantic con- sistency rather than token-level imitation. We propose a semantic-space alignment paradigm in- stantiated with reinforcement learning driven by embedding-level semantic similarity and a strict language-consistency constraint. Experiments on TibetanāChinese machine translation and Tibetan headline generation show that semantic-reward- driven RL acquires low-resource language capabili- ties while substantially reducing alignment tax, pre- serving dominant-language competence with near- zero forgetting. We further observe a consistent mismatch between reference-based metrics and se- mantic quality: despite weaker n-gram overlap, RL is often preferred by LLM-based judges in open- ended generation and yields more transferable rep- resentations under few-shot transfer.Taken together, these findings suggest that semantic-space align- ment offers a scalable path for extending LLMs to weakly supported languages under data scarcity, shifting low-resource adaptation from distribution matching toward meaning-centered alignment. Limitations While modern LLMs nominally support many lan- guages, identifying a language with weak model performance often implies severe data scarcity, as in the case of Tibetan. The limited and domain- narrow nature of available data (e.g., translation cor- pora) may cause supervised fine-tuning to achieve artificially high in-domain metrics that do not fully reflect real-world generalization, making some de- gree of overfitting unavoidable. Ethical Considerations This work promotes inclusive language modeling by extending LLMs to low-resource languages such as Tibetan. All data are publicly available or inter- nally licensed and contain no personal or sensitive information. No human participants were involved, and all evaluations were performed automatically using LLM-based judges. While our method
Chunk 23 Ā· 1,998 chars
lusive language modeling by extending LLMs to low-resource languages such as Tibetan. All data are publicly available or inter- nally licensed and contain no personal or sensitive information. No human participants were involved, and all evaluations were performed automatically using LLM-based judges. While our method re- duces overfitting and potential bias from narrow supervision, residual pretrained biases may persist. Future research should further assess fairness and bias in low-resource settings. Acknowledgments This work was supported by the Hainan Provincial Joint Project of the Liāan International Education Innovation Pilot Zone (Grant No. 624LALH006). References Thales Sales Almeida, Rodrigo Nogueira, and HĆ©lio Pedrini. 2025. Curió-edu 7b: Examining data selec- -- 9 of 15 -- tion impacts in llm continued pretraining. Preprint, arXiv:2512.12770. M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani, Nouf M. Alotaibi, Hisham Abdullah Alyahya, Sultan AlRashed, Faisal Abdulrahman Mirza, Shaykhah Z. Alsubaie, Hassan A. Alahmed, Ghadah Alabduljab- bar, Raghad Alkhathran, Yousef Almushayqih, Ra- neem Alnajim, Salman Alsubaihi, Maryam Al Man- sour, Saad Amin Hassan, Majed Alrubaian, Ali Alam- mari, Zaki Alawami, and 8 others. 2025. Allam: Large language models for arabic and english. In In- ternational Conference on Learning Representations (ICLR) 2025. Poster. Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Preprint, arXiv:2507.06261. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmĆ”n, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov.
Chunk 24 Ā· 1,984 chars
5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Preprint, arXiv:2507.06261. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmĆ”n, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8440ā 8451, Online. Association for Computational Lin- guistics. Zoltan Csaki, Bo Li, Jonathan Lingjie Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, and Urmish Thakker. 2024. Sam- baLingo: Teaching large language models new lan- guages. In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024), pages 1ā21, Miami, Florida, USA. Association for Computational Linguistics. Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2019. A span-extraction dataset for Chinese ma- chine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5886ā5891, Hong Kong, China. Association for Computational Linguistics. DeepSeek-AI and 1 others. 2025. Deepseek-v3.2: Pushing the frontier of open large language models. Preprint, arXiv:2512.02556. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. Danni Liu and Jan Niehues. 2025. Conditions for catas- trophic forgetting in multilingual translation. In Pro- ceedings of the 5th Workshop on Multilingual Rep- resentation Learning (MRL 2025), pages
Chunk 25 Ā· 1,990 chars
enth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. Danni Liu and Jan Niehues. 2025. Conditions for catas- trophic forgetting in multilingual translation. In Pro- ceedings of the 5th Workshop on Multilingual Rep- resentation Learning (MRL 2025), pages 347ā359. Association for Computational Linguistics. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe- view.net. OpenAI. 2025. Introducing gpt-4.1 in the api. OpenAI Blog. Accessed: 2025-12-25. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290. Fabian David Schmidt. 2025. Robust and Scalable Cross-Lingual Transfer. Ph.D. thesis, Bayerische Julius-Maximilians-Universitaet Wuerzburg (Ger- many). John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. 2015. Trust region policy optimization. arXiv preprint arXiv:1502.05477. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proxi- mal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Zhihong Shao and 1 others. 2024. Deepseekmath: Push- ing the limits of mathematical reasoning in open lan- guage models. arXiv preprint arXiv:2402.03300. Kimi Team, Yifan Bai, and 1 others. 2025. Kimi k2: Open agentic intelligence. Preprint, arXiv:2507.20534. Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou,
Chunk 26 Ā· 1,994 chars
. Deepseekmath: Push- ing the limits of mathematical reasoning in open lan- guage models. arXiv preprint arXiv:2402.03300. Kimi Team, Yifan Bai, and 1 others. 2025. Kimi k2: Open agentic intelligence. Preprint, arXiv:2507.20534. Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, and 1 others. 2019. Large-scale datasets for going deeper in image understanding. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1480ā1485. IEEE. Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, Xu Han, Ting Zhang, and Yushuang Dong. 2025. CMHG: A dataset and benchmark for headline generation of minority languages in China. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12350ā12357, Suzhou, China. Association for Computational Linguistics. -- 10 of 15 -- Atsuki Yamaguchi, Terufumi Morishita, Aline Villav- icencio, and Nikolaos Aletras. 2025. Mitigating catastrophic forgetting in target language adapta- tion of llms via source-shielded updates. Preprint, arXiv:2512.04844. An Yang and 1 others. 2025a. Qwen3 technical report. Preprint, arXiv:2505.09388. Menglin Yang, Jialin Chen, Yifei Zhang, Jiahong Liu, Jiasheng Zhang, Qiyao Ma, Harshit Verma, Qianru Zhang, Min Zhou, Irwin King, and Rex Ying. 2025b. Low-rank adaptation for foundation models: A com- prehensive review. CoRR, abs/2501.00365. Ziqing Yang, Zihang Xu, Yiming Cui, Baoxin Wang, Min Lin, Dayong Wu, and Zhigang Chen. 2022. CINO: A Chinese minority pre-trained language model. In Proceedings of the 29th International Con- ference on Computational Linguistics, pages 3937ā 3949, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024. Llama beyond english: An empirical study on language capability transfer. CoRR, abs/2401.01055. -- 11 of 15 -- A Training and Hyperparameter Details A.1 Base
Chunk 27 Ā· 1,998 chars
ongju, Republic of Korea. International Committee on Computational Linguistics. Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024. Llama beyond english: An empirical study on language capability transfer. CoRR, abs/2401.01055. -- 11 of 15 -- A Training and Hyperparameter Details A.1 Base model and LoRA configuration We adopt Qwen3-4B-Instruct as the base model and apply LoRA for all SFT and RL experiments. Unless otherwise specified, LoRA adapters are in- serted into: ⢠Self-attention projections: q_proj, k_proj, v_proj, o_proj ⢠MLP projections: gate_proj, up_proj, down_proj We use LoRA rank r = 64, scaling factor α = 128, and dropout 0.05 throughout all experiments. A.2 Supervised fine-tuning (SFT) details ⢠Initialization: Qwen3-4B-Instruct ⢠Precision: BF16 ⢠Training length: 3 epochs ⢠Optimizer: AdamW ⢠Learning rate: 2 Ć 10ā5 ⢠Scheduler: cosine decay with warmup ratio 0.1 ⢠Global batch size: 32 (2 GPUs Ć per-device batch size 8 Ć gradient accumulation 2) ⢠Sequence length: determined by the data and model defaults, typically 1024ā2048 tokens A.3 Reinforcement learning (GRPO) details We perform reinforcement learning using GRPO (Shao et al., 2024) starting from the cold-start SFT checkpoint. ⢠Initialization: SFT checkpoint (cold-start) ⢠Precision: BF16 ⢠Training length: 1 epoch ⢠Optimizer: AdamW ⢠Learning rate: 5 Ć 10ā7 ⢠Effective global batch size: 32 (per-device batch size 16, gradient accumulation 2) ⢠Group size: 8 sampled candidate generations per input prompt ⢠Max prompt length: 256 tokens ⢠Max completion length: 256 tokens ⢠Sampling temperature: 0.8 ⢠Nucleus sampling: top-p = 0.9 A.4 Rationale for unified configuration Across SFT and RL, we keep the base model, LoRA configuration, precision, and batch scale as aligned as possible. This reduces confounding factors and supports attributing performance differ- ences to the alignment paradigm itself. B LLM-judge Prompt B.1 LLM-judge Prompt for Headline Generation
Chunk 28 Ā· 1,995 chars
r unified configuration
Across SFT and RL, we keep the base model,
LoRA configuration, precision, and batch scale
as aligned as possible. This reduces confounding
factors and supports attributing performance differ-
ences to the alignment paradigm itself.
B LLM-judge Prompt
B.1 LLM-judge Prompt for Headline
Generation Task
For the Tibetan headline generation task (HG), we
used prompt in Figure 2 to evaluate two candidate
headlines generated for a Tibetan news article. The
evaluation was conducted using GPT-5.2 as the
LLM-judge.
B.2 LLM-judge Prompt for Machine
Translation Task
For the Tibetan-Chinese machine translation task
(MT), we used the prompt in Figure 3 to evaluate
two candidate translations. This evaluation was
also conducted using GPT-5.2 as the LLM-judge.
In both tasks, the evaluation process was con-
ducted blind, with no direct reference to the model
outputs, ensuring an unbiased comparison of the
candidate results.
C OOD Log-Likelihood and KL
Divergence Analysis
To complement the downstream evaluation in the
main text, we further analyze forgetting from
a more mechanistic perspective using an out-of-
distribution (OOD) evaluation set. Specifically,
we use passages from the Chinese Machine Read-
ing Comprehension benchmark (CMRC) as a fixed
OOD corpus without supervision signals, repre-
senting general language understanding ability out-
side the low-resource language-expansion training
distribution. On this set, we evaluate token-level
negative log-likelihood (NLL) and KL divergence
relative to the base model. These measurements
help characterize how different training strategies
affect distributional drift beyond downstream task
metrics.
-- 12 of 15 --
LLM Judge Prompt for headline generation
You are an expert linguist specializing in Tibetan-Chinese journalism.
Your task is to evaluate two candidate headlines (Candidate 1 and Candidate 2) generated for a
Tibetan news article.
### Article:
[Source Text]:
{src}
[Candidate 1]:
{cand_1}
[CandidateChunk 29 Ā· 1,999 chars
ics.
-- 12 of 15 --
LLM Judge Prompt for headline generation
You are an expert linguist specializing in Tibetan-Chinese journalism.
Your task is to evaluate two candidate headlines (Candidate 1 and Candidate 2) generated for a
Tibetan news article.
### Article:
[Source Text]:
{src}
[Candidate 1]:
{cand_1}
[Candidate 2]:
{cand_2}
### Task:
Compare the two candidates.
- If Candidate 1 is significantly better, output: [[1]]
- If Candidate 2 is significantly better, output: [[2]]
- If both are equally good or bad, output: [[0]]
Provide a brief reason (in Chinese) before your decision.
### Output Format:
Reason: <brief explanation>
Decision: [[1]] or [[2]] or [[0]]
Figure 2: Prompt for headline generation evaluation.
OOD token-level negative log-likelihood and KL
divergence. We first compare token-level NLL
on the CMRC OOD set. Lower NLL indicates
that the adapted model remains closer to the base
modelās general language modeling behavior on
unseen out-of-domain data. We then examine KL
divergence between the adapted models and the
base model on the same OOD set, which provides
a complementary view of distributional drift.
As shown in Table 6(a), RL leads to substantially
smaller degradation on OOD data than SFT. Com-
pared to the base model, RL increases mean NLL
by only +0.24, whereas SFT increases it by +0.64.
The difference is even more pronounced in the tail:
the 90th-percentile NLL rises by +0.62 under RL
but by +1.43 under SFT. This suggests that forget-
ting under SFT disproportionately affects harder
OOD examples, while semantic RL preserves more
stable behavior across the distribution.
Table 6(b) shows that RL and SFT have com-
parable mean KL divergence to the base model,
indicating that the overall magnitude of adaptation
is similar. However, RL yields consistently lower
median and tail KL values than SFT. In particular,
the 90th-percentile KL is lower for RL (0.0839)
than for both cold-start SFT (0.0912) and final SFT
(0.0932). This suggests that semanticChunk 30 Ā· 1,994 chars
ivergence to the base model,
indicating that the overall magnitude of adaptation
is similar. However, RL yields consistently lower
median and tail KL values than SFT. In particular,
the 90th-percentile KL is lower for RL (0.0839)
than for both cold-start SFT (0.0912) and final SFT
(0.0932). This suggests that semantic RL does not
merely reduce the total amount of learning, but
instead leads to more uniform distributional adapta-
tion and avoids localized large shifts that are more
characteristic of catastrophic forgetting.
Summary. Taken together, the OOD NLL and
KL analyses provide complementary mechanistic
evidence for the main findings in the paper. Com-
pared with supervised fine-tuning, semantic-reward
RL induces smaller degradation in token-level like-
lihood on unseen OOD data, while also producing
more controlled and less heavy-tailed divergence
from the base model. These observations support
our interpretation that semantic RL mitigates align-
ment tax not by suppressing learning altogether, but
-- 13 of 15 --
LLM Judge Prompt for machine translation
You are an expert linguist specializing in Tibetan-Chinese translation.
Your task is to evaluate two candidate translations (Candidate 1 and Candidate 2) for a Tibetan text.
### Article:
[Source Text]:
{src}
[Candidate 1]:
{cand_1}
[Candidate 2]:
{cand_2}
### Task:
Compare the two translations.
- If Candidate 1 is significantly better, output: [[1]]
- If Candidate 2 is significantly better, output: [[2]]
- If both are equally good or bad, output: [[0]]
Provide a brief reason (in English) before your decision.
### Output Format:
Reason: <brief explanation>
Decision: [[1]] or [[2]] or [[0]]
Figure 3: Prompt for machine translation evaluation.
by encouraging more uniform and less destructive
adaptation.
-- 14 of 15 --
(a) OOD token-level negative log-likelihood on the fixed CMRC evaluation set
Model Mean NLL Median NLL P10 NLL P90 NLL
Base Model 2.6097 1.4285 0.3636 5.8844
RL (final) 2.8533 1.5543 0.3848Chunk 31 Ā· 969 chars
igure 3: Prompt for machine translation evaluation. by encouraging more uniform and less destructive adaptation. -- 14 of 15 -- (a) OOD token-level negative log-likelihood on the fixed CMRC evaluation set Model Mean NLL Median NLL P10 NLL P90 NLL Base Model 2.6097 1.4285 0.3636 5.8844 RL (final) 2.8533 1.5543 0.3848 6.5000 SFT (final) 3.2504 1.7390 0.4636 7.3188 (b) KL divergence to the base model on the fixed CMRC OOD set Model Mean KL Median KL P10 KL P90 KL SFT (cold-start) ā„ Base 0.0471 0.0099 0.0000 0.0912 SFT (final) ā„ Base 0.0404 0.0113 0.0000 0.0932 RL (final) ā„ Base 0.0410 0.0073 0.0000 0.0839 Table 6: Mechanistic OOD analysis on a fixed CMRC evaluation set. Panel (a) reports token-level negative log-likelihood (NLL), where lower values indicate better preservation of the base modelās out-of-domain language modeling behavior. Panel (b) reports KL divergence to the base model, characterizing distributional drift after adaptation. -- 15 of 15 --