Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali

Summary

This study benchmarks the adaptation of three open-weight 7–8B language models—Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B—to Romanized Nepali, a Latin-script variant of the Nepali language used in informal digital communication. All models failed to generate Romanized Nepali in zero-shot settings, each exhibiting distinct failure modes linked to their tokenization strategies. After fine-tuning with QLoRA and rsLoRA on a 10,000-sample bilingual dataset, all models achieved BERTScore ≈ 0.75 and chrF++ > 23. Qwen3-8B was recommended as the overall best, leading in structural alignment metrics and producing zero residual errors in a qualitative case study. Llama-3.1-8B showed the largest absolute gains in PPL and BERTScore, supporting the adaptation headroom hypothesis: models with weaker zero-shot baselines can achieve greater improvement through fine-tuning. Mistral-7B-v0.1 had the lowest perplexity but retained semantic errors. The study establishes a reproducible baseline for Romanized Nepali adaptation, highlighting the effectiveness of parameter-efficient fine-tuning in low-resource settings.

PDF viewer

Chunks(41)

Chunk 0 · 1,993 chars

BENCHMARKING LINGUISTIC ADAPTATION IN
COMPARABLE-SIZED LLMs: A STUDY OF LLAMA-3.1-8B,
MISTRAL-7B-v0.1, AND QWEN3-8B ON ROMANIZED NEPALI
Ananda Rimal
Dept. of Computer Science & Engineering
Nepal Engineering College
anandr022342@nec.edu.np
Adarsha Rimal
Central Dept. of CS and IT
Tribhuvan University
adarsharimal07@gmail.com
ABSTRACT
Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for
informal digital communication in Nepal, yet it remains critically underresourced in the landscape
of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic
adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1,
and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using
a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance
is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL),
BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic
consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adap-
tation (QLoRA) [1] with Rank-Stabilized LoRA (rsLoRA) [2] at rank r = 32 on dual NVIDIA
Tesla T4 GPUs, training only ≈ 1% of each model’s parameters in under 27 total GPU-hours.
At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct
architecture-specific failure mode. Following fine-tuning, all three resolve these failures and
converge to BERTScore ≈ 0.75 and chrF++ > 23. Overall dimension-wise assessment across
ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model
to produce semantically relevant zero-shot output and leading all structural alignment metrics
post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest
zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (∆ = −49.77)

Chunk 1 · 1,996 chars

ommended architecture, being the only model
to produce semantically relevant zero-shot output and leading all structural alignment metrics
post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest
zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (∆ = −49.77) and
BERTScore (∆ = +0.3287), making it the preferred choice for iterative low-resource develop-
ment pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation
in comparable-sized open-weight LLMs.
1 INTRODUCTION
The proliferation of social media and mobile messaging platforms has established Romanized Nepali the
Nepali language written using the Latin alphabet as the de facto standard for informal digital discourse in
Nepal. Despite its ubiquity, Romanized Nepali remains significantly under-resourced in Natural Language
Processing. Unlike formal Devanagari script, Romanized Nepali lacks a standardized orthography, leading to
extreme phonetic variation where a single word may be transliterated in multiple ways, such as “khana,” and
“khaana,” [3].
1
arXiv:2604.14171v1 [cs.CL] 25 Mar 2026

-- 1 of 31 --

Nepali is the official language of Nepal, spoken by approximately 44.86% of the population [4]. The
language is traditionally written in Devanagari script, comprising 36 consonants and 13 vowels. Current
LLMs are predominantly trained on standardized scripts and formal datasets, leaving transliterated variants
in a “linguistic shadow” [5]. This gap results in poor performance on downstream tasks such as sentiment
analysis, machine translation, and conversational AI in the Nepali digital context.
This paper investigates the robustness and adaptation capabilities of the comparable-sized LLMs, specifically
Llama-3.1-8B [6], Mistral-7B-v0.1 [7], and Qwen3-8B [8]. We focus on this size class due to its increasing
importance for on device deployment and resource efficient fine tuning in low resource environments. Through
a

Chunk 2 · 1,999 chars

er investigates the robustness and adaptation capabilities of the comparable-sized LLMs, specifically
Llama-3.1-8B [6], Mistral-7B-v0.1 [7], and Qwen3-8B [8]. We focus on this size class due to its increasing
importance for on device deployment and resource efficient fine tuning in low resource environments. Through
a systematic benchmark, we evaluate how different architectures handle the phonetic noise and non standard
syntax inherent in Romanized Nepali, providing a foundation for future generative AI applications for the
Nepali diaspora.
To the best of our knowledge, this is the first comprehensive study focusing specifically on the transliteration
resilience of comparable-sized LLMs on Romanized Nepali script.
2 LITERATURE REVIEW
2.1 Transformer-Based Language Models
The transformer architecture introduced by Vaswani et al. [9] marked a fundamental shift in NLP by replacing
recurrence with self-attention, enabling efficient parallel processing and long-range dependency modeling.
This underpins virtually all modern LLMs. Brown et al. [10] demonstrated that scaling transformer models to
billions of parameters yields emergent few-shot and zero-shot generalization, establishing the paradigm of
large-scale pre-training followed by lightweight task-specific adaptation that all three model families in this
study follow.
2.2 Open-Weight Models in the 7–8B Class
Recent open-weight LLMs have made high-performance language modeling accessible for low-resource
research. Touvron et al. [6] introduced the LLaMA family, demonstrating competitive performance at 7–65B
parameters through careful data curation. Jiang et al. [7] released Mistral 7B, incorporating grouped-query
and sliding window attention for improved inference efficiency. Bai et al. [8] presented the Qwen series
with broader multilingual pre-training coverage, providing stronger cross-lingual transfer to non-English
scripts. These three model families form the subject of the present study due to their comparable

Chunk 3 · 1,991 chars

uped-query
and sliding window attention for improved inference efficiency. Bai et al. [8] presented the Qwen series
with broader multilingual pre-training coverage, providing stronger cross-lingual transfer to non-English
scripts. These three model families form the subject of the present study due to their comparable size (≈7–8B
parameters), open-weight availability, and suitability for on-device deployment.
2.3 Nepali NLP: History and Current State
Natural Language Processing for Nepali has a relatively short but accelerating history. Early work focused
on rule based and statistical approaches to morphological analysis and part-of-speech tagging, motivated
by Nepali’s highly agglutinative morphology and complex sandhi rules that complicate standard tokeniza-
tion [3]. Bam and Shahi [11] demonstrated that even basic preprocessing such as stopword removal and
2

-- 2 of 31 --

stemming requires language specific adaptation, as standard multilingual tools fail to account for Devanagari’s
orthographic properties.
Koirala and Niraula [12] introduced word embedding models for Nepali, establishing distributional semantic
resources for formal text but explicitly noting that Romanized Nepali remains outside their coverage.
Sitaula et al. [13] adapted multilingual BERT via continued pre-training on a Devanagari corpus to produce
NepBERT, showing strong performance on formal NLP tasks. Most recently, Pudasaini et al. [14] developed
NepaliGPT, the first autoregressive generative model for Nepali, again focused exclusively on Devanagari.
Across this body of work, informal Romanized Nepali has remained consistently unaddressed the principal
gap this study targets.
2.4 Tokenization for Low-Resource and Romanized Scripts
Tokenization is a foundational bottleneck for LLM performance on low-resource languages. Rust et al. [15]
conducted a systematic evaluation of multilingual BERT tokenizers across 99 languages, demonstrating
that tokenizer quality is a strong predictor of

Chunk 4 · 1,998 chars

targets.
2.4 Tokenization for Low-Resource and Romanized Scripts
Tokenization is a foundational bottleneck for LLM performance on low-resource languages. Rust et al. [15]
conducted a systematic evaluation of multilingual BERT tokenizers across 99 languages, demonstrating
that tokenizer quality is a strong predictor of downstream task performance. Languages with high subword
fertility experience longer effective sequence lengths, reduced context window utility, and higher inference
cost [15].
Mielke et al. [16] showed that agglutinative languages and those with non-Latin scripts are disproportionately
harmed by BPE-based vocabularies trained on predominantly English corpora. Romanized Nepali faces a
compounded version of this problem: it is neither standard English nor Devanagari, placing it in a tokenization
dead zone between the two primary training distributions of all three studied models.
Kudo and Richardson [17] introduced SentencePiece, a language-agnostic subword tokenizer that treats text
as a raw character sequence without whitespace assumptions, giving it better generalization to unseen scripts.
Mistral-7B-v0.1 and Qwen3-8B both use SentencePiece [7, 8], while Llama-3.1-8B uses Tiktoken [6], a BPE
implementation optimized for English-heavy data. The zero-shot failure analysis in this study directly reflects
these tokenizer design differences.
2.5 Parameter-Efficient Fine-Tuning
Full fine-tuning of LLMs is computationally prohibitive in low-resource settings. Hu et al. [18] introduced
LoRA, which injects trainable low-rank matrices into frozen transformer weights, reducing trainable pa-
rameters to under 1% of the base model. Dettmers et al. [1] extended this with QLoRA, combining 4-bit
NF4 quantization with LoRA adapter training, enabling fine-tuning on consumer-grade hardware. Kala-
jdzievski [2] identified rank-dependent scaling instability in standard LoRA and proposed rsLoRA, replacing
the α/r scaling factor with α/√r to stabilize training at r = 32,

Chunk 5 · 1,983 chars

[1] extended this with QLoRA, combining 4-bit
NF4 quantization with LoRA adapter training, enabling fine-tuning on consumer-grade hardware. Kala-
jdzievski [2] identified rank-dependent scaling instability in standard LoRA and proposed rsLoRA, replacing
the α/r scaling factor with α/√r to stabilize training at r = 32, the rank used in this work.
Instruction fine-tuning [19] has been shown to be particularly effective for generalization to new task formats,
training models to follow structured Instruction-Input-Output templates. The Alpaca dataset format [20] used
in this study is a widely adopted instantiation of this approach.
3

-- 3 of 31 --

2.6 Evaluation Metrics for Low-Resource Generation
Papineni et al. [21] introduced BLEU, the dominant n-gram precision metric for machine translation,
though its reliance on exact surface-form matches makes it poorly suited to non-standardized scripts [22].
Popovic [23] proposed chrF, a character n-gram F-score more robust to spelling variation. Zhang et al. [24]
introduced BERTScore, evaluating semantic similarity via contextual embeddings. Perplexity, grounded
in information theory [25], measures predictive confidence and serves as the primary checkpoint selection
criterion in this work. Lin [26] proposed ROUGE-L for structural alignment via longest common subsequence
matching.
2.7 Research Gap
Across the reviewed literature three gaps are evident. First, all existing Nepali NLP work targets the formal
Devanagari script, leaving informal Romanized Nepali unaddressed [3, 13, 14]. Second, while tokenizer
quality is known to predict downstream performance for low-resource scripts [15, 16], no study has compared
tokenizer design across competing comparable-sized LLM architectures for a non-standardized transliterated
script. Third, no prior work has benchmarked the adaptation of comparable-sized open-weight LLMs to
Romanized Nepali under a rigorous multi-metric framework. This study addresses all three gaps.
3

Chunk 6 · 1,994 chars

dy has compared
tokenizer design across competing comparable-sized LLM architectures for a non-standardized transliterated
script. Third, no prior work has benchmarked the adaptation of comparable-sized open-weight LLMs to
Romanized Nepali under a rigorous multi-metric framework. This study addresses all three gaps.
3 METHODOLOGY
This section describes the complete experimental pipeline consisting of seven sequential stages. Stage 1
covers source corpus selection from an existing Devanagari instruction-following dataset. Stage 2 applies
bilingual transformation: the first 5,000 samples undergo selective English instruction translation while
retaining Romanized Nepali Input and Output fields, and the remaining 5,000 samples undergo full phonetic
transliteration of all three Alpaca fields to Romanized Nepali. Stage 3 partitions the transformed 10,000-
sample corpus into a 9,000-sample training set and a 1,000-sample held-out test set. Stage 4 applies
parameter-efficient fine-tuning via QLoRA [1] with rsLoRA [2] at rank r = 32 on dual NVIDIA Tesla
T4 GPUs across all three architectures. Stage 5 conducts dual-stage evaluation comparing zero-shot and
fine-tuned performance on the held-out test set. Stage 6 scores all outputs across five metrics spanning seven
measurement dimensions: PPL, BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU Stage 7
performs qualitative case study analysis across 10 Golden Questions. Figure 1 provides a visual overview of
the full pipeline.
4

-- 4 of 31 --

Data Preparation (Stages 1–3)
Parameter-Efficient Fine-Tuning (Stage 4)
Evaluation (Stages 5–7)
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
Stage 7
Source Corpus
Devanagari Alpaca Dataset
(≈52,000 samples)
Semantic Translation
Instruction → English
Input / Output → Romanized Nepali
(5,000 samples)
Phonetic Transliteration
Instruction / Input /
Output → Romanized Nepali
(5,000 samples)
10,000-Sample Dataset
Alpaca Format
(EN / Romanized Nepali Instr. /
Romanized Nepali Input

Chunk 7 · 1,993 chars

us
Devanagari Alpaca Dataset
(≈52,000 samples)
Semantic Translation
Instruction → English
Input / Output → Romanized Nepali
(5,000 samples)
Phonetic Transliteration
Instruction / Input /
Output → Romanized Nepali
(5,000 samples)
10,000-Sample Dataset
Alpaca Format
(EN / Romanized Nepali Instr. /
Romanized Nepali Input /
Romanized Nepali Output)
Training Set
9,000 samples
Held-Out Test Set
1,000 samples
(sequestered)
QLoRA + rsLoRA SFT
4-bit NF4 • rank r=32 • α=64
3 epochs • Dual NVIDIA Tesla T4
Llama-3.1-8B
Fine-Tuned
Mistral-7B-v0.1
Fine-Tuned
Qwen3-8B
Fine-Tuned
Dual-Stage Evaluation
Zero-Shot vs. Fine-Tuned
(1,000-sample held-out test set)
PPL
(Perplexity)
BERTScore
(Semantic)
chrF++
(Char n-gram)
ROUGE-
1,ROUGE-
2,ROUGE-L
(LCS)
BLEU
(n-gram)
Qualitative Case Study
10 Golden Questions
Pre- and Post-SFT Analysis
Figure 1. Experimental pipeline: data preparation, parameter-efficient fine-tuning, and evaluation across three model
architectures.
5

-- 5 of 31 --

3.1 Dataset Construction and Transliteration
3.1.1 Source Corpus Selection
The primary data source is the Saugatkafley/alpaca-nepali-sft dataset [27], an instruction-following corpus
originally in Devanagari script containing approximately 52,000 samples. We extracted a curated subset of
10,000 samples, maximising diversity across instruction types including logical reasoning, creative writing,
and factual Q&A. Sequences exceeding 512 tokens were truncated for computational efficiency.
3.1.2 Bilingual Transformation Pipeline
All 10,000 samples originate in Devanagari and undergo one of two transformation paths depending on their
partition assignment, as illustrated in Stage 2 of Figure 1:
1. Semantic Translation (5,000 samples). For the first 5,000 samples, only the Instruction field was
translated from Devanagari into English using the Google Translate engine [28]. The corresponding
Input and Output fields were converted to Romanized Nepali via phonetic transliteration using the
Indic-Transliteration

Chunk 8 · 1,994 chars

1. Semantic Translation (5,000 samples). For the first 5,000 samples, only the Instruction field was
translated from Devanagari into English using the Google Translate engine [28]. The corresponding
Input and Output fields were converted to Romanized Nepali via phonetic transliteration using the
Indic-Transliteration library [29]. This asymmetric design enables the model to learn the
mapping between high-resource English prompting and low-resource Romanized Nepali responses,
improving cross-lingual instruction following without sacrificing output-script consistency.
2. Full Phonetic Transliteration (5,000 samples). For the remaining 5,000 samples, all three Alpaca
fields Instruction, Input, and Output were converted from Devanagari to Romanized Nepali using
the Indic-Transliteration library [29]. Phonetic variations were intentionally permitted (e.g.,
mapping ‘chha’ to both ‘cha’ and ‘chha’) to replicate the non-standardized orthography of real-world
digital Nepali communication.
The resulting dataset contains a 50/50 split of English-instructed and Romanized-instructed samples, with
all Output fields consistently in Romanized Nepali across both partitions. All samples follow the Alpaca
instruction-following schema [20] (Instruction, Input, Output).
3.1.3 Train/Test Partitioning
Following bilingual transformation, the full 10,000-sample corpus was partitioned into a 9,000-sample
training set and a 1,000-sample held-out test set (the “1k test set”). The test set was sequestered before any
training decision to prevent data leakage and is used exclusively for evaluation in Stages 5–7 of the pipeline.
3.2 Parameter-Efficient Fine-Tuning with QLoRA
To adapt all three models within the constraints of dual NVIDIA Tesla T4 GPUs (16 GB VRAM each), we
employed QLoRA [1] accelerated via the Unsloth framework [30]. The fine-tuning stack combines 4-bit
NF4 quantization, low-rank adapter injection, and rank-stabilized scaling.
6

-- 6 of 31 --

3.2.1 4-bit NormalFloat (NF4)

Chunk 9 · 1,999 chars

adapt all three models within the constraints of dual NVIDIA Tesla T4 GPUs (16 GB VRAM each), we
employed QLoRA [1] accelerated via the Unsloth framework [30]. The fine-tuning stack combines 4-bit
NF4 quantization, low-rank adapter injection, and rank-stabilized scaling.
6

-- 6 of 31 --

3.2.1 4-bit NormalFloat (NF4) Quantization
All base model weights were frozen and compressed to 4-bit NF4 precision [1]. NF4 is an information-
theoretically optimal 4-bit data type whose 24 = 16 quantization levels are non-uniformly spaced to
minimise expected quantization error under a standard normal prior N (0, 1), unlike uniformly spaced INT4.
Quantization is applied block-wise:
WNF4 = quantizeNF4
 W
max |W |

(1)
Per-block scaling constants are stored in 32-bit float for dequantization at inference. Base weights receive no
gradient updates; only adapter matrices are trained.
3.2.2 Low-Rank Adapter Injection
Following Hu et al. [18], trainable low-rank matrices are injected alongside each target weight. For a frozen
weight WNF4 ∈ Rd×d, the adapted forward pass becomes:
h = WNF4 x + α
r BA x (2)
where A ∈ Rr×d and B ∈ Rd×r are trainable adapter matrices, r ≪ d is the adapter rank, and α is a scaling
hyperparameter. At initialisation, A ∼ N (0, σ2) and B = 0, ensuring zero adapter contribution at step 0.
Adapters were injected into seven projection layers per transformer block:
Target modules = {q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj}
All other weights remained frozen. We selected r = 32 and α = 64, higher than common defaults of
r ∈ {8, 16} [18], to provide sufficient capacity for the complex mapping from English-centric pre-training to
Romanized Nepali phonetics.
3.2.3 Rank-Stabilized Scaling (rsLoRA)
Standard LoRA’s α/r scaling progressively suppresses adapter influence as rank increases [2]. At r = 32
this is non-trivial. rsLoRA replaces 1/r with 1/√r:
h = WNF4 x + α
√r BA x (3)
For our configuration (r = 32, α = 64):
7

-- 7 of 31 --

α
√r = 64
√32

Chunk 10 · 1,996 chars

manized Nepali phonetics.
3.2.3 Rank-Stabilized Scaling (rsLoRA)
Standard LoRA’s α/r scaling progressively suppresses adapter influence as rank increases [2]. At r = 32
this is non-trivial. rsLoRA replaces 1/r with 1/√r:
h = WNF4 x + α
√r BA x (3)
For our configuration (r = 32, α = 64):
7

-- 7 of 31 --

α
√r = 64
√32 ≈ 11.31 vs. α
r = 64
32 = 2.0 (4)
This 5.7× increase in effective adapter contribution ensures high-rank adapters meaningfully influence the
forward pass without gradient instability [2].
3.3 Experimental Protocol
The study follows a structured three-phase protocol: (i) Zero-Shot Baseline Assessment to quantify innate
cross-lingual transfer, (ii) Supervised Fine-Tuning (SFT) on a curated bilingual Romanized Nepali instruction
corpus using QLoRA with rsLoRA, and (iii) Dual-Stage Evaluation comprising quantitative benchmarking
across five metrics spanning seven measurement dimensions and a qualitative case study across 10 Golden
Questions. All three phases share the same held-out 1,000-sample test set to prevent data leakage, with all
reported metrics reflecting best-checkpoint weights recovered via minimum validation loss.
3.3.1 Phase 1: Zero-Shot Baseline Assessment
Unmodified base weights of all three models were evaluated on the 1k test set with no system prompt,
template, or few-shot examples. This zero-shot condition quantifies each model’s innate ability to process
Romanized Nepali arising solely from pre-training [10], establishing the performance floor from which all
fine-tuning gains ∆ are measured.
3.3.2 Phase 2: Supervised Fine-Tuning (SFT)
All three models were fine-tuned on the 9,000-sample training partition using the SFTTrainer from the
TRL library [31] in Instruction – Output format. The QLoRA + rsLoRA configuration from Section 3.2
was applied identically across all architectures so that Phase 3 differences are attributable to architectural
properties, not configuration asymmetry.
Optimizer and Learning Rate Schedule. An 8-bit AdamW

Chunk 11 · 1,994 chars

rainer from the
TRL library [31] in Instruction – Output format. The QLoRA + rsLoRA configuration from Section 3.2
was applied identically across all architectures so that Phase 3 differences are attributable to architectural
properties, not configuration asymmetry.
Optimizer and Learning Rate Schedule. An 8-bit AdamW optimizer [32] was used with a peak learning
rate of 1 × 10−4. A 200-step linear warmup prevented gradient divergence during initial exposure to
Romanized text, followed by cosine decay [33].
Batch Configuration. A micro-batch size of 2 per device with 4 gradient accumulation steps yields an
effective batch size of:
Beff = Bdevice × Naccum × NGPU = 2 × 4 × 2 = 16 (5)
All models were trained for 3 epochs, corresponding to 3,375 optimizer steps.
8

-- 8 of 31 --

Golden Point Checkpoint Recovery. Terminal-epoch overfitting is a documented risk in low-resource
PEFT settings [1]. Checkpoints and validation evaluations were synchronised every 100 steps with a rolling
window of 10 saved checkpoints. Upon completion, load_best_model_at_end restores the checkpoint
of minimum validation loss Leval. Since:
PPL = exp(Leval) (6)
minimising Leval directly optimises the primary metric, ensuring all Phase 3 scores reflect best-checkpoint
weights.
Category Parameter Value
Optimization
Optimizer AdamW (8-bit) [32]
Learning Rate 1 × 10−4
Warmup Steps 200
Batch Logic Per-Device Batch Size 2
Effective Batch Size 16 (2 × 4 × 2 GPUs)
Hardware GPUs Dual NVIDIA Tesla T4 (16 GB each)
Quantization 4-bit NF4 [1]
LoRA Adapters
Rank r 32
Alpha α 64
Target Modules q, k, v, o_proj,
gate, up, down_proj
Scaling rsLoRA (α/√r) [2]
Checkpointing
save_strategy steps
save_steps 100
save_total_limit 10
load_best_model_at_end True
metric_for_best_model eval_loss
Table 1. Training, hardware, adapter, and checkpointing configuration applied identically to all three models.
3.3.3 Phase 3: Dual-Stage Evaluation
Stage A:Quantitative Benchmarking. All seven measurement dimensions across five

Chunk 12 · 1,994 chars

e_steps 100
save_total_limit 10
load_best_model_at_end True
metric_for_best_model eval_loss
Table 1. Training, hardware, adapter, and checkpointing configuration applied identically to all three models.
3.3.3 Phase 3: Dual-Stage Evaluation
Stage A:Quantitative Benchmarking. All seven measurement dimensions across five metrics (PPL,
BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, BLEU) were computed on the 1k test set for
both zero-shot (Phase 1) and fine-tuned (Phase 2) weights. For each model m and metric M, the absolute
fine-tuning gain is:
∆Mm = Mfine-tuned
m − Mzero-shot
m (7)
Stage B:Qualitative Case Study. A manual analysis of 10 Golden Questions spanning factual Q&A,
creative writing, logical reasoning, and conversational response was conducted pre- and post-SFT. Full
outputs are in Appendix A; analysis is in Section 4.6.
9

-- 9 of 31 --

Phase Name Method Output
1 Zero-Shot Unmodified weights; raw Ro-
manized input; no prompting
Performance floor across five
metrics, seven measurement
dimensions
2 SFT QLoRA [1] + rsLoRA [2]; 4-
bit NF4; 3 epochs; Golden
Point recovery
Best-checkpoint adapters per
model
3A Quantitative 5-metric 7 dimension scor-
ing; zero-shot vs. fine-tuned;
∆ per metric
Per-model gain table
3B Qualitative 10 Golden Questions; pre-
and post-SFT
Linguistic failure analysis
Table 2. Summary of the three experimental phases.
3.4 Evaluation Metrics
Five complementary metrics spanning seven measurement dimensions cover fluency, surface similarity,
and semantic alignment. ROUGE is reported across three variants (ROUGE-1, ROUGE-2, ROUGE-L) to
capture unigram, bigram, and longest common subsequence alignment respectively. Together they capture
orthographic, structural, and semantic dimensions of generation quality a necessary multi-axis framework
given the non-standardized orthography of Romanized Nepali, where surface-form variation is high but
semantic intent is often preserved [3].
3.4.1 Perplexity (PPL)
PPL measures intrinsic language model

Chunk 13 · 1,989 chars

they capture
orthographic, structural, and semantic dimensions of generation quality a necessary multi-axis framework
given the non-standardized orthography of Romanized Nepali, where surface-form variation is high but
semantic intent is often preserved [3].
3.4.1 Perplexity (PPL)
PPL measures intrinsic language model fluency [25]. For a tokenized sequence X = (x1, . . . , xt):
PPL(X) = exp − 1
t
t	X
i=1
log Pθ(xi | x<i)
!
(8)
Lower PPL indicates greater predictive confidence. PPL is the primary metric optimised by the Golden Point
checkpoint strategy, since PPL = exp(Leval), where Leval is the average negative log-likelihood over the
held-out test set. Minimising Leval therefore directly minimises PPL.
3.4.2 chrF++
chrF++ [23] computes a character n-gram F-score. It is preferred over word-level metrics such as BLEU
for transliterated scripts because character n-grams are robust to the orthographic variation inherent in
Romanized Nepali, where the same word may be spelled in multiple phonetically equivalent ways [3]:
chrFβ = (1 + β2) Prec · Rec
β2 · Prec + Rec (9)
10

-- 10 of 31 --

We set β = 2 to weight recall more heavily, as missing Nepali morphological suffixes are more costly than
spurious ones.
3.4.3 BERTScore
BERTScore [24] evaluates semantic similarity via contextual embeddings, capturing meaning beyond surface
token matching. This makes it particularly valuable for Romanized Nepali, where orthographic variants of
the same word should receive high similarity scores even when exact string matching fails. For reference r
and candidate c with unit-normalized contextual embeddings {xi} and {yj }, the recall component is:
RBERT = 1
|r|
X
xi∈r
max
yj ∈c cos(xi, yj ) (10)
where cos(xi, yj ) = x⊤
i yj /(∥xi∥∥yj ∥) is the cosine similarity between embedding pairs. The final F1 score
combines precision and recall analogously.
3.4.4 ROUGE-L
ROUGE-L [26] assesses structural alignment via the Longest Common Subsequence (LCS), capturing fluency
and word-order

Chunk 14 · 1,987 chars

xi∈r
max
yj ∈c cos(xi, yj ) (10)
where cos(xi, yj ) = x⊤
i yj /(∥xi∥∥yj ∥) is the cosine similarity between embedding pairs. The final F1 score
combines precision and recall analogously.
3.4.4 ROUGE-L
ROUGE-L [26] assesses structural alignment via the Longest Common Subsequence (LCS), capturing fluency
and word-order preservation without requiring contiguous n-gram matches:
FLCS = (1 + β2) PLCS RLCS
β2 PLCS + RLCS
(11)
ROUGE-1 (unigram overlap) and ROUGE-2 (bigram overlap) are reported alongside ROUGE-L, together
constituting three of the seven measurement dimensions and providing a complete picture of structural
alignment at multiple granularities.
3.4.5 BLEU
BLEU [21] is the standard 4-gram precision metric included for compatibility with prior NLP literature:
BLEU = BP · exp
4	X
n=1
wn log pn
!
, wn = 1
4 (12)
where BP is the brevity penalty and pn is the n-gram precision for order n. Given the orthographic variance
of Romanized Nepali, BLEU is expected to underestimate true generation quality [22] because it relies on
exact surface-form matches. It is therefore interpreted alongside chrF++ and BERTScore [23, 24], which are
more robust to spelling variation.
11

-- 11 of 31 --

4 RESULTS
4.1 Zero-Shot Baseline Performance
Table 3 reports baseline performance. Qwen3-8B and Mistral-7B achieve comparable perplexity (27.89
and 27.53 respectively), both substantially outperforming Llama-3.1-8B ( PPL = 52.79), suggesting weaker
innate cross-lingual transfer [34]. Qwen3-8B leads on semantic alignment with the highest BERTScore
(0.5631) and chrF++ (13.10), consistent with its broader multilingual pre-training [8]. Mistral-7B records the
lowest BERTScore (0.2327) and chrF++ (4.95), indicating it models token-level distributions competently
but fails to produce semantically coherent output without adaptation.
Model PPL↓ BERT↑ BLEU↑ chrF++↑ R-1↑ R-2↑ R-L↑
Llama-3.1-8B 52.79 0.4224 0.0143 8.15 0.0732 0.0311 0.0681
Qwen3-8B 27.89 0.5631 0.0099 13.10 0.0570 0.0195

Chunk 15 · 1,993 chars

ERTScore (0.2327) and chrF++ (4.95), indicating it models token-level distributions competently
but fails to produce semantically coherent output without adaptation.
Model PPL↓ BERT↑ BLEU↑ chrF++↑ R-1↑ R-2↑ R-L↑
Llama-3.1-8B 52.79 0.4224 0.0143 8.15 0.0732 0.0311 0.0681
Qwen3-8B 27.89 0.5631 0.0099 13.10 0.0570 0.0195 0.0513
Mistral-7B 27.53 0.2327 0.0105 4.95 0.0309 0.0134 0.0281
Table 3. Zero-shot baseline performance on the 1k test set. ↓ = lower is better; ↑ = higher is better.
4.2 Fine-Tuned Performance
Following three epochs of QLoRA + rsLoRA SFT with Golden Point recovery, all models converge to
perplexity 2.81–3.02 (Table 4). Qwen3-8B achieves the highest chrF++ (27.47) and ROUGE scores. Llama-
3.1-8B records the highest BERTScore (0.7511) [24]. Mistral-7B achieves the lowest perplexity (2.812) but
trails on surface metrics.
Model PPL↓ BERT↑ BLEU↑ chrF++↑ R-1↑ R-2↑ R-L↑
Llama-3.1-8B 3.024 0.7511 0.0498 26.97 0.2756 0.1002 0.2359
Qwen3-8B 2.946 0.7505 0.0550 27.47 0.2915 0.1162 0.2511
Mistral-7B 2.812 0.7339 0.0404 23.95 0.2460 0.0842 0.2144
Table 4. Fine-tuned performance after QLoRA + rsLoRA SFT with Golden Point recovery. Bold = best per metric.
4.3 Fine-Tuning Gain Analysis
Table 5 reports the absolute fine-tuning gain (∆ = fine-tuned − zero-shot) for each model across five metrics
spanning seven measurement dimensions. Llama-3.1-8B records the largest PPL reduction (−49.77) despite
its weakest zero-shot baseline, confirming the adaptation headroom hypothesis [1, 18]. Mistral-7B-v0.1
achieves the largest BERTScore gain (+0.5012) and chrF++ gain (+19.00) from its lowest zero-shot semantic
baseline a direct consequence of its near-zero semantic starting point amplifying absolute gain magnitude.
Qwen3-8B leads on all three structural alignment metrics (ROUGE-1: +0.2345, ROUGE-2: +0.0967,
ROUGE-L: +0.1998) and BLEU (+0.0451), owing to its stronger multilingual pre-training coverage [8].
12

-- 12 of 31 --

Model ∆PPL↓ ∆BERT↑ ∆BLEU↑ ∆chrF++↑ ∆R-1↑ ∆R-2↑

Chunk 16 · 1,998 chars

emantic starting point amplifying absolute gain magnitude.
Qwen3-8B leads on all three structural alignment metrics (ROUGE-1: +0.2345, ROUGE-2: +0.0967,
ROUGE-L: +0.1998) and BLEU (+0.0451), owing to its stronger multilingual pre-training coverage [8].
12

-- 12 of 31 --

Model ∆PPL↓ ∆BERT↑ ∆BLEU↑ ∆chrF++↑ ∆R-1↑ ∆R-2↑ ∆R-L↑
Llama-3.1-8B −49.77 +0.3287 +0.0355 +18.82 +0.2024 +0.0691 +0.1678
Qwen3-8B −24.95 +0.1874 +0.0451 +14.37 +0.2345 +0.0967 +0.1998
Mistral-7B −24.72 +0.5012 +0.0299 +19.00 +0.2151 +0.0708 +0.1863
Bold = largest gain per column. ↓ = lower is better; ↑ = higher is better.
Table 5. Absolute fine-tuning gain (∆ = fine-tuned − zero-shot) across five metrics spanning seven measurement
dimensions.
4.4 Training and Validation Loss Analysis
Training and validation loss curves for all three models over 3 epochs (3,375 steps) are presented indi-
vidually below. Checkpoints were saved every 100 steps and the best checkpoint was recovered using
load_best_model_at_end based on minimum validation loss [1].
Figure 2. Training and validation loss for Llama-3.1-8B over 3,375 steps. The green dotted line marks the best
checkpoint at step 2,200 (minimum validation loss = 1.1585).
Llama-3.1-8B exhibits the highest initial training loss (≈ 2.10) among all three models, consistent with its
Tiktoken tokenizer over-fragmenting Romanized Nepali into low-frequency subword units [15]. Validation
loss decreases steadily through Epochs 1 and 2, reaching a global minimum of 1.1585 at step 2,200 (end of
Epoch 2). At the Epoch 3 boundary (step 2,200–2,300) a sharp spike in validation loss to ≈ 1.28 is observed,
caused by the learning rate schedule resetting. The Golden Point recovery mechanism correctly identifies
step 2,200 as the optimal checkpoint, preventing the Epoch 3 degradation from affecting the final reported
metrics.
13

-- 13 of 31 --

Figure 3. Training and validation loss for Mistral-7B-v0.1 over 3,375 steps. The green dotted line marks the best
checkpoint at step

Chunk 17 · 1,998 chars

lden Point recovery mechanism correctly identifies
step 2,200 as the optimal checkpoint, preventing the Epoch 3 degradation from affecting the final reported
metrics.
13

-- 13 of 31 --

Figure 3. Training and validation loss for Mistral-7B-v0.1 over 3,375 steps. The green dotted line marks the best
checkpoint at step 2,200 (minimum validation loss = 1.0930).
Mistral-7B-v0.1 begins with the lowest initial training loss (≈ 1.90) of the three models, reflecting its
SentencePiece tokenizer producing more coherent subword units for Latin-script input than Tiktoken. The
validation curve is the smoothest of all three architectures, declining consistently across both Epochs 1 and 2
with minimal oscillation. The global validation minimum of 1.0930 is achieved at step 2,200, the lowest
among all three models. An epoch-boundary spike to ≈ 1.17 is visible at step 2,300 before Epoch 3 stabilizes,
again confirming that Golden Point recovery at step 2,200 yields the best generalization state.
Figure 4. Training and validation loss for Qwen3-8B over 3,375 steps. The green dotted line marks the best checkpoint
at step 2,200 (minimum validation loss = 1.1313).
Qwen3-8B starts with an intermediate initial training loss (≈ 2.20) and converges steadily through Epochs 1
14

-- 14 of 31 --

and 2. The validation loss reaches its global minimum of 1.1313 at step 2,200. Notably, Qwen3-8B’s Epoch 3
validation loss plateaus at ≈ 1.22 rather than recovering toward the Epoch 2 minimum, suggesting the adapter
reached a capacity ceiling earlier than the other two models consistent with Qwen3-8B’s stronger multilingual
pre-training baseline reducing available adaptation headroom [8]. The Epoch 3 spike is present but less
pronounced than in Llama-3.1-8B.
Cross-Model Summary. All three models achieve their global validation minimum at step 2,200, empiri-
cally confirming that Golden Point checkpoint recovery was necessary and effective across all architectures.
Mistral-7B-v0.1 achieves the lowest

Chunk 18 · 1,990 chars

]. The Epoch 3 spike is present but less
pronounced than in Llama-3.1-8B.
Cross-Model Summary. All three models achieve their global validation minimum at step 2,200, empiri-
cally confirming that Golden Point checkpoint recovery was necessary and effective across all architectures.
Mistral-7B-v0.1 achieves the lowest absolute validation loss (1.0930), followed by Qwen3-8B (1.1313) and
Llama-3.1-8B (1.1585). However, as shown in Tables 4 and 5, lower validation loss does not directly predict
superior downstream metric performance Llama-3.1-8B achieves the highest BERTScore (0.7511) despite its
highest validation loss, supporting the adaptation headroom hypothesis discussed in Section 4.6.
4.5 Trainable Parameters and Training Cost
Property Llama-3.1-8B Qwen3-8B Mistral-7B-v0.1
Model Scale
Total Parameters 8,114,147,328 8,278,029,312 7,325,618,176
Trainable Parameters 83,886,080 87,293,952 83,886,080
Trainable % 1.03% 1.05% 1.15%
Training Cost
Wall-Clock Time 8h 50m 51s 8h 26m 01s 9h 09m 31s
Total GPU Time 26h 26m 23s
Table 6. Trainable parameter counts and wall-clock training time from Unsloth [30] runtime logs. All models: 3 epochs
/ 3,375 steps, rank r = 32 [18], dual NVIDIA Tesla T4.
15

-- 15 of 31 --

4.6 Qualitative Case Study
Full outputs for all 10 Golden Questions are in Appendix A (Tables 9–14).
Zero-Shot Failure Signatures. Three distinct architecture-specific failures emerge [3]. Llama-3.1-8B
produced [no response] on 6 of 10 instructions, indicating Tiktoken over-fragmented Romanized Nepali into
low-frequency subword units [15]. Mistral-7B-v0.1 produced [empty line] on 5 of 10 and responded off-task
in English on the remaining 5. Qwen3-8B was the only model to produce non-empty output on all 10, yet
exclusively in English or Devanagari, reflecting misdirected cross-lingual transfer [34].
Post-SFT Resolution. Following fine-tuning [1], all three models produced Romanized Nepali on all 10
instructions. Llama-3.1-8B and Qwen3-8B generated fluent,

Chunk 19 · 1,997 chars

Qwen3-8B was the only model to produce non-empty output on all 10, yet
exclusively in English or Devanagari, reflecting misdirected cross-lingual transfer [34].
Post-SFT Resolution. Following fine-tuning [1], all three models produced Romanized Nepali on all 10
instructions. Llama-3.1-8B and Qwen3-8B generated fluent, semantically complete responses. Mistral-
7B-v0.1 retained two residual semantic errors, consistent with its lower chrF++ (23.95) and ROUGE-L
(0.2144) [23, 26].
Failure Taxonomy Summary. For Orthographic Robustness, all fine-tuned models correctly resolved
phonetically variant inputs. For Morphological Integrity, Llama-3.1-8B and Qwen3-8B correctly generated
Nepali verb endings and postpositions, while Mistral occasionally dropped them. For Latent Romanization
Drift, all fine-tuned models maintained Romanized Nepali without drifting into English syntax [34].
5 DISCUSSION
5.1 Base Model Structural Collapse
All three base models were functionally unable to respond in Romanized Nepali at zero-shot (Table 7), each
exhibiting a distinct failure mode rooted in its architecture and tokenizer design [3,15]. These failures confirm
that Romanized Nepali falls outside the confident generation distribution of all three pre-training regimes,
consistent with the “linguistic shadow” framing of Bender et al. [5].
Model Failure Mode Technical Description
Llama-3.1-8B Semantic Void Early EOS Triggering [6]: Tiktoken over-
fragments Romanized Nepali; model immediately
predicts End-of-Sequence, yielding null output on
6 of 10 instructions.
Mistral-7B-v0.1 Newline Loop Distributional Collapse [7]: SentencePiece pro-
duces plausible fragments but the model fails to
initiate a response, defaulting to continuous \n
generation.
Qwen3-8B Script Drift Latent Script Bias [8]: Model correctly identi-
fies semantic intent but defaults to dominant pre-
training scripts rather than Romanized format.
Table 7. Taxonomy of structural failures in zero-shot base models.
16

-- 16 of 31

Chunk 20 · 1,997 chars

ails to
initiate a response, defaulting to continuous \n
generation.
Qwen3-8B Script Drift Latent Script Bias [8]: Model correctly identi-
fies semantic intent but defaults to dominant pre-
training scripts rather than Romanized format.
Table 7. Taxonomy of structural failures in zero-shot base models.
16

-- 16 of 31 --

5.2 The Adaptation Headroom Hypothesis
Llama-3.1-8B presents a compelling inversion of the expected performance ordering: despite recording
the weakest zero-shot baseline across all seven measurement dimensions (PPL 52.79, BERTScore 0.4224,
chrF++ 8.15), it achieves the largest absolute PPL reduction (∆ = −49.77), the highest post-SFT BERTScore
(0.7511), and the second-largest chrF++ gain (∆ = +18.82) following fine-tuning. This pattern is not
incidental but structural.
We term this the adaptation headroom hypothesis: models with weaker innate cross-lingual representations
possess greater representational plasticity for script-domain adaptation when exposed to high-quality PEFT
data. The intuition is straightforward a model that has formed no confident prior over Romanized Nepali
phoneme sequences has more parameter space available for the adapter to reshape, whereas a model with a
strong multilingual prior (such as Qwen3-8B) approaches a representational ceiling sooner, as evidenced by
its Epoch 3 validation loss plateauing at ≈ 1.22 rather than recovering toward its Epoch 2 minimum [8].
Formally, for a model m with zero-shot performance floor Mzero-shot
m and post-SFT ceiling Mfine-tuned
m , the
adaptation headroom Hm on metric M is:
Hm(M) = Mfine-tuned
m − Mzero-shot
m (13)
Under this definition, Llama-3.1-8B maximises Hm(PPL) by a factor of 2.0× over both Qwen3-8B (∆ =
−24.95) and Mistral-7B-v0.1 (∆ = −24.72), confirming that the weakest pre-trained baseline produces the
steepest adaptation trajectory under identical QLoRA + rsLoRA conditions [1, 2, 18].
This result carries a non-obvious practical implication: a strong zero-shot multilingual

Chunk 21 · 1,985 chars

a factor of 2.0× over both Qwen3-8B (∆ =
−24.95) and Mistral-7B-v0.1 (∆ = −24.72), confirming that the weakest pre-trained baseline produces the
steepest adaptation trajectory under identical QLoRA + rsLoRA conditions [1, 2, 18].
This result carries a non-obvious practical implication: a strong zero-shot multilingual baseline does not
guarantee the highest fine-tuning return. When the deployment scenario involves a high-quality curated corpus
and iterative fine-tuning rounds, adaptation headroom is a more actionable model selection criterion than
zero-shot performance. Practitioners who select models solely on zero-shot benchmarks risk systematically
overlooking architectures with the highest long-term fine-tuning yield.
5.3 Limitations
Several limitations should be acknowledged. First, the Indic-Transliteration pipeline approximates
but does not fully reproduce the orthographic diversity of naturally occurring user-generated Romanized
Nepali, potentially underestimating real-world phonetic noise. Second, evaluation is on a single instruction-
following dataset [27]; performance on social media text, customer service conversations, or domain-specific
content is not assessed. Third, BERT Score uses a multilingual BERT model with limited Romanized Nepali
exposure, potentially underestimating semantic similarity for high-quality outputs that diverge in surface
form [24]. Fourth, all metrics reflect quantized (4-bit NF4) inference, which introduces a slight precision-
performance trade-off whose magnitude on the reported values has not been independently quantified [1].
17

-- 17 of 31 --

5.4 Broader Impact and Applications
Romanized Nepali is the de facto language of Nepali social media, messaging platforms, and youth discourse.
Current NLP infrastructure, being Devanagari-centric, is effectively inaccessible to this register, creating
barriers for content moderation, mental health chatbots, educational tools, and diaspora communication
services for the

Chunk 22 · 1,994 chars

is the de facto language of Nepali social media, messaging platforms, and youth discourse.
Current NLP infrastructure, being Devanagari-centric, is effectively inaccessible to this register, creating
barriers for content moderation, mental health chatbots, educational tools, and diaspora communication
services for the approximately 3 million Nepali speakers abroad [4].
By demonstrating that parameter-efficient adaptation of comparable-sized models achieves BERTScore
≈ 0.75 with under 90M trainable parameters and 26 GPU-hours on dual T4 hardware, this work establishes a
practically replicable path for low-resource NLP practitioners and Nepali developer communities to build
Romanized Nepali AI tools without access to large compute clusters [1, 18].
5.5 Overall Model Assessment
Drawing on the fine-tuned performance (Table 4), absolute gains across five metrics spanning seven measure-
ment dimensions (Table 5), training cost (Table 6), and the qualitative case study (Section 4.6), we consolidate
a dimension-wise post-SFT ranking across all three architectures in Table 8. Zero-shot baselines are reported
separately in Table 3 and are not repeated here.
Dimension Llama-3.1-8B Qwen3-8B Mistral-7B-v0.1 Best
Post-SFT Quality (five metrics, seven dimensions)
Post-SFT PPL↓ 3rd (3.024) 2nd (2.946) 1st (2.812) Mistral
Post-SFT BERTScore↑ 1st (0.7511) 2nd (0.7505) 3rd (0.7339) Llama
Post-SFT chrF++↑ 2nd (26.97) 1st (27.47) 3rd (23.95) Qwen3-8B
Post-SFT ROUGE-1↑ 2nd (0.2756) 1st (0.2915) 3rd (0.2460) Qwen3-8B
Post-SFT ROUGE-2↑ 2nd (0.1002) 1st (0.1162) 3rd (0.0842) Qwen3-8B
Post-SFT ROUGE-L↑ 2nd (0.2359) 1st (0.2511) 3rd (0.2144) Qwen3-8B
Post-SFT BLEU↑ 2nd (0.0498) 1st (0.0550) 3rd (0.0404) Qwen3-8B
Adaptation & Cost
Adaptation Gain (∆PPL)↓ 1st (−49.77) 2nd (−24.95) 3rd (−24.72) Llama
Training Speed 2nd (8h 51m) 1st (8h 26m) 3rd (9h 10m) Qwen3-8B
Qualitative
Residual Errors (10 Golden
Qs)
2nd (2 content) 1st (0 errors) 3rd (2 semantic) Qwen3-8B
Dimension wins Qwen: 7
Llama:

Chunk 23 · 1,999 chars

498) 1st (0.0550) 3rd (0.0404) Qwen3-8B
Adaptation & Cost
Adaptation Gain (∆PPL)↓ 1st (−49.77) 2nd (−24.95) 3rd (−24.72) Llama
Training Speed 2nd (8h 51m) 1st (8h 26m) 3rd (9h 10m) Qwen3-8B
Qualitative
Residual Errors (10 Golden
Qs)
2nd (2 content) 1st (0 errors) 3rd (2 semantic) Qwen3-8B
Dimension wins Qwen: 7
Llama: 2
Mistral: 1
Table 8. Dimension-wise post-SFT ranking across ten evaluation criteria. Bold indicates best per row.
Recommended Architecture: Qwen3-8B. Qwen3-8B is the recommended architecture for Romanized
Nepali deployment across the widest range of practical settings. Post-SFT, it leads all seven measurement
dimensions spanning structural alignment and fluency: chrF++ (27.47), ROUGE-1 (0.2915), ROUGE-2
(0.1162), ROUGE-L (0.2511), and BLEU (0.0550) [8]. Its qualitative case study produced zero residual
errors against two content errors in Llama-3.1-8B and two semantic errors in Mistral-7B-v0.1, demonstrating
superior factual accuracy and topic binding. It also converges fastest at 8h 26m on dual NVIDIA Tesla T4
hardware, making it the most resource-efficient choice among the three architectures (Table 6).
18

-- 18 of 31 --

Scenario Exception: Iterative Fine-Tuning. Llama-3.1-8B is the preferred architecture when the de-
velopment pipeline involves iterative data collection and repeated fine-tuning rounds. Despite its weakest
zero-shot baseline (PPL 52.79), it achieves the largest absolute PPL reduction (∆PPL = −49.77) and the
highest post-SFT BERTScore (0.7511), confirming the adaptation headroom hypothesis: a weaker pre-trained
multilingual baseline translates into greater representational plasticity under PEFT [1, 18]. Practitioners
building toward a production system over multiple training iterations on a growing Romanized Nepali corpus
should consider Llama-3.1-8B the more tractable long-term investment.
Fluency-Only Deployment. Mistral-7B-v0.1 is recommended only for applications where grammatical
fluency is the primary requirement and factual

Chunk 24 · 1,993 chars

building toward a production system over multiple training iterations on a growing Romanized Nepali corpus
should consider Llama-3.1-8B the more tractable long-term investment.
Fluency-Only Deployment. Mistral-7B-v0.1 is recommended only for applications where grammatical
fluency is the primary requirement and factual precision is secondary. It achieves the lowest post-SFT
perplexity (2.812) and the largest BERTScore gain (∆BERTScore = +0.5012) from its near-zero semantic
zero-shot baseline, as well as the smoothest validation loss curve across all three architectures (minimum
Leval = 1.0930 at step 2,200). However, it produced the two most severe qualitative failures post-SFT: a
semantic hallucination on the Dashain festival task and a single-token non-answer on the categorical reasoning
task [7]. Its lower total parameter count (7.3B vs. 8.1–8.3B) offers a marginal inference speed advantage on
memory-constrained hardware where perplexity is the dominant deployment criterion.
6 CONCLUSION
This study presents the first systematic benchmarking of comparable-sized open-weight LLMs on Romanized
Nepali, addressing a persistent and consequential gap in low-resource NLP for South Asian informal digital
communication. We evaluated Llama-3.1-8B [6], Mistral-7B-v0.1 [7], and Qwen3-8B [8] across zero-shot and
fine-tuned settings under a rigorous framework spanning five metrics across seven measurement dimensions
(PPL, BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU), supplemented by an in-depth
qualitative case study across 10 Golden Questions spanning factual, technical, grammatical, and creative
generation tasks.
Four principal findings emerge from this work. First, all three base models are functionally unable to respond
in Romanized Nepali at zero-shot, each exhibiting a distinct architecture-specific failure mode: early EOS
triggering in Llama-3.1-8B, distributional collapse in Mistral-7B-v0.1, and latent script bias in Qwen3-8B.
These failures collectively

Chunk 25 · 1,988 chars

m this work. First, all three base models are functionally unable to respond
in Romanized Nepali at zero-shot, each exhibiting a distinct architecture-specific failure mode: early EOS
triggering in Llama-3.1-8B, distributional collapse in Mistral-7B-v0.1, and latent script bias in Qwen3-8B.
These failures collectively confirm that Romanized Nepali lies outside the confident generation distribution
of all comparable-sized model pre-training regimes, consistent with the “linguistic shadow” characterisation
of under-resourced transliterated scripts [3, 5].
Second, QLoRA [1] combined with rsLoRA [2] at rank r = 32 fully resolves all three failure modes within
3 epochs on 9,000 samples, training only ≈ 1% of each model’s parameters on dual NVIDIA Tesla T4
GPUs in under 26 total GPU-hours. Post-SFT, all three architectures achieve BERTScore ≈ 0.75 and chrF++
> 23, demonstrating that parameter-efficient fine-tuning is both necessary and sufficient for script-domain
adaptation at this scale.
Third, the adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline
(PPL 52.79), achieves the largest absolute PPL reduction (∆ = −49.77) and the highest post-SFT BERTScore
(0.7511), establishing that a weak pre-trained multilingual baseline does not preclude strong fine-tuning
returns and may in fact enable greater representational plasticity under PEFT [18]. Separately, Mistral-7B-
v0.1 achieves the largest absolute BERTScore gain (∆ = +0.5012) and chrF++ gain (∆ = +19.00) from its
19

-- 19 of 31 --

near-zero semantic zero-shot baseline, a direct consequence of its lowest semantic starting point amplifying
absolute gain magnitude [1].
Fourth, the dimension-wise assessment across ten evaluation criteria (Table 8) identifies Qwen3-8B as the
overall recommended architecture for Romanized Nepali deployment, winning seven of ten dimensions. It
leads all five structural and fluency metrics post-SFT (chrF++ 27.47, ROUGE-1 0.2915, ROUGE-2

Chunk 26 · 1,996 chars

in magnitude [1].
Fourth, the dimension-wise assessment across ten evaluation criteria (Table 8) identifies Qwen3-8B as the
overall recommended architecture for Romanized Nepali deployment, winning seven of ten dimensions. It
leads all five structural and fluency metrics post-SFT (chrF++ 27.47, ROUGE-1 0.2915, ROUGE-2 0.1162,
ROUGE-L 0.2511, BLEU 0.0550), converges fastest at 8h 26m, and produces zero residual errors across all
10 Golden Questions [8]. Llama-3.1-8B remains the preferred choice for iterative development pipelines
where adaptation headroom is the primary selection criterion, while Mistral-7B-v0.1 is suited to fluency-only
applications where its lowest post-SFT perplexity (2.812) is the dominant deployment criterion.
These results establish a reproducible, hardware-accessible baseline for Romanized Nepali generative NLP,
replicable by practitioners without access to large compute clusters. Future work will pursue: (i) Direct
Preference Optimization [35] to align fine-tuned models with colloquial Nepali slang and code-switching
patterns prevalent in real-world social media; (ii) formal subword fertility analysis to quantify the per-
architecture tokenization burden on Romanized Nepali and guide vocabulary extension strategies [15]; (iii)
downstream task evaluation on sentiment analysis, machine translation, and hate speech detection; and
(iv) expanded contrastive fine-tuning corpora to address the lexical semantic grounding gaps identified in
categorical reasoning tasks.
20

-- 20 of 31 --

REFERENCES
[1] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized
LLMs,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.
[2] D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with LoRA,” arXiv preprint
arXiv:2312.03732, 2023.
[3] T. B. Shahi and C. Sitaula, “Natural language processing for Nepali text: A review,” Artificial Intelligence
Review, vol. 55, no. 5, pp.

Chunk 27 · 1,996 chars

rmation Processing Systems (NeurIPS), vol. 36, 2023.
[2] D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with LoRA,” arXiv preprint
arXiv:2312.03732, 2023.
[3] T. B. Shahi and C. Sitaula, “Natural language processing for Nepali text: A review,” Artificial Intelligence
Review, vol. 55, no. 5, pp. 3401–3429, 2022.
[4] National Statistics Office (formerly Central Bureau of Statistics), National Population and Housing
Census 2021, Government of Nepal, Kathmandu, 2021.
[5] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic par-
rots: Can language models be too big?,” in Proc. ACM Conference on Fairness, Accountability, and
Transparency (FAccT), pp. 610–623, 2021.
[6] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E.
Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient
foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[7] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G.
Lengyel, G. Lample, L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T.
Wang, T. Lacroix, and W. El Sayed, “Mistral 7B,” arXiv preprint arXiv:2310.06825, 2023.
[8] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, et al., “Qwen technical report,” arXiv preprint
arXiv:2309.16609, 2023.
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30,
pp. 5998–6008, 2017.
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sastry, A. Askell, et al., “Language models are few-shot learners,” in Advances in Neural Information
Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.
[11] S. Bam and T. B. Shahi, "Named Entity Recognition for Nepali Text Using Support

Chunk 28 · 1,995 chars

der, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sastry, A. Askell, et al., “Language models are few-shot learners,” in Advances in Neural Information
Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.
[11] S. Bam and T. B. Shahi, "Named Entity Recognition for Nepali Text Using Support Vector Machines,"
in Proc. International Conference on Communication and Information Technology (ICCIT), 2014.
[12] P. Koirala and N. Niraula, “NPVec1: Word embeddings for Nepali — construction and evaluation,” in
Proc. 6th Workshop on Representation Learning for NLP (RepL4NLP), pp. 174–184, 2021.
[13] S. Timilsina, M. Gautam, and B. Bhattarai, “NepBERTa: Nepali language model trained in a large
corpus,” in Proc. 2nd Conf. Asia-Pacific Chapter of ACL (AACL-IJCNLP), pp. 273–284, 2022.
[14] S. Pudasaini, A. Dangol, and S. Shakya, "NepaliGPT: A generative language model for the Nepali
language," arXiv preprint arXiv:2506.16399, 2025.
[15] P. Rust, J. Pfeiffer, I. Vuli´c, S. Ruder, and I. Gurevych, “How good is your tokenizer? On the monolingual
performance of multilingual language models,” in Proc. 59th Annual Meeting of the Association for
Computational Linguistics (ACL), pp. 3118–3135, 2021.
21

-- 21 of 31 --

[16] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y. Lee, B.
Sagot, and S. Tan, “Between words and characters: A brief history of open-vocabulary modeling and
tokenization in NLP,” arXiv preprint arXiv:2112.10508, 2021.
[17] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and
detokenizer for neural text processing,” in Proc. 2018 Conference on Empirical Methods in Natural
Language Processing (EMNLP): System Demonstrations, pp. 66–71, 2018.
[18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank
adaptation of large language models,” in International Conference on Learning Representations (ICLR),
2022.
[19] J.

Chunk 29 · 1,987 chars

on Empirical Methods in Natural
Language Processing (EMNLP): System Demonstrations, pp. 66–71, 2018.
[18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank
adaptation of large language models,” in International Conference on Learning Representations (ICLR),
2022.
[19] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned
language models are zero-shot learners,” in International Conference on Learning Representations
(ICLR), 2022.
[20] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford
Alpaca: An instruction-following LLaMA model,” GitHub repository, Stanford University, 2023.
https://github.com/tatsu-lab/stanford_alpaca
[21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine
translation,” in Proc. 40th Annual Meeting of the Association for Computational Linguistics (ACL),
pp. 311–318, 2002.
[22] M. Post, “A call for clarity in reporting BLEU scores,” in Proc. Third Conference on Machine Translation
(WMT), pp. 186–191, 2018.
[23] M. Popovi´c, “chrF: Character n-gram F-score for automatic MT evaluation,” in Proc. Tenth Workshop
on Statistical Machine Translation (WMT), pp. 392–395, 2015.
[24] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating text generation
with BERT,” in International Conference on Learning Representations (ICLR), 2020.
[25] D. Jurafsky and J. H. Martin, Speech and Language Processing, 3rd ed. (draft), Stanford University,
2024. https://web.stanford.edu/~jurafsky/slp3/
[26] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop on
Text Summarization Branches Out, pp. 74–81, 2004.
[27] S. Kafley, “Alpaca Nepali SFT,” Hugging Face Datasets, 2024. https://huggingface.co/
datasets/Saugatkafley/alpaca-nepali-sft
[28] Google LLC, “Google Translate,” Google, 2024.
[29] AI4Bharat,

Chunk 30 · 1,996 chars

package for automatic evaluation of summaries,” in Proc. ACL Workshop on
Text Summarization Branches Out, pp. 74–81, 2004.
[27] S. Kafley, “Alpaca Nepali SFT,” Hugging Face Datasets, 2024. https://huggingface.co/
datasets/Saugatkafley/alpaca-nepali-sft
[28] Google LLC, “Google Translate,” Google, 2024.
[29] AI4Bharat, “IndicTransliteration: Transliteration library for Indic scripts,” GitHub repository, 2024.
[30] Unsloth AI, “Unsloth: 2x faster, 70% less memory LLM finetuning,” GitHub repository, 2024.
[31] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and
Q. Gallouédec, “TRL: Transformers reinforcement learning,” GitHub repository, Hugging Face, 2020.
https://github.com/huggingface/trl
22

-- 22 of 31 --

[32] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication
for transformers at scale,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35,
pp. 30318–30332, 2022.
[33] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in International
Conference on Learning Representations (ICLR), 2017.
[34] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” in Proc. 57th Annual
Meeting of the Association for Computational Linguistics (ACL), pp. 4996–5001, 2019.
[35] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference
optimization: Your language model is secretly a reward model,” in Advances in Neural Information
Processing Systems (NeurIPS), vol. 36, pp. 53728–53741, 2023.
23

-- 23 of 31 --

A QUALITATIVE GOLDEN QUESTIONS ANALYSIS
This appendix presents the complete model outputs for all 10 Golden Questions evaluated under zero-shot
(base) and fine-tuned (post-SFT) conditions for each of the three architectures. Outputs are reproduced
verbatim with no post-processing or correction. [no response] denotes an empty string with zero tokens
generated; [empty line]

Chunk 31 · 1,994 chars

resents the complete model outputs for all 10 Golden Questions evaluated under zero-shot
(base) and fine-tuned (post-SFT) conditions for each of the three architectures. Outputs are reproduced
verbatim with no post-processing or correction. [no response] denotes an empty string with zero tokens
generated; [empty line] denotes continuous blank newlines with no semantic content.
The six tables (Tables 9–14) provide the raw evidence base for the qualitative analysis in Section 4.6,
covering the three failure-taxonomy dimensions (Orthographic Robustness, Morphological Integrity, Latent
Romanization Drift) and the two annotated residual semantic errors in the Mistral-7B-v0.1 fine-tuned outputs.
A.1 Llama-3.1-8B: Base Model (Zero-Shot)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— [no response]
Nepal ko barema thodai
bhannus ta.
— [no response]
Cache memory bhaneko k
ho?
— [no response]
Yo numerical pattern pura
garnuhos.
1, 2, 3, 4, . . . 5
Diyeko vakya lai shuddha
banayera lekhnuhos.
Ma bholi ghar
janchu thiyo.
Diyeko vakya lai shuddha banayera lekhnuho. Ma
bholi ghar janchu thiyo.
Operating System ko
bhumika k huncha?
— [no response]
Dashain ko barema euta
choto paragraph lekhnuhos.
— [no response]
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
1. Diyeko list ma kun chai phalful hoina? 2. Syau
3. Suntala 4. Alu 5. Aap
Internet bhane ko k ho? — [no response]
Euta sano katha lekhnuhos. Topic:
Chalakh Shyal
Chalakh shyal gulo ekta jinish
Table 9. Llama-3.1-8B base model responses (zero-shot). [no response] on 6 of 10 instructions.
24

-- 24 of 31 --

A.2 Llama-3.1-8B: Fine-Tuned (Post-SFT)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— kampyutara eka yantra ho jasale janakari
prashodhana garcha ra samasyaharu samadhana
gardacha
Nepal ko barema thodai
bhannus ta.
— nepala uttara purva esiyama avasthita eka sano
rashtra ho
Cache memory bhaneko k
ho?
— kyasa

Chunk 32 · 1,993 chars

t-SFT)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— kampyutara eka yantra ho jasale janakari
prashodhana garcha ra samasyaharu samadhana
gardacha
Nepal ko barema thodai
bhannus ta.
— nepala uttara purva esiyama avasthita eka sano
rashtra ho
Cache memory bhaneko k
ho?
— kyasa memori eka choto-avadhi bhandarana pranali
ho jasale druta rupama prayoga hune data va
nirdeshanaharu pahuncha garna sakshama cha
Yo numerical pattern pura
garnuhos.
1, 2, 3, 4, . . . 5
Diyeko vakya lai shuddha
banayera lekhnuhos.
Ma bholi ghar
janchu thiyo.
ma bholi ghara janchu thien
Operating System ko
bhumika k huncha?
— operetimga sistama kampyutara pranaliko
adhikarika ho ra prayogakartaharusanga
kampyutaralai niyantrana garna anumati dincha|
yasale pranali srotaharu vyavasthapana gardacha,
prashodhana ra pradarshana pradana gardacha, ra
prayogakarta anuprayogaharulai pratyaksha
pahuncha pradana gardacha|
Dashain ko barema euta
choto paragraph lekhnuhos.
— dashain bhaneko 10 dinako avadhiko vishesha
samayavada ho juna hindu mahinama pratyeka
purnimako lagi ayojana garincha| yo navikaraniya
chandramako dashaka varshika samskaralai
manauncha ra dherai purvadharaharu ra
paramparagata rupama nayan luga ra samanaharu
kinna va dhuvani garna prayoga garincha| yasale
parivara, sathiharu ra sathiharusanga bandhana
banauncha ra sadhaimbhari ramro kura garnuhos
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
4| aap
Internet bhane ko k ho? — intaraneta (anuchcheda: i) eka vishvavyapi
kampyutina netavarka ho jasale janakari sajha garna
ra adana pradana garna sakcha| yo dherai
kampyutara ra anya upakarana va yantraharu milera
baneko cha juna samsarako sabai kunama phailieko
cha|
25

-- 25 of 31 --

Instruction Input Output
Euta sano katha lekhnuhos. Topic:
Chalakh Shyal
eka samayako kura ho, tyahan eka chito ra baliyo
ghoda thiyo jasale sadhaim aphno khela khelna
mana parauncha| hareka dina usale aphno gharabata
bahira niskera

Chunk 33 · 1,998 chars

aharu milera
baneko cha juna samsarako sabai kunama phailieko
cha|
25

-- 25 of 31 --

Instruction Input Output
Euta sano katha lekhnuhos. Topic:
Chalakh Shyal
eka samayako kura ho, tyahan eka chito ra baliyo
ghoda thiyo jasale sadhaim aphno khela khelna
mana parauncha| hareka dina usale aphno gharabata
bahira niskera shaharaka sadakaharu khojcha,
tiniharulai jumauna ra tiniharuko varipari daudana|
eka dina, chito ghodale shaharako cheuma euta sano
khetama thokkiyo| usale yasalai dhilo garyo ra
najika pugda, uhanle euta sano syala bhettanubhayo,
jo kehi khana khojdai thiyo| syala ra chito ghoda
duvai khushisatha banchna thale, sangai khana
phela parne ra sahasika karya garne| tiniharuka lagi,
yo eka adbhuta yatra thiyo . . .
Table 10. Llama-3.1-8B fine-tuned responses (post-SFT). All 10 instructions answered in Romanized Nepali.
A.3 Qwen3-8B: Base Model (Zero-Shot)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— [Devanagari script]: manche haru le aphno
kampyutar ko samasya hal garna prayaas
gardachhan . . . (off-script; definition request
misread as troubleshooting)
Nepal ko barema thodai
bhannus ta.
— Sure! Here’s how you can complete this sentence in
Nepali: “Nepal ma raatko sabai chhato janaa.” This
translates to “In Nepal, it gets very cold at night.”
Cache memory bhaneko k
ho?
— Cache memory in computer systems refers to
high-speed data storage mechanisms designed to
reduce the average time it takes for active data to be
requested by serving frequently accessed
information from faster (but smaller) sources than
main memory. This helps improve system
performance and efficiency because accessing cache
typically requires less power compared to reading
directly from primary or secondary storage devices
like hard drives or SSDs.
Yo numerical pattern pura
garnuhos.
1, 2, 3, 4, . . . The given sequence appears to be increasing by one
each time: Start at 1; Add 1 to get the next number
(e.g., 1 + 1 = 2); Continue this process

Chunk 34 · 1,996 chars

ally requires less power compared to reading
directly from primary or secondary storage devices
like hard drives or SSDs.
Yo numerical pattern pura
garnuhos.
1, 2, 3, 4, . . . The given sequence appears to be increasing by one
each time: Start at 1; Add 1 to get the next number
(e.g., 1 + 1 = 2); Continue this process indefinitely.
So the next few numbers in the series would be: 5,
6, 7, . . .
Diyeko vakya lai shuddha
banayera lekhnuhos.
Ma bholi ghar
janchu thiyo.
Here’s my attempt at completing this task: “Ma ko
ghar ma janayau.” I hope I did it correctly!
26

-- 26 of 31 --

Instruction Input Output
Operating System ko
bhumika k huncha?
— An operating system (OS) plays several crucial
roles in managing computer hardware and software
resources. Key functions: (1) Resource
Management — manages CPU time, memory, disk,
network; (2) Process Control — scheduling
algorithms for task execution; (3) Memory
Protection & Sharing — secure access control;
(4) Device Drivers Interface — intermediary
between apps and hardware; (5) Security Measures
— authentication protocols and encryption.
Dashain ko barema euta
choto paragraph lekhnuhos.
— Dashain is one of the most important festivals in
Nepal and it falls on the first day of Kartik month
according to Nepali calendar which usually occurs
between October–November every year. It’s
celebrated for 15 days by worshipping Goddess
Lakshmi (the goddess of wealth) as well as other
deities like Ganesha & Shiva etc., through various
rituals such as offering prayers at home or temple
premises followed by feasting with family members
over traditional dishes prepared during this time
period including mutton curry served alongside roti
breads.
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
To determine which item does not belong to fruits
(phal): Syau — appears to be a misspelling.
Suntala translates to “sunflower” in English;
sunflowers are plants, not fruits. Alu means potato
in Nepali; potatoes are tubers

Chunk 35 · 1,990 chars

longside roti
breads.
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
To determine which item does not belong to fruits
(phal): Syau — appears to be a misspelling.
Suntala translates to “sunflower” in English;
sunflowers are plants, not fruits. Alu means potato
in Nepali; potatoes are tubers classified as
vegetables, not fruits, because they do not develop
from flowering parts. Aap is mango, which is a
fruit.
Internet bhane ko k ho? — The internet refers to a global network of
interconnected computers and servers that
communicate using standardized protocols such as
TCP/IP. It allows for the sharing of information
across devices worldwide through various services
like email, websites, social media platforms, online
shopping sites, video streaming services etc.,
making it possible for people from different parts of
the world to connect instantly regardless of
geographical boundaries or time zones.
Euta sano katha lekhnuhos. Topic:
Chalakh Shyal
Chalakh shyal ko chakkar maanu paryo jasai taraa
raatko taapmaa nahi hotaa. Tahaaniyatai
sabdaachaarachhaako moolkhojchheka bhanne
chaahau.
Table 11. Qwen3-8B base model responses (zero-shot). All responses in English or Devanagari — no Romanized
Nepali generated.
27

-- 27 of 31 --

A.4 Qwen3-8B: Fine-Tuned (Post-SFT)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— kampyutara eka yantra ho jasale janakari
prashodhana garcha ra prayogakartalai janakari
utpanna gardacha| yasama prosesara, memori,
inaputa autaputa upakaranaharu samavesha
hunchan|
Nepal ko barema thodai
bhannus ta.
— nepala dakshina esiyama avasthita eka puratana
desha ho| yo himali kshetraharu ra vishvako
sabaibhanda aglo pahada sagaramatha (8,848
mitara) samga jivita cha| yasale dherai prakaraka
paridrrishyaharu pradana gardacha, jasama ghanse
maidanaharu, tatiya ghumauro bagaimchaharu,
shanta tala ra charko nadiharu chan| nepalale
vibhinna prakaraka samskrriti ra bhasha pani
prastuta

Chunk 36 · 1,993 chars

yo himali kshetraharu ra vishvako
sabaibhanda aglo pahada sagaramatha (8,848
mitara) samga jivita cha| yasale dherai prakaraka
paridrrishyaharu pradana gardacha, jasama ghanse
maidanaharu, tatiya ghumauro bagaimchaharu,
shanta tala ra charko nadiharu chan| nepalale
vibhinna prakaraka samskrriti ra bhasha pani
prastuta gardacha, juna 125 mukhya bhashaharu
milera baneko cha|
Cache memory bhaneko k
ho?
— kyashe memori eka prakarako chito pahuncha garna
dijaina garieko samana ra data ko lagi avasthita
huna sakcha| yasale dherai prayoga garine
programina nirdeshanaharu ra detako lagi thula
matrama kyacheda va “bhandarana” gardacha juna
ramro-pradarshana vikalpa hunu parne sambhavana
cha|
Yo numerical pattern pura
garnuhos.
1, 2, 3, 4, . . . 5
Diyeko vakya lai shuddha
banayera lekhnuhos.
Ma bholi ghar
janchu thiyo.
maile bholi ghara gaen
Operating System ko
bhumika k huncha?
— operiuna sistamako mukhya karya
prayogakartaharulai hardaveyara srotaharu
pahuncha garna ra pranalilai vyavasthapana garna
anumati dina ho| yasale memori vyavasthapanama
maddata gardacha, pranali suru garnecha ra banda
garnecha, phailaharu padhne ra lekhne,
prakriyaharu chalaune, netavarka jadana sthapita
garne, ra thapa|
Dashain ko barema euta
choto paragraph lekhnuhos.
— dashaina bharatiya kathana kalendarako
sabaibhanda lamo ra sabaibhanda mahattvapurna
avadhiko rupama manincha| yo pratyeka varshako
dashambara dekhi nembarama samarpita cha, juna
pancha dinaharu milera baneko huncha| yasa
avadhile bharkharai parivartanakari gareka
visheshataharu punarsthapana gardacha, jasale
janmadina va svatantratako lagi dherai purano
pratikatmaka mulyaharu pheri jagauncha|
28

-- 28 of 31 --

Instruction Input Output
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
syau
Internet bhane ko k ho? — intaraneta eka vishvavyapi samchara pranali ho
jasale manisaharulai janakari adana ra diincha| yo
dherai kampyutara ra anya upakaranaharu milera
baneko cha juna vibhinna netavarkaharuma

Chunk 37 · 1,998 chars

Instruction Input Output
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
syau
Internet bhane ko k ho? — intaraneta eka vishvavyapi samchara pranali ho
jasale manisaharulai janakari adana ra diincha| yo
dherai kampyutara ra anya upakaranaharu milera
baneko cha juna vibhinna netavarkaharuma jodieko
huncha, jasale prayogakartaharulai sajilaisanga
janakari pahuncha garna sakshama banauncha|
Euta sano katha lekhnuhos. Topic:
Chalakh Shyal
eka pataka tyahan eka chalako shyla thiyo jasale
aphno varaparako samudayama dherai asuvidha ra
nirashako bhavana lyaeko thiyo| manisaharu
uniharulai darae ra tiniharule kehi garna sakdainan|
eka dina, euti yuvati namaki ketile shylabata tadha
gaera suruma pharkie| uni ashvasta thii, kinabhane
unale socheki thiin ki yo nishkriya hunecha va usale
banaeko kunai kurale unako lagi kura garne chaina|
tara, chalaile unako avaja sunayo ra unale unako
barema sodheki thiin| keti sadhaimbhari
chalaharuko barema sikna chahanthin| unale
chalaharulai bachchaharusangako . . .
Table 12. Qwen3-8B fine-tuned responses (post-SFT). All 10 instructions answered correctly in Romanized Nepali.
A.5 Mistral-7B-v0.1: Base Model (Zero-Shot)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— The instruction in this example is written in Nepali
and asks for a computer to be built. The input
provided is a list of components that need to be
included in the build. The response should provide
a detailed explanation of how to assemble the
computer using the provided components. This
could include step-by-step instructions on how to
install each component, such as the motherboard,
CPU, RAM, graphics card, power supply, hard
drive, and case. It should also explain how to
connect all the components together and ensure they
are properly installed and functioning correctly.
Nepal ko barema thodai
bhannus ta.
— [empty line]
Cache memory bhaneko k
ho?
— [empty line]
Yo numerical pattern pura
garnuhos.
1, 2,

Chunk 38 · 1,998 chars

M, graphics card, power supply, hard
drive, and case. It should also explain how to
connect all the components together and ensure they
are properly installed and functioning correctly.
Nepal ko barema thodai
bhannus ta.
— [empty line]
Cache memory bhaneko k
ho?
— [empty line]
Yo numerical pattern pura
garnuhos.
1, 2, 3, 4, . . . 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20
29

-- 29 of 31 --

Instruction Input Output
Diyeko vakya lai shuddha
banayera lekhnuhos.
Ma bholi ghar
janchu thiyo.
The input “Ma bholi ghar janchu thiyo” means “I
went to my mother’s house.” The response “Ma
bholi ghar janchu thiyo” is a complete sentence that
correctly expresses the same meaning as the input.
It uses the correct verb form (thiyo) and
subject-verb agreement (ma).
Operating System ko
bhumika k huncha?
— [empty line]
Dashain ko barema euta
choto paragraph lekhnuhos.
— [empty line]
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
The instruction asks for a list of items in order from
smallest to largest. The input provides four options,
and the correct response is to list them in ascending
order.
Internet bhane ko k ho? — [empty line]
Euta sano katha lekhnuhos. Topic:
Chalakh Shyal
Chalakh Shyal is a popular dish in Nepal that is
made with rice flour and yogurt. It is typically
served as a side dish or snack and is often enjoyed
during festivals and special occasions. The dish has
a unique texture and flavor that is both savory and
slightly sour, making it a favorite among many
people in Nepal.
Table 13. Mistral-7B-v0.1 base model responses (zero-shot). [empty line] on 5 of 10 instructions; remaining 5 in
English, off-task.
A.6 Mistral-7B-v0.1: Fine-Tuned (Post-SFT)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— kampyutara bhaneko kampyutina pranaliko eka
prakara ho jasale manisaharulai janakari ra
suchanako vishleshana garna sakshama gardacha
Nepal ko barema thodai
bhannus ta.
— nepala

Chunk 39 · 1,990 chars

off-task.
A.6 Mistral-7B-v0.1: Fine-Tuned (Post-SFT)
Instruction Input Output
Computer bhaneko k ho?
Yaslai nepali ma paribhashit
garnuhos.
— kampyutara bhaneko kampyutina pranaliko eka
prakara ho jasale manisaharulai janakari ra
suchanako vishleshana garna sakshama gardacha
Nepal ko barema thodai
bhannus ta.
— nepala eka samrriddha ra vividha desha ho jasale
vishvavyapi rupama manava ra janavaraharuko
ghara banauncha| yo yasa prakhyata kshetrako
ghara ho, ra yasako sabaibhanda najikaka
shaharaharu madhye eka ho| yo vishvavyapi rupama
manava ra janavaraharule baneko cha, ra yo
vishvavyapi rupama manava-jastai cha|
Cache memory bhaneko k
ho?
— ache memori eka prakarako samana ho jasale deta
va janakari bhandarana garna prayoga gardacha|
Yo numerical pattern pura
garnuhos.
1, 2, 3, 4, . . . 5
30

-- 30 of 31 --

Instruction Input Output
Diyeko vakya lai shuddha
banayera lekhnuhos.
Ma bholi ghar
janchu thiyo.
ma bholi ghara janchu thien|
Operating System ko
bhumika k huncha?
— aparetina sistama prayogakarta anuprayoga ra
anuprayogaharuma pahuncha pradana garna dijaina
garieko saphtaveyara vikasa pranali ho| yo
prayogakartako avashyakataharu pura garna dijaina
garieko ho, jastai prayogakartale kuna
anuprayogaharu chaliraheka chan, kasari
prayogakarta intaraphesa samrachita cha, ra kasari
prayogakarta deta vyavasthita cha
Dashain ko barema euta
choto paragraph lekhnuhos.
— dashaina bhaneko samayako satha kama garne va
arthika rupama kama garna sakine karyaharuko
suchi ho| yo samanyataya dherai kamaharu ra
karyaharuma vibhajana garera vyavasthita huncha
juna pratyeka karyama kama samaya avadhi ra
adhikatama gunastara samga pura garnu parcha|
dashainaharule samanyataya karyaharulai sana
tukraharuma vibhajana gardacha jasalai thapa
vyavasthita ra vyavasthita banauna maddata
garchan| yasale karyaharulai sana, vyavasthita
tukraharuma todna anumati dincha jasale
samayasima ra sthiratako lagi anumati dincha|
,→ Semantic error: describes task-management
workflow, not a

Chunk 40 · 1,205 chars

aharule samanyataya karyaharulai sana
tukraharuma vibhajana gardacha jasalai thapa
vyavasthita ra vyavasthita banauna maddata
garchan| yasale karyaharulai sana, vyavasthita
tukraharuma todna anumati dincha jasale
samayasima ra sthiratako lagi anumati dincha|
,→ Semantic error: describes task-management
workflow, not a festival.
Diyeko list ma kun chai
phalful hoina?
1. Syau,
2. Suntala,
3. Alu, 4. Aap
ho
,→ Semantic error: outputs “yes” only; no item
identified.
Internet bhane ko k ho? — intaraneta bhaneko samsarako vishvavyapi sanjala
ho|
Euta sano katha lekhnuhos. Topic:
Chalakh Shyal
eka pataka tyahan eka javana keti thiyo jo sadhaim
chalachitra herna ra nayan shaharaharu anveshana
garna chahanthin| unale aphno parivarako barema
sochdai thiin ra unale aphno bhavishyama kehi
rojne nirnaya garnuparcha bhanera vichara gardai
thiin| unale aphno parivarasanga kura garera ra
aphno vicharaharu vyakta garera aphno nirnayako
barema sodhe| antatah, unale aphno parivarako lagi
aphno bhavishya banauna chodne nirnaya gare|
Table 14. Mistral-7B-v0.1 fine-tuned responses (post-SFT). All 10 instructions answered in Romanized Nepali. Two
residual semantic errors are annotated (,→).
31

-- 31 of 31 --