Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
Summary
This study benchmarks the adaptation of three open-weight 7–8B language models—Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B—to Romanized Nepali, a Latin-script variant of the Nepali language used in informal digital communication. All models failed to generate Romanized Nepali in zero-shot settings, each exhibiting distinct failure modes linked to their tokenization strategies. After fine-tuning with QLoRA and rsLoRA on a 10,000-sample bilingual dataset, all models achieved BERTScore ≈ 0.75 and chrF++ > 23. Qwen3-8B was recommended as the overall best, leading in structural alignment metrics and producing zero residual errors in a qualitative case study. Llama-3.1-8B showed the largest absolute gains in PPL and BERTScore, supporting the adaptation headroom hypothesis: models with weaker zero-shot baselines can achieve greater improvement through fine-tuning. Mistral-7B-v0.1 had the lowest perplexity but retained semantic errors. The study establishes a reproducible baseline for Romanized Nepali adaptation, highlighting the effectiveness of parameter-efficient fine-tuning in low-resource settings.
PDF viewer
Chunks(41)
Chunk 0 · 1,993 chars
BENCHMARKING LINGUISTIC ADAPTATION IN COMPARABLE-SIZED LLMs: A STUDY OF LLAMA-3.1-8B, MISTRAL-7B-v0.1, AND QWEN3-8B ON ROMANIZED NEPALI Ananda Rimal Dept. of Computer Science & Engineering Nepal Engineering College anandr022342@nec.edu.np Adarsha Rimal Central Dept. of CS and IT Tribhuvan University adarsharimal07@gmail.com ABSTRACT Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adap- tation (QLoRA) [1] with Rank-Stabilized LoRA (rsLoRA) [2] at rank r = 32 on dual NVIDIA Tesla T4 GPUs, training only ≈ 1% of each model’s parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore ≈ 0.75 and chrF++ > 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (∆ = −49.77)
Chunk 1 · 1,996 chars
ommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (∆ = −49.77) and BERTScore (∆ = +0.3287), making it the preferred choice for iterative low-resource develop- ment pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs. 1 INTRODUCTION The proliferation of social media and mobile messaging platforms has established Romanized Nepali the Nepali language written using the Latin alphabet as the de facto standard for informal digital discourse in Nepal. Despite its ubiquity, Romanized Nepali remains significantly under-resourced in Natural Language Processing. Unlike formal Devanagari script, Romanized Nepali lacks a standardized orthography, leading to extreme phonetic variation where a single word may be transliterated in multiple ways, such as “khana,” and “khaana,” [3]. 1 arXiv:2604.14171v1 [cs.CL] 25 Mar 2026 -- 1 of 31 -- Nepali is the official language of Nepal, spoken by approximately 44.86% of the population [4]. The language is traditionally written in Devanagari script, comprising 36 consonants and 13 vowels. Current LLMs are predominantly trained on standardized scripts and formal datasets, leaving transliterated variants in a “linguistic shadow” [5]. This gap results in poor performance on downstream tasks such as sentiment analysis, machine translation, and conversational AI in the Nepali digital context. This paper investigates the robustness and adaptation capabilities of the comparable-sized LLMs, specifically Llama-3.1-8B [6], Mistral-7B-v0.1 [7], and Qwen3-8B [8]. We focus on this size class due to its increasing importance for on device deployment and resource efficient fine tuning in low resource environments. Through a
Chunk 2 · 1,999 chars
er investigates the robustness and adaptation capabilities of the comparable-sized LLMs, specifically Llama-3.1-8B [6], Mistral-7B-v0.1 [7], and Qwen3-8B [8]. We focus on this size class due to its increasing importance for on device deployment and resource efficient fine tuning in low resource environments. Through a systematic benchmark, we evaluate how different architectures handle the phonetic noise and non standard syntax inherent in Romanized Nepali, providing a foundation for future generative AI applications for the Nepali diaspora. To the best of our knowledge, this is the first comprehensive study focusing specifically on the transliteration resilience of comparable-sized LLMs on Romanized Nepali script. 2 LITERATURE REVIEW 2.1 Transformer-Based Language Models The transformer architecture introduced by Vaswani et al. [9] marked a fundamental shift in NLP by replacing recurrence with self-attention, enabling efficient parallel processing and long-range dependency modeling. This underpins virtually all modern LLMs. Brown et al. [10] demonstrated that scaling transformer models to billions of parameters yields emergent few-shot and zero-shot generalization, establishing the paradigm of large-scale pre-training followed by lightweight task-specific adaptation that all three model families in this study follow. 2.2 Open-Weight Models in the 7–8B Class Recent open-weight LLMs have made high-performance language modeling accessible for low-resource research. Touvron et al. [6] introduced the LLaMA family, demonstrating competitive performance at 7–65B parameters through careful data curation. Jiang et al. [7] released Mistral 7B, incorporating grouped-query and sliding window attention for improved inference efficiency. Bai et al. [8] presented the Qwen series with broader multilingual pre-training coverage, providing stronger cross-lingual transfer to non-English scripts. These three model families form the subject of the present study due to their comparable
Chunk 3 · 1,991 chars
uped-query and sliding window attention for improved inference efficiency. Bai et al. [8] presented the Qwen series with broader multilingual pre-training coverage, providing stronger cross-lingual transfer to non-English scripts. These three model families form the subject of the present study due to their comparable size (≈7–8B parameters), open-weight availability, and suitability for on-device deployment. 2.3 Nepali NLP: History and Current State Natural Language Processing for Nepali has a relatively short but accelerating history. Early work focused on rule based and statistical approaches to morphological analysis and part-of-speech tagging, motivated by Nepali’s highly agglutinative morphology and complex sandhi rules that complicate standard tokeniza- tion [3]. Bam and Shahi [11] demonstrated that even basic preprocessing such as stopword removal and 2 -- 2 of 31 -- stemming requires language specific adaptation, as standard multilingual tools fail to account for Devanagari’s orthographic properties. Koirala and Niraula [12] introduced word embedding models for Nepali, establishing distributional semantic resources for formal text but explicitly noting that Romanized Nepali remains outside their coverage. Sitaula et al. [13] adapted multilingual BERT via continued pre-training on a Devanagari corpus to produce NepBERT, showing strong performance on formal NLP tasks. Most recently, Pudasaini et al. [14] developed NepaliGPT, the first autoregressive generative model for Nepali, again focused exclusively on Devanagari. Across this body of work, informal Romanized Nepali has remained consistently unaddressed the principal gap this study targets. 2.4 Tokenization for Low-Resource and Romanized Scripts Tokenization is a foundational bottleneck for LLM performance on low-resource languages. Rust et al. [15] conducted a systematic evaluation of multilingual BERT tokenizers across 99 languages, demonstrating that tokenizer quality is a strong predictor of
Chunk 4 · 1,998 chars
targets. 2.4 Tokenization for Low-Resource and Romanized Scripts Tokenization is a foundational bottleneck for LLM performance on low-resource languages. Rust et al. [15] conducted a systematic evaluation of multilingual BERT tokenizers across 99 languages, demonstrating that tokenizer quality is a strong predictor of downstream task performance. Languages with high subword fertility experience longer effective sequence lengths, reduced context window utility, and higher inference cost [15]. Mielke et al. [16] showed that agglutinative languages and those with non-Latin scripts are disproportionately harmed by BPE-based vocabularies trained on predominantly English corpora. Romanized Nepali faces a compounded version of this problem: it is neither standard English nor Devanagari, placing it in a tokenization dead zone between the two primary training distributions of all three studied models. Kudo and Richardson [17] introduced SentencePiece, a language-agnostic subword tokenizer that treats text as a raw character sequence without whitespace assumptions, giving it better generalization to unseen scripts. Mistral-7B-v0.1 and Qwen3-8B both use SentencePiece [7, 8], while Llama-3.1-8B uses Tiktoken [6], a BPE implementation optimized for English-heavy data. The zero-shot failure analysis in this study directly reflects these tokenizer design differences. 2.5 Parameter-Efficient Fine-Tuning Full fine-tuning of LLMs is computationally prohibitive in low-resource settings. Hu et al. [18] introduced LoRA, which injects trainable low-rank matrices into frozen transformer weights, reducing trainable pa- rameters to under 1% of the base model. Dettmers et al. [1] extended this with QLoRA, combining 4-bit NF4 quantization with LoRA adapter training, enabling fine-tuning on consumer-grade hardware. Kala- jdzievski [2] identified rank-dependent scaling instability in standard LoRA and proposed rsLoRA, replacing the α/r scaling factor with α/√r to stabilize training at r = 32,
Chunk 5 · 1,983 chars
[1] extended this with QLoRA, combining 4-bit NF4 quantization with LoRA adapter training, enabling fine-tuning on consumer-grade hardware. Kala- jdzievski [2] identified rank-dependent scaling instability in standard LoRA and proposed rsLoRA, replacing the α/r scaling factor with α/√r to stabilize training at r = 32, the rank used in this work. Instruction fine-tuning [19] has been shown to be particularly effective for generalization to new task formats, training models to follow structured Instruction-Input-Output templates. The Alpaca dataset format [20] used in this study is a widely adopted instantiation of this approach. 3 -- 3 of 31 -- 2.6 Evaluation Metrics for Low-Resource Generation Papineni et al. [21] introduced BLEU, the dominant n-gram precision metric for machine translation, though its reliance on exact surface-form matches makes it poorly suited to non-standardized scripts [22]. Popovic [23] proposed chrF, a character n-gram F-score more robust to spelling variation. Zhang et al. [24] introduced BERTScore, evaluating semantic similarity via contextual embeddings. Perplexity, grounded in information theory [25], measures predictive confidence and serves as the primary checkpoint selection criterion in this work. Lin [26] proposed ROUGE-L for structural alignment via longest common subsequence matching. 2.7 Research Gap Across the reviewed literature three gaps are evident. First, all existing Nepali NLP work targets the formal Devanagari script, leaving informal Romanized Nepali unaddressed [3, 13, 14]. Second, while tokenizer quality is known to predict downstream performance for low-resource scripts [15, 16], no study has compared tokenizer design across competing comparable-sized LLM architectures for a non-standardized transliterated script. Third, no prior work has benchmarked the adaptation of comparable-sized open-weight LLMs to Romanized Nepali under a rigorous multi-metric framework. This study addresses all three gaps. 3
Chunk 6 · 1,994 chars
dy has compared tokenizer design across competing comparable-sized LLM architectures for a non-standardized transliterated script. Third, no prior work has benchmarked the adaptation of comparable-sized open-weight LLMs to Romanized Nepali under a rigorous multi-metric framework. This study addresses all three gaps. 3 METHODOLOGY This section describes the complete experimental pipeline consisting of seven sequential stages. Stage 1 covers source corpus selection from an existing Devanagari instruction-following dataset. Stage 2 applies bilingual transformation: the first 5,000 samples undergo selective English instruction translation while retaining Romanized Nepali Input and Output fields, and the remaining 5,000 samples undergo full phonetic transliteration of all three Alpaca fields to Romanized Nepali. Stage 3 partitions the transformed 10,000- sample corpus into a 9,000-sample training set and a 1,000-sample held-out test set. Stage 4 applies parameter-efficient fine-tuning via QLoRA [1] with rsLoRA [2] at rank r = 32 on dual NVIDIA Tesla T4 GPUs across all three architectures. Stage 5 conducts dual-stage evaluation comparing zero-shot and fine-tuned performance on the held-out test set. Stage 6 scores all outputs across five metrics spanning seven measurement dimensions: PPL, BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU Stage 7 performs qualitative case study analysis across 10 Golden Questions. Figure 1 provides a visual overview of the full pipeline. 4 -- 4 of 31 -- Data Preparation (Stages 1–3) Parameter-Efficient Fine-Tuning (Stage 4) Evaluation (Stages 5–7) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Source Corpus Devanagari Alpaca Dataset (≈52,000 samples) Semantic Translation Instruction → English Input / Output → Romanized Nepali (5,000 samples) Phonetic Transliteration Instruction / Input / Output → Romanized Nepali (5,000 samples) 10,000-Sample Dataset Alpaca Format (EN / Romanized Nepali Instr. / Romanized Nepali Input
Chunk 7 · 1,993 chars
us Devanagari Alpaca Dataset (≈52,000 samples) Semantic Translation Instruction → English Input / Output → Romanized Nepali (5,000 samples) Phonetic Transliteration Instruction / Input / Output → Romanized Nepali (5,000 samples) 10,000-Sample Dataset Alpaca Format (EN / Romanized Nepali Instr. / Romanized Nepali Input / Romanized Nepali Output) Training Set 9,000 samples Held-Out Test Set 1,000 samples (sequestered) QLoRA + rsLoRA SFT 4-bit NF4 • rank r=32 • α=64 3 epochs • Dual NVIDIA Tesla T4 Llama-3.1-8B Fine-Tuned Mistral-7B-v0.1 Fine-Tuned Qwen3-8B Fine-Tuned Dual-Stage Evaluation Zero-Shot vs. Fine-Tuned (1,000-sample held-out test set) PPL (Perplexity) BERTScore (Semantic) chrF++ (Char n-gram) ROUGE- 1,ROUGE- 2,ROUGE-L (LCS) BLEU (n-gram) Qualitative Case Study 10 Golden Questions Pre- and Post-SFT Analysis Figure 1. Experimental pipeline: data preparation, parameter-efficient fine-tuning, and evaluation across three model architectures. 5 -- 5 of 31 -- 3.1 Dataset Construction and Transliteration 3.1.1 Source Corpus Selection The primary data source is the Saugatkafley/alpaca-nepali-sft dataset [27], an instruction-following corpus originally in Devanagari script containing approximately 52,000 samples. We extracted a curated subset of 10,000 samples, maximising diversity across instruction types including logical reasoning, creative writing, and factual Q&A. Sequences exceeding 512 tokens were truncated for computational efficiency. 3.1.2 Bilingual Transformation Pipeline All 10,000 samples originate in Devanagari and undergo one of two transformation paths depending on their partition assignment, as illustrated in Stage 2 of Figure 1: 1. Semantic Translation (5,000 samples). For the first 5,000 samples, only the Instruction field was translated from Devanagari into English using the Google Translate engine [28]. The corresponding Input and Output fields were converted to Romanized Nepali via phonetic transliteration using the Indic-Transliteration
Chunk 8 · 1,994 chars
1. Semantic Translation (5,000 samples). For the first 5,000 samples, only the Instruction field was translated from Devanagari into English using the Google Translate engine [28]. The corresponding Input and Output fields were converted to Romanized Nepali via phonetic transliteration using the Indic-Transliteration library [29]. This asymmetric design enables the model to learn the mapping between high-resource English prompting and low-resource Romanized Nepali responses, improving cross-lingual instruction following without sacrificing output-script consistency. 2. Full Phonetic Transliteration (5,000 samples). For the remaining 5,000 samples, all three Alpaca fields Instruction, Input, and Output were converted from Devanagari to Romanized Nepali using the Indic-Transliteration library [29]. Phonetic variations were intentionally permitted (e.g., mapping ‘chha’ to both ‘cha’ and ‘chha’) to replicate the non-standardized orthography of real-world digital Nepali communication. The resulting dataset contains a 50/50 split of English-instructed and Romanized-instructed samples, with all Output fields consistently in Romanized Nepali across both partitions. All samples follow the Alpaca instruction-following schema [20] (Instruction, Input, Output). 3.1.3 Train/Test Partitioning Following bilingual transformation, the full 10,000-sample corpus was partitioned into a 9,000-sample training set and a 1,000-sample held-out test set (the “1k test set”). The test set was sequestered before any training decision to prevent data leakage and is used exclusively for evaluation in Stages 5–7 of the pipeline. 3.2 Parameter-Efficient Fine-Tuning with QLoRA To adapt all three models within the constraints of dual NVIDIA Tesla T4 GPUs (16 GB VRAM each), we employed QLoRA [1] accelerated via the Unsloth framework [30]. The fine-tuning stack combines 4-bit NF4 quantization, low-rank adapter injection, and rank-stabilized scaling. 6 -- 6 of 31 -- 3.2.1 4-bit NormalFloat (NF4)
Chunk 9 · 1,999 chars
adapt all three models within the constraints of dual NVIDIA Tesla T4 GPUs (16 GB VRAM each), we
employed QLoRA [1] accelerated via the Unsloth framework [30]. The fine-tuning stack combines 4-bit
NF4 quantization, low-rank adapter injection, and rank-stabilized scaling.
6
-- 6 of 31 --
3.2.1 4-bit NormalFloat (NF4) Quantization
All base model weights were frozen and compressed to 4-bit NF4 precision [1]. NF4 is an information-
theoretically optimal 4-bit data type whose 24 = 16 quantization levels are non-uniformly spaced to
minimise expected quantization error under a standard normal prior N (0, 1), unlike uniformly spaced INT4.
Quantization is applied block-wise:
WNF4 = quantizeNF4
W
max |W |
(1)
Per-block scaling constants are stored in 32-bit float for dequantization at inference. Base weights receive no
gradient updates; only adapter matrices are trained.
3.2.2 Low-Rank Adapter Injection
Following Hu et al. [18], trainable low-rank matrices are injected alongside each target weight. For a frozen
weight WNF4 ∈ Rd×d, the adapted forward pass becomes:
h = WNF4 x + α
r BA x (2)
where A ∈ Rr×d and B ∈ Rd×r are trainable adapter matrices, r ≪ d is the adapter rank, and α is a scaling
hyperparameter. At initialisation, A ∼ N (0, σ2) and B = 0, ensuring zero adapter contribution at step 0.
Adapters were injected into seven projection layers per transformer block:
Target modules = {q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj}
All other weights remained frozen. We selected r = 32 and α = 64, higher than common defaults of
r ∈ {8, 16} [18], to provide sufficient capacity for the complex mapping from English-centric pre-training to
Romanized Nepali phonetics.
3.2.3 Rank-Stabilized Scaling (rsLoRA)
Standard LoRA’s α/r scaling progressively suppresses adapter influence as rank increases [2]. At r = 32
this is non-trivial. rsLoRA replaces 1/r with 1/√r:
h = WNF4 x + α
√r BA x (3)
For our configuration (r = 32, α = 64):
7
-- 7 of 31 --
α
√r = 64
√32Chunk 10 · 1,996 chars
manized Nepali phonetics. 3.2.3 Rank-Stabilized Scaling (rsLoRA) Standard LoRA’s α/r scaling progressively suppresses adapter influence as rank increases [2]. At r = 32 this is non-trivial. rsLoRA replaces 1/r with 1/√r: h = WNF4 x + α √r BA x (3) For our configuration (r = 32, α = 64): 7 -- 7 of 31 -- α √r = 64 √32 ≈ 11.31 vs. α r = 64 32 = 2.0 (4) This 5.7× increase in effective adapter contribution ensures high-rank adapters meaningfully influence the forward pass without gradient instability [2]. 3.3 Experimental Protocol The study follows a structured three-phase protocol: (i) Zero-Shot Baseline Assessment to quantify innate cross-lingual transfer, (ii) Supervised Fine-Tuning (SFT) on a curated bilingual Romanized Nepali instruction corpus using QLoRA with rsLoRA, and (iii) Dual-Stage Evaluation comprising quantitative benchmarking across five metrics spanning seven measurement dimensions and a qualitative case study across 10 Golden Questions. All three phases share the same held-out 1,000-sample test set to prevent data leakage, with all reported metrics reflecting best-checkpoint weights recovered via minimum validation loss. 3.3.1 Phase 1: Zero-Shot Baseline Assessment Unmodified base weights of all three models were evaluated on the 1k test set with no system prompt, template, or few-shot examples. This zero-shot condition quantifies each model’s innate ability to process Romanized Nepali arising solely from pre-training [10], establishing the performance floor from which all fine-tuning gains ∆ are measured. 3.3.2 Phase 2: Supervised Fine-Tuning (SFT) All three models were fine-tuned on the 9,000-sample training partition using the SFTTrainer from the TRL library [31] in Instruction – Output format. The QLoRA + rsLoRA configuration from Section 3.2 was applied identically across all architectures so that Phase 3 differences are attributable to architectural properties, not configuration asymmetry. Optimizer and Learning Rate Schedule. An 8-bit AdamW
Chunk 11 · 1,994 chars
rainer from the TRL library [31] in Instruction – Output format. The QLoRA + rsLoRA configuration from Section 3.2 was applied identically across all architectures so that Phase 3 differences are attributable to architectural properties, not configuration asymmetry. Optimizer and Learning Rate Schedule. An 8-bit AdamW optimizer [32] was used with a peak learning rate of 1 × 10−4. A 200-step linear warmup prevented gradient divergence during initial exposure to Romanized text, followed by cosine decay [33]. Batch Configuration. A micro-batch size of 2 per device with 4 gradient accumulation steps yields an effective batch size of: Beff = Bdevice × Naccum × NGPU = 2 × 4 × 2 = 16 (5) All models were trained for 3 epochs, corresponding to 3,375 optimizer steps. 8 -- 8 of 31 -- Golden Point Checkpoint Recovery. Terminal-epoch overfitting is a documented risk in low-resource PEFT settings [1]. Checkpoints and validation evaluations were synchronised every 100 steps with a rolling window of 10 saved checkpoints. Upon completion, load_best_model_at_end restores the checkpoint of minimum validation loss Leval. Since: PPL = exp(Leval) (6) minimising Leval directly optimises the primary metric, ensuring all Phase 3 scores reflect best-checkpoint weights. Category Parameter Value Optimization Optimizer AdamW (8-bit) [32] Learning Rate 1 × 10−4 Warmup Steps 200 Batch Logic Per-Device Batch Size 2 Effective Batch Size 16 (2 × 4 × 2 GPUs) Hardware GPUs Dual NVIDIA Tesla T4 (16 GB each) Quantization 4-bit NF4 [1] LoRA Adapters Rank r 32 Alpha α 64 Target Modules q, k, v, o_proj, gate, up, down_proj Scaling rsLoRA (α/√r) [2] Checkpointing save_strategy steps save_steps 100 save_total_limit 10 load_best_model_at_end True metric_for_best_model eval_loss Table 1. Training, hardware, adapter, and checkpointing configuration applied identically to all three models. 3.3.3 Phase 3: Dual-Stage Evaluation Stage A:Quantitative Benchmarking. All seven measurement dimensions across five
Chunk 12 · 1,994 chars
e_steps 100 save_total_limit 10 load_best_model_at_end True metric_for_best_model eval_loss Table 1. Training, hardware, adapter, and checkpointing configuration applied identically to all three models. 3.3.3 Phase 3: Dual-Stage Evaluation Stage A:Quantitative Benchmarking. All seven measurement dimensions across five metrics (PPL, BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, BLEU) were computed on the 1k test set for both zero-shot (Phase 1) and fine-tuned (Phase 2) weights. For each model m and metric M, the absolute fine-tuning gain is: ∆Mm = Mfine-tuned m − Mzero-shot m (7) Stage B:Qualitative Case Study. A manual analysis of 10 Golden Questions spanning factual Q&A, creative writing, logical reasoning, and conversational response was conducted pre- and post-SFT. Full outputs are in Appendix A; analysis is in Section 4.6. 9 -- 9 of 31 -- Phase Name Method Output 1 Zero-Shot Unmodified weights; raw Ro- manized input; no prompting Performance floor across five metrics, seven measurement dimensions 2 SFT QLoRA [1] + rsLoRA [2]; 4- bit NF4; 3 epochs; Golden Point recovery Best-checkpoint adapters per model 3A Quantitative 5-metric 7 dimension scor- ing; zero-shot vs. fine-tuned; ∆ per metric Per-model gain table 3B Qualitative 10 Golden Questions; pre- and post-SFT Linguistic failure analysis Table 2. Summary of the three experimental phases. 3.4 Evaluation Metrics Five complementary metrics spanning seven measurement dimensions cover fluency, surface similarity, and semantic alignment. ROUGE is reported across three variants (ROUGE-1, ROUGE-2, ROUGE-L) to capture unigram, bigram, and longest common subsequence alignment respectively. Together they capture orthographic, structural, and semantic dimensions of generation quality a necessary multi-axis framework given the non-standardized orthography of Romanized Nepali, where surface-form variation is high but semantic intent is often preserved [3]. 3.4.1 Perplexity (PPL) PPL measures intrinsic language model
Chunk 13 · 1,989 chars
they capture
orthographic, structural, and semantic dimensions of generation quality a necessary multi-axis framework
given the non-standardized orthography of Romanized Nepali, where surface-form variation is high but
semantic intent is often preserved [3].
3.4.1 Perplexity (PPL)
PPL measures intrinsic language model fluency [25]. For a tokenized sequence X = (x1, . . . , xt):
PPL(X) = exp − 1
t
t X
i=1
log Pθ(xi | x<i)
!
(8)
Lower PPL indicates greater predictive confidence. PPL is the primary metric optimised by the Golden Point
checkpoint strategy, since PPL = exp(Leval), where Leval is the average negative log-likelihood over the
held-out test set. Minimising Leval therefore directly minimises PPL.
3.4.2 chrF++
chrF++ [23] computes a character n-gram F-score. It is preferred over word-level metrics such as BLEU
for transliterated scripts because character n-grams are robust to the orthographic variation inherent in
Romanized Nepali, where the same word may be spelled in multiple phonetically equivalent ways [3]:
chrFβ = (1 + β2) Prec · Rec
β2 · Prec + Rec (9)
10
-- 10 of 31 --
We set β = 2 to weight recall more heavily, as missing Nepali morphological suffixes are more costly than
spurious ones.
3.4.3 BERTScore
BERTScore [24] evaluates semantic similarity via contextual embeddings, capturing meaning beyond surface
token matching. This makes it particularly valuable for Romanized Nepali, where orthographic variants of
the same word should receive high similarity scores even when exact string matching fails. For reference r
and candidate c with unit-normalized contextual embeddings {xi} and {yj }, the recall component is:
RBERT = 1
|r|
X
xi∈r
max
yj ∈c cos(xi, yj ) (10)
where cos(xi, yj ) = x⊤
i yj /(∥xi∥∥yj ∥) is the cosine similarity between embedding pairs. The final F1 score
combines precision and recall analogously.
3.4.4 ROUGE-L
ROUGE-L [26] assesses structural alignment via the Longest Common Subsequence (LCS), capturing fluency
and word-orderChunk 14 · 1,987 chars
xi∈r max yj ∈c cos(xi, yj ) (10) where cos(xi, yj ) = x⊤ i yj /(∥xi∥∥yj ∥) is the cosine similarity between embedding pairs. The final F1 score combines precision and recall analogously. 3.4.4 ROUGE-L ROUGE-L [26] assesses structural alignment via the Longest Common Subsequence (LCS), capturing fluency and word-order preservation without requiring contiguous n-gram matches: FLCS = (1 + β2) PLCS RLCS β2 PLCS + RLCS (11) ROUGE-1 (unigram overlap) and ROUGE-2 (bigram overlap) are reported alongside ROUGE-L, together constituting three of the seven measurement dimensions and providing a complete picture of structural alignment at multiple granularities. 3.4.5 BLEU BLEU [21] is the standard 4-gram precision metric included for compatibility with prior NLP literature: BLEU = BP · exp 4 X n=1 wn log pn ! , wn = 1 4 (12) where BP is the brevity penalty and pn is the n-gram precision for order n. Given the orthographic variance of Romanized Nepali, BLEU is expected to underestimate true generation quality [22] because it relies on exact surface-form matches. It is therefore interpreted alongside chrF++ and BERTScore [23, 24], which are more robust to spelling variation. 11 -- 11 of 31 -- 4 RESULTS 4.1 Zero-Shot Baseline Performance Table 3 reports baseline performance. Qwen3-8B and Mistral-7B achieve comparable perplexity (27.89 and 27.53 respectively), both substantially outperforming Llama-3.1-8B ( PPL = 52.79), suggesting weaker innate cross-lingual transfer [34]. Qwen3-8B leads on semantic alignment with the highest BERTScore (0.5631) and chrF++ (13.10), consistent with its broader multilingual pre-training [8]. Mistral-7B records the lowest BERTScore (0.2327) and chrF++ (4.95), indicating it models token-level distributions competently but fails to produce semantically coherent output without adaptation. Model PPL↓ BERT↑ BLEU↑ chrF++↑ R-1↑ R-2↑ R-L↑ Llama-3.1-8B 52.79 0.4224 0.0143 8.15 0.0732 0.0311 0.0681 Qwen3-8B 27.89 0.5631 0.0099 13.10 0.0570 0.0195
Chunk 15 · 1,993 chars
ERTScore (0.2327) and chrF++ (4.95), indicating it models token-level distributions competently but fails to produce semantically coherent output without adaptation. Model PPL↓ BERT↑ BLEU↑ chrF++↑ R-1↑ R-2↑ R-L↑ Llama-3.1-8B 52.79 0.4224 0.0143 8.15 0.0732 0.0311 0.0681 Qwen3-8B 27.89 0.5631 0.0099 13.10 0.0570 0.0195 0.0513 Mistral-7B 27.53 0.2327 0.0105 4.95 0.0309 0.0134 0.0281 Table 3. Zero-shot baseline performance on the 1k test set. ↓ = lower is better; ↑ = higher is better. 4.2 Fine-Tuned Performance Following three epochs of QLoRA + rsLoRA SFT with Golden Point recovery, all models converge to perplexity 2.81–3.02 (Table 4). Qwen3-8B achieves the highest chrF++ (27.47) and ROUGE scores. Llama- 3.1-8B records the highest BERTScore (0.7511) [24]. Mistral-7B achieves the lowest perplexity (2.812) but trails on surface metrics. Model PPL↓ BERT↑ BLEU↑ chrF++↑ R-1↑ R-2↑ R-L↑ Llama-3.1-8B 3.024 0.7511 0.0498 26.97 0.2756 0.1002 0.2359 Qwen3-8B 2.946 0.7505 0.0550 27.47 0.2915 0.1162 0.2511 Mistral-7B 2.812 0.7339 0.0404 23.95 0.2460 0.0842 0.2144 Table 4. Fine-tuned performance after QLoRA + rsLoRA SFT with Golden Point recovery. Bold = best per metric. 4.3 Fine-Tuning Gain Analysis Table 5 reports the absolute fine-tuning gain (∆ = fine-tuned − zero-shot) for each model across five metrics spanning seven measurement dimensions. Llama-3.1-8B records the largest PPL reduction (−49.77) despite its weakest zero-shot baseline, confirming the adaptation headroom hypothesis [1, 18]. Mistral-7B-v0.1 achieves the largest BERTScore gain (+0.5012) and chrF++ gain (+19.00) from its lowest zero-shot semantic baseline a direct consequence of its near-zero semantic starting point amplifying absolute gain magnitude. Qwen3-8B leads on all three structural alignment metrics (ROUGE-1: +0.2345, ROUGE-2: +0.0967, ROUGE-L: +0.1998) and BLEU (+0.0451), owing to its stronger multilingual pre-training coverage [8]. 12 -- 12 of 31 -- Model ∆PPL↓ ∆BERT↑ ∆BLEU↑ ∆chrF++↑ ∆R-1↑ ∆R-2↑
Chunk 16 · 1,998 chars
emantic starting point amplifying absolute gain magnitude. Qwen3-8B leads on all three structural alignment metrics (ROUGE-1: +0.2345, ROUGE-2: +0.0967, ROUGE-L: +0.1998) and BLEU (+0.0451), owing to its stronger multilingual pre-training coverage [8]. 12 -- 12 of 31 -- Model ∆PPL↓ ∆BERT↑ ∆BLEU↑ ∆chrF++↑ ∆R-1↑ ∆R-2↑ ∆R-L↑ Llama-3.1-8B −49.77 +0.3287 +0.0355 +18.82 +0.2024 +0.0691 +0.1678 Qwen3-8B −24.95 +0.1874 +0.0451 +14.37 +0.2345 +0.0967 +0.1998 Mistral-7B −24.72 +0.5012 +0.0299 +19.00 +0.2151 +0.0708 +0.1863 Bold = largest gain per column. ↓ = lower is better; ↑ = higher is better. Table 5. Absolute fine-tuning gain (∆ = fine-tuned − zero-shot) across five metrics spanning seven measurement dimensions. 4.4 Training and Validation Loss Analysis Training and validation loss curves for all three models over 3 epochs (3,375 steps) are presented indi- vidually below. Checkpoints were saved every 100 steps and the best checkpoint was recovered using load_best_model_at_end based on minimum validation loss [1]. Figure 2. Training and validation loss for Llama-3.1-8B over 3,375 steps. The green dotted line marks the best checkpoint at step 2,200 (minimum validation loss = 1.1585). Llama-3.1-8B exhibits the highest initial training loss (≈ 2.10) among all three models, consistent with its Tiktoken tokenizer over-fragmenting Romanized Nepali into low-frequency subword units [15]. Validation loss decreases steadily through Epochs 1 and 2, reaching a global minimum of 1.1585 at step 2,200 (end of Epoch 2). At the Epoch 3 boundary (step 2,200–2,300) a sharp spike in validation loss to ≈ 1.28 is observed, caused by the learning rate schedule resetting. The Golden Point recovery mechanism correctly identifies step 2,200 as the optimal checkpoint, preventing the Epoch 3 degradation from affecting the final reported metrics. 13 -- 13 of 31 -- Figure 3. Training and validation loss for Mistral-7B-v0.1 over 3,375 steps. The green dotted line marks the best checkpoint at step
Chunk 17 · 1,998 chars
lden Point recovery mechanism correctly identifies step 2,200 as the optimal checkpoint, preventing the Epoch 3 degradation from affecting the final reported metrics. 13 -- 13 of 31 -- Figure 3. Training and validation loss for Mistral-7B-v0.1 over 3,375 steps. The green dotted line marks the best checkpoint at step 2,200 (minimum validation loss = 1.0930). Mistral-7B-v0.1 begins with the lowest initial training loss (≈ 1.90) of the three models, reflecting its SentencePiece tokenizer producing more coherent subword units for Latin-script input than Tiktoken. The validation curve is the smoothest of all three architectures, declining consistently across both Epochs 1 and 2 with minimal oscillation. The global validation minimum of 1.0930 is achieved at step 2,200, the lowest among all three models. An epoch-boundary spike to ≈ 1.17 is visible at step 2,300 before Epoch 3 stabilizes, again confirming that Golden Point recovery at step 2,200 yields the best generalization state. Figure 4. Training and validation loss for Qwen3-8B over 3,375 steps. The green dotted line marks the best checkpoint at step 2,200 (minimum validation loss = 1.1313). Qwen3-8B starts with an intermediate initial training loss (≈ 2.20) and converges steadily through Epochs 1 14 -- 14 of 31 -- and 2. The validation loss reaches its global minimum of 1.1313 at step 2,200. Notably, Qwen3-8B’s Epoch 3 validation loss plateaus at ≈ 1.22 rather than recovering toward the Epoch 2 minimum, suggesting the adapter reached a capacity ceiling earlier than the other two models consistent with Qwen3-8B’s stronger multilingual pre-training baseline reducing available adaptation headroom [8]. The Epoch 3 spike is present but less pronounced than in Llama-3.1-8B. Cross-Model Summary. All three models achieve their global validation minimum at step 2,200, empiri- cally confirming that Golden Point checkpoint recovery was necessary and effective across all architectures. Mistral-7B-v0.1 achieves the lowest
Chunk 18 · 1,990 chars
]. The Epoch 3 spike is present but less pronounced than in Llama-3.1-8B. Cross-Model Summary. All three models achieve their global validation minimum at step 2,200, empiri- cally confirming that Golden Point checkpoint recovery was necessary and effective across all architectures. Mistral-7B-v0.1 achieves the lowest absolute validation loss (1.0930), followed by Qwen3-8B (1.1313) and Llama-3.1-8B (1.1585). However, as shown in Tables 4 and 5, lower validation loss does not directly predict superior downstream metric performance Llama-3.1-8B achieves the highest BERTScore (0.7511) despite its highest validation loss, supporting the adaptation headroom hypothesis discussed in Section 4.6. 4.5 Trainable Parameters and Training Cost Property Llama-3.1-8B Qwen3-8B Mistral-7B-v0.1 Model Scale Total Parameters 8,114,147,328 8,278,029,312 7,325,618,176 Trainable Parameters 83,886,080 87,293,952 83,886,080 Trainable % 1.03% 1.05% 1.15% Training Cost Wall-Clock Time 8h 50m 51s 8h 26m 01s 9h 09m 31s Total GPU Time 26h 26m 23s Table 6. Trainable parameter counts and wall-clock training time from Unsloth [30] runtime logs. All models: 3 epochs / 3,375 steps, rank r = 32 [18], dual NVIDIA Tesla T4. 15 -- 15 of 31 -- 4.6 Qualitative Case Study Full outputs for all 10 Golden Questions are in Appendix A (Tables 9–14). Zero-Shot Failure Signatures. Three distinct architecture-specific failures emerge [3]. Llama-3.1-8B produced [no response] on 6 of 10 instructions, indicating Tiktoken over-fragmented Romanized Nepali into low-frequency subword units [15]. Mistral-7B-v0.1 produced [empty line] on 5 of 10 and responded off-task in English on the remaining 5. Qwen3-8B was the only model to produce non-empty output on all 10, yet exclusively in English or Devanagari, reflecting misdirected cross-lingual transfer [34]. Post-SFT Resolution. Following fine-tuning [1], all three models produced Romanized Nepali on all 10 instructions. Llama-3.1-8B and Qwen3-8B generated fluent,
Chunk 19 · 1,997 chars
Qwen3-8B was the only model to produce non-empty output on all 10, yet exclusively in English or Devanagari, reflecting misdirected cross-lingual transfer [34]. Post-SFT Resolution. Following fine-tuning [1], all three models produced Romanized Nepali on all 10 instructions. Llama-3.1-8B and Qwen3-8B generated fluent, semantically complete responses. Mistral- 7B-v0.1 retained two residual semantic errors, consistent with its lower chrF++ (23.95) and ROUGE-L (0.2144) [23, 26]. Failure Taxonomy Summary. For Orthographic Robustness, all fine-tuned models correctly resolved phonetically variant inputs. For Morphological Integrity, Llama-3.1-8B and Qwen3-8B correctly generated Nepali verb endings and postpositions, while Mistral occasionally dropped them. For Latent Romanization Drift, all fine-tuned models maintained Romanized Nepali without drifting into English syntax [34]. 5 DISCUSSION 5.1 Base Model Structural Collapse All three base models were functionally unable to respond in Romanized Nepali at zero-shot (Table 7), each exhibiting a distinct failure mode rooted in its architecture and tokenizer design [3,15]. These failures confirm that Romanized Nepali falls outside the confident generation distribution of all three pre-training regimes, consistent with the “linguistic shadow” framing of Bender et al. [5]. Model Failure Mode Technical Description Llama-3.1-8B Semantic Void Early EOS Triggering [6]: Tiktoken over- fragments Romanized Nepali; model immediately predicts End-of-Sequence, yielding null output on 6 of 10 instructions. Mistral-7B-v0.1 Newline Loop Distributional Collapse [7]: SentencePiece pro- duces plausible fragments but the model fails to initiate a response, defaulting to continuous \n generation. Qwen3-8B Script Drift Latent Script Bias [8]: Model correctly identi- fies semantic intent but defaults to dominant pre- training scripts rather than Romanized format. Table 7. Taxonomy of structural failures in zero-shot base models. 16 -- 16 of 31
Chunk 20 · 1,997 chars
ails to initiate a response, defaulting to continuous \n generation. Qwen3-8B Script Drift Latent Script Bias [8]: Model correctly identi- fies semantic intent but defaults to dominant pre- training scripts rather than Romanized format. Table 7. Taxonomy of structural failures in zero-shot base models. 16 -- 16 of 31 -- 5.2 The Adaptation Headroom Hypothesis Llama-3.1-8B presents a compelling inversion of the expected performance ordering: despite recording the weakest zero-shot baseline across all seven measurement dimensions (PPL 52.79, BERTScore 0.4224, chrF++ 8.15), it achieves the largest absolute PPL reduction (∆ = −49.77), the highest post-SFT BERTScore (0.7511), and the second-largest chrF++ gain (∆ = +18.82) following fine-tuning. This pattern is not incidental but structural. We term this the adaptation headroom hypothesis: models with weaker innate cross-lingual representations possess greater representational plasticity for script-domain adaptation when exposed to high-quality PEFT data. The intuition is straightforward a model that has formed no confident prior over Romanized Nepali phoneme sequences has more parameter space available for the adapter to reshape, whereas a model with a strong multilingual prior (such as Qwen3-8B) approaches a representational ceiling sooner, as evidenced by its Epoch 3 validation loss plateauing at ≈ 1.22 rather than recovering toward its Epoch 2 minimum [8]. Formally, for a model m with zero-shot performance floor Mzero-shot m and post-SFT ceiling Mfine-tuned m , the adaptation headroom Hm on metric M is: Hm(M) = Mfine-tuned m − Mzero-shot m (13) Under this definition, Llama-3.1-8B maximises Hm(PPL) by a factor of 2.0× over both Qwen3-8B (∆ = −24.95) and Mistral-7B-v0.1 (∆ = −24.72), confirming that the weakest pre-trained baseline produces the steepest adaptation trajectory under identical QLoRA + rsLoRA conditions [1, 2, 18]. This result carries a non-obvious practical implication: a strong zero-shot multilingual
Chunk 21 · 1,985 chars
a factor of 2.0× over both Qwen3-8B (∆ = −24.95) and Mistral-7B-v0.1 (∆ = −24.72), confirming that the weakest pre-trained baseline produces the steepest adaptation trajectory under identical QLoRA + rsLoRA conditions [1, 2, 18]. This result carries a non-obvious practical implication: a strong zero-shot multilingual baseline does not guarantee the highest fine-tuning return. When the deployment scenario involves a high-quality curated corpus and iterative fine-tuning rounds, adaptation headroom is a more actionable model selection criterion than zero-shot performance. Practitioners who select models solely on zero-shot benchmarks risk systematically overlooking architectures with the highest long-term fine-tuning yield. 5.3 Limitations Several limitations should be acknowledged. First, the Indic-Transliteration pipeline approximates but does not fully reproduce the orthographic diversity of naturally occurring user-generated Romanized Nepali, potentially underestimating real-world phonetic noise. Second, evaluation is on a single instruction- following dataset [27]; performance on social media text, customer service conversations, or domain-specific content is not assessed. Third, BERT Score uses a multilingual BERT model with limited Romanized Nepali exposure, potentially underestimating semantic similarity for high-quality outputs that diverge in surface form [24]. Fourth, all metrics reflect quantized (4-bit NF4) inference, which introduces a slight precision- performance trade-off whose magnitude on the reported values has not been independently quantified [1]. 17 -- 17 of 31 -- 5.4 Broader Impact and Applications Romanized Nepali is the de facto language of Nepali social media, messaging platforms, and youth discourse. Current NLP infrastructure, being Devanagari-centric, is effectively inaccessible to this register, creating barriers for content moderation, mental health chatbots, educational tools, and diaspora communication services for the
Chunk 22 · 1,994 chars
is the de facto language of Nepali social media, messaging platforms, and youth discourse. Current NLP infrastructure, being Devanagari-centric, is effectively inaccessible to this register, creating barriers for content moderation, mental health chatbots, educational tools, and diaspora communication services for the approximately 3 million Nepali speakers abroad [4]. By demonstrating that parameter-efficient adaptation of comparable-sized models achieves BERTScore ≈ 0.75 with under 90M trainable parameters and 26 GPU-hours on dual T4 hardware, this work establishes a practically replicable path for low-resource NLP practitioners and Nepali developer communities to build Romanized Nepali AI tools without access to large compute clusters [1, 18]. 5.5 Overall Model Assessment Drawing on the fine-tuned performance (Table 4), absolute gains across five metrics spanning seven measure- ment dimensions (Table 5), training cost (Table 6), and the qualitative case study (Section 4.6), we consolidate a dimension-wise post-SFT ranking across all three architectures in Table 8. Zero-shot baselines are reported separately in Table 3 and are not repeated here. Dimension Llama-3.1-8B Qwen3-8B Mistral-7B-v0.1 Best Post-SFT Quality (five metrics, seven dimensions) Post-SFT PPL↓ 3rd (3.024) 2nd (2.946) 1st (2.812) Mistral Post-SFT BERTScore↑ 1st (0.7511) 2nd (0.7505) 3rd (0.7339) Llama Post-SFT chrF++↑ 2nd (26.97) 1st (27.47) 3rd (23.95) Qwen3-8B Post-SFT ROUGE-1↑ 2nd (0.2756) 1st (0.2915) 3rd (0.2460) Qwen3-8B Post-SFT ROUGE-2↑ 2nd (0.1002) 1st (0.1162) 3rd (0.0842) Qwen3-8B Post-SFT ROUGE-L↑ 2nd (0.2359) 1st (0.2511) 3rd (0.2144) Qwen3-8B Post-SFT BLEU↑ 2nd (0.0498) 1st (0.0550) 3rd (0.0404) Qwen3-8B Adaptation & Cost Adaptation Gain (∆PPL)↓ 1st (−49.77) 2nd (−24.95) 3rd (−24.72) Llama Training Speed 2nd (8h 51m) 1st (8h 26m) 3rd (9h 10m) Qwen3-8B Qualitative Residual Errors (10 Golden Qs) 2nd (2 content) 1st (0 errors) 3rd (2 semantic) Qwen3-8B Dimension wins Qwen: 7 Llama:
Chunk 23 · 1,999 chars
498) 1st (0.0550) 3rd (0.0404) Qwen3-8B Adaptation & Cost Adaptation Gain (∆PPL)↓ 1st (−49.77) 2nd (−24.95) 3rd (−24.72) Llama Training Speed 2nd (8h 51m) 1st (8h 26m) 3rd (9h 10m) Qwen3-8B Qualitative Residual Errors (10 Golden Qs) 2nd (2 content) 1st (0 errors) 3rd (2 semantic) Qwen3-8B Dimension wins Qwen: 7 Llama: 2 Mistral: 1 Table 8. Dimension-wise post-SFT ranking across ten evaluation criteria. Bold indicates best per row. Recommended Architecture: Qwen3-8B. Qwen3-8B is the recommended architecture for Romanized Nepali deployment across the widest range of practical settings. Post-SFT, it leads all seven measurement dimensions spanning structural alignment and fluency: chrF++ (27.47), ROUGE-1 (0.2915), ROUGE-2 (0.1162), ROUGE-L (0.2511), and BLEU (0.0550) [8]. Its qualitative case study produced zero residual errors against two content errors in Llama-3.1-8B and two semantic errors in Mistral-7B-v0.1, demonstrating superior factual accuracy and topic binding. It also converges fastest at 8h 26m on dual NVIDIA Tesla T4 hardware, making it the most resource-efficient choice among the three architectures (Table 6). 18 -- 18 of 31 -- Scenario Exception: Iterative Fine-Tuning. Llama-3.1-8B is the preferred architecture when the de- velopment pipeline involves iterative data collection and repeated fine-tuning rounds. Despite its weakest zero-shot baseline (PPL 52.79), it achieves the largest absolute PPL reduction (∆PPL = −49.77) and the highest post-SFT BERTScore (0.7511), confirming the adaptation headroom hypothesis: a weaker pre-trained multilingual baseline translates into greater representational plasticity under PEFT [1, 18]. Practitioners building toward a production system over multiple training iterations on a growing Romanized Nepali corpus should consider Llama-3.1-8B the more tractable long-term investment. Fluency-Only Deployment. Mistral-7B-v0.1 is recommended only for applications where grammatical fluency is the primary requirement and factual
Chunk 24 · 1,993 chars
building toward a production system over multiple training iterations on a growing Romanized Nepali corpus should consider Llama-3.1-8B the more tractable long-term investment. Fluency-Only Deployment. Mistral-7B-v0.1 is recommended only for applications where grammatical fluency is the primary requirement and factual precision is secondary. It achieves the lowest post-SFT perplexity (2.812) and the largest BERTScore gain (∆BERTScore = +0.5012) from its near-zero semantic zero-shot baseline, as well as the smoothest validation loss curve across all three architectures (minimum Leval = 1.0930 at step 2,200). However, it produced the two most severe qualitative failures post-SFT: a semantic hallucination on the Dashain festival task and a single-token non-answer on the categorical reasoning task [7]. Its lower total parameter count (7.3B vs. 8.1–8.3B) offers a marginal inference speed advantage on memory-constrained hardware where perplexity is the dominant deployment criterion. 6 CONCLUSION This study presents the first systematic benchmarking of comparable-sized open-weight LLMs on Romanized Nepali, addressing a persistent and consequential gap in low-resource NLP for South Asian informal digital communication. We evaluated Llama-3.1-8B [6], Mistral-7B-v0.1 [7], and Qwen3-8B [8] across zero-shot and fine-tuned settings under a rigorous framework spanning five metrics across seven measurement dimensions (PPL, BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU), supplemented by an in-depth qualitative case study across 10 Golden Questions spanning factual, technical, grammatical, and creative generation tasks. Four principal findings emerge from this work. First, all three base models are functionally unable to respond in Romanized Nepali at zero-shot, each exhibiting a distinct architecture-specific failure mode: early EOS triggering in Llama-3.1-8B, distributional collapse in Mistral-7B-v0.1, and latent script bias in Qwen3-8B. These failures collectively
Chunk 25 · 1,988 chars
m this work. First, all three base models are functionally unable to respond in Romanized Nepali at zero-shot, each exhibiting a distinct architecture-specific failure mode: early EOS triggering in Llama-3.1-8B, distributional collapse in Mistral-7B-v0.1, and latent script bias in Qwen3-8B. These failures collectively confirm that Romanized Nepali lies outside the confident generation distribution of all comparable-sized model pre-training regimes, consistent with the “linguistic shadow” characterisation of under-resourced transliterated scripts [3, 5]. Second, QLoRA [1] combined with rsLoRA [2] at rank r = 32 fully resolves all three failure modes within 3 epochs on 9,000 samples, training only ≈ 1% of each model’s parameters on dual NVIDIA Tesla T4 GPUs in under 26 total GPU-hours. Post-SFT, all three architectures achieve BERTScore ≈ 0.75 and chrF++ > 23, demonstrating that parameter-efficient fine-tuning is both necessary and sufficient for script-domain adaptation at this scale. Third, the adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline (PPL 52.79), achieves the largest absolute PPL reduction (∆ = −49.77) and the highest post-SFT BERTScore (0.7511), establishing that a weak pre-trained multilingual baseline does not preclude strong fine-tuning returns and may in fact enable greater representational plasticity under PEFT [18]. Separately, Mistral-7B- v0.1 achieves the largest absolute BERTScore gain (∆ = +0.5012) and chrF++ gain (∆ = +19.00) from its 19 -- 19 of 31 -- near-zero semantic zero-shot baseline, a direct consequence of its lowest semantic starting point amplifying absolute gain magnitude [1]. Fourth, the dimension-wise assessment across ten evaluation criteria (Table 8) identifies Qwen3-8B as the overall recommended architecture for Romanized Nepali deployment, winning seven of ten dimensions. It leads all five structural and fluency metrics post-SFT (chrF++ 27.47, ROUGE-1 0.2915, ROUGE-2
Chunk 26 · 1,996 chars
in magnitude [1]. Fourth, the dimension-wise assessment across ten evaluation criteria (Table 8) identifies Qwen3-8B as the overall recommended architecture for Romanized Nepali deployment, winning seven of ten dimensions. It leads all five structural and fluency metrics post-SFT (chrF++ 27.47, ROUGE-1 0.2915, ROUGE-2 0.1162, ROUGE-L 0.2511, BLEU 0.0550), converges fastest at 8h 26m, and produces zero residual errors across all 10 Golden Questions [8]. Llama-3.1-8B remains the preferred choice for iterative development pipelines where adaptation headroom is the primary selection criterion, while Mistral-7B-v0.1 is suited to fluency-only applications where its lowest post-SFT perplexity (2.812) is the dominant deployment criterion. These results establish a reproducible, hardware-accessible baseline for Romanized Nepali generative NLP, replicable by practitioners without access to large compute clusters. Future work will pursue: (i) Direct Preference Optimization [35] to align fine-tuned models with colloquial Nepali slang and code-switching patterns prevalent in real-world social media; (ii) formal subword fertility analysis to quantify the per- architecture tokenization burden on Romanized Nepali and guide vocabulary extension strategies [15]; (iii) downstream task evaluation on sentiment analysis, machine translation, and hate speech detection; and (iv) expanded contrastive fine-tuning corpora to address the lexical semantic grounding gaps identified in categorical reasoning tasks. 20 -- 20 of 31 -- REFERENCES [1] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023. [2] D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with LoRA,” arXiv preprint arXiv:2312.03732, 2023. [3] T. B. Shahi and C. Sitaula, “Natural language processing for Nepali text: A review,” Artificial Intelligence Review, vol. 55, no. 5, pp.
Chunk 27 · 1,996 chars
rmation Processing Systems (NeurIPS), vol. 36, 2023. [2] D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with LoRA,” arXiv preprint arXiv:2312.03732, 2023. [3] T. B. Shahi and C. Sitaula, “Natural language processing for Nepali text: A review,” Artificial Intelligence Review, vol. 55, no. 5, pp. 3401–3429, 2022. [4] National Statistics Office (formerly Central Bureau of Statistics), National Population and Housing Census 2021, Government of Nepal, Kathmandu, 2021. [5] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic par- rots: Can language models be too big?,” in Proc. ACM Conference on Fairness, Accountability, and Transparency (FAccT), pp. 610–623, 2021. [6] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [7] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7B,” arXiv preprint arXiv:2310.06825, 2023. [8] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023. [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008, 2017. [10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020. [11] S. Bam and T. B. Shahi, "Named Entity Recognition for Nepali Text Using Support
Chunk 28 · 1,995 chars
der, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020. [11] S. Bam and T. B. Shahi, "Named Entity Recognition for Nepali Text Using Support Vector Machines," in Proc. International Conference on Communication and Information Technology (ICCIT), 2014. [12] P. Koirala and N. Niraula, “NPVec1: Word embeddings for Nepali — construction and evaluation,” in Proc. 6th Workshop on Representation Learning for NLP (RepL4NLP), pp. 174–184, 2021. [13] S. Timilsina, M. Gautam, and B. Bhattarai, “NepBERTa: Nepali language model trained in a large corpus,” in Proc. 2nd Conf. Asia-Pacific Chapter of ACL (AACL-IJCNLP), pp. 273–284, 2022. [14] S. Pudasaini, A. Dangol, and S. Shakya, "NepaliGPT: A generative language model for the Nepali language," arXiv preprint arXiv:2506.16399, 2025. [15] P. Rust, J. Pfeiffer, I. Vuli´c, S. Ruder, and I. Gurevych, “How good is your tokenizer? On the monolingual performance of multilingual language models,” in Proc. 59th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3118–3135, 2021. 21 -- 21 of 31 -- [16] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y. Lee, B. Sagot, and S. Tan, “Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,” arXiv preprint arXiv:2112.10508, 2021. [17] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, pp. 66–71, 2018. [18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations (ICLR), 2022. [19] J.
Chunk 29 · 1,987 chars
on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, pp. 66–71, 2018. [18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations (ICLR), 2022. [19] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in International Conference on Learning Representations (ICLR), 2022. [20] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford Alpaca: An instruction-following LLaMA model,” GitHub repository, Stanford University, 2023. https://github.com/tatsu-lab/stanford_alpaca [21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 311–318, 2002. [22] M. Post, “A call for clarity in reporting BLEU scores,” in Proc. Third Conference on Machine Translation (WMT), pp. 186–191, 2018. [23] M. Popovi´c, “chrF: Character n-gram F-score for automatic MT evaluation,” in Proc. Tenth Workshop on Statistical Machine Translation (WMT), pp. 392–395, 2015. [24] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating text generation with BERT,” in International Conference on Learning Representations (ICLR), 2020. [25] D. Jurafsky and J. H. Martin, Speech and Language Processing, 3rd ed. (draft), Stanford University, 2024. https://web.stanford.edu/~jurafsky/slp3/ [26] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop on Text Summarization Branches Out, pp. 74–81, 2004. [27] S. Kafley, “Alpaca Nepali SFT,” Hugging Face Datasets, 2024. https://huggingface.co/ datasets/Saugatkafley/alpaca-nepali-sft [28] Google LLC, “Google Translate,” Google, 2024. [29] AI4Bharat,
Chunk 30 · 1,996 chars
package for automatic evaluation of summaries,” in Proc. ACL Workshop on Text Summarization Branches Out, pp. 74–81, 2004. [27] S. Kafley, “Alpaca Nepali SFT,” Hugging Face Datasets, 2024. https://huggingface.co/ datasets/Saugatkafley/alpaca-nepali-sft [28] Google LLC, “Google Translate,” Google, 2024. [29] AI4Bharat, “IndicTransliteration: Transliteration library for Indic scripts,” GitHub repository, 2024. [30] Unsloth AI, “Unsloth: 2x faster, 70% less memory LLM finetuning,” GitHub repository, 2024. [31] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec, “TRL: Transformers reinforcement learning,” GitHub repository, Hugging Face, 2020. https://github.com/huggingface/trl 22 -- 22 of 31 -- [32] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 30318–30332, 2022. [33] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations (ICLR), 2017. [34] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” in Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4996–5001, 2019. [35] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 53728–53741, 2023. 23 -- 23 of 31 -- A QUALITATIVE GOLDEN QUESTIONS ANALYSIS This appendix presents the complete model outputs for all 10 Golden Questions evaluated under zero-shot (base) and fine-tuned (post-SFT) conditions for each of the three architectures. Outputs are reproduced verbatim with no post-processing or correction. [no response] denotes an empty string with zero tokens generated; [empty line]
Chunk 31 · 1,994 chars
resents the complete model outputs for all 10 Golden Questions evaluated under zero-shot (base) and fine-tuned (post-SFT) conditions for each of the three architectures. Outputs are reproduced verbatim with no post-processing or correction. [no response] denotes an empty string with zero tokens generated; [empty line] denotes continuous blank newlines with no semantic content. The six tables (Tables 9–14) provide the raw evidence base for the qualitative analysis in Section 4.6, covering the three failure-taxonomy dimensions (Orthographic Robustness, Morphological Integrity, Latent Romanization Drift) and the two annotated residual semantic errors in the Mistral-7B-v0.1 fine-tuned outputs. A.1 Llama-3.1-8B: Base Model (Zero-Shot) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — [no response] Nepal ko barema thodai bhannus ta. — [no response] Cache memory bhaneko k ho? — [no response] Yo numerical pattern pura garnuhos. 1, 2, 3, 4, . . . 5 Diyeko vakya lai shuddha banayera lekhnuhos. Ma bholi ghar janchu thiyo. Diyeko vakya lai shuddha banayera lekhnuho. Ma bholi ghar janchu thiyo. Operating System ko bhumika k huncha? — [no response] Dashain ko barema euta choto paragraph lekhnuhos. — [no response] Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap 1. Diyeko list ma kun chai phalful hoina? 2. Syau 3. Suntala 4. Alu 5. Aap Internet bhane ko k ho? — [no response] Euta sano katha lekhnuhos. Topic: Chalakh Shyal Chalakh shyal gulo ekta jinish Table 9. Llama-3.1-8B base model responses (zero-shot). [no response] on 6 of 10 instructions. 24 -- 24 of 31 -- A.2 Llama-3.1-8B: Fine-Tuned (Post-SFT) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — kampyutara eka yantra ho jasale janakari prashodhana garcha ra samasyaharu samadhana gardacha Nepal ko barema thodai bhannus ta. — nepala uttara purva esiyama avasthita eka sano rashtra ho Cache memory bhaneko k ho? — kyasa
Chunk 32 · 1,993 chars
t-SFT) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — kampyutara eka yantra ho jasale janakari prashodhana garcha ra samasyaharu samadhana gardacha Nepal ko barema thodai bhannus ta. — nepala uttara purva esiyama avasthita eka sano rashtra ho Cache memory bhaneko k ho? — kyasa memori eka choto-avadhi bhandarana pranali ho jasale druta rupama prayoga hune data va nirdeshanaharu pahuncha garna sakshama cha Yo numerical pattern pura garnuhos. 1, 2, 3, 4, . . . 5 Diyeko vakya lai shuddha banayera lekhnuhos. Ma bholi ghar janchu thiyo. ma bholi ghara janchu thien Operating System ko bhumika k huncha? — operetimga sistama kampyutara pranaliko adhikarika ho ra prayogakartaharusanga kampyutaralai niyantrana garna anumati dincha| yasale pranali srotaharu vyavasthapana gardacha, prashodhana ra pradarshana pradana gardacha, ra prayogakarta anuprayogaharulai pratyaksha pahuncha pradana gardacha| Dashain ko barema euta choto paragraph lekhnuhos. — dashain bhaneko 10 dinako avadhiko vishesha samayavada ho juna hindu mahinama pratyeka purnimako lagi ayojana garincha| yo navikaraniya chandramako dashaka varshika samskaralai manauncha ra dherai purvadharaharu ra paramparagata rupama nayan luga ra samanaharu kinna va dhuvani garna prayoga garincha| yasale parivara, sathiharu ra sathiharusanga bandhana banauncha ra sadhaimbhari ramro kura garnuhos Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap 4| aap Internet bhane ko k ho? — intaraneta (anuchcheda: i) eka vishvavyapi kampyutina netavarka ho jasale janakari sajha garna ra adana pradana garna sakcha| yo dherai kampyutara ra anya upakarana va yantraharu milera baneko cha juna samsarako sabai kunama phailieko cha| 25 -- 25 of 31 -- Instruction Input Output Euta sano katha lekhnuhos. Topic: Chalakh Shyal eka samayako kura ho, tyahan eka chito ra baliyo ghoda thiyo jasale sadhaim aphno khela khelna mana parauncha| hareka dina usale aphno gharabata bahira niskera
Chunk 33 · 1,998 chars
aharu milera baneko cha juna samsarako sabai kunama phailieko cha| 25 -- 25 of 31 -- Instruction Input Output Euta sano katha lekhnuhos. Topic: Chalakh Shyal eka samayako kura ho, tyahan eka chito ra baliyo ghoda thiyo jasale sadhaim aphno khela khelna mana parauncha| hareka dina usale aphno gharabata bahira niskera shaharaka sadakaharu khojcha, tiniharulai jumauna ra tiniharuko varipari daudana| eka dina, chito ghodale shaharako cheuma euta sano khetama thokkiyo| usale yasalai dhilo garyo ra najika pugda, uhanle euta sano syala bhettanubhayo, jo kehi khana khojdai thiyo| syala ra chito ghoda duvai khushisatha banchna thale, sangai khana phela parne ra sahasika karya garne| tiniharuka lagi, yo eka adbhuta yatra thiyo . . . Table 10. Llama-3.1-8B fine-tuned responses (post-SFT). All 10 instructions answered in Romanized Nepali. A.3 Qwen3-8B: Base Model (Zero-Shot) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — [Devanagari script]: manche haru le aphno kampyutar ko samasya hal garna prayaas gardachhan . . . (off-script; definition request misread as troubleshooting) Nepal ko barema thodai bhannus ta. — Sure! Here’s how you can complete this sentence in Nepali: “Nepal ma raatko sabai chhato janaa.” This translates to “In Nepal, it gets very cold at night.” Cache memory bhaneko k ho? — Cache memory in computer systems refers to high-speed data storage mechanisms designed to reduce the average time it takes for active data to be requested by serving frequently accessed information from faster (but smaller) sources than main memory. This helps improve system performance and efficiency because accessing cache typically requires less power compared to reading directly from primary or secondary storage devices like hard drives or SSDs. Yo numerical pattern pura garnuhos. 1, 2, 3, 4, . . . The given sequence appears to be increasing by one each time: Start at 1; Add 1 to get the next number (e.g., 1 + 1 = 2); Continue this process
Chunk 34 · 1,996 chars
ally requires less power compared to reading directly from primary or secondary storage devices like hard drives or SSDs. Yo numerical pattern pura garnuhos. 1, 2, 3, 4, . . . The given sequence appears to be increasing by one each time: Start at 1; Add 1 to get the next number (e.g., 1 + 1 = 2); Continue this process indefinitely. So the next few numbers in the series would be: 5, 6, 7, . . . Diyeko vakya lai shuddha banayera lekhnuhos. Ma bholi ghar janchu thiyo. Here’s my attempt at completing this task: “Ma ko ghar ma janayau.” I hope I did it correctly! 26 -- 26 of 31 -- Instruction Input Output Operating System ko bhumika k huncha? — An operating system (OS) plays several crucial roles in managing computer hardware and software resources. Key functions: (1) Resource Management — manages CPU time, memory, disk, network; (2) Process Control — scheduling algorithms for task execution; (3) Memory Protection & Sharing — secure access control; (4) Device Drivers Interface — intermediary between apps and hardware; (5) Security Measures — authentication protocols and encryption. Dashain ko barema euta choto paragraph lekhnuhos. — Dashain is one of the most important festivals in Nepal and it falls on the first day of Kartik month according to Nepali calendar which usually occurs between October–November every year. It’s celebrated for 15 days by worshipping Goddess Lakshmi (the goddess of wealth) as well as other deities like Ganesha & Shiva etc., through various rituals such as offering prayers at home or temple premises followed by feasting with family members over traditional dishes prepared during this time period including mutton curry served alongside roti breads. Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap To determine which item does not belong to fruits (phal): Syau — appears to be a misspelling. Suntala translates to “sunflower” in English; sunflowers are plants, not fruits. Alu means potato in Nepali; potatoes are tubers
Chunk 35 · 1,990 chars
longside roti breads. Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap To determine which item does not belong to fruits (phal): Syau — appears to be a misspelling. Suntala translates to “sunflower” in English; sunflowers are plants, not fruits. Alu means potato in Nepali; potatoes are tubers classified as vegetables, not fruits, because they do not develop from flowering parts. Aap is mango, which is a fruit. Internet bhane ko k ho? — The internet refers to a global network of interconnected computers and servers that communicate using standardized protocols such as TCP/IP. It allows for the sharing of information across devices worldwide through various services like email, websites, social media platforms, online shopping sites, video streaming services etc., making it possible for people from different parts of the world to connect instantly regardless of geographical boundaries or time zones. Euta sano katha lekhnuhos. Topic: Chalakh Shyal Chalakh shyal ko chakkar maanu paryo jasai taraa raatko taapmaa nahi hotaa. Tahaaniyatai sabdaachaarachhaako moolkhojchheka bhanne chaahau. Table 11. Qwen3-8B base model responses (zero-shot). All responses in English or Devanagari — no Romanized Nepali generated. 27 -- 27 of 31 -- A.4 Qwen3-8B: Fine-Tuned (Post-SFT) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — kampyutara eka yantra ho jasale janakari prashodhana garcha ra prayogakartalai janakari utpanna gardacha| yasama prosesara, memori, inaputa autaputa upakaranaharu samavesha hunchan| Nepal ko barema thodai bhannus ta. — nepala dakshina esiyama avasthita eka puratana desha ho| yo himali kshetraharu ra vishvako sabaibhanda aglo pahada sagaramatha (8,848 mitara) samga jivita cha| yasale dherai prakaraka paridrrishyaharu pradana gardacha, jasama ghanse maidanaharu, tatiya ghumauro bagaimchaharu, shanta tala ra charko nadiharu chan| nepalale vibhinna prakaraka samskrriti ra bhasha pani prastuta
Chunk 36 · 1,993 chars
yo himali kshetraharu ra vishvako sabaibhanda aglo pahada sagaramatha (8,848 mitara) samga jivita cha| yasale dherai prakaraka paridrrishyaharu pradana gardacha, jasama ghanse maidanaharu, tatiya ghumauro bagaimchaharu, shanta tala ra charko nadiharu chan| nepalale vibhinna prakaraka samskrriti ra bhasha pani prastuta gardacha, juna 125 mukhya bhashaharu milera baneko cha| Cache memory bhaneko k ho? — kyashe memori eka prakarako chito pahuncha garna dijaina garieko samana ra data ko lagi avasthita huna sakcha| yasale dherai prayoga garine programina nirdeshanaharu ra detako lagi thula matrama kyacheda va “bhandarana” gardacha juna ramro-pradarshana vikalpa hunu parne sambhavana cha| Yo numerical pattern pura garnuhos. 1, 2, 3, 4, . . . 5 Diyeko vakya lai shuddha banayera lekhnuhos. Ma bholi ghar janchu thiyo. maile bholi ghara gaen Operating System ko bhumika k huncha? — operiuna sistamako mukhya karya prayogakartaharulai hardaveyara srotaharu pahuncha garna ra pranalilai vyavasthapana garna anumati dina ho| yasale memori vyavasthapanama maddata gardacha, pranali suru garnecha ra banda garnecha, phailaharu padhne ra lekhne, prakriyaharu chalaune, netavarka jadana sthapita garne, ra thapa| Dashain ko barema euta choto paragraph lekhnuhos. — dashaina bharatiya kathana kalendarako sabaibhanda lamo ra sabaibhanda mahattvapurna avadhiko rupama manincha| yo pratyeka varshako dashambara dekhi nembarama samarpita cha, juna pancha dinaharu milera baneko huncha| yasa avadhile bharkharai parivartanakari gareka visheshataharu punarsthapana gardacha, jasale janmadina va svatantratako lagi dherai purano pratikatmaka mulyaharu pheri jagauncha| 28 -- 28 of 31 -- Instruction Input Output Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap syau Internet bhane ko k ho? — intaraneta eka vishvavyapi samchara pranali ho jasale manisaharulai janakari adana ra diincha| yo dherai kampyutara ra anya upakaranaharu milera baneko cha juna vibhinna netavarkaharuma
Chunk 37 · 1,998 chars
- Instruction Input Output Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap syau Internet bhane ko k ho? — intaraneta eka vishvavyapi samchara pranali ho jasale manisaharulai janakari adana ra diincha| yo dherai kampyutara ra anya upakaranaharu milera baneko cha juna vibhinna netavarkaharuma jodieko huncha, jasale prayogakartaharulai sajilaisanga janakari pahuncha garna sakshama banauncha| Euta sano katha lekhnuhos. Topic: Chalakh Shyal eka pataka tyahan eka chalako shyla thiyo jasale aphno varaparako samudayama dherai asuvidha ra nirashako bhavana lyaeko thiyo| manisaharu uniharulai darae ra tiniharule kehi garna sakdainan| eka dina, euti yuvati namaki ketile shylabata tadha gaera suruma pharkie| uni ashvasta thii, kinabhane unale socheki thiin ki yo nishkriya hunecha va usale banaeko kunai kurale unako lagi kura garne chaina| tara, chalaile unako avaja sunayo ra unale unako barema sodheki thiin| keti sadhaimbhari chalaharuko barema sikna chahanthin| unale chalaharulai bachchaharusangako . . . Table 12. Qwen3-8B fine-tuned responses (post-SFT). All 10 instructions answered correctly in Romanized Nepali. A.5 Mistral-7B-v0.1: Base Model (Zero-Shot) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — The instruction in this example is written in Nepali and asks for a computer to be built. The input provided is a list of components that need to be included in the build. The response should provide a detailed explanation of how to assemble the computer using the provided components. This could include step-by-step instructions on how to install each component, such as the motherboard, CPU, RAM, graphics card, power supply, hard drive, and case. It should also explain how to connect all the components together and ensure they are properly installed and functioning correctly. Nepal ko barema thodai bhannus ta. — [empty line] Cache memory bhaneko k ho? — [empty line] Yo numerical pattern pura garnuhos. 1, 2,
Chunk 38 · 1,998 chars
M, graphics card, power supply, hard drive, and case. It should also explain how to connect all the components together and ensure they are properly installed and functioning correctly. Nepal ko barema thodai bhannus ta. — [empty line] Cache memory bhaneko k ho? — [empty line] Yo numerical pattern pura garnuhos. 1, 2, 3, 4, . . . 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 29 -- 29 of 31 -- Instruction Input Output Diyeko vakya lai shuddha banayera lekhnuhos. Ma bholi ghar janchu thiyo. The input “Ma bholi ghar janchu thiyo” means “I went to my mother’s house.” The response “Ma bholi ghar janchu thiyo” is a complete sentence that correctly expresses the same meaning as the input. It uses the correct verb form (thiyo) and subject-verb agreement (ma). Operating System ko bhumika k huncha? — [empty line] Dashain ko barema euta choto paragraph lekhnuhos. — [empty line] Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap The instruction asks for a list of items in order from smallest to largest. The input provides four options, and the correct response is to list them in ascending order. Internet bhane ko k ho? — [empty line] Euta sano katha lekhnuhos. Topic: Chalakh Shyal Chalakh Shyal is a popular dish in Nepal that is made with rice flour and yogurt. It is typically served as a side dish or snack and is often enjoyed during festivals and special occasions. The dish has a unique texture and flavor that is both savory and slightly sour, making it a favorite among many people in Nepal. Table 13. Mistral-7B-v0.1 base model responses (zero-shot). [empty line] on 5 of 10 instructions; remaining 5 in English, off-task. A.6 Mistral-7B-v0.1: Fine-Tuned (Post-SFT) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — kampyutara bhaneko kampyutina pranaliko eka prakara ho jasale manisaharulai janakari ra suchanako vishleshana garna sakshama gardacha Nepal ko barema thodai bhannus ta. — nepala
Chunk 39 · 1,990 chars
off-task. A.6 Mistral-7B-v0.1: Fine-Tuned (Post-SFT) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnuhos. — kampyutara bhaneko kampyutina pranaliko eka prakara ho jasale manisaharulai janakari ra suchanako vishleshana garna sakshama gardacha Nepal ko barema thodai bhannus ta. — nepala eka samrriddha ra vividha desha ho jasale vishvavyapi rupama manava ra janavaraharuko ghara banauncha| yo yasa prakhyata kshetrako ghara ho, ra yasako sabaibhanda najikaka shaharaharu madhye eka ho| yo vishvavyapi rupama manava ra janavaraharule baneko cha, ra yo vishvavyapi rupama manava-jastai cha| Cache memory bhaneko k ho? — ache memori eka prakarako samana ho jasale deta va janakari bhandarana garna prayoga gardacha| Yo numerical pattern pura garnuhos. 1, 2, 3, 4, . . . 5 30 -- 30 of 31 -- Instruction Input Output Diyeko vakya lai shuddha banayera lekhnuhos. Ma bholi ghar janchu thiyo. ma bholi ghara janchu thien| Operating System ko bhumika k huncha? — aparetina sistama prayogakarta anuprayoga ra anuprayogaharuma pahuncha pradana garna dijaina garieko saphtaveyara vikasa pranali ho| yo prayogakartako avashyakataharu pura garna dijaina garieko ho, jastai prayogakartale kuna anuprayogaharu chaliraheka chan, kasari prayogakarta intaraphesa samrachita cha, ra kasari prayogakarta deta vyavasthita cha Dashain ko barema euta choto paragraph lekhnuhos. — dashaina bhaneko samayako satha kama garne va arthika rupama kama garna sakine karyaharuko suchi ho| yo samanyataya dherai kamaharu ra karyaharuma vibhajana garera vyavasthita huncha juna pratyeka karyama kama samaya avadhi ra adhikatama gunastara samga pura garnu parcha| dashainaharule samanyataya karyaharulai sana tukraharuma vibhajana gardacha jasalai thapa vyavasthita ra vyavasthita banauna maddata garchan| yasale karyaharulai sana, vyavasthita tukraharuma todna anumati dincha jasale samayasima ra sthiratako lagi anumati dincha| ,→ Semantic error: describes task-management workflow, not a
Chunk 40 · 1,205 chars
aharule samanyataya karyaharulai sana tukraharuma vibhajana gardacha jasalai thapa vyavasthita ra vyavasthita banauna maddata garchan| yasale karyaharulai sana, vyavasthita tukraharuma todna anumati dincha jasale samayasima ra sthiratako lagi anumati dincha| ,→ Semantic error: describes task-management workflow, not a festival. Diyeko list ma kun chai phalful hoina? 1. Syau, 2. Suntala, 3. Alu, 4. Aap ho ,→ Semantic error: outputs “yes” only; no item identified. Internet bhane ko k ho? — intaraneta bhaneko samsarako vishvavyapi sanjala ho| Euta sano katha lekhnuhos. Topic: Chalakh Shyal eka pataka tyahan eka javana keti thiyo jo sadhaim chalachitra herna ra nayan shaharaharu anveshana garna chahanthin| unale aphno parivarako barema sochdai thiin ra unale aphno bhavishyama kehi rojne nirnaya garnuparcha bhanera vichara gardai thiin| unale aphno parivarasanga kura garera ra aphno vicharaharu vyakta garera aphno nirnayako barema sodhe| antatah, unale aphno parivarako lagi aphno bhavishya banauna chodne nirnaya gare| Table 14. Mistral-7B-v0.1 fine-tuned responses (post-SFT). All 10 instructions answered in Romanized Nepali. Two residual semantic errors are annotated (,→). 31 -- 31 of 31 --