Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

Summary

This paper introduces Nemotron-Mini-Hindi 4B, a bilingual small language model (SLM) for Hindi and English, developed to improve performance on low-resource languages. The model is based on Nemotron-Mini 4B and was continuously pre-trained on 400 billion tokens, with equal parts Hindi and English data. The Hindi corpus includes 100 billion real and synthetic tokens, the latter generated via translation and transliteration. The model undergoes supervised fine-tuning (SFT) and preference tuning with Direct Preference Optimization (DPO). Results show state-of-the-art performance on Hindi benchmarks like IndicXTREME and IndicNLG, while remaining competitive in English tasks. An ablation study confirms that Hindi pre-training is essential for strong performance, surpassing alignment alone. The model also demonstrates improved factual accuracy and cross-lingual transfer. Evaluations using IndicQuest and SubjectiveEval, along with human judgment, confirm its superiority over similar models. The work highlights the effectiveness of continued pre-training with synthetic data for adapting multilingual LLMs to low-resource languages.

PDF viewer

Chunks(20)

Chunk 0 · 1,992 chars

Adapting Multilingual LLMs to Low-Resource Languages using Continued
Pre-training and Synthetic Corpus
Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul,
Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, Eileen Long
NVIDIA
{ravirajj, kanishks, anushak, rkalani, rapaul, uvaidya
schauhan, nwartikar, elong}@nvidia.com
Abstract
Multilingual LLMs support a variety of lan-
guages; however, their performance is subopti-
mal for low-resource languages. In this work,
we emphasize the importance of continued pre-
training of multilingual LLMs and the use of
translation-based synthetic pre-training corpora
for improving LLMs in low-resource languages.
We conduct our study in the context of the low-
resource Indic language Hindi. We introduce
Nemotron-Mini-Hindi 4B, a bilingual SLM
supporting both Hindi and English, based on
Nemotron-Mini 4B. The model is trained using
a mix of real and synthetic Hindi + English to-
kens, with continuous pre-training performed
on 400B tokens. We demonstrate that both the
base and instruct models achieve state-of-the-
art results on Hindi benchmarks while remain-
ing competitive on English tasks. Addition-
ally, we observe that the continued pre-training
approach enhances the model’s overall factual
accuracy. We perform an ablation study to high-
light the impact of Hindi pre-training, showing
significant improvements in Hindi chat capa-
bilities and factual accuracy, which cannot be
achieved through Hindi alignment alone.
1 Introduction
The accuracy and utility of large language mod-
els (LLMs) have continuously improved over time.
Both closed and open-source LLMs have demon-
strated strong performance in English and several
other languages. Open models such as Nemotron
(Adler et al., 2024), Gemma (Team et al., 2024),
and Llama (Dubey et al., 2024) are inherently mul-
tilingual. For instance, the Nemotron-4 15B model
was pre-trained on 8 trillion tokens, of which 15%
were multilingual (Parmar et al., 2024).

Chunk 1 · 1,989 chars

formance in English and several
other languages. Open models such as Nemotron
(Adler et al., 2024), Gemma (Team et al., 2024),
and Llama (Dubey et al., 2024) are inherently mul-
tilingual. For instance, the Nemotron-4 15B model
was pre-trained on 8 trillion tokens, of which 15%
were multilingual (Parmar et al., 2024). However,
the proportion of multilingual data is limited, which
in turn affects the accuracy of these models on non-
English languages.
Figure 1: Adaptation of multilingual Nemotron-Mini-
4B model (also known as Minitron-4B).
The model’s performance further diminishes as
we move from high-resource to low-resource lan-
guages. In this work, we specifically focus on the
Indic language Hindi as our target low-resource
language. Out of the 8 trillion tokens used to train
the Nemotron-4 models, only 20 billion tokens are
in Hindi. As a result, while the model can under-
stand and generate Hindi content to a reasonable
extent, the usability of such a multilingual LLM
for specific low-resource languages remains ques-
tionable. Frequent hallucinations, meaningless sen-
tences, and mixing of English content often occur
when responding to purely Hindi queries in the
Devanagari script. There is a strong need to adapt
multilingual LLMs to target languages to enhance
their usability.
Recently, in the context of Indic languages, tar-
get language Supervised Fine-Tuning (SFT) has
become a common practice to adapt LLMs to spe-
cific languages (Gala et al., 2024). However, it re-
mains to be studied whether language-specific SFT
tuning improves LLMs’ understanding in regional
contexts. Some studies suggest that SFT can intro-
duce LLMs to new domain knowledge, though it is
typically used to enhance the model’s instruction-
following capability (Mecklenburg et al., 2024).
SFT on translated English instruction tuning data
is widely used to develop regional LLMs for Indic
languages. While this may improve instruction-
following in the target language, it may not

Chunk 2 · 1,997 chars

ain knowledge, though it is
typically used to enhance the model’s instruction-
following capability (Mecklenburg et al., 2024).
SFT on translated English instruction tuning data
is widely used to develop regional LLMs for Indic
languages. While this may improve instruction-
following in the target language, it may not en-
arXiv:2410.14815v2 [cs.CL] 21 Apr 2025

-- 1 of 8 --

Model Layers Hidden Size Att. Heads Query Groups MLP Hidden Parameters
Nemotron 4B 32 3072 24 8 9216 4.19B
Table 1: Architecture details of Nemotron-Mini-4B model.
hance LLMs’ understanding of regional contexts
(Balachandran, 2023). Another approach to updat-
ing LLM knowledge is continued pre-training, but
the limited availability of tokens for low-resource
languages makes this both infeasible and prone to
overfitting.
In this work, we focus on a continued pre-
training approach using a mix of real and synthetic
corpora. We demonstrate that a robust base model
can be adapted to the target language with a small
continued pre-training corpus. This approach is
particularly relevant for low-resource languages,
where the amount of training data is limited. The
synthetic pre-training dataset is curated by trans-
lating high-quality generic English corpora into
the target language. To further expand the corpus
and support Roman script queries in the target lan-
guage, the text is transliterated into Roman script
and used for pre-training. The base model is then
aligned using supervised fine-tuning (SFT), fol-
lowed by preference tuning with Direct Preference
Optimization (DPO). We observe that the contin-
ued pre-training approach is particularly useful for
reducing hallucinations, improving regional knowl-
edge of LLMs, and enhancing response capabilities
in the target language. The high-level process is
outlined in Figure 1,
Based on this approach, we present Nemotron-
Mini-Hindi-4B-Base1 and Nemotron-Mini-Hindi-
4B-Instruct23, state-of-the-art Small Language
Models (SLMs) for the Hindi language.

Chunk 3 · 1,997 chars

roving regional knowl-
edge of LLMs, and enhancing response capabilities
in the target language. The high-level process is
outlined in Figure 1,
Based on this approach, we present Nemotron-
Mini-Hindi-4B-Base1 and Nemotron-Mini-Hindi-
4B-Instruct23, state-of-the-art Small Language
Models (SLMs) for the Hindi language. These
SLMs support Hindi, English, and Hinglish.
The Hindi models are based on the multilingual
Nemotron-Mini-4B (also known as Minitron-4B),
adapted with continued pre-training on 400 billion
Hindi and English tokens. The data blend used
equal proportions of both languages. The instruct
version of the model was developed using SFT
and DPO techniques. The model outperforms all
similarly sized models on various IndicXTREME,
IndicNLG benchmark tasks and popular translated
English benchmarks such as MMLU, Hellaswag,
1https://huggingface.co/nvidia/
Nemotron-4-Mini-Hindi-4B-Base
2https://huggingface.co/nvidia/
Nemotron-4-Mini-Hindi-4B-Instruct
3https://build.nvidia.com/nvidia/
nemotron-4-mini-hindi-4b-instruct
ARC-C, and ARC-E (Gala et al., 2024). We also
perform LLM-based evaluations using the bench-
mark datasets IndicQuest (Rohera et al., 2024) and
in-house SubjectiveEval, with GPT-4 serving as the
judge LLM. This is the first study to present and
evaluate bilingual language models of this nature.
We provide a thorough study of the models in both
languages.
Additionally, we perform an extensive ablation
to analyze the impact of Hindi SFT and DPO on
both the base multilingual Nemotron-Mini-4B and
the Hindi-specific Nemotron-Mini-Hindi-4B. Our
results show that while Hindi alignment improves
performance, Hindi pre-training is essential for
achieving strong results on Hindi benchmarks. No-
tably, Nemotron-Mini-Hindi 4B leads to signifi-
cant gains in factual accuracy, not just for Hindi
but also for English, showcasing effective cross-
lingual transfer.
2 Related Work
In this section, we review various approaches for
adapting LLMs to different languages.

Chunk 4 · 1,997 chars

achieving strong results on Hindi benchmarks. No-
tably, Nemotron-Mini-Hindi 4B leads to signifi-
cant gains in factual accuracy, not just for Hindi
but also for English, showcasing effective cross-
lingual transfer.
2 Related Work
In this section, we review various approaches for
adapting LLMs to different languages. Several ef-
forts have focused on adapting LLaMA models to
Indic languages. A common method involves ex-
tending the vocabulary, followed by SFT or PEFT
(LoRA) using translated and available SFT corpora
in Indic languages. Examples of such work include
OpenHathi, Airavata (Gala et al., 2024), Tamil-
LLaMA (Balachandran, 2023), Navarasa4, Ambari,
MalayaLLM, and Marathi-Gemma (Joshi, 2022).
Notably, some of these efforts employ bilingual
next-word prediction, alternating between English
and the target language in the pre-training corpus.
Airavata also introduced an evaluation framework5
for Indic LLMs, which we leverage to evaluate
Nemotron-Mini-Hindi 4B and other multilingual
models.
Apart from Indic languages, similar efforts have
been made for other languages, including Chinese
LLaMA (Cui et al., 2023), LLaMATurk (Tora-
man, 2024), FinGPT (Luukkonen et al., 2023), and
RedWhale (Vo et al., 2024) for Chinese, Turkish,
Finnish, and Korean, respectively. These LLMs
4https://huggingface.co/Telugu-LLM-Labs/Indic-gemma-
7b-finetuned-sft-Navarasa-2.0
5https://github.com/AI4Bharat/IndicInstruct

-- 2 of 8 --

use one or more techniques such as tokenizer ex-
tension, secondary pretraining, and supervised fine-
tuning. The key distinction of our work lies in its
emphasis on developing bilingual LLMs, whereas
the aforementioned efforts concentrate on creating
monolingual LLMs.
Cahyawijaya et al. (2024) show that large lan-
guage models can learn low-resource languages
effectively using in-context learning and few-shot
examples, improving performance through cross-
lingual contexts without extensive tuning. Gur-
gurov et al. (2024) enhance multilingual LLMs
for

Chunk 5 · 1,998 chars

on creating
monolingual LLMs.
Cahyawijaya et al. (2024) show that large lan-
guage models can learn low-resource languages
effectively using in-context learning and few-shot
examples, improving performance through cross-
lingual contexts without extensive tuning. Gur-
gurov et al. (2024) enhance multilingual LLMs
for low-resource languages by using adapters with
data from ConceptNet, boosting performance in
sentiment analysis and named entity recognition.
3 Methodology
In this section, we describe our methodology for
adapting multilingual LLMs to target languages to
improve performance in those languages. Specifi-
cally, we build a bilingual SLM that supports both
Hindi and English. We conduct our adaptation ex-
periments using the multilingual Nemotron-Mini-
4B model (also known as Minitron-4B). The model
undergoes continuous pre-training with an equal
mixture of Hindi and English data, consisting of
200B tokens per language. The original Nemotron-
4B model was primarily trained on English tokens
and had seen only 20B Hindi tokens. Given the
limited amount of Hindi data, adapting an exist-
ing multilingual model rather than training from
scratch is an effective strategy, allowing us to lever-
age the knowledge learned from the pre-trained
model. Additionally, as Nemotron-4B employs a
large 256k tokenizer, we did not need to extend the
tokenizer. The fertility ratio for Hindi text is 1.7,
which is better than that of its Llama (2.64) and
Gemma (1.98) counterparts.
3.1 Synthetic Data Curation
One of the key aspects of our work is the creation
of a synthetic Hindi pre-training dataset. This syn-
thetic data is generated using machine translation
and transliteration. We first select high-quality En-
glish data sources and translate them into Hindi
using a custom document translation pipeline. This
pipeline preserves the document structure, includ-
ing elements like bullet points and tables, and em-
ploys the IndicTrans2 model (Gala et al.) for sen-
tence translation.

Chunk 6 · 1,984 chars

nsliteration. We first select high-quality En-
glish data sources and translate them into Hindi
using a custom document translation pipeline. This
pipeline preserves the document structure, includ-
ing elements like bullet points and tables, and em-
ploys the IndicTrans2 model (Gala et al.) for sen-
tence translation. However, since the translated
data may contain noise, we use an n-gram language
model to filter out low-quality samples. This model,
trained on MuRIL-tokenized (Khanuja et al., 2021)
real Hindi data, applies perplexity scores to identify
and exclude noisy translations. Around 2% of the
documents were discarded post-filtering.
The translated Hindi data comprises approxi-
mately 60 billion tokens. We then combine this
synthetic data with around 40 billion real tokens
(web-scraped data) to create a dataset totaling 100
billion Hindi tokens. Additionally, this entire Hindi
text is transliterated into Roman script, expanding
the total dataset to 220 billion tokens. The translit-
erated tokens are included to enable the model
to support Hinglish queries. This Hindi data is
further combined with 200 billion English tokens
for continued pre-training. Including the English
dataset helps prevent catastrophic forgetting of En-
glish capabilities and contributes to training sta-
bility. Fuzzy deduplication is performed on the
entire text using NeMo-Curator6 to eliminate simi-
lar documents. The real Hindi data sources include
internal web-based datasets and Sangraha Corpus
(Khan et al., 2024). The English dataset is a subset
of the pre-training corpus used for the Nemotron-
15B model. All the datasets used in this work are
commercially friendly.
3.2 Continued Pre-training
The Nemotron-Mini-4B base model is used for
continuous pre-training, and its architecture details
are presented in Table 1. The Nemotron-Mini-4B
model is derived from the Nemotron-15B model
using compression techniques such as pruning and
distillation, consisting of 2.6B trainable

Chunk 7 · 1,992 chars

friendly.
3.2 Continued Pre-training
The Nemotron-Mini-4B base model is used for
continuous pre-training, and its architecture details
are presented in Table 1. The Nemotron-Mini-4B
model is derived from the Nemotron-15B model
using compression techniques such as pruning and
distillation, consisting of 2.6B trainable parameters
(Muralidharan et al., 2024). Re-training is per-
formed using a standard causal modeling objective.
The dataset consists of 400B tokens, with an equal
mix of Hindi and English. During batch sampling,
greater weight is given to real data compared to
synthetic data. We use the same optimizer settings
and data split as (Parmar et al., 2024), with a cosine
learning rate decay schedule from 2e-4 to 4.5e-7.
This model is referred to as Nemotron-Mini-Hindi-
4B, a base model where Hindi is the primary lan-
guage. The re-training was performed using the
Megatron-LM library (Shoeybi et al., 2020) and
128 Nvidia A100 GPUs.
6https://github.com/NVIDIA/NeMo-Curator

-- 3 of 8 --

Base models 	Metric 	Nemotron-Mini-Hindi-4B Nemotron-Mini-4B Sarvam-1 2B Gemma 2-2B Openhathi Llama-3.1 8B Gemma 2-9B
IndicSentiment 	F1 - NLU 	84.31 	72.47 	96.36 	91.90 	72.89 	92.06 	94.90
IndicCopa 	F1 - NLU 	81.86 	62.50 	51.63 	58.65 	68.69 	61.87 	72.58
IndicXNLI 	F1 - NLU 	49.67 	40.39 	36.08 	16.67 	16.67 	16.67 	16.79
IndicXParaphrase 	F1 - NLU 	37.09 	16.27 	80.99 	26.60 	71.72 	72.75 	71.38
Indic QA (With Context) 1 shot 	F1 - NLG 	18.32 	15.10 	35.81 	33.37 	20.69 	35.92 	46.27
Indic Headline 1 shot 	BLEURT - NLG 	0.50 	0.46 	0.36 	0.27 	0.47 	0.38 	0.27
IndicWikiBio 1 shot 	BLEURT - NLG 	0.62 	0.59 	0.53 	0.60 	0.52 	0.60 	0.63
MMLU 	Acc - NLU 	49.89 	38.20 	45.65 	35.05 	32.27 	44.84 	55.08
BoolQ 	Acc - NLU 	71.71 	70.79 	56.08 	66.00 	58.56 	61.00 	61.00
ARC Easy 	Acc - NLU 	78.81 	58.25 	76.85 	52.31 	44.28 	67.05 	85.69
Arc Challenge 	Acc - NLU 	65.02 	47.87 	59.04 	40.78 	32.68 	54.10 	76.02
Hella Swag 	Acc - NLU 	31.66 	25.31 	37.13 	27.50 	25.59 	33.50

Chunk 8 · 1,983 chars

9.89 	38.20 	45.65 	35.05 	32.27 	44.84 	55.08
BoolQ 	Acc - NLU 	71.71 	70.79 	56.08 	66.00 	58.56 	61.00 	61.00
ARC Easy 	Acc - NLU 	78.81 	58.25 	76.85 	52.31 	44.28 	67.05 	85.69
Arc Challenge 	Acc - NLU 	65.02 	47.87 	59.04 	40.78 	32.68 	54.10 	76.02
Hella Swag 	Acc - NLU 	31.66 	25.31 	37.13 	27.50 	25.59 	33.50 	42.40
Table 2: Performance metrics for various base models across different Hindi tasks. The results are zero-shot unless
otherwise specified.
Instruct models 	Metric 	Nemotron-Mini-Hindi-4B Nemotron-Mini-4B Airavata Navarasa 2B Gemma-2 2B Navarasa 7B Llama-3.1 8B Gemma-2 9B
IndicSentiment 	F1 - NLU 	97.62 	90.01 	95.81 	93.62 	94.32 	95.99 	98.59 	99.09
IndicCopa 	F1 - NLU 	80.1 	66.01 	63.75 	38.83 	27.64 	62.59 	59.08 	89.89
IndicXNLI 	F1 - NLU 	53.77 	39.25 	73.26 	16.67 	17.33 	38.19 	31.27 	39.71
IndicXParaphrase 	F1 - NLU 	67.93 	83.74 	76.53 	43.82 	43.06 	44.58 	77.72 	61.38
Indic QA (With Context) 1 shot 	F1 - NLG 	37.51 	42.56 	37.69 	3.3 	62.95 	19.09 	40.03 	59.83
Indic Headline 1 shot 	BLEURT - NLG 	0.44 	0.18 	0.38 	0.24 	0.39 	0.3 	0.26 	0.25
IndicWikiBio 1 shot 	BLEURT - NLG 	0.6 	0.49 	0.43 	0.3 	0.49 	0.45 	0.42 	0.24
MMLU 	Acc - NLU 	50.5 	38.66 	34.96 	23.1 	39.39 	40 	45.85 	57.35
BoolQ 	Acc - NLU 	67.86 	60.00 	64.5 	60.31 	70 	78.1 	80 	84
ARC Easy 	Acc - NLU 	79.97 	60.14 	54 	38.8 	59.76 	61.24 	71.55 	91.16
Arc Challenge 	Acc - NLU 	65.53 	49.83 	35.92 	31.66 	48.55 	48.29 	59.64 	81.23
Hella Swag 	Acc - NLU 	39.9 	39.69 	25.37 	25.3 	34.7 	30.8 	35.5 	54.6
IndicQuest (En) 	Score (1-5) 	4.01 	3.94 	3.75 	3.78 	4.1 	4.07 	4.2 	4.4
IndicQuest (Hi) 	Score (1-5) 	4.15 	2.72 	3.1 	3.18 	3.58 	3.6 	4.02 	4.23
SubjectiveEval (Hi) 	Score (1-5) 	4.35 	1.64 	2.24 	1.75 	3.66 	2.97 	3.98 	4.5
Table 3: Performance metrics for various instruct models across different Hindi tasks. The results are zero-shot
unless otherwise specified.
Task 	Nemotron-Mini-Hindi-4B-Base Nemotron-Mini-4B-Base Gemma-2 2b
MMLU (5) 	56.37 	58.60

Chunk 9 · 1,997 chars

02 	4.23
SubjectiveEval (Hi) 	Score (1-5) 	4.35 	1.64 	2.24 	1.75 	3.66 	2.97 	3.98 	4.5
Table 3: Performance metrics for various instruct models across different Hindi tasks. The results are zero-shot
unless otherwise specified.
Task 	Nemotron-Mini-Hindi-4B-Base Nemotron-Mini-4B-Base Gemma-2 2b
MMLU (5) 	56.37 	58.60 	51.3
arc_challenge (25) 	46.08 	50.90 	55.4
hellaswag (10) 	74.64 	75.00 	73
truthfulqa_mc2 (0) 	41.05 	42.72 	-
winogrande (5) 	70.09 	74.00 	70.9
xlsum_english (3) 	29.71 	29.62 	-
Table 4: Performance of base models on English Bench-
marks
Model Setting 	SubjectiveEval IndicQuest (Hi) IndicQuest (En)
Nemotron-Mini-4B-Base
SFT (En) + DPO (En) 	1.92 	2.66 	3.89
SFT (En) + DPO (En + Hi) 	1.88 	2.80 	3.87
SFT (En + Hi) + DPO (En) 	2.73 	3.20 	3.86
SFT (En + Hi) + DPO (En + Hi) 	2.51 	3.14 	3.88
Nemotron-Mini-Hindi-4B-Base
SFT (En) + DPO (En) 	3.81 	4.12 	4.02
SFT (En) + DPO (En + Hi) 	4.3 	4.10 	4.03
SFT (En + Hi) + DPO (En) 	4.28 	4.06 	4.02
SFT (En + Hi) + DPO (En + Hi) 	4.25 	4.13 	4.04
Table 5: Ablation study of post-training configurations
analyzing the impact of Hindi pretraining on Subjec-
tiveEval (Chat capability) and IndicQuest (Factual accu-
racy) tasks.
3.3 Model Alignment
The first alignment stage is Supervised Fine-Tuning
(SFT). We use a general SFT corpus with approx-
imately 200k examples, comprising various tasks
as outlined in (Adler et al., 2024). The model is
trained for one epoch with a global batch size of
1024 and a learning rate in the range of [5e-6, 9e-7],
using cosine annealing. Due to the lack of a high-
quality Hindi SFT corpus, we leverage English-
only data for SFT. We also experimented with trans-
lated English data (filtered using back-translation-
based methods) for SFT, but did not observe any im-
provements with this addition. We found that using
the English-only SFT corpus enhances instruction-
following capabilities in Hindi, highlighting the
cross-lingual transferability of these skills. For
the ablation study,

Chunk 10 · 1,998 chars

nglish data (filtered using back-translation-
based methods) for SFT, but did not observe any im-
provements with this addition. We found that using
the English-only SFT corpus enhances instruction-
following capabilities in Hindi, highlighting the
cross-lingual transferability of these skills. For
the ablation study, we use approximately 70k high-
quality Hindi examples selected from a larger pool
of 200k translated samples. These Hindi instances
are derived by translating the original English data,
with noisy or low-quality translations filtered out
using a back-translation-based filtering approach.
After SFT stage, the model undergoes a
preference-tuning phase, where it learns from
triplets consisting of a prompt, a preferred response,
and a rejected response. In this stage, we apply the
Direct Preference Optimization (DPO) (Rafailov
et al., 2024) algorithm, which trains the policy net-
work to maximize the reward difference between
the preferred and rejected responses. We train the
model for one epoch with a global batch size of
512 and a learning rate in the range of [9e-6, 9e-
7], utilizing cosine annealing. For the DPO stage,
we use approximately 200k English samples and
60k synthetic Hindi samples. The synthetic Hindi
samples were created by translating the English

-- 4 of 8 --

samples and then filtered using back-translation
methods. We observe that incorporating synthetic
Hindi samples during this stage improves the over-
all performance of the model. The aligned model
is referred to as Nemotron-Mini-Hindi-4B-Instruct.
Both the SFT and DPO stages are carried out using
Nemo Aligner (Shen et al., 2024) and 64 Nvidia
A100 GPUs.
Figure 2: Comparison of different instruct models on
various parameters using SubjectiveEval.
Figure 3: Comparison of different instruct models on
various parameters using IndicQuest-Hi.
3.4 Evaluation Datasets
We evaluate Nemotron-Mini-Hindi-4B and other
multilingual LLMs using both native Hindi bench-
marks and translated English

Chunk 11 · 1,995 chars

arison of different instruct models on
various parameters using SubjectiveEval.
Figure 3: Comparison of different instruct models on
various parameters using IndicQuest-Hi.
3.4 Evaluation Datasets
We evaluate Nemotron-Mini-Hindi-4B and other
multilingual LLMs using both native Hindi bench-
marks and translated English benchmarks. The
native benchmarks include tasks from IndicX-
TREME, IndicNLG, and IndicQuest, while the
translated English benchmarks include popular
datasets like MMLU and Hellaswag. Addition-
ally, we curate an open-ended QnA dataset termed
SubjectiveEval to assess the model’s generation ca-
pabilities in the Hindi language. Human evaluation
is also conducted using the translated MT-Bench
dataset.
• IndicXTREME: The benchmark consists of
different Natural Language Understanding
Figure 4: Comparison of different instruct models on
Factuality score of IndicQuest. The ground truth an-
swers from IndicQuest are provided as a reference to
GPT4 for better scoring. The Nemotron-Mini-Hindi-
4B provides comparable scores for Hindi and English
whereas other models provide better factuality for En-
glish.
(NLU) tasks in Indic languages (Doddapaneni
et al., 2023). We consider different tasks like
IndicSentiment, IndicCopa, IndicXNLI, and
IndicXParaphrase.
• IndicNLG: The IndicNLG benchmark (Ku-
mar et al., 2022) consists of various tasks for
evaluating the generation capabilities of the
model. We consider IndicHeadline, IndicWik-
iBio, and IndicQA covering text summariza-
tion and question-answering tasks.
• IndicQuest: IndicQuest (Rohera et al.,
2024) is a gold-standard fact-based question-
answering benchmark designed to evaluate
multilingual language models ability to cap-
ture regional knowledge across various Indic
languages. It focuses on factual questions re-
lated to India in domains such as Literature,
History, Geography, Politics, and Economics.
The dataset is available in English as well as
several Indic languages, including Hindi, al-
lowing for

Chunk 12 · 1,998 chars

language models ability to cap-
ture regional knowledge across various Indic
languages. It focuses on factual questions re-
lated to India in domains such as Literature,
History, Geography, Politics, and Economics.
The dataset is available in English as well as
several Indic languages, including Hindi, al-
lowing for language-specific evaluations. For
LLM-as-a-judge evaluation, the ground truth
facts are passed to the evaluator LLM as a
reference.
• SubjectiveEval: This in-house Hindi evalu-
ation dataset features open-ended questions
across various Indian domains, including His-
tory, Geography, Agriculture, Food, Culture,
Religion, Science and Technology, Mathemat-
ics, and Thinking Ability. It offers broader
coverage compared to the fact-based ques-
tions in IndicQuest. It assesses a model’s

-- 5 of 8 --

Figure 5: Results of human evaluation on translated MT-Bench. A win indicates Nemotron-Mini-Hindi-4B model is
preferred.
understanding, generative capabilities, coher-
ence, and insightfulness. Questions include
’what’, ’how’, and ’why’ types, varying from
brief one-word answers to detailed explana-
tions. The dataset also tests analytical and
problem-solving skills with hypothetical sce-
narios. Model responses are evaluated using
an LLM as a judge.
• Translated English Benchmarks: We use
translated versions of popular benchmarks for
exhaustive evaluation of our models. The
benchmarks include MMLU, Hella Swag,
BoolQ, Arc-Easy, and Arc-Challenge.
• Human Evaluation: For human evaluation,
we utilized a translated version of the multi-
turn MT-Bench dataset (Zheng et al., 2023).
The prompts were first translated into Hindi
using the Google Translate API and then man-
ually filtered to remove problematic prompts
or those relying on English-specific semantics.
During evaluation, human judges conducted
A/B testing, where they were presented with
randomized, pair-wise model responses for
comparison.
4 Results and Discussion
The results for the base models are shown

Chunk 13 · 1,996 chars

e API and then man-
ually filtered to remove problematic prompts
or those relying on English-specific semantics.
During evaluation, human judges conducted
A/B testing, where they were presented with
randomized, pair-wise model responses for
comparison.
4 Results and Discussion
The results for the base models are shown in Ta-
ble 2 and Table 4. The Nemotron-Mini-Hindi-
4B Base delivers state-of-the-art performance on
nearly all benchmarks compared to similarly sized
models. Additionally, it outperforms larger mod-
els like Gemma-2-9B and Llama-3.1-8B on more
than half of the benchmarks. Hindi-specific contin-
ued pre-training significantly enhances the model’s
performance on Hindi tasks compared to the base
Nemotron-Mini-4B model. There is some degra-
dation on English benchmarks, though the results
remain competitive. This underscores the impor-
tance of dual-language continued pre-training.
We observe similar results with the instruct
model on IndicXTREME, IndicNLG, and trans-
lated English benchmarks. The results are pre-
sented in Table 3. The instruct model is also eval-
uated using LLM-as-a-judge on IndicQuest and
SubjectiveEval. On these benchmarks, we see im-
provements in both English and Hindi compared to
the Nemotron-Mini-4B-Instruct model. The model
outperforms all baseline models except for Gemma-
2-9B. Notably, we observe improvements in the
model’s factuality and language consistency. These
results are shown in Figure 2, 3, and 4. Further-
more, during human evaluations, responses from
Nemotron-Mini-4B-Hindi were consistently pre-
ferred over those from other models, as shown in
Figure 5.
Table 5 shows an ablation study analyzing the
impact of Hindi SFT and DPO on Nemotron-Mini-
4B Base and the Hindi-pretrained Nemotron-Mini-
Hindi-4B Base. While Hindi alignment improves
the base Nemotron-Mini-4B, its performance re-
mains significantly lower than the Hindi-pretrained
model across all metrics. Notably, Nemotron-Mini-
Hindi-4B with English alignment

Chunk 14 · 1,998 chars

mpact of Hindi SFT and DPO on Nemotron-Mini-
4B Base and the Hindi-pretrained Nemotron-Mini-
Hindi-4B Base. While Hindi alignment improves
the base Nemotron-Mini-4B, its performance re-
mains significantly lower than the Hindi-pretrained
model across all metrics. Notably, Nemotron-Mini-
Hindi-4B with English alignment alone outper-
forms Nemotron-Mini-4B even with Hindi align-
ment, highlighting the strong benefits of Hindi
pretraining. The best results are achieved with
English SFT and (English + Hindi) DPO on the
Hindi-pretrained model, suggesting that mixed-
language alignment can enhance generalization.
Overall, Hindi pretraining leads to substantial im-
provements not only in Hindi understanding but

-- 6 of 8 --

also in factual accuracy and English performance,
demonstrating effective cross-lingual transfer.
5 Conclusion
We present Nemotron-Mini-Hindi-4B-Base and
Nemotron-Mini-Hindi-4B-Instruct, state-of-the-art
SLMs primarily designed for the Hindi language.
These models have been continuously pre-trained
and aligned using a combination of Hindi and En-
glish data. The Hindi corpus includes both real and
synthetic data, with the synthetic data generated
through translation. The models outperform simi-
larly sized models on various Hindi benchmarks,
as assessed through reference-based and LLM-as-a-
judge evaluations. They also perform competitively
on English benchmarks. We emphasize the impor-
tance of pre-training to reduce hallucinations and
enhance the factuality of the models.
Limitations
The model was trained on internet data that in-
cludes toxic language and biases, which means it
might reproduce these biases and generate toxic
responses, particularly if prompted with harmful
content. It may also produce inaccurate, incom-
plete, or irrelevant information, potentially leading
to socially undesirable outputs. The problem could
be worsened if the suggested prompt template is
not used.
To mitigate these issues to some extent, we have
implemented safety

Chunk 15 · 1,991 chars

s, particularly if prompted with harmful
content. It may also produce inaccurate, incom-
plete, or irrelevant information, potentially leading
to socially undesirable outputs. The problem could
be worsened if the suggested prompt template is
not used.
To mitigate these issues to some extent, we have
implemented safety alignment during the DPO
stage to guide the model away from responding
to toxic or harmful content. Additionally, we con-
duct safety evaluations using benchmarks such as
Aegis7 (Ghosh et al., 2024), Garak8 (Derczynski
et al., 2024), and Human Content red-teaming, and
our findings indicate that the model’s responses
remain within permissible limits.
Acknowledgements
This work would not have been possible without
contributions from many people at NVIDIA. To
mention a few: Asif Ahamed, Ayush Dattagupta,
Umair Ahmed, Yoshi Suhara, Ameya Mahabalesh-
warkar, Zijia Chen, Varun Singh, Vibhu Jawa,
Saurav Muralidharan, Sharath Turuvekere Sreeni-
vas, Marcin Chochowski, Rohit Watve, Oluwatobi
Olabiyi, Mostofa Patwary, and Oleksii Kuchaiev.
7https://huggingface.co/datasets/nvidia/Aegis-AI-
Content-Safety-Dataset-1.0
8https://github.com/leondz/garak
References
Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh,
Pallab Bhattacharya, Annika Brundyn, Jared Casper,
Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al.
2024. Nemotron-4 340b technical report. arXiv
preprint arXiv:2406.11704.
Abhinand Balachandran. 2023. Tamil-llama: A new
tamil language model based on llama 2. arXiv
preprint arXiv:2311.05845.
Samuel Cahyawijaya, Holy Lovenia, and Pascale Fung.
2024. Llms are few-shot in-context low-resource
language learners. In Proceedings of the 2024 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies (Volume 1: Long Papers), pages
405–433.
Yiming Cui, Ziqing Yang, and Xin Yao. 2023. Efficient
and effective text encoding for chinese llama and
alpaca. arXiv preprint arXiv:2304.08177.
Leon

Chunk 16 · 1,998 chars

2024 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies (Volume 1: Long Papers), pages
405–433.
Yiming Cui, Ziqing Yang, and Xin Yao. 2023. Efficient
and effective text encoding for chinese llama and
alpaca. arXiv preprint arXiv:2304.08177.
Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho
Majumdar, and Nanna Inie. 2024. garak: A frame-
work for security probing large language models.
arXiv preprint arXiv:2406.11036.
Sumanth Doddapaneni, Rahul Aralikatte, Gowtham
Ramesh, Shreya Goyal, Mitesh M Khapra, Anoop
Kunchukuttan, and Pratyush Kumar. 2023. Towards
leaving no indic language behind: Building mono-
lingual corpora, benchmark and models for indic
languages. In Proceedings of the 61st Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 12402–12426.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, et al. 2024. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783.
Jay Gala, Pranjal A Chitale, AK Raghavan, Varun
Gumma, Sumanth Doddapaneni, Janki Atul Nawale,
Anupama Sujatha, Ratish Puduppully, Vivek Ragha-
van, Pratyush Kumar, et al. Indictrans2: Towards
high-quality and accessible machine translation mod-
els for all 22 scheduled indian languages. Transac-
tions on Machine Learning Research.
Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain,
Mohammed Safi Ur Rahman Khan, Diptesh Kano-
jia, Ratish Puduppully, Mitesh M Khapra, Raj Dabre,
Rudra Murthy, Anoop Kunchukuttan, et al. 2024.
Airavata: Introducing hindi instruction-tuned llm.
arXiv preprint arXiv:2401.15006.
Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and
Christopher Parisien. 2024. Aegis: Online adaptive
ai content safety moderation with ensemble of llm
experts. arXiv preprint arXiv:2404.05993.
Daniil Gurgurov, Mareike Hartmann, and Simon Os-
termann. 2024. Adapting multilingual

Chunk 17 · 1,990 chars

llm.
arXiv preprint arXiv:2401.15006.
Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and
Christopher Parisien. 2024. Aegis: Online adaptive
ai content safety moderation with ensemble of llm
experts. arXiv preprint arXiv:2404.05993.
Daniil Gurgurov, Mareike Hartmann, and Simon Os-
termann. 2024. Adapting multilingual llms to

-- 7 of 8 --

low-resource languages with knowledge graphs via
adapters. In Proceedings of the 1st Workshop on
Knowledge Graphs and Large Language Models
(KaLLM 2024), pages 63–74.
Raviraj Joshi. 2022. L3cube-mahanlp: Marathi natural
language processing datasets, models, and library.
arXiv preprint arXiv:2205.14728.
Mohammed Safi Ur Rahman Khan, Priyam Mehta,
Ananth Sankar, Umashankar Kumaravelan, Sumanth
Doddapaneni, Sparsh Jain, Anoop Kunchukuttan,
Pratyush Kumar, Raj Dabre, Mitesh M Khapra, et al.
2024. Indicllmsuite: A blueprint for creating pre-
training and fine-tuning datasets for indian languages.
arXiv preprint arXiv:2403.06350.
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani,
Savya Khosla, Atreyee Dey, Balaji Gopalan,
Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja
Nagipogu, Shachi Dave, et al. 2021. Muril: Multi-
lingual representations for indian languages. arXiv
preprint arXiv:2103.10730.
Aman Kumar, Himani Shrotriya, Prachi Sahu, Amogh
Mishra, Raj Dabre, Ratish Puduppully, Anoop
Kunchukuttan, Mitesh M Khapra, and Pratyush
Kumar. 2022. Indicnlg benchmark: Multilingual
datasets for diverse nlg tasks in indic languages.
In Proceedings of the 2022 Conference on Empir-
ical Methods in Natural Language Processing, pages
5363–5394.
Risto Luukkonen, Ville Komulainen, Jouni Luoma,
Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari,
Filip Ginter, Veronika Laippala, Niklas Muennighoff,
Aleksandra Piktus, et al. 2023. Fingpt: Large gener-
ative models for a small language. In Proceedings
of the 2023 Conference on Empirical Methods in
Natural Language Processing, pages 2710–2726.
Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel
Holstein,

Chunk 18 · 1,999 chars

ari Kupari,
Filip Ginter, Veronika Laippala, Niklas Muennighoff,
Aleksandra Piktus, et al. 2023. Fingpt: Large gener-
ative models for a small language. In Proceedings
of the 2023 Conference on Empirical Methods in
Natural Language Processing, pages 2710–2726.
Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel
Holstein, Leonardo Nunes, Sara Malvar, Bruno Silva,
Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy
Yannam, et al. 2024. Injecting new knowledge into
large language models via supervised fine-tuning.
arXiv preprint arXiv:2404.00213.
Saurav Muralidharan, Sharath Turuvekere Sreenivas,
Raviraj Joshi, Marcin Chochowski, Mostofa Patwary,
Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz,
and Pavlo Molchanov. 2024. Compact language mod-
els via pruning and knowledge distillation. arXiv
preprint arXiv:2407.14679.
Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings,
Mostofa Patwary, Sandeep Subramanian, Dan Su,
Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala,
Ayush Dattagupta, et al. 2024. Nemotron-4 15b tech-
nical report. arXiv preprint arXiv:2402.16819.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo-
pher D Manning, Stefano Ermon, and Chelsea Finn.
2024. Direct preference optimization: Your language
model is secretly a reward model. Advances in Neu-
ral Information Processing Systems, 36.
Pritika Rohera, Chaitrali Ginimav, Akanksha Salunke,
Gayatri Sawant, and Raviraj Joshi. 2024. L3cube-
indicquest: A benchmark question answering dataset
for evaluating knowledge of llms in indic context.
arXiv preprint arXiv:2409.08706.
Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi
Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy
Zhang, Sahil Jain, Ali Taghibakhshi, et al. 2024.
Nemo-aligner: Scalable toolkit for efficient model
alignment. arXiv preprint arXiv:2405.01481.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. 2020. Megatron-lm: Training multi-billion
parameter language models using model parallelism.
Preprint,

Chunk 19 · 1,213 chars

khshi, et al. 2024.
Nemo-aligner: Scalable toolkit for efficient model
alignment. arXiv preprint arXiv:2405.01481.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. 2020. Megatron-lm: Training multi-billion
parameter language models using model parallelism.
Preprint, arXiv:1909.08053.
Gemma Team, Morgane Riviere, Shreya Pathak,
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
raju, Léonard Hussenot, Thomas Mesnard, Bobak
Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
Improving open language models at a practical size.
arXiv preprint arXiv:2408.00118.
Cagri Toraman. 2024. Llamaturk: Adapting open-
source generative large language models for low-
resource language. arXiv preprint arXiv:2405.07745.
Anh-Dung Vo, Minseong Jung, Wonbeen Lee, and Dae-
woo Choi. 2024. Redwhale: An adapted korean
llm through efficient continual pretraining. arXiv
preprint arXiv:2408.11294.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
Judging llm-as-a-judge with mt-bench and chatbot
arena. Advances in Neural Information Processing
Systems, 36:46595–46623.

-- 8 of 8 --