Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

Authors: Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit MitraarXiv:2604.12477Published: 14-Apr-2026Source

Investigates strategic prompting to extract usable text data from LLMs for low-resource West African languages (Hausa and Fongbe). Compares different elicitation strategies to unlock linguistic knowledge encoded in LLMs for underrepresented languages.

Summary

This study explores extracting usable text data from large language models (LLMs) for two West African languages—Hausa (Afroasiatic, ~80 million speakers) and Fongbe (Niger-Congo, ~2 million speakers)—using six elicitation strategies: creative writing, functional text, structured knowledge, dialogue, topic-switching, and constrained generation. The researchers compared GPT-4o Mini and Gemini 2.5 Flash, evaluating outputs on linguistic accuracy, lexical diversity, domain coverage, and code-switching. Results show that elicitation strategy significantly impacts output quality, with optimal methods varying by language. Hausa benefits from volume-maximizing tasks like functional text and dialogue, while Fongbe requires constraint-heavy prompts to ensure monolingual output. GPT-4o Mini outperforms Gemini, generating 6–41× more usable words per API call. For Fongbe, Gemini achieves higher language purity under constrained tasks. The study provides a practical framework for low-resource language communities to extract data from LLMs and releases all generated corpora and code. Key limitations include reliance on commercial APIs, automatic evaluation metrics, and modest sample sizes. Future work will explore additional models, human evaluation, and downstream task utility.

PDF viewer

Chunks(24)

Chunk 0 · 1,990 chars

Mining Large Language Models for Low-Resource Language Data:
Comparing Elicitation Strategies for Hausa and Fongbe
Mahounan Pericles Adjovi1, Roald Eiselen2, Prasenjit Mitra1
1Carnegie Mellon University Africa, Kigali, Rwanda
2Centre for Text Technology, North-West University, Potchefstroom, South Africa
madjovi@andrew.cmu.edu, Roald.Eiselen@nwu.ac.za, prasenjm@andrew.cmu.edu
Abstract
Large language models (LLMs) are trained on data contributed by low-resource language communities, including
curated datasets such as MasakhaNER and MAFAND-MT, yet the linguistic knowledge encoded in these models
remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract
usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers)
and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types:
creative writing, functional text, structured knowledge, dialogue, topic-switching probes, and constrained generation
across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). Generated outputs are evaluated on linguistic
accuracy, lexical diversity, domain coverage, and code-switching rates through automatic assessment metrics.
Our findings reveal that elicitation strategy significantly affects output quality and that optimal strategies differ by
language: Hausa benefits from volume-maximizing tasks such as functional text and dialogue, while Fongbe requires
constraint-heavy prompts that enforce monolingual output. GPT-4o Mini extracts 6–41× more usable target-language
words per API call than Gemini, though Gemini achieves higher language purity for Fongbe on constrained tasks. We
provide a practical framework for low-resource language communities to maximize usable data extraction from LLMs
and release all generated corpora and code.
Keywords: low-resource languages, African NLP, data extraction, large language models, Hausa,

Chunk 1 · 1,988 chars

achieves higher language purity for Fongbe on constrained tasks. We
provide a practical framework for low-resource language communities to maximize usable data extraction from LLMs
and release all generated corpora and code.
Keywords: low-resource languages, African NLP, data extraction, large language models, Hausa, Fongbe,
resource creation
1. Introduction
Natural Language Processing (NLP) technologies
remain largely inaccessible to speakers of most
African languages due to severe data scarcity
(Joshi et al., 2020). Languages such as Fongbe,
a national language of Benin, and Hausa, widely
spoken across West Africa, suffer from limited
digital text resources despite having millions of
speakers. Meanwhile, large language models
(LLMs) have been trained on web-scale data that
includes contributions from these language commu-
nities, including curated academic datasets such
as MasakhaNER 2.0 (Adelani et al., 2022b) and
MAFAND-MT (Adelani et al., 2022a). The linguistic
knowledge absorbed from these sources resides
within commercial LLMs, yet it flows back to these
communities only through paid API access. A nat-
ural question arises: can we systematically extract
usable language data from these models to cre-
ate new resources for the very communities whose
data helped build them?
This question has both practical and ethical sig-
nificance. On the practical side, low-resource lan-
guage communities face a critical bootstrapping
problem: building NLP systems requires data, but
data collection is expensive and slow. If LLMs can
serve as an efficient source of text data across di-
verse domains, this could accelerate resource cre-
ation for languages where text expansion remains
a critical priority gap. On the ethical side, the rela-
tionship between LLM training data and community
benefit is asymmetric: language communities con-
tribute data that increases the commercial value of
LLMs, yet receive limited benefit in return. Devel-
oping systematic methods for

Chunk 2 · 1,994 chars

ges where text expansion remains
a critical priority gap. On the ethical side, the rela-
tionship between LLM training data and community
benefit is asymmetric: language communities con-
tribute data that increases the commercial value of
LLMs, yet receive limited benefit in return. Devel-
oping systematic methods for extracting linguistic
knowledge from LLMs represents a practical step
toward rebalancing this relationship.
Extracting usable language data from LLMs is
non-trivial for several reasons. First, LLM genera-
tion quality varies dramatically across low-resource
languages, with substantial performance gaps doc-
umented even among African languages with mil-
lions of speakers (Robinson et al., 2023; Hendy
et al., 2023). Second, generated text frequently
exhibits code-switching with colonial languages:
English for Hausa, French for Fongbe, reducing
its utility as monolingual training data (Orife et al.,
2020). Third, for tonal languages like Fongbe,
LLMs frequently produce missing or incorrect di-
acritics, which are obligatory in standard orthog-
raphy and distinguish lexical meaning (Lefebvre
and Brousseau, 2002). Fourth, it is unclear which
prompting strategies maximize both the quantity
and quality of extractable data.
Previous work has examined LLMs for low-
resource language tasks primarily through ma-
chine translation (Robinson et al., 2023; Hendy
et al., 2023) or data augmentation for specific down-
stream tasks (Whitehouse et al., 2023). The Fikira
dataset (Adelani et al., 2024) demonstrated that
instruction-tuned models can generate reasoning
arXiv:2604.12477v1 [cs.CL] 14 Apr 2026

-- 1 of 11 --

data for African languages, but did not compare
across elicitation task types. To our knowledge, no
study has systematically explored elicitation strate-
gies to assess which tasks may yield the most us-
able data per API call for low-resource African lan-
guages. This work presents an early exploratory
investigation into this question, with the goal

Chunk 3 · 1,993 chars

id not compare
across elicitation task types. To our knowledge, no
study has systematically explored elicitation strate-
gies to assess which tasks may yield the most us-
able data per API call for low-resource African lan-
guages. This work presents an early exploratory
investigation into this question, with the goal of
identifying promising directions rather than drawing
definitive conclusions.
1.1. Research Question
We address the following central research question:
Which types of elicitation tasks maximize
the quantity and quality of usable text data
that can be extracted from large language
models for Hausa and Fongbe?
This is operationalized through three sub-
questions:
1. How does the linguistic quality of LLM-
generated text vary across elicitation task
types for Hausa and Fongbe?
2. Which elicitation strategies produce the great-
est lexical diversity and domain coverage per
API call?
3. Do optimal elicitation strategies differ between
languages with different levels of LLM sup-
port?
1.2. Summary of Contributions
• A systematic taxonomy of LLM elicitation
strategies for low-resource language data ex-
traction, evaluated across six task types for two
typologically distinct West African languages
(Section 3).
• An empirical comparison of two commercial
LLMs revealing that GPT-4o Mini generates
6–41 times more usable target-language text
than Gemini 2.5 Flash, with language-specific
optimal strategies (Section 4).
• A practical framework and released corpora
enabling low-resource language communities
to replicate our methodology (Section 5).
2. Related Work
2.1. African Language NLP Resources
Research on African language technology has ac-
celerated significantly since 2019. The Masakhane
project established a participatory approach
to machine translation across more than 30
African languages (Nekoto et al., 2020). Subse-
quent efforts produced standardized benchmarks:
MasakhaNER 2.0 for NER across 20 languages
(Adelani et al., 2022b), MasakhaPOS for

Chunk 4 · 1,986 chars

c-
celerated significantly since 2019. The Masakhane
project established a participatory approach
to machine translation across more than 30
African languages (Nekoto et al., 2020). Subse-
quent efforts produced standardized benchmarks:
MasakhaNER 2.0 for NER across 20 languages
(Adelani et al., 2022b), MasakhaPOS for part-of-
speech tagging (Dione et al., 2023), and MAFAND-
MT for news-domain machine translation (Adelani
et al., 2022a).
Despite these advances, resource availability re-
mains severely unbalanced. Under the taxonomy
of Joshi et al. (2020), Hausa falls in mid-tiers (3–
4) given its international media presence, while
Fongbe falls closer to the lowest tiers (0–1). Conti-
nental surveys confirm that most African languages
lack sufficient corpora (Hedderich et al., 2021). Our
work proposes LLM-based data extraction as a scal-
able complement to manual corpus construction.
2.2. LLMs for Low-Resource Languages
Robinson et al. (2023) showed that ChatGPT de-
grades significantly for low-resource African lan-
guages. Hendy et al. (2023) found systematic qual-
ity drops for languages with limited web presence.
AfriDoc-MT (Alabi et al., 2025) evaluated document-
level translation for African languages including
Hausa. African-centric models such as AfriBERTa
(Ogueji et al., 2021) and AfroXLMR (Alabi et al.,
2022) outperform multilingual baselines but focus
on comprehension rather than generation.
A critical gap persists: none of these studies
investigate which types of prompts maximize ex-
tractable language data. Our work reframes the
question from “how well do LLMs translate into lan-
guage X?” to “which prompting strategies extract
the most usable data from LLMs for language X?”
2.3. Data Augmentation and Synthetic
Data
Data augmentation encompasses Easy Data Aug-
mentation (EDA) (Wei and Zou, 2019), back-
translation (Sennrich et al., 2016), and LLM-based
generation (Schick and Schütze, 2021). White-
house et al. (2023) found mixed results for

Chunk 5 · 1,996 chars

the most usable data from LLMs for language X?”
2.3. Data Augmentation and Synthetic
Data
Data augmentation encompasses Easy Data Aug-
mentation (EDA) (Wei and Zou, 2019), back-
translation (Sennrich et al., 2016), and LLM-based
generation (Schick and Schütze, 2021). White-
house et al. (2023) found mixed results for low-
resource LLM augmentation. Dai and Adel (2020)
showed augmentation effectiveness depends on
method and dataset size. The Fikira dataset (Ade-
lani et al., 2024) generated reasoning data for
African languages but did not compare elicitation
strategies. These studies evaluate augmentation
for specific downstream tasks rather than investi-
gating which strategies maximize general corpus
utility.
3. Methodology
We design a controlled experiment comparing six
elicitation task types across two LLMs and two lan-
guages. All prompts, scripts, and evaluation code

-- 2 of 11 --

are released publicly (see Appendix A). Full prompt
structures and examples are provided in Appendix
D.
3.1. Elicitation Task Taxonomy
Table 1 summarizes six task types, each probing
a different dimension of LLM linguistic knowledge
(see Appendix D for full prompt details).
Task Type Rationale N
Creative
Writing (po-
ems, stories,
folktales, songs,
proverbs)
Tests deep cultural and linguis-
tic knowledge; generates domain-
diverse narrative text
25
Functional
Text (letters,
instructions,
news, recipes,
announce-
ments)
Tests practical domain coverage;
generates text useful for downstream
NLP
25
Structured
Knowledge
(definitions,
grammar
examples,
vocabulary lists,
translations)
Tests metalinguistic knowledge; pro-
duces high-density lexical output
25
Dialogue
(conversations,
interviews,
negotiations,
family discus-
sions)
Tests colloquial register and spoken-
form generation
25
Topic
Switching
(domes-
tic→sports,
narrative shifts,
knowledge
switches)
Tests language maintenance robust-
ness under topic changes
25
Constrained
Gen. (vocab-
constrained,
no-code-
switching,
technical

Chunk 6 · 1,990 chars

rsations,
interviews,
negotiations,
family discus-
sions)
Tests colloquial register and spoken-
form generation
25
Topic
Switching
(domes-
tic→sports,
narrative shifts,
knowledge
switches)
Tests language maintenance robust-
ness under topic changes
25
Constrained
Gen. (vocab-
constrained,
no-code-
switching,
technical mono-
lingual)
Tests ability to stay in target language
under explicit constraints
25
Table 1: Elicitation task taxonomy: 6 types × 25
prompts = 150 per language.
Creative Writing prompts request poems, folk-
tales, stories, songs, and proverbs about culturally
relevant themes. Functional Text prompts request
letters, instructions, news articles, recipes, and an-
nouncements. Structured Knowledge prompts
request definitions, cultural explanations, grammar
examples, vocabulary lists, and translations. Di-
alogue prompts request conversations in varied
social contexts (market, clinic, family, interview, ne-
gotiation). Topic Switching prompts begin on a
familiar topic and switch to a domain typically dis-
cussed in colonial languages, requiring continu-
ation in the target language. Constrained Gen-
eration prompts impose vocabulary constraints,
no-code-switching rules, length requirements, and
structural formats.
3.2. Languages
Hausa (ISO 639-3: hau) is an Afroasiatic language
spoken by approximately 80–100 million people
across Nigeria, Niger, and neighboring countries.
It features grammatical gender, rich morphology,
and complex tense-aspect-mood marking (New-
man, 2000). It has a standardized Latin orthogra-
phy, substantial web presence, and is included in
XLM-RoBERTa (Conneau et al., 2020). The colo-
nial contact language is English.
Fongbe (ISO 639-3: fon) is a Niger-Congo Gbe
language spoken by approximately 2 million people
in Benin. It features serial verb constructions and
a three-tone system with obligatory diacritic mark-
ing (Lefebvre and Brousseau, 2002). Tone distin-
guishes meaning: kó (high) = “harvest,” kò (low) =
“build,” kô

Chunk 7 · 1,992 chars

English.
Fongbe (ISO 639-3: fon) is a Niger-Congo Gbe
language spoken by approximately 2 million people
in Benin. It features serial verb constructions and
a three-tone system with obligatory diacritic mark-
ing (Lefebvre and Brousseau, 2002). Tone distin-
guishes meaning: kó (high) = “harvest,” kò (low) =
“build,” kô (falling) = “neck.” Fongbe has minimal
web presence and is absent from XLM-RoBERTa.
The colonial contact language is French.
3.3. Models and Prompt Design
We evaluate GPT-4o Mini (OpenAI) and Gemini
2.5 Flash (Google), both accessed with temper-
ature 0.7, top-p 0.95, max output 1,024 tokens,
and a system prompt requiring target-language out-
put. Each prompt template contains placeholders
({language}, {language_culture}, {colo-
nial_language}) substituted at generation time.
All prompts are written in English. This choice
reflects a deliberate experimental design decision:
English-language prompting provides a controlled,
reproducible interface that does not require anno-
tators to be proficient in Hausa or Fongbe, and en-
ables direct comparability across languages. We
acknowledge that prompting directly in the target
language may yield different results and we con-
sider this a promising direction for future work (Sec-
tion 6). Preliminary evidence from related work
suggests that target-language prompting can im-
prove output quality for well-resourced languages,
though its effects for very low-resource languages
like Fongbe remain untested.
Each prompt is sent once per model per lan-
guage: 150 × 2 × 2 = 600 API calls. Outputs are
saved as JSON with resumability support.

-- 3 of 11 --

3.4. Evaluation Framework
We evaluate outputs using: Output Validity (mini-
mum 20 tokens); Lexical Diversity (TTR, hapax
ratio, vocabulary size); Language Fidelity via
GlotLID (Kargaran et al., 2023), a fastText classifier
covering 2,000+ languages including hau_Latn
and fon_Latn, applied at document and sentence
levels; Diacritic Analysis for Fongbe (tonal

Chunk 8 · 1,999 chars

puts using: Output Validity (mini-
mum 20 tokens); Lexical Diversity (TTR, hapax
ratio, vocabulary size); Language Fidelity via
GlotLID (Kargaran et al., 2023), a fastText classifier
covering 2,000+ languages including hau_Latn
and fon_Latn, applied at document and sentence
levels; Diacritic Analysis for Fongbe (tonal vowel
ratio); Repetition Detection (4-gram and sentence
repetition); and Reference Overlap (character tri-
gram cosine similarity against MasakhaNER 2.0
training text for Hausa).
4. Results
We report results from 600 API calls (150 prompts
× 2 models × 2 languages).
4.1. Output Validity
Table 2 reports the percentage of outputs exceed-
ing the 20-token minimum and the average word
count per condition.
Gemini GPT-4o Mini
Task Fon Hau Fon Hau
Creative 28/18 76/27 100/90 100/104
Functional 36/17 68/20 100/153 100/205
Structured 40/18 56/20 100/92 100/114
Dialogue 80/23 52/20 100/145 100/190
Topic Switch 4/15 88/73 100/157 100/190
Constrained 20/17 92/34 100/78 100/125
Overall 35/18 72/32 100/119 100/155
Table 2: Output validity (% valid/avg. words) by task
and model.
GPT-4o Mini achieves 100% validity across all 12
conditions, producing outputs averaging 119 words
for Fongbe and 155 for Hausa. Gemini generates
much shorter responses: 18 words on average for
Fongbe (35% valid) and 32 for Hausa (72% valid).
The disparity is most extreme for Fongbe topic-
switching, where Gemini produces valid output for
only 4% of prompts.
4.2. Language Fidelity
Table 3 reports document-level target language
detection using GlotLID.
Hausa outputs are reliably identified: 89% for
Gemini and 100% for GPT-4o Mini. Fongbe shows
greater variation. Gemini achieves its highest
Fongbe fidelity on constrained generation (100%)
and topic switching (96%), while GPT-4o Mini
scores highest on constrained generation (88%)
but poorly on topic switching (32%). When Fongbe
Gemini GPT-4o Mini
Task Fon Hau Fon Hau
Creative 60 92 48 100
Functional 68 96 40 100
Structured 40 72 52 100
Dialogue

Chunk 9 · 1,995 chars

ts highest
Fongbe fidelity on constrained generation (100%)
and topic switching (96%), while GPT-4o Mini
scores highest on constrained generation (88%)
but poorly on topic switching (32%). When Fongbe
Gemini GPT-4o Mini
Task Fon Hau Fon Hau
Creative 60 92 48 100
Functional 68 96 40 100
Structured 40 72 52 100
Dialogue 12 76 60 100
Topic Switch 96 100 32 100
Constrained 100 100 88 100
Overall 63 89 53 100
Table 3: Document-level target language detection
(%) by GlotLID, per task and model.
outputs are misidentified, GlotLID most frequently
labels them as English (32 cases), Yoruba (24), or
French (23), suggesting code-switching contami-
nation or generation in related Gbe languages.
At the sentence level, constrained generation
achieves the lowest code-switching rates across
both models (0.01–0.09). GPT-4o Mini shows con-
sistently low code-switching for Hausa (0.01–0.16)
but elevated rates for Fongbe (0.25–0.66), indicat-
ing frequent interspersion of French or English sen-
tences within otherwise Fongbe text.
4.3. Lexical Diversity
Table 4 reports TTR and vocabulary size.
Gemini GPT-4o Mini
Task Fon Hau Fon Hau
Type-Token Ratio
Creative .93 .92 .58 .67
Functional .88 .92 .48 .60
Structured .96 .95 .60 .71
Dialogue .88 .92 .46 .58
Topic Switch .89 .81 .48 .63
Constrained .89 .82 .54 .67
Avg. Vocabulary Size
Creative 16 24 50 68
Functional 15 19 74 117
Structured 18 19 48 73
Dialogue 20 18 66 108
Topic Switch 13 53 70 117
Constrained 15 27 32 74
Table 4: Lexical diversity measured by Type-Token
Ratio (TTR) and average vocabulary size per con-
dition.
Gemini’s higher TTR (0.81–0.96 vs. 0.46–0.71)
is largely an artifact of output length. In absolute
terms, GPT-4o Mini yields 13,895 unique Hausa
and 8,478 Fongbe word tokens across all outputs,
versus Gemini’s 3,977 and 2,427—a 3.5× advan-
tage.

-- 4 of 11 --

4.4. Fongbe Diacritic Analysis
GPT-4o Mini produces diacritics in 96–100% of
Fongbe outputs, with diacritic-to-alphabetic ratios
of 0.24–0.37. Gemini is less

Chunk 10 · 1,996 chars

erms, GPT-4o Mini yields 13,895 unique Hausa
and 8,478 Fongbe word tokens across all outputs,
versus Gemini’s 3,977 and 2,427—a 3.5× advan-
tage.

-- 4 of 11 --

4.4. Fongbe Diacritic Analysis
GPT-4o Mini produces diacritics in 96–100% of
Fongbe outputs, with diacritic-to-alphabetic ratios
of 0.24–0.37. Gemini is less reliable: only 36%
of dialogue outputs contain diacritics (ratio 0.02),
versus 100% for constrained generation (ratio 0.31).
Explicit constraints help Gemini activate Fongbe
orthographic knowledge that unconstrained tasks
fail to elicit.
4.5. Extraction Efficiency
Table 5 presents usable words per API call (from
outputs that are both valid and detected as the
target language). Figure 1 ( Appendix B) visualizes
these differences.
Model Task Fon Hau
Gemini
Creative 0.0 20.7
Functional 1.7 13.5
Structured 0.9 5.8
Dialogue 0.0 7.0
Topic Switch 0.0 70.6
Constrained 5.7 32.7
Per call 1.4 25.0
GPT-4o
Creative 37.5 103.7
Functional 55.0 204.8
Structured 49.4 114.5
Dialogue 80.8 190.0
Topic Switch 50.6 190.1
Constrained 69.6 124.6
Per call 57.2 154.6
Table 5: Usable target-language words per API call.
GPT-4o Mini is 6× more efficient for Hausa, 41×
for Fongbe.
GPT-4o Mini extracts 154.6 usable Hausa words
per call (6× Gemini) and 57.2 Fongbe words (41×
Gemini). The most efficient strategies differ by lan-
guage: for Hausa, functional text and dialogue max-
imize extraction (190–205 words/call); for Fongbe,
dialogue (80.8) and constrained generation (69.6)
are most productive. Gemini’s Fongbe extraction
is near zero for most tasks.
Repetition rates are low across all conditions
(<0.06). Reference corpus overlap shows GPT-4o
Mini’s Hausa outputs have higher character trigram
similarity to MasakhaNER 2.0 text (cosine 0.10 vs.
0.07 for Gemini). Crucially, both values are well
below 0.15, a conservative threshold above which
near-verbatim reproduction would become plausi-
ble. The elevated similarity for GPT-4o Mini most
likely reflects that this model generates

Chunk 11 · 1,997 chars

have higher character trigram
similarity to MasakhaNER 2.0 text (cosine 0.10 vs.
0.07 for Gemini). Crucially, both values are well
below 0.15, a conservative threshold above which
near-verbatim reproduction would become plausi-
ble. The elevated similarity for GPT-4o Mini most
likely reflects that this model generates more natu-
ral Hausa text whose statistical profile resembles
existing Hausa corpora—an indicator of generation
quality rather than memorization. All cosine values
by task type are visualized in Figure 4 (Appendix
B).
5. Discussion
5.1. Optimal Elicitation Strategies by
Language
Our results confirm that optimal strategies differ
substantially between languages (RQ3).
For Hausa, functional text and dialogue yield
the most usable words (190–205 per call with
GPT-4o Mini), while constrained generation and
topic switching achieve the highest language fidelity
(100% for both models). Hausa is sufficiently rep-
resented in LLM training data to sustain generation
across diverse task types.
For Fongbe, constrained generation emerges
as the most reliable strategy: highest language
fidelity (100% Gemini, 88% GPT-4o Mini), best dia-
critic ratios, and lowest code-switching. Communi-
ties working with extremely low-resource languages
should prioritize constrained generation prompts
that explicitly require monolingual output and spec-
ify target-language vocabulary.
The Hausa–Fongbe gap is consistently large:
GPT-4o Mini achieves 100% vs. 53% language fi-
delity, produces 2.7× more usable words per call,
and exhibits 4–10× lower code-switching rates.
This disparity likely reflects training data represen-
tation rather than inherent linguistic difficulty.
5.2. Training Data as a Confounding
Factor
The performance gap between GPT-4o Mini and
Gemini 2.5 Flash and between Hausa and Fongbe
most plausibly reflects differences in training data
composition rather than architectural differences
per se. Robinson et al. (2023) demonstrated that
ChatGPT performance degrades

Chunk 12 · 1,998 chars

ulty.
5.2. Training Data as a Confounding
Factor
The performance gap between GPT-4o Mini and
Gemini 2.5 Flash and between Hausa and Fongbe
most plausibly reflects differences in training data
composition rather than architectural differences
per se. Robinson et al. (2023) demonstrated that
ChatGPT performance degrades sharply for lan-
guages underrepresented in web-crawled pretrain-
ing data, and Hendy et al. (2023) showed sys-
tematic quality drops correlate with web presence
rather than linguistic complexity. Hausa has a sub-
stantial international media presence (BBC Hausa,
VOA Hausa), whereas Fongbe has minimal digital
footprint. If Gemini’s training data includes pro-
portionally less Hausa and Fongbe text than GPT-
4o Mini’s, this would fully explain the extraction
efficiency gap without invoking any architectural
cause. Unfortunately, neither OpenAI nor Google
discloses the language-level composition of their
training data, making this hypothesis untestable
with current information. Future work using open-
weight models with documented training corpora
(e.g., BLOOM, Llama variants) could help disentan-
gle data from architecture effects.

-- 5 of 11 --

5.3. Implications for Resource Creation
Our findings yield practical recommendations. First,
model selection matters more than task se-
lection: switching from Gemini to GPT-4o Mini
increases Fongbe efficiency by 41×, whereas
task variation within GPT-4o Mini yields only 2×
difference. Second, explicit constraints im-
prove fidelity: constrained generation consistently
achieves the highest language purity and diacritic
accuracy. Third, post-hoc filtering is essential:
even the best Fongbe condition produces 12% non-
target outputs; GlotLID filtering can remove contam-
inated text. Fourth, cost-efficiency is compelling:
GPT-4o Mini extracted 23,192 usable Hausa words
and 8,574 Fongbe words for under $0.10, scalable
to substantial corpora for under $10.
6. Conclusion and Future Work
We presented an exploratory

Chunk 13 · 1,991 chars

ion produces 12% non-
target outputs; GlotLID filtering can remove contam-
inated text. Fourth, cost-efficiency is compelling:
GPT-4o Mini extracted 23,192 usable Hausa words
and 8,574 Fongbe words for under $0.10, scalable
to substantial corpora for under $10.
6. Conclusion and Future Work
We presented an exploratory evaluation of six LLM
elicitation strategies for extracting usable text data
for Hausa and Fongbe. While the scale of this study
is limited, our initial findings suggest three trends
worth investigating further. First, GPT-4o Mini pro-
duces substantially more usable text than Gemini
2.5 Flash, yielding 6× more Hausa words and 41×
more Fongbe words per API call. Second, elicita-
tion strategies appear to be language-dependent:
Hausa benefits from volume-maximizing tasks
(functional text, dialogue), while Fongbe appears to
require constraint-heavy prompts (constrained gen-
eration). Third, the Hausa–Fongbe performance
gap is consistent across conditions, suggesting that
LLM-based extraction may currently be more viable
for moderately resourced languages. These find-
ings are preliminary and will require validation at
larger scale and across additional languages and
models.
Future work will include additional LLMs
(Claude Sonnet, open-source and African-
language-focused models), human evaluation
with native speakers, downstream utility testing
on MasakhaNER 2.0 and MasakhaPOS, target-
language prompting experiments, larger prompt
samples for statistical robustness, and extension
to additional African languages.
7. Limitations
This study has several limitations. First, we eval-
uate only two commercial LLMs; the performance
gap we observe may not generalize to open-source
or African-language-focused models. Second, our
evaluation relies entirely on automatic metrics; hu-
man evaluation by native speakers is essential,
particularly for Fongbe where GlotLID misidenti-
fies 47% of GPT-4o Mini outputs despite many
likely containing valid Fongbe with

Chunk 14 · 1,993 chars

bserve may not generalize to open-source
or African-language-focused models. Second, our
evaluation relies entirely on automatic metrics; hu-
man evaluation by native speakers is essential,
particularly for Fongbe where GlotLID misidenti-
fies 47% of GPT-4o Mini outputs despite many
likely containing valid Fongbe with code-switching
— misidentification does not necessarily imply low
linguistic quality, but may reflect code-switching that
the classifier penalizes. Third, we do not evaluate
downstream task utility: whether extracted corpora
improve NER or POS tagging performance remains
to be tested. Fourth, with 25 prompts per task
type, sample sizes are modest; while directional
patterns are consistent across conditions, larger
experiments would enable more robust statistical
comparisons. Fifth, all prompts are written in En-
glish; target-language prompting may yield different
results. Sixth, our memorization assessment relies
on reference overlap as an indirect proxy, though
the uniformly low cosine similarity values (<0.12)
suggest verbatim reproduction is unlikely. Finally,
our methodology relies on commercial APIs, intro-
ducing cost barriers and reproducibility concerns;
future work should investigate open-source alter-
natives.
8. Ethics Statement
Data provenance and community benefit. We
acknowledge that LLMs were trained on data con-
tributed by language communities, often without
explicit consent. Our work aims to redirect encoded
knowledge back to these communities. All gener-
ated data will be released under CC-BY-4.0.
Quality and potential harms. LLM-generated
text may contain errors, inaccuracies, or halluci-
nated content. We document the synthetic nature
of all corpora and recommend native speaker vali-
dation before production use.
Commercial API usage. Our methodology re-
lies on commercial APIs, introducing cost barriers.
Future work should investigate open-source alter-
natives.
9. Data and Code Availability
All generated corpora, prompts,

Chunk 15 · 1,998 chars

ent the synthetic nature
of all corpora and recommend native speaker vali-
dation before production use.
Commercial API usage. Our methodology re-
lies on commercial APIs, introducing cost barriers.
Future work should investigate open-source alter-
natives.
9. Data and Code Availability
All generated corpora, prompts, generation scripts,
and evaluation code will be made publicly available
upon acceptance under a CC-BY-4.0 license. The
evaluation pipeline, including the GlotLID-based
language fidelity assessment, is provided for full
reproducibility.
10. Acknowledgements
The authors thank the reviewers for their construc-
tive feedback. We acknowledge the Masakhane
community for their foundational contributions
to African NLP resources, particularly the
MasakhaNER 2.0 and MasakhaPOS datasets
used in our evaluation. This publication was
developed as part of the Center for Inclusive
Digital Transformation of Africa (CIDTA) and the

-- 6 of 11 --

Afretec Network, which is managed by Carnegie
Mellon University Africa and receives financial
support from the Mastercard Foundation. The
views expressed in this document are solely those
of the authors and do not necessarily reflect those
of Carnegie Mellon University or the Mastercard
Foundation.
11. Bibliographical References
David Adelani, Jesujoba Alabi, Angela Fan, Julia
Kreutzer, et al. 2022a. A few thousand trans-
lations go a long way! Leveraging pre-trained
models for African news translation. In Proceed-
ings of the 2022 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics. Association for Computational Lin-
guistics.
David Adelani, Shamsuddeen Muhammad, et al.
2024. Fikira: Multilingual reasoning dataset for
African languages. Masakhane Project Technical
Report.
David Adelani, Graham Neubig, Sebastian Ruder,
Shruti Rijhwani, Michael Beukman, Chester
Palen-Michel, Constantine Lignos, Jesujoba Al-
abi, Shamsuddeen Muhammad, Peter Nabende,
Cheikh Dione, et al. 2022b. MasakhaNER

Chunk 16 · 1,996 chars

024. Fikira: Multilingual reasoning dataset for
African languages. Masakhane Project Technical
Report.
David Adelani, Graham Neubig, Sebastian Ruder,
Shruti Rijhwani, Michael Beukman, Chester
Palen-Michel, Constantine Lignos, Jesujoba Al-
abi, Shamsuddeen Muhammad, Peter Nabende,
Cheikh Dione, et al. 2022b. MasakhaNER 2.0:
Africa-centric transfer learning for named entity
recognition. In Proceedings of the 2022 Confer-
ence on Empirical Methods in Natural Language
Processing. Association for Computational Lin-
guistics.
Jesujoba Alabi, David Adelani, Marius Mosbach,
and Dietrich Klakow. 2022. Adapting pre-trained
language models to African languages via mul-
tilingual adaptive fine-tuning. In Proceedings of
the 29th International Conference on Computa-
tional Linguistics.
Jesujoba Oluwadara Alabi, Israel Abebe Azime,
et al. 2025. AFRIDOC-MT: Document-level MT
corpus for African languages. In Proceedings
of the 2025 Conference on Empirical Methods
in Natural Language Processing. Association for
Computational Linguistics.
Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzmán, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov. 2020.
Unsupervised cross-lingual representation learn-
ing at scale. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics, pages 8440–8451. Association for Com-
putational Linguistics.
Xiang Dai and Heike Adel. 2020. An analysis of
simple data augmentation for named entity recog-
nition. In Proceedings of the 28th International
Conference on Computational Linguistics.
Cheikh M Bamba Dione, David Adelani, et al. 2023.
MasakhaPOS: Part-of-speech tagging for typo-
logically diverse African languages. In Proceed-
ings of the 61st Annual Meeting of the Associa-
tion for Computational Linguistics. Association
for Computational Linguistics.
Michael Hedderich, Lukas Lange, Heike Adel, Jan-
nik Strobe, and Dietrich Klakow. 2021. A sur-
vey on recent

Chunk 17 · 1,993 chars

: Part-of-speech tagging for typo-
logically diverse African languages. In Proceed-
ings of the 61st Annual Meeting of the Associa-
tion for Computational Linguistics. Association
for Computational Linguistics.
Michael Hedderich, Lukas Lange, Heike Adel, Jan-
nik Strobe, and Dietrich Klakow. 2021. A sur-
vey on recent approaches for natural language
processing in low-resource scenarios. In Pro-
ceedings of the 2021 Conference of the North
American Chapter of the Association for Com-
putational Linguistics. Association for Computa-
tional Linguistics.
Amr Hendy, Mohamed Abdelrehim, Amr Sharaf,
Vikas Rauber, et al. 2023. How good are GPT
models at machine translation? A comprehen-
sive evaluation. arXiv preprint arXiv:2302.09210.
Pratik Joshi, Sebastin Santy, Amar Buber, Kalika
Bali, and Monojit Choudhury. 2020. The state
and fate of linguistic diversity and inclusion in the
NLP world. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics, pages 6282–6293. Association for Com-
putational Linguistics.
Amir Hossein Kargaran, Ayyoob Imani,
François Yvon, and Hinrich Schütze.
2023. GlotLID: Language identification
for low-resource languages. https:
//huggingface.co/cis-lmu/glotlid.
Version 1.0.
Claire Lefebvre and Anne-Marie Brousseau. 2002.
A Grammar of Fongbe. Mouton de Gruyter,
Berlin.
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa
Matsila, Timi Fasubaa, et al. 2020. Participatory
research for low-resourced machine translation:
A case study in African languages. In Findings
of the Association for Computational Linguistics:
EMNLP 2020, pages 2144–2160. Association
for Computational Linguistics.
Paul Newman. 2000. The Hausa Language: An En-
cyclopedic Reference Grammar. Yale University
Press.
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021.
Small data? No problem! Exploring the viabil-
ity of pretrained multilingual language models
for low-resourced languages. In Proceedings

-- 7 of 11 --

of the 1st Workshop on Multilingual

Chunk 18 · 1,926 chars

0. The Hausa Language: An En-
cyclopedic Reference Grammar. Yale University
Press.
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021.
Small data? No problem! Exploring the viabil-
ity of pretrained multilingual language models
for low-resourced languages. In Proceedings

-- 7 of 11 --

of the 1st Workshop on Multilingual Represen-
tation Learning. Association for Computational
Linguistics.
Iroro Orife, Julia Kreutzer, Bonaventure Dossou,
Chris Emezue, et al. 2020. Masakhane –
machine translation for Africa. arXiv preprint
arXiv:2003.11529.
Nathaniel Robinson, Perez Ogayo, David R.
Mortensen, and Graham Neubig. 2023. Chat-
GPT MT: Competitive for high- (but not low-) re-
source languages. In Proceedings of the Eighth
Conference on Machine Translation. Association
for Computational Linguistics.
Timo Schick and Hinrich Schütze. 2021. Gener-
ating datasets with pretrained language models.
In Proceedings of the 2021 Conference on Em-
pirical Methods in Natural Language Processing.
Association for Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Improving neural machine translation mod-
els with monolingual data. In Proceedings of the
54th Annual Meeting of the Association for Com-
putational Linguistics. Association for Computa-
tional Linguistics.
Jason Wei and Kai Zou. 2019. EDA: Easy data aug-
mentation techniques for boosting performance
on text classification tasks. In Proceedings of
the 2019 Conference on Empirical Methods in
Natural Language Processing. Association for
Computational Linguistics.
Chenxi Whitehouse et al. 2023. LLM-powered data
augmentation for enhanced cross-lingual perfor-
mance. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Pro-
cessing. Association for Computational Linguis-
tics.
Appendix A. Code and Data
Repository
All prompts, generation scripts, evaluation code,
and generated corpora are publicly available

Chunk 19 · 1,998 chars

ntation for enhanced cross-lingual perfor-
mance. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Pro-
cessing. Association for Computational Linguis-
tics.
Appendix A. Code and Data
Repository
All prompts, generation scripts, evaluation code,
and generated corpora are publicly available at:
https://github.com/Pericles001/
mining_llm_low_resource_languages_
fon_hau/tree/main
The repository is organised as follows:
prompts/ JSON files containing all 150 prompts
per language, organised by task type
src/ Core modules for genera-
tion (generator.py), evaluation
(evaluator.py), and language detec-
tion (language_detector.py)
scripts/ CLI entry points for generation, evalu-
ation, and analysis
outputs/ Raw LLM outputs organised by model,
language, and task type
results/ Aggregated evaluation results, figures,
and LATEX tables
Appendix B. Supplementary Figures
Figure 1: Extraction efficiency: usable target-
language words per API call, by model, language,
and task type. GPT-4o Mini dominates across all
conditions; the gap is most extreme for Fongbe.
Figure 2: Document-level target language detection
heatmap (GlotLID). Green = high fidelity; red = low.
Hausa is uniformly high for GPT-4o Mini; Fongbe
fidelity depends strongly on task type.
Figure 3: Fongbe diacritic analysis by task type.
Left: diacritic-to-alphabetic ratio; Right: proportion
of outputs containing any diacritics. Constrained
generation reliably elicits diacritics from both mod-
els.

-- 8 of 11 --

Figure 4: Character trigram cosine similarity be-
tween generated Hausa text and MasakhaNER 2.0
training text, used as a proxy for potential memo-
rization. All values are well below 0.15 (dashed
line), suggesting outputs represent novel genera-
tion rather than training data reproduction.
Figure 5: Sentence-level code-switching rates by
model, language, and task type. Constrained
generation consistently achieves the lowest code-
switching. Fongbe shows much higher rates than
Hausa

Chunk 20 · 1,996 chars

are well below 0.15 (dashed
line), suggesting outputs represent novel genera-
tion rather than training data reproduction.
Figure 5: Sentence-level code-switching rates by
model, language, and task type. Constrained
generation consistently achieves the lowest code-
switching. Fongbe shows much higher rates than
Hausa across all tasks.
Appendix C. Full Evaluation
Summary
Table 6 reports all evaluation metrics across all 24
conditions (2 models × 2 languages × 6 task types).
Quality is a composite score averaging language
confidence and inverse code-switching rate.

-- 9 of 11 --

Model Lang Task Valid% Words TTR Hapax Vocab CS LangConf Quality
Gemini
fon
constrained 0.20 17.1 0.891 0.802 14.7 0.087 0.998 0.927
creative 0.28 17.6 0.932 0.876 16.4 0.387 0.782 0.791
dialogue 0.80 22.8 0.880 0.787 20.2 0.913 0.744 0.555
functional 0.36 16.7 0.883 0.795 14.7 0.420 0.929 0.816
structured 0.40 18.5 0.955 0.915 17.7 0.660 0.616 0.645
topic switch 0.04 15.0 0.892 0.800 13.4 0.093 0.995 0.941
hau
constrained 0.92 34.0 0.822 0.704 27.2 0.027 1.000 0.921
creative 0.76 26.8 0.918 0.854 23.5 0.227 0.912 0.867
dialogue 0.52 19.9 0.920 0.859 18.2 0.490 0.873 0.817
functional 0.68 20.1 0.923 0.853 18.5 0.127 0.995 0.914
structured 0.56 20.1 0.946 0.904 18.9 0.367 0.856 0.779
topic switch 0.88 72.8 0.812 0.707 52.7 0.027 1.000 0.919
GPT-4o
fon
constrained 1.00 77.6 0.544 0.375 31.7 0.246 0.937 0.869
creative 1.00 90.0 0.581 0.405 50.0 0.640 0.703 0.688
dialogue 1.00 144.5 0.458 0.275 65.5 0.598 0.868 0.736
functional 1.00 153.0 0.479 0.325 73.5 0.627 0.870 0.675
structured 1.00 91.8 0.597 0.486 48.1 0.498 0.862 0.731
topic switch 1.00 156.8 0.477 0.321 70.4 0.661 0.828 0.646
hau
constrained 1.00 124.6 0.667 0.520 73.6 0.012 1.000 0.898
creative 1.00 103.7 0.674 0.512 67.5 0.066 1.000 0.887
dialogue 1.00 190.0 0.578 0.408 107.6 0.144 1.000 0.871
functional 1.00 204.8 0.602 0.448 117.4 0.146 1.000 0.874
structured 1.00 114.5 0.708 0.603 73.1 0.163 0.930 0.889
topic switch 1.00 190.1

Chunk 21 · 1,997 chars

8 0.646
hau
constrained 1.00 124.6 0.667 0.520 73.6 0.012 1.000 0.898
creative 1.00 103.7 0.674 0.512 67.5 0.066 1.000 0.887
dialogue 1.00 190.0 0.578 0.408 107.6 0.144 1.000 0.871
functional 1.00 204.8 0.602 0.448 117.4 0.146 1.000 0.874
structured 1.00 114.5 0.708 0.603 73.1 0.163 0.930 0.889
topic switch 1.00 190.1 0.628 0.489 116.6 0.028 1.000 0.891
Table 6: Full evaluation summary across all 24 conditions. CS = code-switching rate; LangConf = GlotLID
language confidence score; Quality = composite score.

-- 10 of 11 --

Appendix D. Prompt Taxonomy
Details
This appendix documents the structure and ra-
tionale of all 150 prompts per language (6 task
types × 25 prompts). Each task type is di-
vided into subtasks to ensure domain cover-
age. All prompts use three placeholders: {lan-
guage}, {language_culture}, and {colo-
nial_language}, substituted at generation time.
A. Constrained Generation (cg_01–cg_25)
Subtasks: vocabulary-constrained (cg_01–05),
no-code-switching (cg_06–10), length-constrained
(cg_11–15), technical-monolingual (cg_16–20),
structure-constrained (cg_21–25).
Design rationale: Constrained generation
prompts impose explicit linguistic constraints to
prevent code-switching and test the model’s ability
to generate monolingual output. Vocabulary-
constrained prompts seed the output with
target-language words, reducing the risk of the
model falling back to colonial language vocabulary
for unknown concepts. Technical-monolingual
prompts specifically target domains (computing,
electricity, banking) where Fongbe and Hausa
lack standard terminology, forcing the model to
paraphrase rather than borrow.
Representative templates:
• cg_01: “Write a short paragraph in {language} us-
ing ALL of the following words: {word_list_1}. Do
not use any {colonial_language} words.”
• cg_06: “Write a story in {language} about a day at
the market. You must write ONLY in {language}. If
you do not know a word in {language}, describe the
concept using other {language} words

Chunk 22 · 1,997 chars

te a short paragraph in {language} us-
ing ALL of the following words: {word_list_1}. Do
not use any {colonial_language} words.”
• cg_06: “Write a story in {language} about a day at
the market. You must write ONLY in {language}. If
you do not know a word in {language}, describe the
concept using other {language} words instead of
switching to {colonial_language}.”
Word lists for vocabulary-constrained prompts are
provided in the released data.
B. Creative Writing (cw_01–cw_25)
Subtasks: poem (cw_01–05), folktale (cw_06–
10), story (cw_11–15), song (cw_16–20), proverb
(cw_21–25).
Design rationale: Creative writing prompts test
deep cultural and linguistic knowledge by eliciting
culturally rooted content (folktales, proverbs) that
requires the model to draw on language-specific
cultural knowledge, not just translation of English
concepts. Folktales and proverbs are particularly
valuable as they are community-specific and cannot
easily be produced by back-translation.
Representative templates:
• cw_06: “Write a traditional folktale in {language}
about a clever tortoise who outsmarts a lion. The
story should be 5–10 sentences long.”
C. Dialogue (dl_01–dl_25)
Subtasks: conversation (dl_01–05), professional
(dl_06–10), family (dl_11–15), interview (dl_16–
20), negotiation (dl_21–25).
Design rationale: Dialogue prompts elicit collo-
quial register and spoken-form text, which is under-
represented in formal corpora. The negotiation and
professional subtasks target domains with special-
ized vocabulary (medical, agricultural, financial),
which helps expand domain coverage of the result-
ing corpus.
D. Functional Text (ft_01–ft_25)
Subtasks: letter (ft_01–05), instructions (ft_06–10),
news (ft_11–15), recipe (ft_16–20), announcement
(ft_21–25).
Design rationale: Functional text prompts target
practical domains that are immediately useful for
downstream NLP tasks (e.g., news classification,
instruction following). These genres are typically
well-represented in NLP benchmarks

Chunk 23 · 1,531 chars

instructions (ft_06–10),
news (ft_11–15), recipe (ft_16–20), announcement
(ft_21–25).
Design rationale: Functional text prompts target
practical domains that are immediately useful for
downstream NLP tasks (e.g., news classification,
instruction following). These genres are typically
well-represented in NLP benchmarks but under-
resourced for African languages.
E. Structured Knowledge (sk_01–sk_25)
Subtasks: definition (sk_01–05), cultural explana-
tion (sk_06–10), grammar examples (sk_11–15),
vocabulary list (sk_16–20), translation (sk_21–25).
Design rationale: Structured knowledge prompts
elicit the model’s metalinguistic knowledge, pro-
ducing high-density lexical output (vocabulary lists,
grammar examples) that is directly usable for dic-
tionary construction and grammar documentation.
F. Topic Switching (ts_01–ts_25)
Subtasks: domestic-to-sports and related binary
switches (ts_01–10), narrative shift (ts_11–15),
multi-topic (ts_16–20), knowledge switch (ts_21–
25).
Design rationale: Topic-switching prompts stress-
test language maintenance by requiring the model
to continue in the target language after transitioning
to a domain (technology, politics, science) that is
more commonly discussed in the colonial contact
language. This probes whether language fidelity
holds under topic-induced pressure to code-switch.
Representative templates:
• ts_25: “In {language}, describe a funeral cere-
mony. Then, in the same response and still in {lan-
guage}, explain what artificial intelligence is.”

-- 11 of 11 --