The Token Tax: Systematic Bias in Multilingual Tokenization

Summary

This study examines how tokenization inefficiency creates systemic disadvantages for morphologically complex, low-resource languages in large language models (LLMs). Using the AfriMMLU benchmark, which includes 9,000 multiple-choice questions across 16 African languages and five subjects, the authors show that higher token fertility (tokens per word) consistently correlates with lower model accuracy. Regression analyses across 10 models reveal slopes ranging from -0.08 to -0.18, meaning each additional token per word reduces accuracy by 8–18 percentage points. Reasoning models like DeepSeek and o1 outperform non-reasoning models, narrowing accuracy gaps by 8–12 points but not eliminating them. Economically, token inflation leads to quadratic increases in training and inference costs. A language with double the token count of English incurs four times the training cost and time, and double the inference latency. These findings highlight the "token tax" faced by speakers of low-resource languages and call for morphologically aware tokenization, fair pricing, and expanded multilingual benchmarks to promote equitable NLP.

PDF viewer

Chunks(12)

Chunk 0 · 1,996 chars

The Token Tax: Systematic Bias in Multilingual
Tokenization
Jessica M. Lundin
Institute for Disease Modeling
Gates Foundation
Ada Zhang
University of San Francisco
Nihal Karim
University of San Francisco
Hamza Louzan
University of San Francisco
Victor Wei
University of San Francisco
David Adelani
McGill University
Cody Carroll
University of San Francisco
Abstract
Tokenization inefficiency imposes structural disadvantages on morphologically
complex, low-resource languages, inflating compute resources and depressing accu-
racy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA
items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reli-
ably predicts accuracy. Higher fertility consistently predicts lower accuracy across
all models and subjects. We further find that reasoning models (e.g., DeepSeek,
o1) consistently outperform non-reasoning peers across high- and low-resource
languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior
generations. Finally, translating token inflation to economics, a doubling in tokens
results in quadrupled training cost and time, underscoring the “token tax” faced by
many languages. These results motivate morphologically aware tokenization, fair
pricing, and multilingual benchmarks for equitable natural language processing
(NLP).
1 Introduction
Prior work decisively establishes tokenization as a source of computational and economic inequality
in NLP systems, with quantified impacts ranging from larger token counts to sizable BLEU point
performance degradation for morphologically complex low-resource languages Petrov et al. [2023a].
The mathematical reality of O(n2) attention scaling Keles et al. [2022] combined with large fertil-
ity values creates compound disadvantages that no amount of engineering optimization can fully
overcome within current architectures Sreedhar et al. [2023]. These technical disparities translate
directly into economic exclusion through tokenization

Chunk 1 · 1,997 chars

(n2) attention scaling Keles et al. [2022] combined with large fertil-
ity values creates compound disadvantages that no amount of engineering optimization can fully
overcome within current architectures Sreedhar et al. [2023]. These technical disparities translate
directly into economic exclusion through tokenization taxes, prohibitive training and inference cost
measured in dollars and tons of CO2, and systematic underrepresentation in model capabilities that
affects billions of speakers worldwide.
A reasonable cost to train a small-medium model or a large frontier model is easily $1 M (1 month)
to $100 M ( ˜3 months) with primarily English tokens. If we instead train on a language with 2× or 5×
more tokens for the same content, the transformer’s quadratic O(n2) compute scaling means costs do
not grow linearly. The result is a nonlinear 4× or 25× increase in energy consumption, dollar cost,
training time, and CO2 emissions relative to English. In this example the cost is $4-25 M (4 months-2
years) and $400M-2.5B (1-6 years).
Our contributions are as follows.
Preprint. Under review.
arXiv:2509.05486v1 [cs.CL] 5 Sep 2025

-- 1 of 8 --

• We extend prior fertility and accuracy analysis to 10 models and 16 languages, confirming
fertility as a reliable predictor of MCQA accuracy.
• We conduct the first large-scale comparison of tokenization effects for reasoning vs. non-
reasoning LLMs on AfriMMLU, showing that reasoning capabilities substantially reduce
but do not eliminate tokenization bias.
• We release public datasets containing: i. model results from AfriMMLU benchmark
including reasoning models, ii. MMLU token metrics.
2 Methods
The overall architecture of our methodology is as follows.
1: for each language ℓ in corpus do
2: for each model m do
3: Calculate number of tokens using model m’s tokenizer
4: Calculate fertility scores for language ℓ
5: Run MCQA inference to obtain accuracy for language ℓ
6: end for
7: end for
8: Fit linear regressions of accuracy on

Chunk 2 · 1,991 chars

ecture of our methodology is as follows.
1: for each language ℓ in corpus do
2: for each model m do
3: Calculate number of tokens using model m’s tokenizer
4: Calculate fertility scores for language ℓ
5: Run MCQA inference to obtain accuracy for language ℓ
6: end for
7: end for
8: Fit linear regressions of accuracy on fertility for each model-subject pair, and summarize slopes
and explained variance.
We apply the methodology to AfriMMLU Adelani et al. [2025], which covers 5 subjects (elementary
mathematics, global facts, high school geography, high school macroeconomics, and international
law) into 16 African languages, with a total of 9,000 of MCQA records.
For mixed effect, the model selection was conducted across random effect structures using AIC.
The AIC-selected model favored a random effects structure which included both intercept and slope,
suggesting that fertility’s impact on accuracy is language-dependent.
3 Results and Discussion
3.1 Model Inference on MMLU
Figure 1 shows MMLU accuracy of the African languages relative to English and French. Consistent
with prior work, African languages show large performance gaps relative to high-resource languages.
On average, African languages trail English by 25 accuracy points, with French typically falling
between the two.
Encouragingly, reasoning-oriented models-DeepSeek and o1-substantially reduce this disparity.
Across subjects, these models outperform non-reasoning peers by 8-12 points on African languages,
while maintaining strong performance in English. In Global Facts, the most challenging subject, the
gap between English and African languages narrows from 25 points in baseline models to 12-14
points under reasoning models. These results suggest that improved reasoning capabilities directly
benefit low-resource settings.
3.2 Fertility and Accuracy
Figure 2 shows the fertility and accuracy of Llama 3.1 405B model across 5 subjects. See Appendix
for Figure 3 and Table 1 of regression results including

Chunk 3 · 1,994 chars

o 12-14
points under reasoning models. These results suggest that improved reasoning capabilities directly
benefit low-resource settings.
3.2 Fertility and Accuracy
Figure 2 shows the fertility and accuracy of Llama 3.1 405B model across 5 subjects. See Appendix
for Figure 3 and Table 1 of regression results including uncertainty analysis. Across all 10 models
and five subjects, higher fertility is consistently associated with lower accuracy. Linear regressions
quantify this relationship: slopes range from −0.08 to −0.18, meaning each additional token per
word reduces accuracy by 8 − 18 percentage points, depending on subject and model.
Table 1 reports model- and subject-specific regressions. Several effects are large and statistically
significant, such as Llama-3.1-405B on Microeconomics (slope = −0.185, p = 0.002) and Qwen-2.5-
32B on Geography (slope = −0.155, p = 0.006). Fertility explains 20-50% of variance in accuracy
across these regressions, underscoring its importance as a predictor.
Taken together, these findings show that tokenization bias is not incidental but systematically erodes
model performance in proportion to fertility.
2

-- 2 of 8 --

(a) Accuracy Aggregation (English) (b) Accuracy Aggregation (French)
Figure 1: Accuracy in English (upper panel) and 17* combined African languages (middle panel)
across the 5 MMLU subjects with the performance gap (bottom panel). *There are 17 languages
including Amharic. In the ME analysis we did not include Amharic because it shares less similarity
with the remaining languages.
Figure 2: Fertility vs accuracy trade-offs for Llama 3.1 405B model across five experimental
conditions.
4 Economic Consequences of Token Inflation
Beyond accuracy, token inflation directly increases computational cost. Because transformer training
scales quadratically in sequence length, a 2× increase in fertility produces a 4× increase in training
time and cost. Table 2 shows that training Llama-3.1-405B costs $105M in English but

Chunk 4 · 1,992 chars

onsequences of Token Inflation
Beyond accuracy, token inflation directly increases computational cost. Because transformer training
scales quadratically in sequence length, a 2× increase in fertility produces a 4× increase in training
time and cost. Table 2 shows that training Llama-3.1-405B costs $105M in English but $420M in a
language with double fertility.
Inference costs are similarly inflated. As shown in Table 3, generating 1M English-equivalent tokens
with GPT-4o costs $5-20, while the same content in a 2× fertility language costs $10-40. Latency
also doubles: A prompt + completion that requires 2 seconds in English typically takes 4 seconds in
higher-fertility languages.
3

-- 3 of 8 --

These disparities demonstrate how tokenization bias manifests as a “token tax” paid disproportionately
by speakers of morphologically complex, low-resource languages.
5 Conclusion
This study demonstrates that tokenization inefficiency imposes systematic disadvantages on low-
resource, morphologically complex languages. Across 10 large language models and 16 African
languages in AfriMMLU, we find that fertility (tokens per word) strongly predicts model accuracy,
with higher fertility consistently associated with poorer performance. Regression analyses show
effect sizes as large as −0.18, explaining up to half the variance in accuracy.
Encouragingly, reasoning models DeepSeek and o1 substantially narrow accuracy gaps, improving
African language performance by 8-12 points on average and cutting the English-African disparity
nearly in half. Nevertheless, large differences remain, underscoring that better reasoning alone does
not eliminate inequities rooted in tokenization.
We further show that token inflation has severe economic implications. Doubling fertility leads to
4× increases in training cost and inference latency, turning linguistic diversity into a computational
liability. These disparities make clear that tokenization bias is not a minor technical artifact but

Chunk 5 · 1,981 chars

in tokenization.
We further show that token inflation has severe economic implications. Doubling fertility leads to
4× increases in training cost and inference latency, turning linguistic diversity into a computational
liability. These disparities make clear that tokenization bias is not a minor technical artifact but a
systemic barrier to equitable NLP.
Moving forward, addressing this barrier will require interventions at multiple levels: technical
(morphologically aware tokenization, efficient attention mechanisms), economic (pricing structures
that do not penalize high-fertility languages), and benchmarking (expansion of multilingual evaluation
datasets like AfriMMLU). Only by aligning progress across these fronts can NLP avoid a future
where billions of speakers are excluded from the benefits of language technology.
References
David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Oluwadara
Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chia-
maka Ijeoma Chukwuneke, Happy Buzaaba, Blessing Kudzaishe Sibanda, Godson Koffi Kalipe,
Jonathan Mukiibi, Salomon Kabongo Kabenamualu, Foutse Yuehgoh, Mmasibidi Setaka, Lol-
wethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Salomey Osei, Shamsuddeen Hassan Muham-
mad, Sokhar Samb, Tadesse Kebede Guge, Tombekai Vangoni Sherman, and Pontus Stenetorp.
IrokoBench: A new benchmark for African languages in the age of large language models. In Luis
Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations
of the Americas Chapter of the Association for Computational Linguistics: Human Language
Technologies (Volume 1: Long Papers) , pages 2732–2757, Albuquerque, New Mexico, April
2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/
2025.naacl-long.139. URL https://aclanthology.org/2025.naacl-long.139/.
Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering,

Chunk 6 · 1,995 chars

ogies (Volume 1: Long Papers) , pages 2732–2757, Albuquerque, New Mexico, April
2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/
2025.naacl-long.139. URL https://aclanthology.org/2025.naacl-long.139/.
Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes
Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber,
Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff,
Samuel Weinbach, Rafet Sifa, Stefan Kesselheim, and Nicolas Flores-Herr. Tokenizer choice for
LLM training: Negligible or crucial? In Kevin Duh, Helena Gomez, and Steven Bethard, editors,
Findings of the Association for Computational Linguistics: NAACL 2024 , pages 3907–3924,
Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/
2024.findings-naacl.247. URL https://aclanthology.org/2024.findings-naacl.247/.
Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. On the computational
complexity of self-attention, 2022. URL https://arxiv.org/abs/2209.04881.
Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers
introduce unfairness between languages. In Thirty-seventh Conference on Neural Information
Processing Systems, 2023a. URL https://openreview.net/forum?id=78yDLKi95p.
Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. Language model tokenizers
introduce unfairness between languages, 2023b. URL https://arxiv.org/abs/2305.15425.
4

-- 4 of 8 --

Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng, and Junjie Hu. Local byte fusion for neural
machine translation, 2023. URL https://arxiv.org/abs/2205.11490.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/
1706.03762.
A Regression Results by Subject
Table 1 and Figure 3

Chunk 7 · 1,999 chars

ral
machine translation, 2023. URL https://arxiv.org/abs/2205.11490.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/
1706.03762.
A Regression Results by Subject
Table 1 and Figure 3 show results of accuracy-on-fertility regressions for the 10 models over 5
subjects.
B Fertility and Parity
Fertility measures the average number of tokens required to represent a word in a corpus:
F = T
W
for T and W the token and word counts. Higher F inflates sequence length, affecting a model’s
ability to learn long-range dependencies and compute costs [Ali et al., 2024].
Parity [Petrov et al., 2023b] compares the lengths of token sequences in one language with the
counterpart language translation:
Parity = |t(sA)|
|t(sB )|
Parity scores close to 1 indicate consistency in tokenizing across different languages. Scores above
1 indicate less efficiency in language A relative to language B, which was taken as English for all
parity calculations in this paper.
C Inference
C.1 Training and Inference Cost Comparison: English vs. Language X
This is a thought exercise in training and inference costs for LLMs applied to English and Language
X. The analysis assumes the same model architecture and tokenizer across languages, with cost
differences due to tokenization inefficiencies and quadratic O(n2) training scaling Vaswani et al.
[2023] of transformer models.
We assume Language X has a fixed 2× increase in tokens across tokenizers (although there are
variations not included here). We assume English has 1 000 000 tokens (baseline) and Language X:
approximately 2 000 000 tokens for equivalent content. There is a 22 = 4x increase in training time
and cost.
Using published petaFLOP-day figures and assuming a compute cost of $240 per petaFLOP-day:
In addition to cost, token inflation impacts time. Transformer models scale with O(n2) in sequence
length. With a 2×

Chunk 8 · 1,996 chars

:
approximately 2 000 000 tokens for equivalent content. There is a 22 = 4x increase in training time
and cost.
Using published petaFLOP-day figures and assuming a compute cost of $240 per petaFLOP-day:
In addition to cost, token inflation impacts time. Transformer models scale with O(n2) in sequence
length. With a 2× token increase, Language X requires (22) = 4× more compute. This means
training that takes 90 days for English would take ∼360 days for Language X on the same hardware.
For inference time, decoding scales approximately linearly with token count. A prompt+completion
that takes 2 seconds in English may take about 4 seconds in Language X.
These multipliers apply whether the additional tokens appear in the input (prompt) or output (comple-
tion), and they exacerbate cost disparities for low-resource languages.
C.2 Prompt
You must only reply with ’Final Answer: X’ where X is A, B, C, or D.
Do NOT add explanations, reasoning, or extra text.
Question:
5

-- 5 of 8 --

Table 1: Fertility vs Accuracy by Model and Subject. Results from linear models regressing accuracy
on translation fertility across 16 languages for each model-subject combination. Negative slopes
indicate that higher fertility (more diverse translations) correlates with lower accuracy. Bold p-values
indicate statistical significance (p < 0.05). Asterisks (*) indicate results that remain significant after
Benjamini-Hochberg FDR correction (FDR < 0.05). Bold ρ values indicate correlations with large
effect sizes (|ρ| ≥ 0.50). R2 values show the proportion of variance in accuracy explained by fertility
and bold text indicates large effect sizes by Cohen’s convention (R2 ≥ 0.25). Regressions for o1 are
not included because Openai has not released details on the tokenizer for this model.
Subject Model Intercept Slope Std. Error t-value P-value ρ R2 Adj. R2
Elementary Math Sonnet3.5 0.652 -0.018 0.029 -0.609 0.552 -0.155 0.024 -0.041
Aya23 35B 0.480 -0.079 0.019 -4.058 0.001* -0.723 0.523

Chunk 9 · 1,993 chars

egressions for o1 are
not included because Openai has not released details on the tokenizer for this model.
Subject 	Model 	Intercept Slope Std. Error t-value P-value ρ 	R2 Adj. R2
Elementary Math 	Sonnet3.5 	0.652 -0.018 0.029 	-0.609 0.552 -0.155 0.024 -0.041
Aya23 35B 	0.480 -0.079 0.019 	-4.058 0.001* -0.723 0.523 0.492
DeepSeek R1 	1.045 -0.066 0.045 	-1.475 0.161 -0.356 0.127 0.068
DeepSeek V3 	0.884 -0.044 0.048 	-0.922 0.371 -0.232 0.054 -0.009
Gemini 1.5 Pro 0.907 -0.045 0.042 	-1.078 0.298 -0.268 0.072 0.010
Llama3.1 405B 0.836 -0.054 0.045 	-1.195 0.251 -0.295 0.087 0.026
Phi4 	0.773 -0.125 0.034 	-3.641 0.002* -0.685 0.469 0.434
GPT-4o 	1.002 -0.089 0.057 	-1.571 0.137 -0.376 0.141 0.084
Pixtral 12B 	0.417 -0.024 0.014 	-1.717 0.106 -0.405 0.164 0.109
Qwen2.5 32B 	0.857 -0.113 0.037 	-3.012 0.009* -0.614 0.377 0.335
Global Facts 	Sonnet3.5 	0.508 -0.011 0.027 	-0.390 0.702 -0.100 0.011 -0.056
Aya23 35B 	0.335 -0.044 0.023 	-1.930 0.073 -0.446 0.199 0.146
DeepSeek R1 	0.574 -0.002 0.038 	-0.061 0.952 -0.016 0.000 -0.066
DeepSeek V3 	0.619 -0.045 0.034 	-1.308 0.211 -0.320 0.102 0.042
Gemini 1.5 Pro 0.585 -0.011 0.051 	-0.222 0.827 -0.057 0.003 -0.063
Llama3.1 405B 0.685 -0.084 0.033 	-2.564 0.022 -0.552 0.305 0.258
Phi4 	0.408 -0.015 0.017 	-0.845 0.411 -0.213 0.045 -0.018
GPT-4o 	0.638 -0.063 0.054 	-1.169 0.261 -0.289 0.084 0.022
Pixtral 12B 	0.428 -0.038 0.018 	-2.068 0.056 -0.471 0.222 0.170
Qwen2.5 32B 	0.505 -0.052 0.024 	-2.171 0.046 -0.489 0.239 0.188
High School Geography 	Sonnet3.5 	0.781 -0.080 0.045 	-1.779 0.096 -0.417 0.174 0.119
Aya23 35B 	0.475 -0.097 0.038 	-2.512 0.024 -0.544 0.296 0.249
DeepSeek R1 	0.847 -0.082 0.056 	-1.466 0.163 -0.354 0.125 0.067
DeepSeek V3 	0.843 -0.124 0.053 	-2.331 0.034 -0.516 0.266 0.217
Gemini 1.5 Pro 0.750 -0.065 0.070 	-0.937 0.363 -0.235 0.055 -0.008
Llama3.1 405B 0.822 -0.131 0.055 	-2.359 0.032 -0.520 0.271 0.222
Phi4 	0.808 -0.162 0.048 	-3.343 0.004* -0.653 0.427 0.389
GPT-4o 	0.952 -0.151 0.068

Chunk 10 · 1,998 chars

6 	-1.466 0.163 -0.354 0.125 0.067
DeepSeek V3 	0.843 -0.124 0.053 	-2.331 0.034 -0.516 0.266 0.217
Gemini 1.5 Pro 0.750 -0.065 0.070 	-0.937 0.363 -0.235 0.055 -0.008
Llama3.1 405B 0.822 -0.131 0.055 	-2.359 0.032 -0.520 0.271 0.222
Phi4 	0.808 -0.162 0.048 	-3.343 0.004* -0.653 0.427 0.389
GPT-4o 	0.952 -0.151 0.068 	-2.211 0.043 -0.496 0.246 0.195
Pixtral 12B 	0.688 -0.121 0.035 	-3.414 0.004* -0.661 0.437 0.400
Qwen2.5 32B 	0.755 -0.155 0.049 	-3.190 0.006* -0.636 0.404 0.365
High School Microeconomics Sonnet3.5 	0.750 -0.096 0.042 	-2.307 0.036 -0.512 0.262 0.213
Aya23 35B 	0.549 -0.105 0.038 	-2.775 0.014 -0.582 0.339 0.295
DeepSeek R1 	0.888 -0.088 0.074 	-1.194 0.251 -0.295 0.087 0.026
DeepSeek V3 	0.906 -0.157 0.049 	-3.179 0.006* -0.634 0.403 0.363
Gemini 1.5 Pro 0.883 -0.129 0.067 	-1.920 0.074 -0.444 0.197 0.144
Llama3.1 405B 0.953 -0.185 0.050 	-3.691 0.002* -0.690 0.476 0.441
Phi4 	0.858 -0.184 0.053 	-3.479 0.003* -0.668 0.447 0.410
GPT-4o 	0.942 -0.150 0.084 	-1.779 0.096 -0.417 0.174 0.119
Pixtral 12B 	0.622 -0.105 0.033 	-3.179 0.006* -0.635 0.403 0.363
Qwen2.5 32B 	0.779 -0.154 0.048 	-3.196 0.006* -0.636 0.405 0.365
International Law 	Sonnet3.5 	0.645 -0.040 0.028 	-1.426 0.174 -0.346 0.119 0.061
Aya23 35B 	0.578 -0.101 0.042 	-2.403 0.030 -0.527 0.278 0.230
DeepSeek R1 	0.813 -0.043 0.043 	-1.010 0.329 -0.252 0.064 0.001
DeepSeek V3 	0.771 -0.073 0.045 	-1.617 0.127 -0.385 0.148 0.092
Gemini 1.5 Pro 0.796 -0.039 0.052 	-0.758 0.460 -0.192 0.037 -0.027
Llama3.1 405B 0.876 -0.096 0.038 	-2.548 0.022 -0.550 0.302 0.256
Phi4 	0.804 -0.101 0.041 	-2.452 0.027 -0.535 0.286 0.238
GPT-4o 	0.889 -0.085 0.072 	-1.175 0.258 -0.290 0.084 0.023
Pixtral 12B 	0.686 -0.095 0.033 	-2.859 0.012* -0.594 0.353 0.310
Qwen2.5 32B 	0.787 -0.092 0.040 	-2.297 0.036 -0.510 0.260 0.211
6

-- 6 of 8 --

Table 2: Training compute and cost estimates for LLaMA models (USD).
Model etaFLOP-days English $ Language X ($4×)
LLaMA 2 (69B) 21 000 5 M 20 M
LLaMA 3 (70B) 100 000 24

Chunk 11 · 1,070 chars

3
Pixtral 12B 	0.686 -0.095 0.033 	-2.859 0.012* -0.594 0.353 0.310
Qwen2.5 32B 	0.787 -0.092 0.040 	-2.297 0.036 -0.510 0.260 0.211
6

-- 6 of 8 --

Table 2: Training compute and cost estimates for LLaMA models (USD).
Model etaFLOP-days English $ Language X ($4×)
LLaMA 2 (69B) 21 000 5 M 20 M
LLaMA 3 (70B) 100 000 24 M 96 M
LLaMA 3.1 (405B) 440 000 105 M 420 M
Table 3: Inference cost per 1M English-equivalent tokens (USD) including *reasoning models. The
costs are shown for input/output.
Provider Model (type) English $ Language X ∼2×
OpenAI GPT-4o 5 / 20 10 / 40
OpenAI o4-mini* 4 / 16 8 / 32
Google Gemini 2.5 Flash 0.30 / 2.50 0.60 / 5.00
Google Gemini 2.5 Pro* 1.25 / 10 2.50 / 20
Anthropic Claude 4 Sonnet 3 / 15 6 / 30
Anthropic Claude 4 Opus* 15 / 75 30 / 150
<question text>
Choices:
A. <option 1>
B. <option 2>
C. <option 3>
D. <option 4>
Your response must be strictly formatted as:
Final Answer: X
7

-- 7 of 8 --

Figure 3: Fertility vs. accuracy trade-offs for the 10 models across five MMLU subjects including
correlation coefficient.
8

-- 8 of 8 --