The Token Tax: Systematic Bias in Multilingual Tokenization
Summary
This study examines how tokenization inefficiency creates systemic disadvantages for morphologically complex, low-resource languages in large language models (LLMs). Using the AfriMMLU benchmark, which includes 9,000 multiple-choice questions across 16 African languages and five subjects, the authors show that higher token fertility (tokens per word) consistently correlates with lower model accuracy. Regression analyses across 10 models reveal slopes ranging from -0.08 to -0.18, meaning each additional token per word reduces accuracy by 8–18 percentage points. Reasoning models like DeepSeek and o1 outperform non-reasoning models, narrowing accuracy gaps by 8–12 points but not eliminating them. Economically, token inflation leads to quadratic increases in training and inference costs. A language with double the token count of English incurs four times the training cost and time, and double the inference latency. These findings highlight the "token tax" faced by speakers of low-resource languages and call for morphologically aware tokenization, fair pricing, and expanded multilingual benchmarks to promote equitable NLP.
PDF viewer
Chunks(12)
Chunk 0 · 1,996 chars
The Token Tax: Systematic Bias in Multilingual Tokenization Jessica M. Lundin Institute for Disease Modeling Gates Foundation Ada Zhang University of San Francisco Nihal Karim University of San Francisco Hamza Louzan University of San Francisco Victor Wei University of San Francisco David Adelani McGill University Cody Carroll University of San Francisco Abstract Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accu- racy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reli- ably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (e.g., DeepSeek, o1) consistently outperform non-reasoning peers across high- and low-resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the “token tax” faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP). 1 Introduction Prior work decisively establishes tokenization as a source of computational and economic inequality in NLP systems, with quantified impacts ranging from larger token counts to sizable BLEU point performance degradation for morphologically complex low-resource languages Petrov et al. [2023a]. The mathematical reality of O(n2) attention scaling Keles et al. [2022] combined with large fertil- ity values creates compound disadvantages that no amount of engineering optimization can fully overcome within current architectures Sreedhar et al. [2023]. These technical disparities translate directly into economic exclusion through tokenization
Chunk 1 · 1,997 chars
(n2) attention scaling Keles et al. [2022] combined with large fertil- ity values creates compound disadvantages that no amount of engineering optimization can fully overcome within current architectures Sreedhar et al. [2023]. These technical disparities translate directly into economic exclusion through tokenization taxes, prohibitive training and inference cost measured in dollars and tons of CO2, and systematic underrepresentation in model capabilities that affects billions of speakers worldwide. A reasonable cost to train a small-medium model or a large frontier model is easily $1 M (1 month) to $100 M ( ˜3 months) with primarily English tokens. If we instead train on a language with 2× or 5× more tokens for the same content, the transformer’s quadratic O(n2) compute scaling means costs do not grow linearly. The result is a nonlinear 4× or 25× increase in energy consumption, dollar cost, training time, and CO2 emissions relative to English. In this example the cost is $4-25 M (4 months-2 years) and $400M-2.5B (1-6 years). Our contributions are as follows. Preprint. Under review. arXiv:2509.05486v1 [cs.CL] 5 Sep 2025 -- 1 of 8 -- • We extend prior fertility and accuracy analysis to 10 models and 16 languages, confirming fertility as a reliable predictor of MCQA accuracy. • We conduct the first large-scale comparison of tokenization effects for reasoning vs. non- reasoning LLMs on AfriMMLU, showing that reasoning capabilities substantially reduce but do not eliminate tokenization bias. • We release public datasets containing: i. model results from AfriMMLU benchmark including reasoning models, ii. MMLU token metrics. 2 Methods The overall architecture of our methodology is as follows. 1: for each language ℓ in corpus do 2: for each model m do 3: Calculate number of tokens using model m’s tokenizer 4: Calculate fertility scores for language ℓ 5: Run MCQA inference to obtain accuracy for language ℓ 6: end for 7: end for 8: Fit linear regressions of accuracy on
Chunk 2 · 1,991 chars
ecture of our methodology is as follows. 1: for each language ℓ in corpus do 2: for each model m do 3: Calculate number of tokens using model m’s tokenizer 4: Calculate fertility scores for language ℓ 5: Run MCQA inference to obtain accuracy for language ℓ 6: end for 7: end for 8: Fit linear regressions of accuracy on fertility for each model-subject pair, and summarize slopes and explained variance. We apply the methodology to AfriMMLU Adelani et al. [2025], which covers 5 subjects (elementary mathematics, global facts, high school geography, high school macroeconomics, and international law) into 16 African languages, with a total of 9,000 of MCQA records. For mixed effect, the model selection was conducted across random effect structures using AIC. The AIC-selected model favored a random effects structure which included both intercept and slope, suggesting that fertility’s impact on accuracy is language-dependent. 3 Results and Discussion 3.1 Model Inference on MMLU Figure 1 shows MMLU accuracy of the African languages relative to English and French. Consistent with prior work, African languages show large performance gaps relative to high-resource languages. On average, African languages trail English by 25 accuracy points, with French typically falling between the two. Encouragingly, reasoning-oriented models-DeepSeek and o1-substantially reduce this disparity. Across subjects, these models outperform non-reasoning peers by 8-12 points on African languages, while maintaining strong performance in English. In Global Facts, the most challenging subject, the gap between English and African languages narrows from 25 points in baseline models to 12-14 points under reasoning models. These results suggest that improved reasoning capabilities directly benefit low-resource settings. 3.2 Fertility and Accuracy Figure 2 shows the fertility and accuracy of Llama 3.1 405B model across 5 subjects. See Appendix for Figure 3 and Table 1 of regression results including
Chunk 3 · 1,994 chars
o 12-14 points under reasoning models. These results suggest that improved reasoning capabilities directly benefit low-resource settings. 3.2 Fertility and Accuracy Figure 2 shows the fertility and accuracy of Llama 3.1 405B model across 5 subjects. See Appendix for Figure 3 and Table 1 of regression results including uncertainty analysis. Across all 10 models and five subjects, higher fertility is consistently associated with lower accuracy. Linear regressions quantify this relationship: slopes range from −0.08 to −0.18, meaning each additional token per word reduces accuracy by 8 − 18 percentage points, depending on subject and model. Table 1 reports model- and subject-specific regressions. Several effects are large and statistically significant, such as Llama-3.1-405B on Microeconomics (slope = −0.185, p = 0.002) and Qwen-2.5- 32B on Geography (slope = −0.155, p = 0.006). Fertility explains 20-50% of variance in accuracy across these regressions, underscoring its importance as a predictor. Taken together, these findings show that tokenization bias is not incidental but systematically erodes model performance in proportion to fertility. 2 -- 2 of 8 -- (a) Accuracy Aggregation (English) (b) Accuracy Aggregation (French) Figure 1: Accuracy in English (upper panel) and 17* combined African languages (middle panel) across the 5 MMLU subjects with the performance gap (bottom panel). *There are 17 languages including Amharic. In the ME analysis we did not include Amharic because it shares less similarity with the remaining languages. Figure 2: Fertility vs accuracy trade-offs for Llama 3.1 405B model across five experimental conditions. 4 Economic Consequences of Token Inflation Beyond accuracy, token inflation directly increases computational cost. Because transformer training scales quadratically in sequence length, a 2× increase in fertility produces a 4× increase in training time and cost. Table 2 shows that training Llama-3.1-405B costs $105M in English but
Chunk 4 · 1,992 chars
onsequences of Token Inflation Beyond accuracy, token inflation directly increases computational cost. Because transformer training scales quadratically in sequence length, a 2× increase in fertility produces a 4× increase in training time and cost. Table 2 shows that training Llama-3.1-405B costs $105M in English but $420M in a language with double fertility. Inference costs are similarly inflated. As shown in Table 3, generating 1M English-equivalent tokens with GPT-4o costs $5-20, while the same content in a 2× fertility language costs $10-40. Latency also doubles: A prompt + completion that requires 2 seconds in English typically takes 4 seconds in higher-fertility languages. 3 -- 3 of 8 -- These disparities demonstrate how tokenization bias manifests as a “token tax” paid disproportionately by speakers of morphologically complex, low-resource languages. 5 Conclusion This study demonstrates that tokenization inefficiency imposes systematic disadvantages on low- resource, morphologically complex languages. Across 10 large language models and 16 African languages in AfriMMLU, we find that fertility (tokens per word) strongly predicts model accuracy, with higher fertility consistently associated with poorer performance. Regression analyses show effect sizes as large as −0.18, explaining up to half the variance in accuracy. Encouragingly, reasoning models DeepSeek and o1 substantially narrow accuracy gaps, improving African language performance by 8-12 points on average and cutting the English-African disparity nearly in half. Nevertheless, large differences remain, underscoring that better reasoning alone does not eliminate inequities rooted in tokenization. We further show that token inflation has severe economic implications. Doubling fertility leads to 4× increases in training cost and inference latency, turning linguistic diversity into a computational liability. These disparities make clear that tokenization bias is not a minor technical artifact but
Chunk 5 · 1,981 chars
in tokenization. We further show that token inflation has severe economic implications. Doubling fertility leads to 4× increases in training cost and inference latency, turning linguistic diversity into a computational liability. These disparities make clear that tokenization bias is not a minor technical artifact but a systemic barrier to equitable NLP. Moving forward, addressing this barrier will require interventions at multiple levels: technical (morphologically aware tokenization, efficient attention mechanisms), economic (pricing structures that do not penalize high-fertility languages), and benchmarking (expansion of multilingual evaluation datasets like AfriMMLU). Only by aligning progress across these fronts can NLP avoid a future where billions of speakers are excluded from the benefits of language technology. References David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Oluwadara Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chia- maka Ijeoma Chukwuneke, Happy Buzaaba, Blessing Kudzaishe Sibanda, Godson Koffi Kalipe, Jonathan Mukiibi, Salomon Kabongo Kabenamualu, Foutse Yuehgoh, Mmasibidi Setaka, Lol- wethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Salomey Osei, Shamsuddeen Hassan Muham- mad, Sokhar Samb, Tadesse Kebede Guge, Tombekai Vangoni Sherman, and Pontus Stenetorp. IrokoBench: A new benchmark for African languages in the age of large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages 2732–2757, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.139. URL https://aclanthology.org/2025.naacl-long.139/. Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering,
Chunk 6 · 1,995 chars
ogies (Volume 1: Long Papers) , pages 2732–2757, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.139. URL https://aclanthology.org/2025.naacl-long.139/. Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, Stefan Kesselheim, and Nicolas Flores-Herr. Tokenizer choice for LLM training: Negligible or crucial? In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024 , pages 3907–3924, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-naacl.247. URL https://aclanthology.org/2024.findings-naacl.247/. Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. On the computational complexity of self-attention, 2022. URL https://arxiv.org/abs/2209.04881. Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=78yDLKi95p. Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages, 2023b. URL https://arxiv.org/abs/2305.15425. 4 -- 4 of 8 -- Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng, and Junjie Hu. Local byte fusion for neural machine translation, 2023. URL https://arxiv.org/abs/2205.11490. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/ 1706.03762. A Regression Results by Subject Table 1 and Figure 3
Chunk 7 · 1,999 chars
ral machine translation, 2023. URL https://arxiv.org/abs/2205.11490. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/ 1706.03762. A Regression Results by Subject Table 1 and Figure 3 show results of accuracy-on-fertility regressions for the 10 models over 5 subjects. B Fertility and Parity Fertility measures the average number of tokens required to represent a word in a corpus: F = T W for T and W the token and word counts. Higher F inflates sequence length, affecting a model’s ability to learn long-range dependencies and compute costs [Ali et al., 2024]. Parity [Petrov et al., 2023b] compares the lengths of token sequences in one language with the counterpart language translation: Parity = |t(sA)| |t(sB )| Parity scores close to 1 indicate consistency in tokenizing across different languages. Scores above 1 indicate less efficiency in language A relative to language B, which was taken as English for all parity calculations in this paper. C Inference C.1 Training and Inference Cost Comparison: English vs. Language X This is a thought exercise in training and inference costs for LLMs applied to English and Language X. The analysis assumes the same model architecture and tokenizer across languages, with cost differences due to tokenization inefficiencies and quadratic O(n2) training scaling Vaswani et al. [2023] of transformer models. We assume Language X has a fixed 2× increase in tokens across tokenizers (although there are variations not included here). We assume English has 1 000 000 tokens (baseline) and Language X: approximately 2 000 000 tokens for equivalent content. There is a 22 = 4x increase in training time and cost. Using published petaFLOP-day figures and assuming a compute cost of $240 per petaFLOP-day: In addition to cost, token inflation impacts time. Transformer models scale with O(n2) in sequence length. With a 2×
Chunk 8 · 1,996 chars
: approximately 2 000 000 tokens for equivalent content. There is a 22 = 4x increase in training time and cost. Using published petaFLOP-day figures and assuming a compute cost of $240 per petaFLOP-day: In addition to cost, token inflation impacts time. Transformer models scale with O(n2) in sequence length. With a 2× token increase, Language X requires (22) = 4× more compute. This means training that takes 90 days for English would take ∼360 days for Language X on the same hardware. For inference time, decoding scales approximately linearly with token count. A prompt+completion that takes 2 seconds in English may take about 4 seconds in Language X. These multipliers apply whether the additional tokens appear in the input (prompt) or output (comple- tion), and they exacerbate cost disparities for low-resource languages. C.2 Prompt You must only reply with ’Final Answer: X’ where X is A, B, C, or D. Do NOT add explanations, reasoning, or extra text. Question: 5 -- 5 of 8 -- Table 1: Fertility vs Accuracy by Model and Subject. Results from linear models regressing accuracy on translation fertility across 16 languages for each model-subject combination. Negative slopes indicate that higher fertility (more diverse translations) correlates with lower accuracy. Bold p-values indicate statistical significance (p < 0.05). Asterisks (*) indicate results that remain significant after Benjamini-Hochberg FDR correction (FDR < 0.05). Bold ρ values indicate correlations with large effect sizes (|ρ| ≥ 0.50). R2 values show the proportion of variance in accuracy explained by fertility and bold text indicates large effect sizes by Cohen’s convention (R2 ≥ 0.25). Regressions for o1 are not included because Openai has not released details on the tokenizer for this model. Subject Model Intercept Slope Std. Error t-value P-value ρ R2 Adj. R2 Elementary Math Sonnet3.5 0.652 -0.018 0.029 -0.609 0.552 -0.155 0.024 -0.041 Aya23 35B 0.480 -0.079 0.019 -4.058 0.001* -0.723 0.523
Chunk 9 · 1,993 chars
egressions for o1 are not included because Openai has not released details on the tokenizer for this model. Subject Model Intercept Slope Std. Error t-value P-value ρ R2 Adj. R2 Elementary Math Sonnet3.5 0.652 -0.018 0.029 -0.609 0.552 -0.155 0.024 -0.041 Aya23 35B 0.480 -0.079 0.019 -4.058 0.001* -0.723 0.523 0.492 DeepSeek R1 1.045 -0.066 0.045 -1.475 0.161 -0.356 0.127 0.068 DeepSeek V3 0.884 -0.044 0.048 -0.922 0.371 -0.232 0.054 -0.009 Gemini 1.5 Pro 0.907 -0.045 0.042 -1.078 0.298 -0.268 0.072 0.010 Llama3.1 405B 0.836 -0.054 0.045 -1.195 0.251 -0.295 0.087 0.026 Phi4 0.773 -0.125 0.034 -3.641 0.002* -0.685 0.469 0.434 GPT-4o 1.002 -0.089 0.057 -1.571 0.137 -0.376 0.141 0.084 Pixtral 12B 0.417 -0.024 0.014 -1.717 0.106 -0.405 0.164 0.109 Qwen2.5 32B 0.857 -0.113 0.037 -3.012 0.009* -0.614 0.377 0.335 Global Facts Sonnet3.5 0.508 -0.011 0.027 -0.390 0.702 -0.100 0.011 -0.056 Aya23 35B 0.335 -0.044 0.023 -1.930 0.073 -0.446 0.199 0.146 DeepSeek R1 0.574 -0.002 0.038 -0.061 0.952 -0.016 0.000 -0.066 DeepSeek V3 0.619 -0.045 0.034 -1.308 0.211 -0.320 0.102 0.042 Gemini 1.5 Pro 0.585 -0.011 0.051 -0.222 0.827 -0.057 0.003 -0.063 Llama3.1 405B 0.685 -0.084 0.033 -2.564 0.022 -0.552 0.305 0.258 Phi4 0.408 -0.015 0.017 -0.845 0.411 -0.213 0.045 -0.018 GPT-4o 0.638 -0.063 0.054 -1.169 0.261 -0.289 0.084 0.022 Pixtral 12B 0.428 -0.038 0.018 -2.068 0.056 -0.471 0.222 0.170 Qwen2.5 32B 0.505 -0.052 0.024 -2.171 0.046 -0.489 0.239 0.188 High School Geography Sonnet3.5 0.781 -0.080 0.045 -1.779 0.096 -0.417 0.174 0.119 Aya23 35B 0.475 -0.097 0.038 -2.512 0.024 -0.544 0.296 0.249 DeepSeek R1 0.847 -0.082 0.056 -1.466 0.163 -0.354 0.125 0.067 DeepSeek V3 0.843 -0.124 0.053 -2.331 0.034 -0.516 0.266 0.217 Gemini 1.5 Pro 0.750 -0.065 0.070 -0.937 0.363 -0.235 0.055 -0.008 Llama3.1 405B 0.822 -0.131 0.055 -2.359 0.032 -0.520 0.271 0.222 Phi4 0.808 -0.162 0.048 -3.343 0.004* -0.653 0.427 0.389 GPT-4o 0.952 -0.151 0.068
Chunk 10 · 1,998 chars
6 -1.466 0.163 -0.354 0.125 0.067 DeepSeek V3 0.843 -0.124 0.053 -2.331 0.034 -0.516 0.266 0.217 Gemini 1.5 Pro 0.750 -0.065 0.070 -0.937 0.363 -0.235 0.055 -0.008 Llama3.1 405B 0.822 -0.131 0.055 -2.359 0.032 -0.520 0.271 0.222 Phi4 0.808 -0.162 0.048 -3.343 0.004* -0.653 0.427 0.389 GPT-4o 0.952 -0.151 0.068 -2.211 0.043 -0.496 0.246 0.195 Pixtral 12B 0.688 -0.121 0.035 -3.414 0.004* -0.661 0.437 0.400 Qwen2.5 32B 0.755 -0.155 0.049 -3.190 0.006* -0.636 0.404 0.365 High School Microeconomics Sonnet3.5 0.750 -0.096 0.042 -2.307 0.036 -0.512 0.262 0.213 Aya23 35B 0.549 -0.105 0.038 -2.775 0.014 -0.582 0.339 0.295 DeepSeek R1 0.888 -0.088 0.074 -1.194 0.251 -0.295 0.087 0.026 DeepSeek V3 0.906 -0.157 0.049 -3.179 0.006* -0.634 0.403 0.363 Gemini 1.5 Pro 0.883 -0.129 0.067 -1.920 0.074 -0.444 0.197 0.144 Llama3.1 405B 0.953 -0.185 0.050 -3.691 0.002* -0.690 0.476 0.441 Phi4 0.858 -0.184 0.053 -3.479 0.003* -0.668 0.447 0.410 GPT-4o 0.942 -0.150 0.084 -1.779 0.096 -0.417 0.174 0.119 Pixtral 12B 0.622 -0.105 0.033 -3.179 0.006* -0.635 0.403 0.363 Qwen2.5 32B 0.779 -0.154 0.048 -3.196 0.006* -0.636 0.405 0.365 International Law Sonnet3.5 0.645 -0.040 0.028 -1.426 0.174 -0.346 0.119 0.061 Aya23 35B 0.578 -0.101 0.042 -2.403 0.030 -0.527 0.278 0.230 DeepSeek R1 0.813 -0.043 0.043 -1.010 0.329 -0.252 0.064 0.001 DeepSeek V3 0.771 -0.073 0.045 -1.617 0.127 -0.385 0.148 0.092 Gemini 1.5 Pro 0.796 -0.039 0.052 -0.758 0.460 -0.192 0.037 -0.027 Llama3.1 405B 0.876 -0.096 0.038 -2.548 0.022 -0.550 0.302 0.256 Phi4 0.804 -0.101 0.041 -2.452 0.027 -0.535 0.286 0.238 GPT-4o 0.889 -0.085 0.072 -1.175 0.258 -0.290 0.084 0.023 Pixtral 12B 0.686 -0.095 0.033 -2.859 0.012* -0.594 0.353 0.310 Qwen2.5 32B 0.787 -0.092 0.040 -2.297 0.036 -0.510 0.260 0.211 6 -- 6 of 8 -- Table 2: Training compute and cost estimates for LLaMA models (USD). Model etaFLOP-days English $ Language X ($4×) LLaMA 2 (69B) 21 000 5 M 20 M LLaMA 3 (70B) 100 000 24
Chunk 11 · 1,070 chars
3 Pixtral 12B 0.686 -0.095 0.033 -2.859 0.012* -0.594 0.353 0.310 Qwen2.5 32B 0.787 -0.092 0.040 -2.297 0.036 -0.510 0.260 0.211 6 -- 6 of 8 -- Table 2: Training compute and cost estimates for LLaMA models (USD). Model etaFLOP-days English $ Language X ($4×) LLaMA 2 (69B) 21 000 5 M 20 M LLaMA 3 (70B) 100 000 24 M 96 M LLaMA 3.1 (405B) 440 000 105 M 420 M Table 3: Inference cost per 1M English-equivalent tokens (USD) including *reasoning models. The costs are shown for input/output. Provider Model (type) English $ Language X ∼2× OpenAI GPT-4o 5 / 20 10 / 40 OpenAI o4-mini* 4 / 16 8 / 32 Google Gemini 2.5 Flash 0.30 / 2.50 0.60 / 5.00 Google Gemini 2.5 Pro* 1.25 / 10 2.50 / 20 Anthropic Claude 4 Sonnet 3 / 15 6 / 30 Anthropic Claude 4 Opus* 15 / 75 30 / 150 <question text> Choices: A. <option 1> B. <option 2> C. <option 3> D. <option 4> Your response must be strictly formatted as: Final Answer: X 7 -- 7 of 8 -- Figure 3: Fertility vs. accuracy trade-offs for the 10 models across five MMLU subjects including correlation coefficient. 8 -- 8 of 8 --