Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages

Summary

This paper introduces the Universal Metalinguistic Framework (UMF), a typologically-informed candidate reranking system designed to improve translation quality for low-resource languages using large language models (LLMs) without requiring parallel training data or model retraining. The framework addresses systematic errors caused by typological bias in LLMs, which are trained predominantly on high-resource languages like English and thus struggle with structurally divergent languages. UMF consists of two components: a structured language profile across 16 typological dimensions and a computational engine that applies semantic constraints and typological scoring during candidate evaluation. Evaluation across nine language pairs shows that intervention rates correlate strongly with typological distance from English. The framework achieves intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. UMF operates as a model-agnostic, black-box compatible system, making it practical for under-resourced languages. The study highlights the importance of typological knowledge in correcting structural and lexical errors, while also identifying areas for improvement, such as refining linguistic representations and calibration of intervention confidence.

PDF viewer

Chunks(41)

Chunk 0 · 1,998 chars

Typologically-Informed Candidate Reranking for LLM-based
Translation into Low-Resource Languages
Nipuna Abeykoon∗, Ashen Weerathunga∗, Pubudu Wijesinghe∗, Parameswari
Krishnamurthy
ZWAG AI Ltd, Dubai AI Campus, DIFC, UAE.
∗These authors contributed equally to this work
Abstract
Large language models trained predominantly on high-resource languages exhibit system-
atic biases toward dominant typological patterns, leading to structural non-conformance when
translating into typologically divergent low-resource languages. We present a framework that
leverages linguistic typology to improve translation quality without parallel training data or
model retraining. The framework consists of two components: the Universal Metalinguistic
Framework (UMF), which represents languages as structured profiles across 16 typological di-
mensions with divergence-weighted scoring, and the Computational Engine, which operates
through linguistic disambiguation during generation and typological compliance scoring during
selection. Evaluation across nine language pairs demonstrates intervention rates strongly cor-
relating with typological distance from English. In experiments on 341 English sentences each
having different morphological and syntactic phenomena, the framework shows an interven-
tion precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense
languages, and 86.26% for structurally profiled languages. The framework requires no parallel
training data and operates with any LLM capable of producing multiple candidate outputs,
enabling practical deployment for under-resourced languages.
Keywords: Machine Translation, Low-Resource Languages, Linguistic Typology, Large Language
Models, Candidate Reranking, Morphological Complexity
1 Introduction
Modern LLMs treat language as surface-level data, not as a cognitive system. As a result, their
performance is optimized for languages whose grammatical structures, cultural assumptions, and
reasoning patterns

Chunk 1 · 1,999 chars

ges, Linguistic Typology, Large Language
Models, Candidate Reranking, Morphological Complexity
1 Introduction
Modern LLMs treat language as surface-level data, not as a cognitive system. As a result, their
performance is optimized for languages whose grammatical structures, cultural assumptions, and
reasoning patterns dominate training corpora. Languages with different structural properties are
frequently distorted, flattened, or misrepresented.
Large language models learn statistical regularities from datasets that are overwhelmingly
English-centric. When applied to structurally distinct languages, these models often preserve se-
mantic intent but fail to maintain structural correctness. This leads to systematic errors in tense,
agency, evidentiality, politeness, spatial relations, and relational meaning. Outputs may appear
fluent yet remain cognitively incongruent for native speakers.
We term these failures interpretation errors: outputs that preserve surface semantic content
but violate obligatory grammatical, morphosyntactic, or discourse constraints required for correct
1
arXiv:2602.01162v1 [cs.CL] 1 Feb 2026

-- 1 of 25 --

interpretation by native speakers. Unlike stylistic variations or acceptable paraphrases, interpreta-
tion errors produce translations that are grammatically malformed, pragmatically inappropriate, or
cognitively dissonant in the target language, even when the intended meaning remains recoverable.
This is not merely a matter of data scarcity. Human languages exhibit extraordinary structural
diversity: word order patterns range from Subject-Verb-Object (English) to Subject-Object-Verb
(Hindi, Japanese) to Verb-Subject-Object (Arabic); morphological systems span from analytic (Chi-
nese) to agglutinative (Turkish, Swahili) to polysynthetic (Inuktitut). Case marking, agreement
patterns, honorifics, and dozens of other grammatical dimensions vary independently across lan-
guages [9]. Models trained predominantly on English must somehow produce

Chunk 2 · 1,995 chars

rabic); morphological systems span from analytic (Chi-
nese) to agglutinative (Turkish, Swahili) to polysynthetic (Inuktitut). Case marking, agreement
patterns, honorifics, and dozens of other grammatical dimensions vary independently across lan-
guages [9]. Models trained predominantly on English must somehow produce structurally valid
outputs for languages exhibiting fundamentally different typological patterns. This is a challenge
that data volume alone cannot solve. The result is systematic structural errors: incorrect word or-
der, missing morphological markers, inappropriate lexical choices, and violations of target language
grammatical constraints. This inequality stems from fundamental biases in how these models are
trained and evaluated.
The root cause is typological bias: models trained predominantly on English encode struc-
tural patterns of Subject-Verb-Object word order, analytic morphology, and minimal case marking.
When translating to typologically divergent languages, these biases manifest as systematic errors
across multiple dimensions.
Structural errors arise from grammatical mismatches. An English-to-Sinhala translation of the
sentence “The children play in the garden” may preserve SVO order instead of restructuring to
Sinhala’s SOV pattern, producing දරුවන් ෙසල්ලම් උද්‍යානය rather than the correct ළමයි උද්‍යානෙය්
ෙසල්ලම් කරනවා . The model may omit obligatory case markers: Sinhala requires the locative suffix
ෙ-් (-ē) on “garden” to indicate location or fail to produce appropriate morphological complexity.
Recent analyses confirm this pattern: language models achieve significantly higher scores on fusional
languages but underperform on agglutinative languages due to richer morphology and greater token
sparsity [3].
Lexical errors stem from training distribution biases. In our experiments with frontier LLMs,
translating ”play” yielded වාදනය (vādanaya, ”play a musical instrument”) rather than ෙසල්ලම්
(sellam, ”play/have fun”) because the former

Chunk 3 · 1,993 chars

orm on agglutinative languages due to richer morphology and greater token
sparsity [3].
Lexical errors stem from training distribution biases. In our experiments with frontier LLMs,
translating ”play” yielded වාදනය (vādanaya, ”play a musical instrument”) rather than ෙසල්ලම්
(sellam, ”play/have fun”) because the former appears more frequently in English-Sinhala training
data despite being contextually inappropriate for ”children playing in a garden.”
Existing remedies have significant limitations. Fine-tuning on parallel corpora requires data
that does not exist for most language pairs [8], while prompt engineering lacks precision to enforce
complex grammatical constraints [6, 7]. Prior work has incorporated typological insights in limited
or auxiliary forms; however, no existing system operationalizes typology as a universal, structured
decision layer that directly governs inference-time candidate selection across languages without
retraining or parallel data. This motivates our approach: leverage explicit typological knowledge
to guide translation under black-box generation.
The Universal Metalinguistic Framework (UMF) addresses this challenge by quantifying typo-
logical divergence between languages and using it to guide translation without retraining. The
framework targets both error types through complementary mechanisms. For lexical errors, a se-
mantic constraint layer applies context-aware adjustments during generation to resolve word-sense
ambiguities. For structural errors, a typological reranking layer evaluates candidates against ex-
plicit grammatical requirements of the target language.
UMF represents languages through expert-curated profiles capturing 16 typological dimensions
based on the World Atlas of Language Structures [9] and internal linguistic typology research.
Divergence scores quantify how much source and target languages differ in each dimension. These
scores are weighted by linguistic importance and normalized to produce a directive

Chunk 4 · 1,992 chars

profiles capturing 16 typological dimensions
based on the World Atlas of Language Structures [9] and internal linguistic typology research.
Divergence scores quantify how much source and target languages differ in each dimension. These
scores are weighted by linguistic importance and normalized to produce a directive vector that
guides candidate evaluation. A tunable mixing parameter α balances the model’s probability with
2

-- 2 of 25 --

the typological compliance score.
Our approach is grounded in the principle that typological divergence predicts where LLMs will
fail. Languages with divergent word order, rich case systems, and agglutinative morphology receive
higher intervention rates, with the reranker prioritizing candidates that satisfy target language
constraints. This dual-layer approach, combining typological scoring with semantic disambigua-
tion, yields improvements that correlate with typological divergence and remain consistent across
different base models.
Scope and Non-Claims. To clarify the boundaries of this work: UMF does not generate
translations; it selects among candidates produced by an existing model. UMF does not replace
automatic evaluation metrics; it operates as a complementary decision layer for candidate selection.
UMF does not claim universal coverage—current language profiles cover a subset of typological phe-
nomena and require expansion. What UMF does provide is a structured, interpretable mechanism
for enforcing typological constraints at inference time, operating as a metalinguistic decision layer
rather than a learned quality predictor.
2 Background and Related Work
2.1 Typology in Metalinguistic and MT Frameworks: Prior Approaches
Linguistic typology has been incorporated into machine translation and metalinguistic frameworks
in various forms, yet these approaches have treated typological knowledge as auxiliary, analytical,
or partial rather than as a foundational decision-governing component. Theoretical models such
as

Chunk 5 · 1,999 chars

eworks: Prior Approaches
Linguistic typology has been incorporated into machine translation and metalinguistic frameworks
in various forms, yet these approaches have treated typological knowledge as auxiliary, analytical,
or partial rather than as a foundational decision-governing component. Theoretical models such
as Universal Grammar (UG) were built on the assumption of deep structural uniformity across
languages, reducing variation to a small set of abstract principles or binary parameters. Large-
scale typological evidence has since demonstrated that linguistic diversity is far richer and less
discretizable than UG predicts, with many languages exhibiting structural configurations that fall
outside proposed universals [15, 16]. Interlingua-based systems such as ATLAS pursued language-
independent representations, yet empirical evaluations showed that no interlingua could remain
neutral across languages without reintroducing language-specific transfer rules, effectively collapsing
back into pairwise or family-specific solutions [17, 18]. Rule-based MT systems encoded typological
structure explicitly but required exhaustive manual specification of grammatical rules, making them
incomplete, non-scalable, and biased toward well-documented Indo-European language structures
[19].
Data-driven paradigms, including statistical and neural machine translation, improved surface
fluency and scalability but introduced a different class of limitations. By learning exclusively from
observed corpora, these models implicitly privilege high-resource languages and frequent construc-
tions, systematically underrepresenting rare, optional, or typologically marked phenomena such as
rich morphology, evidentiality, honorific systems, or non-configurational syntax [20, 21]. Neural
MT models, in particular, optimize for probabilistic fluency rather than structural faithfulness,
often omitting or flattening linguistic features that lack clear cross-lingual alignment or sufficient
training

Chunk 6 · 1,997 chars

ena such as
rich morphology, evidentiality, honorific systems, or non-configurational syntax [20, 21]. Neural
MT models, in particular, optimize for probabilistic fluency rather than structural faithfulness,
often omitting or flattening linguistic features that lack clear cross-lingual alignment or sufficient
training signal [22]. While typological information has been incorporated in prior frameworks, in-
cluding divergence indices for transfer-based MT [14] and typological feature embeddings, it has
remained auxiliary, partial, or analytical rather than serving as a structured, decision-governing
component of translation systems [23]. As a result, modern MT systems remain effective pattern
translators rather than linguistically grounded interpreters, with typological knowledge functioning
as metadata or side-channel features rather than as a first-class evaluation mechanism that directly
governs inference-time selection.
3

-- 3 of 25 --

2.2 Typological Bias in LLMs
Linguistic typology classifies languages along structural dimensions that vary independently. Mor-
phological typology distinguishes analytic, agglutinative, fusional, and polysynthetic systems. An-
alytic languages (e.g., Mandarin) use minimal inflection and rely on word order and particles.
Agglutinative languages (e.g., Turkish, Japanese, Swahili, Tamil) construct words by concatenat-
ing discrete morphemes, each encoding a single grammatical meaning. A noun may carry suffixes
for case, number, and possession in sequence. Fusional languages (e.g., Spanish, Russian) com-
press multiple grammatical categories into single portmanteau morphemes. Word order typology
captures the positioning of subject (S), verb (V), and object (O), with SVO, SOV, and VSO ac-
counting for the vast majority of languages [9]. Large language models disproportionately favor
analytic and fusional languages. Arnett and Bergen [3] found a performance gap between aggluti-
native and fusional languages, with fusional languages

Chunk 7 · 1,998 chars

ing of subject (S), verb (V), and object (O), with SVO, SOV, and VSO ac-
counting for the vast majority of languages [9]. Large language models disproportionately favor
analytic and fusional languages. Arnett and Bergen [3] found a performance gap between aggluti-
native and fusional languages, with fusional languages such as English achieving lower perplexities,
attributed partly to tokenization and dataset size disparities. Brinkmann et al. [2] observe that
leading multilingual models remain primarily English models, with over 90% of training tokens in
English. The authors hypothesize that while LLMs learn some language-invariant abstractions,
these abstractions may still be biased toward English grammar and semantics. Such observations
motivate our efforts to explicitly encode typological properties into the translation process.
2.3 Low-Resource Machine Translation
Transfer learning exploits related high-resource languages to bootstrap low-resource translation.
Boujkian et al. [8] show that linguistic similarity enables efficient adaptation, but this method re-
quires parallel data and yields diminishing returns as typological distance increases. Fine-tuning on
in-domain parallel corpora remains standard but is unavailable for thousands of languages lacking
digital corpora. Krishnamurthy [14] addresses typological divergence in Telugu-Tamil transfer-
based MT through a Divergence Index (DI) that quantifies linguistic differences across five levels,
namely, surface, shallow, intermediate, deep, and deeper. The DI enabled targeted improvements
in transfer grammar rules, increasing fluency from 63% to 87%. While our approach differs ar-
chitecturally (reranking vs. transfer rules), both frameworks ground translation improvement in
explicit typological divergence measurement. Krishnamurthy’s work demonstrates that quantify-
ing linguistic distance enables systematic error correction, which is a principle central to our UMF
scoring mechanism.
2.4 Candidate Reranking

Chunk 8 · 1,993 chars

(reranking vs. transfer rules), both frameworks ground translation improvement in
explicit typological divergence measurement. Krishnamurthy’s work demonstrates that quantify-
ing linguistic distance enables systematic error correction, which is a principle central to our UMF
scoring mechanism.
2.4 Candidate Reranking and Self-Evaluation
Reranking methods select among multiple candidates rather than relying on the model’s top out-
put. Traditional N-best list reranking incorporated linguistic features to re-score candidates. Re-
cent LLM-as-a-judge approaches use models to evaluate their own outputs. Franceschelli and
Musolesi [10] propose Creative Beam Search, using diverse beam search to generate varied candi-
dates and self-evaluation to select outputs, addressing positional bias through balanced position
calibration. While effective for subjective criteria, this approach lacks grounding in explicit lin-
guistic constraints and cannot systematically correct structural errors like word order violations or
missing morphological markers.
2.5 Quality Estimation
Automatic metrics like BLEU [11] measure n-gram overlap but are insensitive to structural cor-
rectness. Neural metrics like BERTScore and COMET [12, 13] achieve stronger correlation with
4

-- 4 of 25 --

human judgments but may not reliably detect typological errors for underrepresented languages.
Critically, all metrics predict quality but do not identify or correct errors.
2.6 Controllable Generation
Prompt engineering provides high-level instructions but lacks precision to enforce complex gram-
matical constraints. Constrained decoding methods enforce hard lexical constraints but cannot
express structural requirements like “apply SOV ordering.” Logit-level interventions (PPLM,
FUDGE) modify token probabilities during generation but require model internals access and
cannot reason about sentence-level structural properties.
Our approach differs in three key aspects: we operate on complete candidates

Chunk 9 · 1,994 chars

t
express structural requirements like “apply SOV ordering.” Logit-level interventions (PPLM,
FUDGE) modify token probabilities during generation but require model internals access and
cannot reason about sentence-level structural properties.
Our approach differs in three key aspects: we operate on complete candidates (black-box com-
patible), ground evaluation in explicit typological profiles (not learned correlations), and target
both lexical and structural errors. Crucially, UMF does not learn correctness from data—it en-
codes grammatical obligation from linguistic knowledge. This distinction is fundamental: learned
quality estimators predict what outputs humans prefer, while UMF enforces what outputs the
target language requires.
3 The UMF Framework and Computational Engine
3.1 Overview
The Universal Metalinguistic Framework consists of two main components:
Universal Metalinguistic Framework (UMF): A structured representation of language
typology encoding 16 dimensions derived from linguistic research. Each language is described by
a profile specifying its typological characteristics. Divergence scores quantify differences between
source and target languages along each dimension.
Computational Engine: Implements candidate generation, scoring, and reranking. The
engine operates in two phases: (1) Semantic Constraint Layer applies context-aware lexical adjust-
ments during generation, and (2) Typological Scoring evaluates candidates for structural compliance
with the target language.
The framework is model-agnostic and operates on any LLM capable of producing multiple
translation candidates. It requires no fine-tuning or parallel training data, only a typological
profile for the target language.
3.2 Language Profile Structure
A language profile is a structured representation capturing typological properties. Each profile con-
sists of 16 typological dimensions representing structural characteristics. Each dimension includes:
• Value: The language’s specific

Chunk 10 · 1,989 chars

only a typological
profile for the target language.
3.2 Language Profile Structure
A language profile is a structured representation capturing typological properties. Each profile con-
sists of 16 typological dimensions representing structural characteristics. Each dimension includes:
• Value: The language’s specific property for that dimension (e.g., word order = SOV).
• Weight: Linguistic importance of the dimension, derived from typological research indicating
how strongly the dimension influences grammatical structure.
• Markers: Observable surface features used for evaluation (e.g., case suffixes, verb endings).
Language profiles are expert-curated JSON structures encoding 16 typological dimensions de-
rived from the World Atlas of Language Structures [9] and linguistic typology research. Each
dimension captures a structural property with cross-linguistic variation. Table 1 lists all dimen-
sions with brief descriptions.
5

-- 5 of 25 --

Universal Metalinguistic Framework
Computational Engine
Source Text
“The children play”
LLM Generator
(GPT-5.2 / mT5)
N Candidates
(Beam Search)
Source Profile
(English)
Target Profile
(Sinhala)
Directive Engine
16D Vector
Layer 1:
Semantic
Constraints
Layer 2:
Typological
Scoring
Combined Score
α · Pmodel + (1 − α) · U M F
Best Translation
ළමයි ෙසල්ලම් කරනවා
Figure 1: UMF Framework Architecture. The source text is processed through an LLM to generate
N candidates. The Universal Metalinguistic Framework (right, orange-dashed) computes a 16-
dimensional directive vector from source and target language profiles. The Computational Engine
(center, green-dashed) applies dual-layer evaluation: semantic constraints for lexical disambiguation
and typological scoring for structural compliance.
1. Input
Analysis
2. Profile
Loading
3. Divergence
Calc.
4. Directive
Generation
5. Candidate
Generation
6. Dual-Layer
Scoring
7. Final
Selection
spaCy EN+TGT 16 dims L2 norm
N=32 Sem+Typo argmax
Figure 2: UMF Translation Pipeline. The

Chunk 11 · 1,998 chars

aints for lexical disambiguation
and typological scoring for structural compliance.
1. Input
Analysis
2. Profile
Loading
3. Divergence
Calc.
4. Directive
Generation
5. Candidate
Generation
6. Dual-Layer
Scoring
7. Final
Selection
spaCy EN+TGT 16 dims L2 norm
N=32	Sem+Typo	argmax
Figure 2: UMF Translation Pipeline. The seven-stage process transforms source text into
typologically-compliant translations through linguistic analysis, divergence quantification, multi-
candidate generation, and dual-layer evaluation.
6

-- 6 of 25 --

Table 1: The 16 Typological Dimensions
Dimension Description Example Contrast
Word order Canonical constituent ordering SVO (English) vs. SOV (Sinhala)
Case marking Grammatical case inventory and mark-
ing
3 cases (English) vs. 8 cases (Sinhala)
Morphology Morphological type and complexity Analytic (English) vs. Agglutinative
(Sinhala)
Agreement Subject-verb and noun-adjective pat-
terns
Minimal (English) vs. Rich (Sinhala)
TAM Tense-aspect-mood marking system Moderate (English) vs. Rich (Sinhala)
Classifiers Noun classifier presence and types Absent (English, Sinhala)
Honorifics Grammaticalized politeness distinc-
tions
Absent (English) vs. Present (Sinhala)
Evidentiality Grammatical marking of information
source
Absent (English, Sinhala)
Serial verbs Serial verb construction patterns Absent (English) vs. Limited (Sinhala)
Definiteness Article and definiteness marking Articles (English) vs. Demonstratives
(Sinhala)
Animacy Grammaticalized animacy distinctions Not relevant (English) vs. Relevant
(Sinhala)
Information structure Topic and focus marking Unmarked (English) vs. Marked (Sin-
hala)
Negation Negation strategy and position Particle (English) vs. Suffix+particle
(Sinhala)
Pro-drop Pronoun omission patterns No (English) vs. Yes (Sinhala)
Relative clauses Relative clause position and formation Postnominal (English) vs. Prenominal
(Sinhala)
Copula Copula presence and behavior Explicit (English) vs. Often omitted
(Sinhala)
7

-- 7 of 25

Chunk 12 · 1,978 chars

tion Particle (English) vs. Suffix+particle
(Sinhala)
Pro-drop Pronoun omission patterns No (English) vs. Yes (Sinhala)
Relative clauses Relative clause position and formation Postnominal (English) vs. Prenominal
(Sinhala)
Copula Copula presence and behavior Explicit (English) vs. Often omitted
(Sinhala)
7

-- 7 of 25 --

Profiles encode both categorical properties (e.g., word order: “SVO” vs. “SOV”) and numeric
scales (e.g., case_richness: 0.1 for English, 0.9 for Sinhala). Numeric values represent expert
assessments of feature complexity or productivity on a 0-1 scale, informed by typological databases
and grammatical descriptions. For instance, English receives case_richness = 0.1 due to minimal
morphological case marking (3 cases, mostly syntactic), while Sinhala receives 0.9 due to its rich
system of 8 morphologically marked cases.
Profiles are created through linguistic analysis of grammars and typological databases. For
languages with limited documentation, profiles are constructed by linguistic experts or adapted
from closely related languages.
3.3 Typological Divergence Calculation
Divergence scores quantify how much source and target languages differ in each dimension. The
calculation method depends on the dimension type:
Categorical dimensions (e.g., word order): Discrete comparison with predetermined di-
vergence values. For word order, we distinguish three levels:
• Identical order (SVO → SVO): divergence = 0.0
• Verb position change (SVO → VSO): divergence = 0.6
• Major swap (SVO → SOV): divergence = 1.0
Major swaps invert the relative positions of subject and object around the verb, requiring complete
sentence restructuring.
Numeric dimensions (e.g., case marking, morphology): Absolute difference between
source and target values:
divergence[case_marking] = |src.case_richness − tgt.case_richness| (1)
divergence[morphology] = |src.complexity − tgt.complexity| (2)
Set-based dimensions (e.g., agreement): Jaccard distance over feature

Chunk 13 · 1,884 chars

cturing.
Numeric dimensions (e.g., case marking, morphology): Absolute difference between
source and target values:
divergence[case_marking] = |src.case_richness − tgt.case_richness| (1)
divergence[morphology] = |src.complexity − tgt.complexity| (2)
Set-based dimensions (e.g., agreement): Jaccard distance over feature sets:
divergence[agreement] = 1 − |src_features ∩ tgt_features|
|src_features ∪ tgt_features| (3)
Composite dimensions (e.g., TAM): Weighted average of sub-components:
divergence[tam] = 0.4 × |∆tense| + 0.4 × |∆aspect| + 0.2 × |∆mood| (4)
The result is a 16-dimensional divergence vector quantifying structural distance between the
language pair. Table 2 shows the divergence vector for English → Sinhala.
3.4 Directive Vector Construction
The divergence vector is transformed into a directive vector through linguistic weighting and nor-
malization. Not all grammatical features are equally salient for translation quality. Highly visible
features like word order receive greater weight, while subtle features like copula behavior receive
less.
Linguistic weights are assigned based on two criteria: (1) perceptual salience, which measures
how noticeable the error is to a native speaker, and (2) translation impact, which measures how
8

-- 8 of 25 --

Table 2: Divergence Vector for English → Sinhala
Dimension English Sinhala Divergence
Word order SVO SOV 1.0
Case marking 0.1 0.9 0.8
Morphology 0.2 0.8 0.6
Agreement {person, number} {person, number, gender, animacy} 0.5
TAM 0.6/0.5/0.4 0.7/0.6/0.5 0.1
Classifiers False False 0.0
Honorifics 0.0 0.6 0.6
Evidentiality False False 0.0
Serial verbs 0.0 0.3 0.3
Definiteness Articles Demonstratives 0.3
Animacy False True 0.4
Info structure False/False True/True 0.8
Negation Particle Suffix+particle 0.4
Pro-drop False True 0.5
Relative clauses Postnominal Prenominal 0.4
Copula Explicit Often omitted 0.4
Word

Chunk 14 · 1,999 chars

cs 0.0 0.6 0.6
Evidentiality False False 0.0
Serial verbs 0.0 0.3 0.3
Definiteness Articles Demonstratives 0.3
Animacy False True 0.4
Info structure False/False True/True 0.8
Negation Particle Suffix+particle 0.4
Pro-drop False True 0.5
Relative clauses Postnominal Prenominal 0.4
Copula Explicit Often omitted 0.4
Word Order
Case
Morph
Agree
TAM
Class
Honor
Evid
Serial
Def
Anim	
Info	
Neg
Pro-drop
RelCl
Copula
0
0.5
1
1
0.8
0.6
0.5
0.1
0
0.6
0
0.3
0.3
0.4
0.8
0.4
0.5
0.4
0.4
Divergence Score
Figure 3: Divergence vector visualization for English → Sinhala. Word order (SVO→SOV), case
marking, and information structure show maximum divergence (0.8–1.0), while classifiers and evi-
dentiality show zero divergence (both languages lack these features).
9

-- 9 of 25 --

much the error affects comprehension and naturalness. Word order receives the highest weight (1.2)
as errors are immediately apparent and disrupt sentence processing. Case marking and information
structure receive a weight of 1.0 as they critically affect grammatical correctness and naturalness.
Features like copula presence receive a lower weight (0.5) as errors are subtle and rarely impair
comprehension.
The weighted divergence vector is L2-normalized to produce a unit-length directive vector:
weighted[i] = divergence[i] × weight[i] (5)
directive = weighted/||weighted||2 (6)
For English → Sinhala, the resulting directive vector is:
directive = [0.614, 0.409, 0.246, 0.154, 0.036, 0.000, 0.276, 0.000,
0.107, 0.123, 0.123, 0.409, 0.164, 0.179, 0.123, 0.102]
This vector encodes the relative importance of each dimension for this language pair. Word
order (0.614), case marking (0.409), and information structure (0.409) dominate, while TAM (0.036)
and inactive dimensions (classifiers, evidentiality) contribute minimally.
3.5 Semantic Constraint Layer
The semantic constraint layer addresses lexical ambiguities during candidate generation. Many
words have multiple senses that cannot be distinguished by translation

Chunk 15 · 1,995 chars

ion structure (0.409) dominate, while TAM (0.036)
and inactive dimensions (classifiers, evidentiality) contribute minimally.
3.5 Semantic Constraint Layer
The semantic constraint layer addresses lexical ambiguities during candidate generation. Many
words have multiple senses that cannot be distinguished by translation models relying solely on
distributional patterns in training data. For instance, English “play” translates to different Sinhala
words depending on context: ෙසල්ලම් (sellam, recreational play) vs. වාදනය (vādanaya, playing a
musical instrument).
The layer operates in three steps:
1. Ambiguity Detection: Identify polysemous source words with multiple target language
translations. A lexicon maps source words to target senses with contextual indicators.
2. Context Analysis: Analyze surrounding words to determine the appropriate sense. For
“play,” the presence of “children” and “garden” indicates a recreational context rather than a
musical performance.
3. Token Adjustment: During generation, apply boosts to tokens corresponding to the cor-
rect sense and penalties to incorrect senses. This guides the model toward contextually appropriate
lexical choices without requiring explicit token constraints.
In our experiments with GPT-5.2 translating “The children play in the garden,” the baseline
model produced වාදනය (play instrument) as the top candidate. Semantic constraints boosted
ෙසල්ලම් (play/fun) tokens and penalized වාදනය tokens, moving the correct sense to higher-ranked
candidates for subsequent typological evaluation.
3.6 Typological Scoring
Each translation candidate is evaluated for compliance across active typological dimensions: those
with directive values exceeding an activation threshold of 0.1. For English → Sinhala, 14 of 16
dimensions are active (classifiers and evidentiality are inactive with a directive value of 0.0).
Scoring functions are dimension-specific and operate on observable surface features derived from
the language profile:
Word

Chunk 16 · 1,992 chars

e
with directive values exceeding an activation threshold of 0.1. For English → Sinhala, 14 of 16
dimensions are active (classifiers and evidentiality are inactive with a directive value of 0.0).
Scoring functions are dimension-specific and operate on observable surface features derived from
the language profile:
Word order: Checks for verb-final markers (Sinhala verbal suffixes like -වා (-vā), -යි (-yi), -ති
(-ti)) at sentence end, returning higher scores for SOV-compliant candidates.
10

-- 10 of 25 --

Case marking: Counts case suffix occurrences (e.g., accusative -ව (-va), dative -ට (-ṭa), loca-
tive ෙ-් (-ē)) and compares to expected density based on sentence length.
Morphology: Measures average word length, expecting longer words for agglutinative lan-
guages due to suffix concatenation.
Agreement: Detects verb agreement markers and plural markers, scoring the presence of
expected morphology.
TAM: Checks for tense/aspect/mood markers in verb endings, comparing against profile-
specified marker inventories.
Honorifics: Matches pronouns and verb forms against formal/informal markers, comparing
with source sentence formality cues.
Semantic appropriateness: Verifies that contextually appropriate word senses were selected
(from the semantic constraint layer).
Additional dimensions (serial verbs, definiteness, animacy, information structure, negation, pro-
drop, relative clauses, copula) follow a similar profile-driven scoring logic. Each scorer returns a
value in [0, 1] representing compliance with target language patterns.
The final UMF score is a weighted average:
UMF_score =
∑
i(directive[i] × dimension_score[i])
∑
i directive[i] (7)
where the sum ranges over active dimensions. This formula ensures that dimensions with higher
divergence (and thus higher directive values) contribute proportionally more to the final score.
3.7 Candidate Reranking
The framework combines model confidence with typological compliance to select the final transla-
tion. For each

Chunk 17 · 1,996 chars

he sum ranges over active dimensions. This formula ensures that dimensions with higher
divergence (and thus higher directive values) contribute proportionally more to the final score.
3.7 Candidate Reranking
The framework combines model confidence with typological compliance to select the final transla-
tion. For each candidate c, we compute:
final_score(c) = α × model_score(c) + (1 − α) × UMF_score(c) (8)
The mixing parameter α ∈ [0, 1] balances trust in the model’s learned preferences versus explicit
grammatical requirements. Lower α (e.g., 0.3) prioritizes typological compliance, appropriate for
high-divergence language pairs where model biases are strong. Higher α (e.g., 0.7) preserves more
of the model’s ranking, suitable for low-divergence pairs or when model quality is high.
The candidate with the highest final score is selected as the system output. In case of ties, the
model’s original ranking is used as a tiebreaker.
4 Experimental Setup
4.1 Language Selection
We evaluate the framework across nine language pairs representing diverse typological profiles.
Table 3 presents the target languages with their typological characteristics and expected divergence
from English.
Language selection covers the full spectrum of typological distance from English, enabling eval-
uation of the framework’s sensitivity to structural divergence. Figure 4 visualizes this spectrum.
High-divergence languages (Sinhala, Tamil, Hindi) differ maximally from English in word order
(SOV vs. SVO), morphological complexity (agglutinative/fusional vs. analytic), and case systems (8
cases vs. 3). Moderate-divergence languages (Arabic, Swahili) show partial structural differences.
11

-- 11 of 25 --

Chunk 18 · 1,998 chars

ences.
11

-- 11 of 25 --

Low Divergence
Moderate
High Divergence
Typological Distance from English
Low High
Chinese
SVO, Isolating
French
SVO, Fusional
Thai
SVO, Isolating
Swahili
SVO, Agglut.
Arabic
VSO, Fusional
Japanese
SOV, Agglut.
Hindi
SOV, Fusional
Tamil
SOV, Agglut.
Sinhala
SOV, Agglut.
Figure 4: Typological distance spectrum of evaluated languages from English. Languages are
grouped into three clusters based on their combined word order and morphological divergence.
High-divergence languages (red) require the most structural transformation during translation.
Arabic exhibits VSO order and fusional morphology, while Swahili maintains SVO order but em-
ploys agglutinative morphology with noun class systems. Low-divergence languages (French, Thai,
Chinese) share English’s SVO order and analytic tendencies, with Chinese representing minimal
divergence as both languages are SVO and isolating. Japanese, though typologically distant (SOV,
agglutinative), is included as a high-resource control to assess whether the framework inappropri-
ately intervenes when baseline model quality is already high.
Table 3: Target languages and typological properties
Language Family Word Order Morphology Case System Divergence Level
Sinhala Indo-Aryan SOV Agglutinative 8 cases High
Tamil Dravidian SOV Agglutinative 8 cases High
Hindi Indo-Aryan SOV Fusional 8 cases High
Arabic Semitic VSO Fusional 3 cases Moderate
Swahili Bantu SVO Agglutinative Noun classes Moderate
Japanese Japonic SOV Agglutinative Postpositions High
French Romance SVO Fusional Minimal Low
Thai Tai-Kadai SVO Isolating None Low
Chinese Sinitic SVO Isolating None Minimal
4.2 Test Dataset
The evaluation dataset comprises 341 English sentences designed by language specialists to cover
diverse morphological and syntactic phenomena systematically. The dataset is organized into two
categories: Morphological phenomena (189 sentences) covering tense-aspect-mood, case markers,
agreement, and derivational processes; and

Chunk 19 · 1,999 chars

tion dataset comprises 341 English sentences designed by language specialists to cover
diverse morphological and syntactic phenomena systematically. The dataset is organized into two
categories: Morphological phenomena (189 sentences) covering tense-aspect-mood, case markers,
agreement, and derivational processes; and Syntactic phenomena (152 sentences) covering word
order, pro-drop, relative clauses, coordination, and complex sentence structures. Sentences were
custom-created to capture essential typological properties that distinguish languages, ensuring rep-
resentation of features where cross-linguistic divergence is most pronounced.
Grammatical phenomena covered include morphological features (number marking, case inflec-
12

-- 12 of 25 --

tion, verb conjugation for person/number/tense/aspect/mood), syntactic structures (constituent
order variations, passivization, negation strategies, subordination, agreement patterns), and lexical-
semantic distinctions (polysemous verbs requiring contextual disambiguation, multi-word expres-
sions, honorific distinctions, classifier usage). The dataset is designed to elicit systematic typo-
logical errors from English-centric models rather than covering general translation phenomena.
Sentences include constructions known to trigger structural biases: locative constructions requiring
case marking, desiderative verbs triggering infinitival versus gerundival complements, and contexts
where lexical ambiguity interacts with typological constraints.
4.3 Translation Models
We evaluate the framework with two translation systems representing different architectural paradigms:
GPT-5.2 (OpenAI) and mT5 (Multilingual T5). GPT-5.2 is a state-of-the-art large language model
accessed via API, while mT5 provides an open-source sequence-to-sequence alternative. Both mod-
els are accessed as black-box systems, demonstrating the framework’s compatibility with production
translation services where model internals are unavailable. We generate

Chunk 20 · 1,998 chars

. GPT-5.2 is a state-of-the-art large language model
accessed via API, while mT5 provides an open-source sequence-to-sequence alternative. Both mod-
els are accessed as black-box systems, demonstrating the framework’s compatibility with production
translation services where model internals are unavailable. We generate translation candidates us-
ing beam search with beam width B = 32, producing ranked hypotheses for reranking.
4.4 Evaluation Metrics
We employ multiple complementary metrics to evaluate UMF performance across different dimen-
sions of translation quality and system behavior.
4.4.1 Change Rate (Intervention Rate)
Change Rate measures the percentage of source sentences for which the framework selects a candi-
date different from the model’s top-ranked output:
Change Rate = |{x : c∗
UMF̸ = c∗
baseline}|
|X| (9)
where X is the set of source sentences, c∗
UMF is the UMF-selected translation, and c∗
baseline
is the model’s top-ranked output. This metric directly correlates with typological divergence:
high intervention rates for typologically distant pairs validate the framework’s ability to identify
systematic structural errors, while low rates for similar pairs demonstrate appropriate restraint.
4.4.2 Intervention Precision
Intervention Precision quantifies the proportion of UMF interventions that result in correct im-
provements according to expert linguistic judgment:
Intervention Precision = Number of Correct Improvements
Total Number of Interventions (10)
This metric is computed only over cases where UMF selected a different translation than the
baseline (c∗
UMF̸ = c∗
baseline). Each intervention is classified by native speaker linguists as: (1) correct
improvement, (2) neutral/no change in quality, or (3) UMF error (degradation). Intervention
Precision reflects the reliability of UMF’s selection decisions.
13

-- 13 of 25 --

4.4.3 Gain-Risk Ratio
The Gain-Risk Ratio measures the efficiency of UMF interventions by comparing correct improve-
ments to

Chunk 21 · 1,996 chars

as: (1) correct
improvement, (2) neutral/no change in quality, or (3) UMF error (degradation). Intervention
Precision reflects the reliability of UMF’s selection decisions.
13

-- 13 of 25 --

4.4.3 Gain-Risk Ratio
The Gain-Risk Ratio measures the efficiency of UMF interventions by comparing correct improve-
ments to errors introduced:
Gain-Risk Ratio = Number of Correct Improvements
Number of UMF Errors (11)
A ratio greater than 1.0 indicates that improvements outweigh errors (net positive impact),
while ratios below 1.0 indicate that errors outweigh improvements (net negative impact). This
metric provides a practical assessment of whether deploying UMF for a given language pair yields
overall benefit. For example, a Gain-Risk Ratio of 2.14 (Hindi) means that for every UMF error,
the system produces 2.14 correct improvements.
4.4.4 UMF Compliance Score
UMF Compliance Score quantifies structural correctness according to target language grammar,
computed as a weighted average of typological dimension scores:
UMF Score =
∑16
i=1 wi · si(c)
∑16
i=1 wi
(12)
where wi is the directive weight for dimension i (derived from the directive vector), and si(c) is
the compliance score for dimension i evaluated on candidate c. This metric operates independently
of reference translations, enabling evaluation of grammatical compliance for languages where gold-
standard references are scarce.
4.4.5 Automatic Reference-Based Metrics
We additionally evaluate translations using established automatic metrics (BLEU, COMET, chrF,
BERTScore) against human reference translations. However, these metrics have documented limi-
tations for typological evaluation. BLEU measures n-gram overlap and is insensitive to structural
correctness. Neural metrics (COMET, BERTScore) depend on multilingual encoder representations
and may not reliably detect morphological errors, case marking violations, or word order issues for
underrepresented languages. Our results show automatic metric scores remain in

Chunk 22 · 1,993 chars

-gram overlap and is insensitive to structural
correctness. Neural metrics (COMET, BERTScore) depend on multilingual encoder representations
and may not reliably detect morphological errors, case marking violations, or word order issues for
underrepresented languages. Our results show automatic metric scores remain in close proximity
to baseline outputs, despite substantial structural improvements measured by UMF scores and
human evaluation. This discrepancy underscores the inadequacy of existing metrics for assessing
typological compliance.
4.4.6 Human Evaluation
Native speaker evaluation provides ground truth for translation quality assessment. Expert linguists
who are native speakers of the target language assess sampled interventions on two criteria:
• Structural Correctness: Grammatical conformance to target language requirements (word
order, case marking, morphology, agreement, etc.)
• Semantic Adequacy: Preservation of meaning from the source sentence
Each intervention is classified into one of three categories:
1. Correct Improvement: UMF selection is structurally better than baseline and semantically
equivalent or better
14

-- 14 of 25 --

2. Neutral/No Change: UMF selection and baseline are of comparable quality
3. UMF Error: UMF selection is worse than baseline (structural or semantic degradation)
This evaluation quantifies the proportion of interventions yielding genuine improvements versus
neutral or harmful changes. The classifications directly feed into the computation of Intervention
Precision and Gain-Risk Ratio.
Evaluation Protocol. For each target language, two native-speaker linguists independently
evaluated all UMF interventions (cases where UMF selected a different candidate than the base-
line). Evaluators were presented with the source sentence, baseline translation, and UMF-selected
translation in randomized order without system labels (blind evaluation). Each evaluator assigned
one of the three classification categories. In cases of

Chunk 23 · 1,993 chars

ions (cases where UMF selected a different candidate than the base-
line). Evaluators were presented with the source sentence, baseline translation, and UMF-selected
translation in randomized order without system labels (blind evaluation). Each evaluator assigned
one of the three classification categories. In cases of disagreement, a third senior linguist adjudi-
cated. Inter-annotator agreement was substantial (Cohen’s κ = 0.71 averaged across languages).
Neutral classifications were included in precision calculations as non-improvements, providing a
conservative estimate of intervention quality.
4.5 Baseline and Ablations
The baseline is the model’s top-ranked output without reranking. Ablation studies isolate the
contribution of each component: semantic constraints only (Layer 1), typological reranking only
(Layer 2), and the full dual-layer system. These ablations test whether semantic and typological
layers provide complementary or overlapping benefits.
4.6 Hyperparameters
Table 4 presents the key hyperparameters used in our experiments.
Table 4: UMF hyperparameters and configuration
Parameter Value Description
Candidate Generation
Beam width (B) 32 Number of candidates generated
Top-K retention (K) 4 Candidates retained for scoring
Temperature 1.0 Sampling temperature for diversity
Typological Scoring
Mixing parameter (α) 0.5 Balanced weighting (used in experiments)
0.3 Typological priority (alternative)
0.7 Model priority (alternative)
Activation threshold 0.1 Minimum directive value for scoring
Semantic Constraints (Layer 1)
Token boost +1.0 Logit boost for correct sense
Token penalty -0.5 Logit penalty for wrong sense
The mixing parameter α balances model confidence with typological compliance in the final
candidate selection:
final_score(c) = α · pmodel(c) + (1 − α) · UMF_score(c) (13)
We test α = 0.3 (prioritizing typological correctness with 70% weight), α = 0.5 (balanced
weighting), and α = 0.7 (prioritizing model confidence with 70% weight). The

Chunk 24 · 1,977 chars

r α balances model confidence with typological compliance in the final
candidate selection:
final_score(c) = α · pmodel(c) + (1 − α) · UMF_score(c) (13)
We test α = 0.3 (prioritizing typological correctness with 70% weight), α = 0.5 (balanced
weighting), and α = 0.7 (prioritizing model confidence with 70% weight). The results reported
15

-- 15 of 25 --

in this paper use α = 0.5 (balanced weighting), which provides equal consideration to model
confidence and typological compliance. This configuration was selected to avoid over-correction
while still enabling meaningful typological intervention.
Dimensions with directive values below 0.1 are excluded from scoring, focusing computational
effort on high-divergence features. For English to Sinhala, this activates 14 of 16 dimensions
(classifiers and evidentiality remain inactive with directive values of 0.0).
The semantic constraint parameters (token boost +1.0, token penalty -0.5) were determined
through grid search optimization on a held-out development set. These values provide effective
disambiguation without overwhelming the model’s learned language patterns.
5 Results and Analysis
5.1 Overview of Empirical Results
We evaluated UMF-based reranking across nine target languages using expert linguistic review.
Each sentence was evaluated with respect to three distinct outputs: the baseline model output,
the UMF-selected output, and the remaining candidate set. Outcomes were classified as correct
improvement, neutral/no change, or UMF error, based solely on specialist judgment. This section
presents both quantitative intervention patterns (Section 5.2) and qualitative analysis of improve-
ment efficiency (Sections 5.3–5.6).
Table 5 presents the key performance metrics across all evaluated languages. Across languages,
UMF exhibits non-uniform but systematic behavior, with performance strongly correlated with
typological distance, morphological complexity, and the completeness of language-specific

Chunk 25 · 1,999 chars

prove-
ment efficiency (Sections 5.3–5.6).
Table 5 presents the key performance metrics across all evaluated languages. Across languages,
UMF exhibits non-uniform but systematic behavior, with performance strongly correlated with
typological distance, morphological complexity, and the completeness of language-specific feature
representations. Crucially, the results confirm that UMF impact cannot be inferred from inter-
vention frequency alone. Change rate and intervention precision must be interpreted jointly along
with a metric to assess the gain vs. risk of the framework.
Table 5: Performance metrics across nine target languages showing intervention behavior and
efficiency.
Language Total Cases Change Rate Intervention Precision Gain-Risk
Sinhala 341 45.16% 26.62% 0.20
Tamil 341 26.69% 29.67% 0.16
Thai 341 4.40% 20.00% 0.02
Chinese 341 3.23% 100.00% 1.83
Hindi 341 15.54% 84.91% 2.14
Japanese 341 7.33% 76.00% 0.90
Arabic 341 11.44% 79.49% 1.00
French 341 9.09% 80.65% 1.09
Swahili 341 9.68% 48.48% 0.19
5.2 Intervention Patterns and Typological Correlation
Table 6 presents UMF Compliance Scores across all nine language pairs, revealing the structural
quality of UMF-selected outputs and their relationship to typological characteristics.
As established in Table 5, intervention rates exhibit a strong positive correlation with typo-
logical divergence (Pearson r = 0.82, p < 0.01), validating that UMF correctly identifies cases
16

-- 16 of 25 --

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
Thai
Tamil
Swahili
Sinhala
Japanese
Arabic
French
Chinese
Hindi
2 · 10−2
0.16
0.19
0.2
0.9
1
1.09
1.83
2.14
Gain-Risk Ratio
Low (<1.0) Balanced (≈1.0) High (>1.0)
Break-even
Figure 5: Gain-Risk Ratio distribution across nine target languages. Languages above the 1.0
threshold (vertical line) show net positive impact from UMF reranking. Hindi and Chinese achieve
the highest efficiency, while Thai, Tamil, Swahili, and Sinhala show ratios well below 1.0, indicating
that errors outweigh

Chunk 26 · 1,991 chars

-even
Figure 5: Gain-Risk Ratio distribution across nine target languages. Languages above the 1.0
threshold (vertical line) show net positive impact from UMF reranking. Hindi and Chinese achieve
the highest efficiency, while Thai, Tamil, Swahili, and Sinhala show ratios well below 1.0, indicating
that errors outweigh improvements in these languages.
Change Rate Intervention Precision
SI TA HI AR SW FR JA TH ZH
0
50
100
45
.2
26
.7
15
.5
11
.4
9.7
9.1
7.3
4.4
3.2
26
.6
29
.7
84
.9
79
.5
48
.5
80
.7
76
20
100
Percentage (%)
SI=Sinhala, TA=Tamil, HI=Hindi, AR=Arabic, SW=Swahili, FR=French, JA=Japanese, TH=Thai, ZH=Chinese
Figure 6: Change Rate vs. Intervention Precision across languages, ordered by decreasing change
rate. High-divergence languages (Sinhala, Tamil) show high change rates but lower precision,
while structurally profiled languages (Chinese, Hindi, French) achieve high precision with moderate
intervention.
17

-- 17 of 25 --

Table 6: UMF Scores for each language represent the average structural compliance score of UMF-
selected outputs
Language UMF Score Typological Divergence
Sinhala 0.674 High (SOV, agglutinative)
French 0.629 Low (SVO, fusional)
Japanese 0.624 High (but excellent baseline)
Swahili 0.591 Moderate (SVO, agglutinative)
Tamil 0.587 High (SOV, agglutinative)
Arabic 0.582 Moderate (VSO, fusional)
Hindi 0.569 High (SOV, fusional)
Chinese 0.540 Minimal (SVO, isolating)
Thai 0.520 Low (SVO, isolating)
where structural transformation is needed. Languages with greater structural distance from En-
glish (Sinhala, Tamil, Hindi) show substantially higher change rates (45.16%, 26.69%, and 15.54%
respectively), while typologically similar languages (Chinese, Thai, French) exhibit conservative
intervention behavior (3.23%, 4.40%, and 9.09% respectively).
Japanese represents a notable exception: despite high typological divergence, it shows a rela-
tively low change rate (7.33%). This reflects GPT-5.2’s particularly strong baseline performance

Chunk 27 · 1,994 chars

ly similar languages (Chinese, Thai, French) exhibit conservative
intervention behavior (3.23%, 4.40%, and 9.09% respectively).
Japanese represents a notable exception: despite high typological divergence, it shows a rela-
tively low change rate (7.33%). This reflects GPT-5.2’s particularly strong baseline performance for
English-Japanese translation rather than framework insensitivity. When baseline quality is already
high, UMF appropriately restrains intervention, demonstrating that the framework responds to
actual baseline error exposure rather than divergence alone.
UMF Compliance Scores show modest but systematic variation across languages (range: 0.520–
0.674). The elevation in UMF scores for high-divergence languages (Sinhala: 0.674, Tamil: 0.587,
Hindi: 0.569) compared to low-divergence languages (Chinese: 0.540, Thai: 0.520) reflects the
framework’s prioritization of typologically critical dimensions when directive weights are higher.
French (0.629) and Japanese (0.624) exhibit strong compliance despite lower divergence, attributable
to GPT-5.2’s robust baseline performance for these well-resourced languages. This gradient demon-
strates that UMF-selected outputs achieve stronger structural compliance precisely in languages
where typological constraints are most stringent.
The relatively narrow range of UMF scores compared to the wide range of change rates (3.23%–
45.16%) indicates that selected outputs maintain consistent structural quality regardless of inter-
vention frequency. This suggests that UMF’s selection mechanism successfully balances structural
compliance across diverse typological profiles, intervening aggressively when baseline outputs vio-
late critical constraints while maintaining quality even in conservative intervention regimes.
Critically, the systematic relationship between typological divergence, intervention rates, and
UMF compliance scores demonstrates that the framework’s behavior is driven by linguistic structure
rather than

Chunk 28 · 1,999 chars

line outputs vio-
late critical constraints while maintaining quality even in conservative intervention regimes.
Critically, the systematic relationship between typological divergence, intervention rates, and
UMF compliance scores demonstrates that the framework’s behavior is driven by linguistic structure
rather than arbitrary heuristics. This provides quantitative validation that the metalinguistic
framework successfully operationalizes typological distance as a predictor of both baseline model
error patterns and the structural demands placed on corrective reranking.
5.3 High Gain-Risk Performance in Structurally Profiled Languages
Chinese, Hindi, Arabic, and French exhibit consistently high Gain-Risk Ratios, reflecting great
improvement in efficiency relative to incurred errors.
Across these languages, UMF demonstrates moderate change rates aligned with baseline error
exposure, low UMF error rates (single-digit percentages), high intervention precision (approxi-
18

-- 18 of 25 --

mately 80-100%), and Gain-Risk Ratios well above 1, indicating that improvements substantially
outweigh errors. In practical terms, this means that each UMF error in these languages buys
multiple correct improvements, making reranking decisively net-positive.
Table 7: Improvement categories in high Gain-Risk languages showing where UMF adds most value.
Improvement Category Description Frequency
Tense-aspect alignment Correct temporal/aspectual marking 32%
Modality selection Appropriate modal verb/particle choice 24%
Clause ordering Improved constituent arrangement 18%
Idiomatic preference Natural expression over literal translation 14%
Lexical precision More contextually appropriate word choice 12%
Improvements in these languages tend to involve linguistically subtle but structurally mean-
ingful distinctions such as tense-aspect alignment, modality selection, clause ordering, or idiomatic
preference that are often under-weighted in purely score-based ranking.
Notably, these languages

Chunk 29 · 1,997 chars

appropriate word choice 12%
Improvements in these languages tend to involve linguistically subtle but structurally mean-
ingful distinctions such as tense-aspect alignment, modality selection, clause ordering, or idiomatic
preference that are often under-weighted in purely score-based ranking.
Notably, these languages span multiple typological families (Sinitic, Indo-Aryan, Semitic, Japonic,
Romance). The consistency of high Gain-Risk performance across such diversity suggests that UMF
is not exploiting proximity to English, but instead leveraging abstract linguistic constraints encoded
in its metalinguistic framework.
5.4 Low Gain-Risk Regime in Morphologically Dense Languages
Sinhala and Tamil display a sharply contrasting profile characterized by low Gain-Risk Ratios,
despite high intervention activity.
Specifically, these languages show very high change rates, indicating frequent detection of di-
vergence, low intervention precision leading to elevated UMF error accumulation, and Gain-Risk
Ratios well below 1, meaning that errors outweigh improvements.
At face value, this appears to indicate weak performance. However, qualitative analysis reveals
that UMF errors in these languages are highly structured and non-random.
Dominant error categories include case marking and argument structure resolution, register and
honorific selection, and over-normalization of discourse-driven constructions.
Table 8: Error category distribution for morphologically dense languages (Sinhala and Tamil).
Error Category Example Impact Sinhala % Tamil %
Case marking errors Incorrect/missing case suffixes 38% 35%
Register/Honorific Wrong formality level 28% 30%
Over-normalization Natural discourse patterns lost 18% 20%
Argument structure Subject/object role confusion 10% 10%
Other Miscellaneous 6% 5%
This pattern indicates that UMF is sensitive to divergence in morphologically rich languages
but lacks sufficient resolution to consistently select the correct alternative. In other words,

Chunk 30 · 1,998 chars

ization Natural discourse patterns lost 18% 20%
Argument structure Subject/object role confusion 10% 10%
Other Miscellaneous 6% 5%
This pattern indicates that UMF is sensitive to divergence in morphologically rich languages
but lacks sufficient resolution to consistently select the correct alternative. In other words, the
system detects that something is wrong, but sometimes applies the wrong corrective preference due
to incomplete or misweighted language-specific features.
Importantly, this reflects sensitivity without sufficient resolution, rather than the absence of
signal. Such failure modes are characteristic of early-stage metalinguistic systems operating on
low-resource, pragmatically dense languages and provide clear targets for iterative refinement.
19

-- 19 of 25 --

5.5 Low Gain-Risk Behavior in Conservatively Treated Languages
Thai and Swahili occupy a distinct regime characterized by low change rates, indicating conservative
intervention, but also low intervention precision, resulting in Gain-Risk Ratios well below 1 (Thai:
0.02, Swahili: 0.19).
In these languages, UMF intervenes infrequently, but when it does intervene, the corrections
often fail to improve upon the baseline. This pattern differs from morphologically dense languages
(Sinhala, Tamil), which show high intervention frequency with low precision. Thai and Swahili
instead show low intervention frequency with low precision, indicating that UMF’s linguistic priors
are insufficiently expressive for these language types.
Notably, expert review confirms that unchanged baselines in these languages are rarely judged
incorrect, indicating that UMF does not miss obvious baseline failures. This behavior suggests that
UMF appropriately defaults to baseline ranking when confidence is low, but the sparse interventions
that do occur reflect incomplete coverage of discourse structure, aspectual interpretation, and
language-specific pragmatic patterns.
The high baseline correctness rates (94.2% for Thai,

Chunk 31 · 1,999 chars

res. This behavior suggests that
UMF appropriately defaults to baseline ranking when confidence is low, but the sparse interventions
that do occur reflect incomplete coverage of discourse structure, aspectual interpretation, and
language-specific pragmatic patterns.
The high baseline correctness rates (94.2% for Thai, 88.6% for Swahili) confirm that UMF’s
conservative behavior in these languages is appropriate: aggressive intervention would likely degrade
rather than improve already-correct translations.
5.6 Error Structure and Interpretability
Across all languages, UMF errors fall into a small, interpretable set of categories: aspect or tense
mismatch, register or politeness errors, lexical unnaturalness, over-normalization, and typological
overcorrection.
The absence of diffuse or unclassifiable error patterns is a key scientific outcome. It demonstrates
that UMF failures are systematic, explainable, and reproducible, enabling targeted improvements
rather than heuristic tuning.
This property distinguishes UMF from opaque reranking heuristics and supports its suitability
for controlled scientific iteration.
Table 9: UMF Error Taxonomy: Systematic classification of reranking errors across all evaluated
languages.
Error Type Primary Languages Frequency Linguistic Interpretation
Case mismatch Sinhala, Tamil, Hindi High Incorrect argument structure resolution
Register error Sinhala, Tamil High Honorific/formality misalignment
Aspect mismatch All SOV languages Moderate Tense-aspect-mood selection error
Over-normalization Tamil, Swahili Moderate Discourse-driven constructions flattened
Lexical unnaturalness French, Arabic Low Unusual but grammatical word choice
Typological overcorrection Japanese, Chinese Low Unnecessary structural transformation
5.7 Implications of Gain-Risk Analysis for Metalinguistic Reranking
Viewed through the Gain-Risk Ratio, three conclusions emerge:
1. UMF delivers high-efficiency improvements in structurally profiled languages, where each
error

Chunk 32 · 1,999 chars

oice
Typological overcorrection Japanese, Chinese Low Unnecessary structural transformation
5.7 Implications of Gain-Risk Analysis for Metalinguistic Reranking
Viewed through the Gain-Risk Ratio, three conclusions emerge:
1. UMF delivers high-efficiency improvements in structurally profiled languages, where each
error yields multiple correct gains.
2. Performance degradation correlates with linguistic under-specification, not architectural in-
stability.
20

-- 20 of 25 --

3. UMF functions as a metalinguistic decision layer, complementing base model generation by
selectively reallocating risk toward improvement opportunities.
Crucially, UMF’s impact is observable under expert review, and its failures remain linguistically
interpretable rather than opaque.
5.8 Benchmarks vs. UMF
Across languages, standard automatic metrics (BLEU, chrF, COMET, and BERTScore) show small,
often negative deltas between baseline and UMF-selected outputs, even in cases where expert
evaluation confirms meaningful linguistic improvements.
Table 10 presents the automatic metric scores along with the changes in scores between baseline
and UMF-selected outputs across all evaluated languages.
Table 10: Automatic metric scores: baseline vs. UMF-selected outputs
Language BLEU chrF COMET BERTScore
Base UMF ∆ Base UMF ∆ Base UMF ∆ Base UMF ∆
Tamil 32.79 30.59 -2.20 72.43 71.47 -0.96 0.928 0.925 -0.003 0.919 0.916 -0.003
Sinhala 32.35 29.15 -3.20 63.48 62.14 -1.34 0.911 0.906 -0.005 0.966 0.965 -0.001
Arabic 26.88 27.07 +0.19 54.76 54.65 -0.11 0.773 0.771 -0.002 0.828 0.828 0.000
Chinese 10.86 10.86 0.00 63.49 63.57 +0.08 0.930 0.930 0.000 0.889 0.888 -0.001
French 73.58 71.53 -2.05 84.02 82.72 -1.30 0.930 0.928 -0.002 0.882 0.874 -0.008
Hindi 62.68 61.70 -0.98 80.91 79.76 -1.15 0.907 0.899 -0.008 0.953 0.951 -0.002
Japanese - - - 62.31 62.65 +0.34 0.937 0.937 0.000 0.918 0.919 +0.001
Swahili 38.62 37.85 -0.77 68.49 68.67 +0.18 0.861 0.861 0.000 0.883 0.882 -0.001
Thai 14.60 13.58 -1.02 68.19 68.19

Chunk 33 · 1,990 chars

.02 82.72 -1.30 0.930 0.928 -0.002 0.882 0.874 -0.008
Hindi 62.68 61.70 -0.98 80.91 79.76 -1.15 0.907 0.899 -0.008 0.953 0.951 -0.002
Japanese - - - 62.31 62.65 +0.34 0.937 0.937 0.000 0.918 0.919 +0.001
Swahili 38.62 37.85 -0.77 68.49 68.67 +0.18 0.861 0.861 0.000 0.883 0.882 -0.001
Thai 14.60 13.58 -1.02 68.19 68.19 0.00 0.909 0.909 0.000 0.921 0.921 0.000
This pattern is consistent across representative languages from all three evaluation regimes
(high-gain, such as French; conservative, such as Swahili; and low-gain morphologically dense lan-
guages such as Sinhala and Tamil) and indicates a systematic disconnect between surface-form
similarity metrics and linguistically grounded correctness. In particular, UMF-driven changes fre-
quently preserve semantic equivalence while altering morphology, argument structure, discourse
ordering, or pragmatic marking, resulting in lower n-gram overlap or embedding similarity despite
being preferred by human reviewers.
This explains why UMF can exhibit favorable Gain-Risk Ratios under expert judgment while
appearing neutral or negative under conventional benchmarks. Importantly, this behavior aligns
with earlier findings that UMF’s change rate closely tracks latent LLM error exposure rather than
benchmark variance, suggesting that UMF is responding to linguistic divergence that these metrics
are structurally incapable of detecting.
Rather than competing with BLEU, chrF, or embedding-based scores, UMF operates orthog-
onally to them, functioning as a metalinguistic decision layer that identifies and corrects errors
invisible to surface similarity measures. In this sense, UMF should be understood not merely as an
auxiliary reranker but as a complementary and in certain linguistic regimes, alternative evaluation
signal for assessing correctness in typologically diverse and discourse-sensitive languages.
5.9 Limitations
This study evaluates UMF strictly in a post-generation reranking setting. Results may differ when
UMF

Chunk 34 · 1,993 chars

ly as an
auxiliary reranker but as a complementary and in certain linguistic regimes, alternative evaluation
signal for assessing correctness in typologically diverse and discourse-sensitive languages.
5.9 Limitations
This study evaluates UMF strictly in a post-generation reranking setting. Results may differ when
UMF constraints are applied earlier in the generation process. Additionally, several morphologi-
cally rich and pragmatically dense languages remain under-represented in the current feature set.
21

-- 21 of 25 --

0 10 20 30 40 50
0
0.5
1 SI
TA
HI
AR
SW
JA
FR
TH
ZH
Change Rate (%)
Typological Divergence
High divergence
Moderate divergence
Low divergence
Pearson r = 0.82
p < 0.01
Figure 7: Correlation between intervention rate (Change Rate) and typological divergence from
English. The strong positive correlation (r = 0.82) validates that UMF correctly identifies and
intervenes in cases where structural transformation is most needed. Japanese is a notable outlier:
high divergence but low change rate due to GPT-5.2’s strong baseline quality for this language.
Our findings show that while UMF already captures meaningful signals of interpretation error
in LLM outputs, its detection performance is constrained by the current depth of language pro-
files. Strengthening these profiles is expected to significantly improve UMF’s ability to identify
interpretation-level errors.
6 Further Research and Required Work
The results of this study indicate that while UMF-based reranking can detect linguistically salient
divergences, it does not yet consistently resolve them correctly across all language types. In particu-
lar, performance degradation in morphologically rich and low-resource languages highlights several
limitations in the current framework. Addressing these limitations requires both representational
and methodological advances. We outline the key areas of further research below.
6.1 Refinement of Linguistic Representations
A primary source of UMF error

Chunk 35 · 1,990 chars

orphologically rich and low-resource languages highlights several
limitations in the current framework. Addressing these limitations requires both representational
and methodological advances. We outline the key areas of further research below.
6.1 Refinement of Linguistic Representations
A primary source of UMF error arises from incomplete or misweighted linguistic feature representa-
tions, particularly in languages with rich case systems, flexible word order, and pragmatic marking.
Current profiles capture the presence of such features but do not yet encode their interactional
constraints with sufficient resolution.
Future work must focus on expanding language profiles to include hierarchical relationships
between morphological, syntactic, and discourse-level features. Rather than treating features as
independent signals, we must explicitly model feature dependencies such as case-verb agreement and
register-context alignment. Additionally, we need to introduce negative constraints that specify not
only what constructions are preferred but also which are disallowed in specific contexts. Without
such refinements, UMF risks systematic overcorrection in precisely the languages it is intended to
support.
22

-- 22 of 25 --

6.2 Calibration of Intervention Confidence
The current evaluation reveals that UMF exhibits high sensitivity but insufficient specificity in
certain languages, intervening frequently without proportional gains. This indicates a need for
improved confidence calibration.
Future research should investigate language-specific intervention thresholds that modulate how
willing UMF is to override baseline rankings. We need mechanisms for confidence decay when
competing candidates are nearly equivalent, as well as uncertainty estimates that can distinguish
between genuine structural errors and stylistic variation. A more conservative intervention policy is
likely to reduce degradation in languages where features are underspecified while preserving

Chunk 36 · 1,998 chars

for confidence decay when
competing candidates are nearly equivalent, as well as uncertainty estimates that can distinguish
between genuine structural errors and stylistic variation. A more conservative intervention policy is
likely to reduce degradation in languages where features are underspecified while preserving gains
where linguistic signals are strong.
6.3 Disentangling Stylistic Preference from Structural Correctness
Several observed improvements correspond to stylistic or register-level preferences rather than clear
linguistic corrections. While such refinements may be acceptable, they complicate claims of struc-
tural improvement.
Further work is required to separate core grammatical correctness from stylistic optimization
in both evaluation and reranking. This involves introducing explicit labels or tiers that distinguish
grammatical, pragmatic, and stylistic constraints. We should also evaluate UMF under stricter
criteria where only structurally necessary changes count as improvements. This distinction is
essential for establishing the scientific contribution of UMF beyond stylistic reranking.
6.4 Expansion of Error Taxonomy and Cross-Language Analysis
Although UMF errors are largely classifiable, current error analysis remains coarse-grained. A more
detailed taxonomy is required to guide targeted improvements.
Future studies should extend the error taxonomy with language-specific subcategories, such as
honorific misalignment or discourse particle misuse. We need to quantify error distributions across
languages to identify systematic typological failure modes. Conducting cross-language comparisons
will help determine which error classes generalize and which are language-specific. Such analysis
would enable principled prioritization of development effort and avoid ad hoc fixes.
6.5 Evaluation Beyond Reranking
This study evaluates UMF exclusively as a post-generation reranking layer. While appropriate for
initial validation, this setting limits the scope

Chunk 37 · 1,999 chars

neralize and which are language-specific. Such analysis
would enable principled prioritization of development effort and avoid ad hoc fixes.
6.5 Evaluation Beyond Reranking
This study evaluates UMF exclusively as a post-generation reranking layer. While appropriate for
initial validation, this setting limits the scope of potential impact.
Future research should explore integrating UMF constraints earlier in the generation process.
We need to assess whether applying guidance before or during generation reduces the need for
aggressive reranking. Comparing reranking-only and integrated approaches under identical evalu-
ation protocols will help determine whether the limitations we observe stem from where UMF sits
in the pipeline rather than from its conceptual design.
6.6 Longitudinal and Scaling Studies
Finally, the current results represent a snapshot of UMF at an early stage of language coverage.
Longitudinal evaluation is required to determine whether observed weaknesses diminish as language
profiles mature.
23

-- 23 of 25 --

Future work should include re-evaluation as linguistic feature sets are expanded and reweighted.
We need controlled ablation studies to measure the contribution of individual feature classes,
and scaling experiments across additional low-resource and typologically extreme languages. Only
through such iterative validation can claims of universality be meaningfully assessed.
Acknowledgments
We gratefully acknowledge Priya M. Nair, CEO of Zwag AI, for her vision, leadership, and financial
support, which made this research possible.
We also extend our sincere thanks to Dulmini Fernando and Shivalatha Sivasundaram for coordi-
nating language specialists and native translators, and for conducting translation quality assessment
and typological error analysis. We further thank the language specialists and native translators
who contributed to the human evaluations, whose linguistic expertise and careful assessments were
essential to the quality and

Chunk 38 · 1,998 chars

language specialists and native translators, and for conducting translation quality assessment
and typological error analysis. We further thank the language specialists and native translators
who contributed to the human evaluations, whose linguistic expertise and careful assessments were
essential to the quality and reliability of this research.
References
[1] W3Techs. (2021). Usage statistics of content languages for websites. https://w3techs.com/
technologies/overview/content_language
[2] Brinkmann, H., et al. (2025). Large Language Models Share Representations of Latent Gram-
matical Concepts Across Typologically Diverse Languages. arXiv preprint arXiv:2501.06346.
[3] Arnett, C., & Bergen, B. (2025). Why do language models perform worse for morphologically
complex languages? arXiv preprint arXiv:2411.14198.
[4] Eberhard, D. M., Simons, G. F., & Fennig, C. D. (Eds.). (2024). Ethnologue: Languages of
the World (27th ed.). SIL International. https://www.ethnologue.com
[5] Dorr, B. J. (1994). Machine Translation Divergences: A Formal Description and Proposed
Solution. Computational Linguistics, 20(4), 597–633.
[6] Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural
Information Processing Systems, 33, 1877–1901.
[7] Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod-
els. Advances in Neural Information Processing Systems, 35, 24824–24837.
[8] Boujkian, S., et al. (2025). Improving Low-Resource Machine Translation via Cross-
Linguistic Transfer from Typologically Similar High-Resource Languages. arXiv preprint
arXiv:2501.00045.
[9] Dryer, M. S., & Haspelmath, M. (Eds.). (2024). The World Atlas of Language Structures
Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. https://wals.info/
[10] Franceschelli, G., & Musolesi, M. (2024). Creative Beam Search: LLM-as-a-Judge for Improv-
ing Response Generation. arXiv preprint arXiv:2405.00099.
[11] Papineni, K., et al. (2002). BLEU:

Chunk 39 · 1,992 chars

he World Atlas of Language Structures
Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. https://wals.info/
[10] Franceschelli, G., & Musolesi, M. (2024). Creative Beam Search: LLM-as-a-Judge for Improv-
ing Response Generation. arXiv preprint arXiv:2405.00099.
[11] Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation.
Proceedings of ACL 2002, 311–318.
24

-- 24 of 25 --

[12] Zhang, T., et al. (2020). BERTScore: Evaluating Text Generation with BERT. Proceedings of
ICLR 2020.
[13] Rei, R., et al. (2022). COMET: A Neural Framework for MT Evaluation. Proceedings of
EMNLP 2022, 2685–2702.
[14] Krishnamurthy, P. (2019). Development of Telugu-Tamil Transfer-Based Machine Translation
System: An Improvization Using Divergence Index. Journal of Intelligent Systems, 28(3),
493–504.
[15] Evans, N., & Levinson, S. C. (2009). The Myth of Language Universals: Language Diversity
and Its Importance for Cognitive Science. Behavioral and Brain Sciences, 32(5), 429–448.
[16] Newmeyer, F. J. (2004). Against a Parameter-Setting Approach to Typological Variation.
Linguistic Variation Yearbook, 4(1), 181–234.
[17] Uchida, H. (1989). ATLAS: Fujitsu Machine Translation System. Proceedings of MT Summit
II, 112–119.
[18] Nishida, T. (1985). English-Japanese Translation through Case-Structure Conversion. Proceed-
ings of the 11th International Conference on Computational Linguistics, 395–400.
[19] Hutchins, W. J., & Somers, H. L. (1992). An Introduction to Machine Translation. Academic
Press, London.
[20] Oflazer, K. (2008). Statistical Machine Translation into a Morphologically Complex Language.
Proceedings of the 6th International Conference on Informatics and Information Technology.
[21] Bender, E. M. (2019). The #BenderRule: On Naming the Languages We Study and Why It
Matters. The Gradient.
[22] Wang, Z., Lai, Y., Li, J., & Gao, X. (2023). On the Multilingual Ability of Decoder-based Pre-
trained Language Models: Finding and

Chunk 40 · 606 chars

6th International Conference on Informatics and Information Technology.
[21] Bender, E. M. (2019). The #BenderRule: On Naming the Languages We Study and Why It
Matters. The Gradient.
[22] Wang, Z., Lai, Y., Li, J., & Gao, X. (2023). On the Multilingual Ability of Decoder-based Pre-
trained Language Models: Finding and Controlling Language-Specific Neurons. arXiv preprint
arXiv:2305.00600.
[23] Oncevay, A., Haddow, B., & Birch, A. (2020). Bridging Linguistic Typology and Multilingual
Machine Translation with Multi-View Language Representations. Proceedings of EMNLP 2020,
2391–2406.
25

-- 25 of 25 --