Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages
Summary
This paper introduces the Universal Metalinguistic Framework (UMF), a typologically-informed candidate reranking system designed to improve translation quality for low-resource languages using large language models (LLMs) without requiring parallel training data or model retraining. The framework addresses systematic errors caused by typological bias in LLMs, which are trained predominantly on high-resource languages like English and thus struggle with structurally divergent languages. UMF consists of two components: a structured language profile across 16 typological dimensions and a computational engine that applies semantic constraints and typological scoring during candidate evaluation. Evaluation across nine language pairs shows that intervention rates correlate strongly with typological distance from English. The framework achieves intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. UMF operates as a model-agnostic, black-box compatible system, making it practical for under-resourced languages. The study highlights the importance of typological knowledge in correcting structural and lexical errors, while also identifying areas for improvement, such as refining linguistic representations and calibration of intervention confidence.
PDF viewer
Chunks(41)
Chunk 0 · 1,998 chars
Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages Nipuna Abeykoon∗, Ashen Weerathunga∗, Pubudu Wijesinghe∗, Parameswari Krishnamurthy ZWAG AI Ltd, Dubai AI Campus, DIFC, UAE. ∗These authors contributed equally to this work Abstract Large language models trained predominantly on high-resource languages exhibit system- atic biases toward dominant typological patterns, leading to structural non-conformance when translating into typologically divergent low-resource languages. We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining. The framework consists of two components: the Universal Metalinguistic Framework (UMF), which represents languages as structured profiles across 16 typological di- mensions with divergence-weighted scoring, and the Computational Engine, which operates through linguistic disambiguation during generation and typological compliance scoring during selection. Evaluation across nine language pairs demonstrates intervention rates strongly cor- relating with typological distance from English. In experiments on 341 English sentences each having different morphological and syntactic phenomena, the framework shows an interven- tion precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. The framework requires no parallel training data and operates with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages. Keywords: Machine Translation, Low-Resource Languages, Linguistic Typology, Large Language Models, Candidate Reranking, Morphological Complexity 1 Introduction Modern LLMs treat language as surface-level data, not as a cognitive system. As a result, their performance is optimized for languages whose grammatical structures, cultural assumptions, and reasoning patterns
Chunk 1 · 1,999 chars
ges, Linguistic Typology, Large Language Models, Candidate Reranking, Morphological Complexity 1 Introduction Modern LLMs treat language as surface-level data, not as a cognitive system. As a result, their performance is optimized for languages whose grammatical structures, cultural assumptions, and reasoning patterns dominate training corpora. Languages with different structural properties are frequently distorted, flattened, or misrepresented. Large language models learn statistical regularities from datasets that are overwhelmingly English-centric. When applied to structurally distinct languages, these models often preserve se- mantic intent but fail to maintain structural correctness. This leads to systematic errors in tense, agency, evidentiality, politeness, spatial relations, and relational meaning. Outputs may appear fluent yet remain cognitively incongruent for native speakers. We term these failures interpretation errors: outputs that preserve surface semantic content but violate obligatory grammatical, morphosyntactic, or discourse constraints required for correct 1 arXiv:2602.01162v1 [cs.CL] 1 Feb 2026 -- 1 of 25 -- interpretation by native speakers. Unlike stylistic variations or acceptable paraphrases, interpreta- tion errors produce translations that are grammatically malformed, pragmatically inappropriate, or cognitively dissonant in the target language, even when the intended meaning remains recoverable. This is not merely a matter of data scarcity. Human languages exhibit extraordinary structural diversity: word order patterns range from Subject-Verb-Object (English) to Subject-Object-Verb (Hindi, Japanese) to Verb-Subject-Object (Arabic); morphological systems span from analytic (Chi- nese) to agglutinative (Turkish, Swahili) to polysynthetic (Inuktitut). Case marking, agreement patterns, honorifics, and dozens of other grammatical dimensions vary independently across lan- guages [9]. Models trained predominantly on English must somehow produce
Chunk 2 · 1,995 chars
rabic); morphological systems span from analytic (Chi- nese) to agglutinative (Turkish, Swahili) to polysynthetic (Inuktitut). Case marking, agreement patterns, honorifics, and dozens of other grammatical dimensions vary independently across lan- guages [9]. Models trained predominantly on English must somehow produce structurally valid outputs for languages exhibiting fundamentally different typological patterns. This is a challenge that data volume alone cannot solve. The result is systematic structural errors: incorrect word or- der, missing morphological markers, inappropriate lexical choices, and violations of target language grammatical constraints. This inequality stems from fundamental biases in how these models are trained and evaluated. The root cause is typological bias: models trained predominantly on English encode struc- tural patterns of Subject-Verb-Object word order, analytic morphology, and minimal case marking. When translating to typologically divergent languages, these biases manifest as systematic errors across multiple dimensions. Structural errors arise from grammatical mismatches. An English-to-Sinhala translation of the sentence “The children play in the garden” may preserve SVO order instead of restructuring to Sinhala’s SOV pattern, producing දරුවන් ෙසල්ලම් උද්යානය rather than the correct ළමයි උද්යානෙය් ෙසල්ලම් කරනවා . The model may omit obligatory case markers: Sinhala requires the locative suffix ෙ-් (-ē) on “garden” to indicate location or fail to produce appropriate morphological complexity. Recent analyses confirm this pattern: language models achieve significantly higher scores on fusional languages but underperform on agglutinative languages due to richer morphology and greater token sparsity [3]. Lexical errors stem from training distribution biases. In our experiments with frontier LLMs, translating ”play” yielded වාදනය (vādanaya, ”play a musical instrument”) rather than ෙසල්ලම් (sellam, ”play/have fun”) because the former
Chunk 3 · 1,993 chars
orm on agglutinative languages due to richer morphology and greater token sparsity [3]. Lexical errors stem from training distribution biases. In our experiments with frontier LLMs, translating ”play” yielded වාදනය (vādanaya, ”play a musical instrument”) rather than ෙසල්ලම් (sellam, ”play/have fun”) because the former appears more frequently in English-Sinhala training data despite being contextually inappropriate for ”children playing in a garden.” Existing remedies have significant limitations. Fine-tuning on parallel corpora requires data that does not exist for most language pairs [8], while prompt engineering lacks precision to enforce complex grammatical constraints [6, 7]. Prior work has incorporated typological insights in limited or auxiliary forms; however, no existing system operationalizes typology as a universal, structured decision layer that directly governs inference-time candidate selection across languages without retraining or parallel data. This motivates our approach: leverage explicit typological knowledge to guide translation under black-box generation. The Universal Metalinguistic Framework (UMF) addresses this challenge by quantifying typo- logical divergence between languages and using it to guide translation without retraining. The framework targets both error types through complementary mechanisms. For lexical errors, a se- mantic constraint layer applies context-aware adjustments during generation to resolve word-sense ambiguities. For structural errors, a typological reranking layer evaluates candidates against ex- plicit grammatical requirements of the target language. UMF represents languages through expert-curated profiles capturing 16 typological dimensions based on the World Atlas of Language Structures [9] and internal linguistic typology research. Divergence scores quantify how much source and target languages differ in each dimension. These scores are weighted by linguistic importance and normalized to produce a directive
Chunk 4 · 1,992 chars
profiles capturing 16 typological dimensions based on the World Atlas of Language Structures [9] and internal linguistic typology research. Divergence scores quantify how much source and target languages differ in each dimension. These scores are weighted by linguistic importance and normalized to produce a directive vector that guides candidate evaluation. A tunable mixing parameter α balances the model’s probability with 2 -- 2 of 25 -- the typological compliance score. Our approach is grounded in the principle that typological divergence predicts where LLMs will fail. Languages with divergent word order, rich case systems, and agglutinative morphology receive higher intervention rates, with the reranker prioritizing candidates that satisfy target language constraints. This dual-layer approach, combining typological scoring with semantic disambigua- tion, yields improvements that correlate with typological divergence and remain consistent across different base models. Scope and Non-Claims. To clarify the boundaries of this work: UMF does not generate translations; it selects among candidates produced by an existing model. UMF does not replace automatic evaluation metrics; it operates as a complementary decision layer for candidate selection. UMF does not claim universal coverage—current language profiles cover a subset of typological phe- nomena and require expansion. What UMF does provide is a structured, interpretable mechanism for enforcing typological constraints at inference time, operating as a metalinguistic decision layer rather than a learned quality predictor. 2 Background and Related Work 2.1 Typology in Metalinguistic and MT Frameworks: Prior Approaches Linguistic typology has been incorporated into machine translation and metalinguistic frameworks in various forms, yet these approaches have treated typological knowledge as auxiliary, analytical, or partial rather than as a foundational decision-governing component. Theoretical models such as
Chunk 5 · 1,999 chars
eworks: Prior Approaches Linguistic typology has been incorporated into machine translation and metalinguistic frameworks in various forms, yet these approaches have treated typological knowledge as auxiliary, analytical, or partial rather than as a foundational decision-governing component. Theoretical models such as Universal Grammar (UG) were built on the assumption of deep structural uniformity across languages, reducing variation to a small set of abstract principles or binary parameters. Large- scale typological evidence has since demonstrated that linguistic diversity is far richer and less discretizable than UG predicts, with many languages exhibiting structural configurations that fall outside proposed universals [15, 16]. Interlingua-based systems such as ATLAS pursued language- independent representations, yet empirical evaluations showed that no interlingua could remain neutral across languages without reintroducing language-specific transfer rules, effectively collapsing back into pairwise or family-specific solutions [17, 18]. Rule-based MT systems encoded typological structure explicitly but required exhaustive manual specification of grammatical rules, making them incomplete, non-scalable, and biased toward well-documented Indo-European language structures [19]. Data-driven paradigms, including statistical and neural machine translation, improved surface fluency and scalability but introduced a different class of limitations. By learning exclusively from observed corpora, these models implicitly privilege high-resource languages and frequent construc- tions, systematically underrepresenting rare, optional, or typologically marked phenomena such as rich morphology, evidentiality, honorific systems, or non-configurational syntax [20, 21]. Neural MT models, in particular, optimize for probabilistic fluency rather than structural faithfulness, often omitting or flattening linguistic features that lack clear cross-lingual alignment or sufficient training
Chunk 6 · 1,997 chars
ena such as rich morphology, evidentiality, honorific systems, or non-configurational syntax [20, 21]. Neural MT models, in particular, optimize for probabilistic fluency rather than structural faithfulness, often omitting or flattening linguistic features that lack clear cross-lingual alignment or sufficient training signal [22]. While typological information has been incorporated in prior frameworks, in- cluding divergence indices for transfer-based MT [14] and typological feature embeddings, it has remained auxiliary, partial, or analytical rather than serving as a structured, decision-governing component of translation systems [23]. As a result, modern MT systems remain effective pattern translators rather than linguistically grounded interpreters, with typological knowledge functioning as metadata or side-channel features rather than as a first-class evaluation mechanism that directly governs inference-time selection. 3 -- 3 of 25 -- 2.2 Typological Bias in LLMs Linguistic typology classifies languages along structural dimensions that vary independently. Mor- phological typology distinguishes analytic, agglutinative, fusional, and polysynthetic systems. An- alytic languages (e.g., Mandarin) use minimal inflection and rely on word order and particles. Agglutinative languages (e.g., Turkish, Japanese, Swahili, Tamil) construct words by concatenat- ing discrete morphemes, each encoding a single grammatical meaning. A noun may carry suffixes for case, number, and possession in sequence. Fusional languages (e.g., Spanish, Russian) com- press multiple grammatical categories into single portmanteau morphemes. Word order typology captures the positioning of subject (S), verb (V), and object (O), with SVO, SOV, and VSO ac- counting for the vast majority of languages [9]. Large language models disproportionately favor analytic and fusional languages. Arnett and Bergen [3] found a performance gap between aggluti- native and fusional languages, with fusional languages
Chunk 7 · 1,998 chars
ing of subject (S), verb (V), and object (O), with SVO, SOV, and VSO ac- counting for the vast majority of languages [9]. Large language models disproportionately favor analytic and fusional languages. Arnett and Bergen [3] found a performance gap between aggluti- native and fusional languages, with fusional languages such as English achieving lower perplexities, attributed partly to tokenization and dataset size disparities. Brinkmann et al. [2] observe that leading multilingual models remain primarily English models, with over 90% of training tokens in English. The authors hypothesize that while LLMs learn some language-invariant abstractions, these abstractions may still be biased toward English grammar and semantics. Such observations motivate our efforts to explicitly encode typological properties into the translation process. 2.3 Low-Resource Machine Translation Transfer learning exploits related high-resource languages to bootstrap low-resource translation. Boujkian et al. [8] show that linguistic similarity enables efficient adaptation, but this method re- quires parallel data and yields diminishing returns as typological distance increases. Fine-tuning on in-domain parallel corpora remains standard but is unavailable for thousands of languages lacking digital corpora. Krishnamurthy [14] addresses typological divergence in Telugu-Tamil transfer- based MT through a Divergence Index (DI) that quantifies linguistic differences across five levels, namely, surface, shallow, intermediate, deep, and deeper. The DI enabled targeted improvements in transfer grammar rules, increasing fluency from 63% to 87%. While our approach differs ar- chitecturally (reranking vs. transfer rules), both frameworks ground translation improvement in explicit typological divergence measurement. Krishnamurthy’s work demonstrates that quantify- ing linguistic distance enables systematic error correction, which is a principle central to our UMF scoring mechanism. 2.4 Candidate Reranking
Chunk 8 · 1,993 chars
(reranking vs. transfer rules), both frameworks ground translation improvement in explicit typological divergence measurement. Krishnamurthy’s work demonstrates that quantify- ing linguistic distance enables systematic error correction, which is a principle central to our UMF scoring mechanism. 2.4 Candidate Reranking and Self-Evaluation Reranking methods select among multiple candidates rather than relying on the model’s top out- put. Traditional N-best list reranking incorporated linguistic features to re-score candidates. Re- cent LLM-as-a-judge approaches use models to evaluate their own outputs. Franceschelli and Musolesi [10] propose Creative Beam Search, using diverse beam search to generate varied candi- dates and self-evaluation to select outputs, addressing positional bias through balanced position calibration. While effective for subjective criteria, this approach lacks grounding in explicit lin- guistic constraints and cannot systematically correct structural errors like word order violations or missing morphological markers. 2.5 Quality Estimation Automatic metrics like BLEU [11] measure n-gram overlap but are insensitive to structural cor- rectness. Neural metrics like BERTScore and COMET [12, 13] achieve stronger correlation with 4 -- 4 of 25 -- human judgments but may not reliably detect typological errors for underrepresented languages. Critically, all metrics predict quality but do not identify or correct errors. 2.6 Controllable Generation Prompt engineering provides high-level instructions but lacks precision to enforce complex gram- matical constraints. Constrained decoding methods enforce hard lexical constraints but cannot express structural requirements like “apply SOV ordering.” Logit-level interventions (PPLM, FUDGE) modify token probabilities during generation but require model internals access and cannot reason about sentence-level structural properties. Our approach differs in three key aspects: we operate on complete candidates
Chunk 9 · 1,994 chars
t express structural requirements like “apply SOV ordering.” Logit-level interventions (PPLM, FUDGE) modify token probabilities during generation but require model internals access and cannot reason about sentence-level structural properties. Our approach differs in three key aspects: we operate on complete candidates (black-box com- patible), ground evaluation in explicit typological profiles (not learned correlations), and target both lexical and structural errors. Crucially, UMF does not learn correctness from data—it en- codes grammatical obligation from linguistic knowledge. This distinction is fundamental: learned quality estimators predict what outputs humans prefer, while UMF enforces what outputs the target language requires. 3 The UMF Framework and Computational Engine 3.1 Overview The Universal Metalinguistic Framework consists of two main components: Universal Metalinguistic Framework (UMF): A structured representation of language typology encoding 16 dimensions derived from linguistic research. Each language is described by a profile specifying its typological characteristics. Divergence scores quantify differences between source and target languages along each dimension. Computational Engine: Implements candidate generation, scoring, and reranking. The engine operates in two phases: (1) Semantic Constraint Layer applies context-aware lexical adjust- ments during generation, and (2) Typological Scoring evaluates candidates for structural compliance with the target language. The framework is model-agnostic and operates on any LLM capable of producing multiple translation candidates. It requires no fine-tuning or parallel training data, only a typological profile for the target language. 3.2 Language Profile Structure A language profile is a structured representation capturing typological properties. Each profile con- sists of 16 typological dimensions representing structural characteristics. Each dimension includes: • Value: The language’s specific
Chunk 10 · 1,989 chars
only a typological profile for the target language. 3.2 Language Profile Structure A language profile is a structured representation capturing typological properties. Each profile con- sists of 16 typological dimensions representing structural characteristics. Each dimension includes: • Value: The language’s specific property for that dimension (e.g., word order = SOV). • Weight: Linguistic importance of the dimension, derived from typological research indicating how strongly the dimension influences grammatical structure. • Markers: Observable surface features used for evaluation (e.g., case suffixes, verb endings). Language profiles are expert-curated JSON structures encoding 16 typological dimensions de- rived from the World Atlas of Language Structures [9] and linguistic typology research. Each dimension captures a structural property with cross-linguistic variation. Table 1 lists all dimen- sions with brief descriptions. 5 -- 5 of 25 -- Universal Metalinguistic Framework Computational Engine Source Text “The children play” LLM Generator (GPT-5.2 / mT5) N Candidates (Beam Search) Source Profile (English) Target Profile (Sinhala) Directive Engine 16D Vector Layer 1: Semantic Constraints Layer 2: Typological Scoring Combined Score α · Pmodel + (1 − α) · U M F Best Translation ළමයි ෙසල්ලම් කරනවා Figure 1: UMF Framework Architecture. The source text is processed through an LLM to generate N candidates. The Universal Metalinguistic Framework (right, orange-dashed) computes a 16- dimensional directive vector from source and target language profiles. The Computational Engine (center, green-dashed) applies dual-layer evaluation: semantic constraints for lexical disambiguation and typological scoring for structural compliance. 1. Input Analysis 2. Profile Loading 3. Divergence Calc. 4. Directive Generation 5. Candidate Generation 6. Dual-Layer Scoring 7. Final Selection spaCy EN+TGT 16 dims L2 norm N=32 Sem+Typo argmax Figure 2: UMF Translation Pipeline. The
Chunk 11 · 1,998 chars
aints for lexical disambiguation and typological scoring for structural compliance. 1. Input Analysis 2. Profile Loading 3. Divergence Calc. 4. Directive Generation 5. Candidate Generation 6. Dual-Layer Scoring 7. Final Selection spaCy EN+TGT 16 dims L2 norm N=32 Sem+Typo argmax Figure 2: UMF Translation Pipeline. The seven-stage process transforms source text into typologically-compliant translations through linguistic analysis, divergence quantification, multi- candidate generation, and dual-layer evaluation. 6 -- 6 of 25 -- Table 1: The 16 Typological Dimensions Dimension Description Example Contrast Word order Canonical constituent ordering SVO (English) vs. SOV (Sinhala) Case marking Grammatical case inventory and mark- ing 3 cases (English) vs. 8 cases (Sinhala) Morphology Morphological type and complexity Analytic (English) vs. Agglutinative (Sinhala) Agreement Subject-verb and noun-adjective pat- terns Minimal (English) vs. Rich (Sinhala) TAM Tense-aspect-mood marking system Moderate (English) vs. Rich (Sinhala) Classifiers Noun classifier presence and types Absent (English, Sinhala) Honorifics Grammaticalized politeness distinc- tions Absent (English) vs. Present (Sinhala) Evidentiality Grammatical marking of information source Absent (English, Sinhala) Serial verbs Serial verb construction patterns Absent (English) vs. Limited (Sinhala) Definiteness Article and definiteness marking Articles (English) vs. Demonstratives (Sinhala) Animacy Grammaticalized animacy distinctions Not relevant (English) vs. Relevant (Sinhala) Information structure Topic and focus marking Unmarked (English) vs. Marked (Sin- hala) Negation Negation strategy and position Particle (English) vs. Suffix+particle (Sinhala) Pro-drop Pronoun omission patterns No (English) vs. Yes (Sinhala) Relative clauses Relative clause position and formation Postnominal (English) vs. Prenominal (Sinhala) Copula Copula presence and behavior Explicit (English) vs. Often omitted (Sinhala) 7 -- 7 of 25
Chunk 12 · 1,978 chars
tion Particle (English) vs. Suffix+particle (Sinhala) Pro-drop Pronoun omission patterns No (English) vs. Yes (Sinhala) Relative clauses Relative clause position and formation Postnominal (English) vs. Prenominal (Sinhala) Copula Copula presence and behavior Explicit (English) vs. Often omitted (Sinhala) 7 -- 7 of 25 -- Profiles encode both categorical properties (e.g., word order: “SVO” vs. “SOV”) and numeric scales (e.g., case_richness: 0.1 for English, 0.9 for Sinhala). Numeric values represent expert assessments of feature complexity or productivity on a 0-1 scale, informed by typological databases and grammatical descriptions. For instance, English receives case_richness = 0.1 due to minimal morphological case marking (3 cases, mostly syntactic), while Sinhala receives 0.9 due to its rich system of 8 morphologically marked cases. Profiles are created through linguistic analysis of grammars and typological databases. For languages with limited documentation, profiles are constructed by linguistic experts or adapted from closely related languages. 3.3 Typological Divergence Calculation Divergence scores quantify how much source and target languages differ in each dimension. The calculation method depends on the dimension type: Categorical dimensions (e.g., word order): Discrete comparison with predetermined di- vergence values. For word order, we distinguish three levels: • Identical order (SVO → SVO): divergence = 0.0 • Verb position change (SVO → VSO): divergence = 0.6 • Major swap (SVO → SOV): divergence = 1.0 Major swaps invert the relative positions of subject and object around the verb, requiring complete sentence restructuring. Numeric dimensions (e.g., case marking, morphology): Absolute difference between source and target values: divergence[case_marking] = |src.case_richness − tgt.case_richness| (1) divergence[morphology] = |src.complexity − tgt.complexity| (2) Set-based dimensions (e.g., agreement): Jaccard distance over feature
Chunk 13 · 1,884 chars
cturing.
Numeric dimensions (e.g., case marking, morphology): Absolute difference between
source and target values:
divergence[case_marking] = |src.case_richness − tgt.case_richness| (1)
divergence[morphology] = |src.complexity − tgt.complexity| (2)
Set-based dimensions (e.g., agreement): Jaccard distance over feature sets:
divergence[agreement] = 1 − |src_features ∩ tgt_features|
|src_features ∪ tgt_features| (3)
Composite dimensions (e.g., TAM): Weighted average of sub-components:
divergence[tam] = 0.4 × |∆tense| + 0.4 × |∆aspect| + 0.2 × |∆mood| (4)
The result is a 16-dimensional divergence vector quantifying structural distance between the
language pair. Table 2 shows the divergence vector for English → Sinhala.
3.4 Directive Vector Construction
The divergence vector is transformed into a directive vector through linguistic weighting and nor-
malization. Not all grammatical features are equally salient for translation quality. Highly visible
features like word order receive greater weight, while subtle features like copula behavior receive
less.
Linguistic weights are assigned based on two criteria: (1) perceptual salience, which measures
how noticeable the error is to a native speaker, and (2) translation impact, which measures how
8
-- 8 of 25 --
Table 2: Divergence Vector for English → Sinhala
Dimension English Sinhala Divergence
Word order SVO SOV 1.0
Case marking 0.1 0.9 0.8
Morphology 0.2 0.8 0.6
Agreement {person, number} {person, number, gender, animacy} 0.5
TAM 0.6/0.5/0.4 0.7/0.6/0.5 0.1
Classifiers False False 0.0
Honorifics 0.0 0.6 0.6
Evidentiality False False 0.0
Serial verbs 0.0 0.3 0.3
Definiteness Articles Demonstratives 0.3
Animacy False True 0.4
Info structure False/False True/True 0.8
Negation Particle Suffix+particle 0.4
Pro-drop False True 0.5
Relative clauses Postnominal Prenominal 0.4
Copula Explicit Often omitted 0.4
WordChunk 14 · 1,999 chars
cs 0.0 0.6 0.6 Evidentiality False False 0.0 Serial verbs 0.0 0.3 0.3 Definiteness Articles Demonstratives 0.3 Animacy False True 0.4 Info structure False/False True/True 0.8 Negation Particle Suffix+particle 0.4 Pro-drop False True 0.5 Relative clauses Postnominal Prenominal 0.4 Copula Explicit Often omitted 0.4 Word Order Case Morph Agree TAM Class Honor Evid Serial Def Anim Info Neg Pro-drop RelCl Copula 0 0.5 1 1 0.8 0.6 0.5 0.1 0 0.6 0 0.3 0.3 0.4 0.8 0.4 0.5 0.4 0.4 Divergence Score Figure 3: Divergence vector visualization for English → Sinhala. Word order (SVO→SOV), case marking, and information structure show maximum divergence (0.8–1.0), while classifiers and evi- dentiality show zero divergence (both languages lack these features). 9 -- 9 of 25 -- much the error affects comprehension and naturalness. Word order receives the highest weight (1.2) as errors are immediately apparent and disrupt sentence processing. Case marking and information structure receive a weight of 1.0 as they critically affect grammatical correctness and naturalness. Features like copula presence receive a lower weight (0.5) as errors are subtle and rarely impair comprehension. The weighted divergence vector is L2-normalized to produce a unit-length directive vector: weighted[i] = divergence[i] × weight[i] (5) directive = weighted/||weighted||2 (6) For English → Sinhala, the resulting directive vector is: directive = [0.614, 0.409, 0.246, 0.154, 0.036, 0.000, 0.276, 0.000, 0.107, 0.123, 0.123, 0.409, 0.164, 0.179, 0.123, 0.102] This vector encodes the relative importance of each dimension for this language pair. Word order (0.614), case marking (0.409), and information structure (0.409) dominate, while TAM (0.036) and inactive dimensions (classifiers, evidentiality) contribute minimally. 3.5 Semantic Constraint Layer The semantic constraint layer addresses lexical ambiguities during candidate generation. Many words have multiple senses that cannot be distinguished by translation
Chunk 15 · 1,995 chars
ion structure (0.409) dominate, while TAM (0.036) and inactive dimensions (classifiers, evidentiality) contribute minimally. 3.5 Semantic Constraint Layer The semantic constraint layer addresses lexical ambiguities during candidate generation. Many words have multiple senses that cannot be distinguished by translation models relying solely on distributional patterns in training data. For instance, English “play” translates to different Sinhala words depending on context: ෙසල්ලම් (sellam, recreational play) vs. වාදනය (vādanaya, playing a musical instrument). The layer operates in three steps: 1. Ambiguity Detection: Identify polysemous source words with multiple target language translations. A lexicon maps source words to target senses with contextual indicators. 2. Context Analysis: Analyze surrounding words to determine the appropriate sense. For “play,” the presence of “children” and “garden” indicates a recreational context rather than a musical performance. 3. Token Adjustment: During generation, apply boosts to tokens corresponding to the cor- rect sense and penalties to incorrect senses. This guides the model toward contextually appropriate lexical choices without requiring explicit token constraints. In our experiments with GPT-5.2 translating “The children play in the garden,” the baseline model produced වාදනය (play instrument) as the top candidate. Semantic constraints boosted ෙසල්ලම් (play/fun) tokens and penalized වාදනය tokens, moving the correct sense to higher-ranked candidates for subsequent typological evaluation. 3.6 Typological Scoring Each translation candidate is evaluated for compliance across active typological dimensions: those with directive values exceeding an activation threshold of 0.1. For English → Sinhala, 14 of 16 dimensions are active (classifiers and evidentiality are inactive with a directive value of 0.0). Scoring functions are dimension-specific and operate on observable surface features derived from the language profile: Word
Chunk 16 · 1,992 chars
e with directive values exceeding an activation threshold of 0.1. For English → Sinhala, 14 of 16 dimensions are active (classifiers and evidentiality are inactive with a directive value of 0.0). Scoring functions are dimension-specific and operate on observable surface features derived from the language profile: Word order: Checks for verb-final markers (Sinhala verbal suffixes like -වා (-vā), -යි (-yi), -ති (-ti)) at sentence end, returning higher scores for SOV-compliant candidates. 10 -- 10 of 25 -- Case marking: Counts case suffix occurrences (e.g., accusative -ව (-va), dative -ට (-ṭa), loca- tive ෙ-් (-ē)) and compares to expected density based on sentence length. Morphology: Measures average word length, expecting longer words for agglutinative lan- guages due to suffix concatenation. Agreement: Detects verb agreement markers and plural markers, scoring the presence of expected morphology. TAM: Checks for tense/aspect/mood markers in verb endings, comparing against profile- specified marker inventories. Honorifics: Matches pronouns and verb forms against formal/informal markers, comparing with source sentence formality cues. Semantic appropriateness: Verifies that contextually appropriate word senses were selected (from the semantic constraint layer). Additional dimensions (serial verbs, definiteness, animacy, information structure, negation, pro- drop, relative clauses, copula) follow a similar profile-driven scoring logic. Each scorer returns a value in [0, 1] representing compliance with target language patterns. The final UMF score is a weighted average: UMF_score = ∑ i(directive[i] × dimension_score[i]) ∑ i directive[i] (7) where the sum ranges over active dimensions. This formula ensures that dimensions with higher divergence (and thus higher directive values) contribute proportionally more to the final score. 3.7 Candidate Reranking The framework combines model confidence with typological compliance to select the final transla- tion. For each
Chunk 17 · 1,996 chars
he sum ranges over active dimensions. This formula ensures that dimensions with higher divergence (and thus higher directive values) contribute proportionally more to the final score. 3.7 Candidate Reranking The framework combines model confidence with typological compliance to select the final transla- tion. For each candidate c, we compute: final_score(c) = α × model_score(c) + (1 − α) × UMF_score(c) (8) The mixing parameter α ∈ [0, 1] balances trust in the model’s learned preferences versus explicit grammatical requirements. Lower α (e.g., 0.3) prioritizes typological compliance, appropriate for high-divergence language pairs where model biases are strong. Higher α (e.g., 0.7) preserves more of the model’s ranking, suitable for low-divergence pairs or when model quality is high. The candidate with the highest final score is selected as the system output. In case of ties, the model’s original ranking is used as a tiebreaker. 4 Experimental Setup 4.1 Language Selection We evaluate the framework across nine language pairs representing diverse typological profiles. Table 3 presents the target languages with their typological characteristics and expected divergence from English. Language selection covers the full spectrum of typological distance from English, enabling eval- uation of the framework’s sensitivity to structural divergence. Figure 4 visualizes this spectrum. High-divergence languages (Sinhala, Tamil, Hindi) differ maximally from English in word order (SOV vs. SVO), morphological complexity (agglutinative/fusional vs. analytic), and case systems (8 cases vs. 3). Moderate-divergence languages (Arabic, Swahili) show partial structural differences. 11 -- 11 of 25 -- Low Divergence Moderate High Divergence Typological Distance from English Low High Chinese SVO, Isolating French SVO, Fusional Thai SVO, Isolating Swahili SVO, Agglut. Arabic VSO, Fusional Japanese SOV, Agglut. Hindi SOV, Fusional Tamil SOV, Agglut. Sinhala SOV, Agglut. Figure 4: Typological
Chunk 18 · 1,998 chars
ences. 11 -- 11 of 25 -- Low Divergence Moderate High Divergence Typological Distance from English Low High Chinese SVO, Isolating French SVO, Fusional Thai SVO, Isolating Swahili SVO, Agglut. Arabic VSO, Fusional Japanese SOV, Agglut. Hindi SOV, Fusional Tamil SOV, Agglut. Sinhala SOV, Agglut. Figure 4: Typological distance spectrum of evaluated languages from English. Languages are grouped into three clusters based on their combined word order and morphological divergence. High-divergence languages (red) require the most structural transformation during translation. Arabic exhibits VSO order and fusional morphology, while Swahili maintains SVO order but em- ploys agglutinative morphology with noun class systems. Low-divergence languages (French, Thai, Chinese) share English’s SVO order and analytic tendencies, with Chinese representing minimal divergence as both languages are SVO and isolating. Japanese, though typologically distant (SOV, agglutinative), is included as a high-resource control to assess whether the framework inappropri- ately intervenes when baseline model quality is already high. Table 3: Target languages and typological properties Language Family Word Order Morphology Case System Divergence Level Sinhala Indo-Aryan SOV Agglutinative 8 cases High Tamil Dravidian SOV Agglutinative 8 cases High Hindi Indo-Aryan SOV Fusional 8 cases High Arabic Semitic VSO Fusional 3 cases Moderate Swahili Bantu SVO Agglutinative Noun classes Moderate Japanese Japonic SOV Agglutinative Postpositions High French Romance SVO Fusional Minimal Low Thai Tai-Kadai SVO Isolating None Low Chinese Sinitic SVO Isolating None Minimal 4.2 Test Dataset The evaluation dataset comprises 341 English sentences designed by language specialists to cover diverse morphological and syntactic phenomena systematically. The dataset is organized into two categories: Morphological phenomena (189 sentences) covering tense-aspect-mood, case markers, agreement, and derivational processes; and
Chunk 19 · 1,999 chars
tion dataset comprises 341 English sentences designed by language specialists to cover diverse morphological and syntactic phenomena systematically. The dataset is organized into two categories: Morphological phenomena (189 sentences) covering tense-aspect-mood, case markers, agreement, and derivational processes; and Syntactic phenomena (152 sentences) covering word order, pro-drop, relative clauses, coordination, and complex sentence structures. Sentences were custom-created to capture essential typological properties that distinguish languages, ensuring rep- resentation of features where cross-linguistic divergence is most pronounced. Grammatical phenomena covered include morphological features (number marking, case inflec- 12 -- 12 of 25 -- tion, verb conjugation for person/number/tense/aspect/mood), syntactic structures (constituent order variations, passivization, negation strategies, subordination, agreement patterns), and lexical- semantic distinctions (polysemous verbs requiring contextual disambiguation, multi-word expres- sions, honorific distinctions, classifier usage). The dataset is designed to elicit systematic typo- logical errors from English-centric models rather than covering general translation phenomena. Sentences include constructions known to trigger structural biases: locative constructions requiring case marking, desiderative verbs triggering infinitival versus gerundival complements, and contexts where lexical ambiguity interacts with typological constraints. 4.3 Translation Models We evaluate the framework with two translation systems representing different architectural paradigms: GPT-5.2 (OpenAI) and mT5 (Multilingual T5). GPT-5.2 is a state-of-the-art large language model accessed via API, while mT5 provides an open-source sequence-to-sequence alternative. Both mod- els are accessed as black-box systems, demonstrating the framework’s compatibility with production translation services where model internals are unavailable. We generate
Chunk 20 · 1,998 chars
. GPT-5.2 is a state-of-the-art large language model
accessed via API, while mT5 provides an open-source sequence-to-sequence alternative. Both mod-
els are accessed as black-box systems, demonstrating the framework’s compatibility with production
translation services where model internals are unavailable. We generate translation candidates us-
ing beam search with beam width B = 32, producing ranked hypotheses for reranking.
4.4 Evaluation Metrics
We employ multiple complementary metrics to evaluate UMF performance across different dimen-
sions of translation quality and system behavior.
4.4.1 Change Rate (Intervention Rate)
Change Rate measures the percentage of source sentences for which the framework selects a candi-
date different from the model’s top-ranked output:
Change Rate = |{x : c∗
UMF̸ = c∗
baseline}|
|X| (9)
where X is the set of source sentences, c∗
UMF is the UMF-selected translation, and c∗
baseline
is the model’s top-ranked output. This metric directly correlates with typological divergence:
high intervention rates for typologically distant pairs validate the framework’s ability to identify
systematic structural errors, while low rates for similar pairs demonstrate appropriate restraint.
4.4.2 Intervention Precision
Intervention Precision quantifies the proportion of UMF interventions that result in correct im-
provements according to expert linguistic judgment:
Intervention Precision = Number of Correct Improvements
Total Number of Interventions (10)
This metric is computed only over cases where UMF selected a different translation than the
baseline (c∗
UMF̸ = c∗
baseline). Each intervention is classified by native speaker linguists as: (1) correct
improvement, (2) neutral/no change in quality, or (3) UMF error (degradation). Intervention
Precision reflects the reliability of UMF’s selection decisions.
13
-- 13 of 25 --
4.4.3 Gain-Risk Ratio
The Gain-Risk Ratio measures the efficiency of UMF interventions by comparing correct improve-
ments toChunk 21 · 1,996 chars
as: (1) correct improvement, (2) neutral/no change in quality, or (3) UMF error (degradation). Intervention Precision reflects the reliability of UMF’s selection decisions. 13 -- 13 of 25 -- 4.4.3 Gain-Risk Ratio The Gain-Risk Ratio measures the efficiency of UMF interventions by comparing correct improve- ments to errors introduced: Gain-Risk Ratio = Number of Correct Improvements Number of UMF Errors (11) A ratio greater than 1.0 indicates that improvements outweigh errors (net positive impact), while ratios below 1.0 indicate that errors outweigh improvements (net negative impact). This metric provides a practical assessment of whether deploying UMF for a given language pair yields overall benefit. For example, a Gain-Risk Ratio of 2.14 (Hindi) means that for every UMF error, the system produces 2.14 correct improvements. 4.4.4 UMF Compliance Score UMF Compliance Score quantifies structural correctness according to target language grammar, computed as a weighted average of typological dimension scores: UMF Score = ∑16 i=1 wi · si(c) ∑16 i=1 wi (12) where wi is the directive weight for dimension i (derived from the directive vector), and si(c) is the compliance score for dimension i evaluated on candidate c. This metric operates independently of reference translations, enabling evaluation of grammatical compliance for languages where gold- standard references are scarce. 4.4.5 Automatic Reference-Based Metrics We additionally evaluate translations using established automatic metrics (BLEU, COMET, chrF, BERTScore) against human reference translations. However, these metrics have documented limi- tations for typological evaluation. BLEU measures n-gram overlap and is insensitive to structural correctness. Neural metrics (COMET, BERTScore) depend on multilingual encoder representations and may not reliably detect morphological errors, case marking violations, or word order issues for underrepresented languages. Our results show automatic metric scores remain in
Chunk 22 · 1,993 chars
-gram overlap and is insensitive to structural correctness. Neural metrics (COMET, BERTScore) depend on multilingual encoder representations and may not reliably detect morphological errors, case marking violations, or word order issues for underrepresented languages. Our results show automatic metric scores remain in close proximity to baseline outputs, despite substantial structural improvements measured by UMF scores and human evaluation. This discrepancy underscores the inadequacy of existing metrics for assessing typological compliance. 4.4.6 Human Evaluation Native speaker evaluation provides ground truth for translation quality assessment. Expert linguists who are native speakers of the target language assess sampled interventions on two criteria: • Structural Correctness: Grammatical conformance to target language requirements (word order, case marking, morphology, agreement, etc.) • Semantic Adequacy: Preservation of meaning from the source sentence Each intervention is classified into one of three categories: 1. Correct Improvement: UMF selection is structurally better than baseline and semantically equivalent or better 14 -- 14 of 25 -- 2. Neutral/No Change: UMF selection and baseline are of comparable quality 3. UMF Error: UMF selection is worse than baseline (structural or semantic degradation) This evaluation quantifies the proportion of interventions yielding genuine improvements versus neutral or harmful changes. The classifications directly feed into the computation of Intervention Precision and Gain-Risk Ratio. Evaluation Protocol. For each target language, two native-speaker linguists independently evaluated all UMF interventions (cases where UMF selected a different candidate than the base- line). Evaluators were presented with the source sentence, baseline translation, and UMF-selected translation in randomized order without system labels (blind evaluation). Each evaluator assigned one of the three classification categories. In cases of
Chunk 23 · 1,993 chars
ions (cases where UMF selected a different candidate than the base- line). Evaluators were presented with the source sentence, baseline translation, and UMF-selected translation in randomized order without system labels (blind evaluation). Each evaluator assigned one of the three classification categories. In cases of disagreement, a third senior linguist adjudi- cated. Inter-annotator agreement was substantial (Cohen’s κ = 0.71 averaged across languages). Neutral classifications were included in precision calculations as non-improvements, providing a conservative estimate of intervention quality. 4.5 Baseline and Ablations The baseline is the model’s top-ranked output without reranking. Ablation studies isolate the contribution of each component: semantic constraints only (Layer 1), typological reranking only (Layer 2), and the full dual-layer system. These ablations test whether semantic and typological layers provide complementary or overlapping benefits. 4.6 Hyperparameters Table 4 presents the key hyperparameters used in our experiments. Table 4: UMF hyperparameters and configuration Parameter Value Description Candidate Generation Beam width (B) 32 Number of candidates generated Top-K retention (K) 4 Candidates retained for scoring Temperature 1.0 Sampling temperature for diversity Typological Scoring Mixing parameter (α) 0.5 Balanced weighting (used in experiments) 0.3 Typological priority (alternative) 0.7 Model priority (alternative) Activation threshold 0.1 Minimum directive value for scoring Semantic Constraints (Layer 1) Token boost +1.0 Logit boost for correct sense Token penalty -0.5 Logit penalty for wrong sense The mixing parameter α balances model confidence with typological compliance in the final candidate selection: final_score(c) = α · pmodel(c) + (1 − α) · UMF_score(c) (13) We test α = 0.3 (prioritizing typological correctness with 70% weight), α = 0.5 (balanced weighting), and α = 0.7 (prioritizing model confidence with 70% weight). The
Chunk 24 · 1,977 chars
r α balances model confidence with typological compliance in the final candidate selection: final_score(c) = α · pmodel(c) + (1 − α) · UMF_score(c) (13) We test α = 0.3 (prioritizing typological correctness with 70% weight), α = 0.5 (balanced weighting), and α = 0.7 (prioritizing model confidence with 70% weight). The results reported 15 -- 15 of 25 -- in this paper use α = 0.5 (balanced weighting), which provides equal consideration to model confidence and typological compliance. This configuration was selected to avoid over-correction while still enabling meaningful typological intervention. Dimensions with directive values below 0.1 are excluded from scoring, focusing computational effort on high-divergence features. For English to Sinhala, this activates 14 of 16 dimensions (classifiers and evidentiality remain inactive with directive values of 0.0). The semantic constraint parameters (token boost +1.0, token penalty -0.5) were determined through grid search optimization on a held-out development set. These values provide effective disambiguation without overwhelming the model’s learned language patterns. 5 Results and Analysis 5.1 Overview of Empirical Results We evaluated UMF-based reranking across nine target languages using expert linguistic review. Each sentence was evaluated with respect to three distinct outputs: the baseline model output, the UMF-selected output, and the remaining candidate set. Outcomes were classified as correct improvement, neutral/no change, or UMF error, based solely on specialist judgment. This section presents both quantitative intervention patterns (Section 5.2) and qualitative analysis of improve- ment efficiency (Sections 5.3–5.6). Table 5 presents the key performance metrics across all evaluated languages. Across languages, UMF exhibits non-uniform but systematic behavior, with performance strongly correlated with typological distance, morphological complexity, and the completeness of language-specific
Chunk 25 · 1,999 chars
prove- ment efficiency (Sections 5.3–5.6). Table 5 presents the key performance metrics across all evaluated languages. Across languages, UMF exhibits non-uniform but systematic behavior, with performance strongly correlated with typological distance, morphological complexity, and the completeness of language-specific feature representations. Crucially, the results confirm that UMF impact cannot be inferred from inter- vention frequency alone. Change rate and intervention precision must be interpreted jointly along with a metric to assess the gain vs. risk of the framework. Table 5: Performance metrics across nine target languages showing intervention behavior and efficiency. Language Total Cases Change Rate Intervention Precision Gain-Risk Sinhala 341 45.16% 26.62% 0.20 Tamil 341 26.69% 29.67% 0.16 Thai 341 4.40% 20.00% 0.02 Chinese 341 3.23% 100.00% 1.83 Hindi 341 15.54% 84.91% 2.14 Japanese 341 7.33% 76.00% 0.90 Arabic 341 11.44% 79.49% 1.00 French 341 9.09% 80.65% 1.09 Swahili 341 9.68% 48.48% 0.19 5.2 Intervention Patterns and Typological Correlation Table 6 presents UMF Compliance Scores across all nine language pairs, revealing the structural quality of UMF-selected outputs and their relationship to typological characteristics. As established in Table 5, intervention rates exhibit a strong positive correlation with typo- logical divergence (Pearson r = 0.82, p < 0.01), validating that UMF correctly identifies cases 16 -- 16 of 25 -- 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 Thai Tamil Swahili Sinhala Japanese Arabic French Chinese Hindi 2 · 10−2 0.16 0.19 0.2 0.9 1 1.09 1.83 2.14 Gain-Risk Ratio Low (<1.0) Balanced (≈1.0) High (>1.0) Break-even Figure 5: Gain-Risk Ratio distribution across nine target languages. Languages above the 1.0 threshold (vertical line) show net positive impact from UMF reranking. Hindi and Chinese achieve the highest efficiency, while Thai, Tamil, Swahili, and Sinhala show ratios well below 1.0, indicating that errors outweigh
Chunk 26 · 1,991 chars
-even Figure 5: Gain-Risk Ratio distribution across nine target languages. Languages above the 1.0 threshold (vertical line) show net positive impact from UMF reranking. Hindi and Chinese achieve the highest efficiency, while Thai, Tamil, Swahili, and Sinhala show ratios well below 1.0, indicating that errors outweigh improvements in these languages. Change Rate Intervention Precision SI TA HI AR SW FR JA TH ZH 0 50 100 45 .2 26 .7 15 .5 11 .4 9.7 9.1 7.3 4.4 3.2 26 .6 29 .7 84 .9 79 .5 48 .5 80 .7 76 20 100 Percentage (%) SI=Sinhala, TA=Tamil, HI=Hindi, AR=Arabic, SW=Swahili, FR=French, JA=Japanese, TH=Thai, ZH=Chinese Figure 6: Change Rate vs. Intervention Precision across languages, ordered by decreasing change rate. High-divergence languages (Sinhala, Tamil) show high change rates but lower precision, while structurally profiled languages (Chinese, Hindi, French) achieve high precision with moderate intervention. 17 -- 17 of 25 -- Table 6: UMF Scores for each language represent the average structural compliance score of UMF- selected outputs Language UMF Score Typological Divergence Sinhala 0.674 High (SOV, agglutinative) French 0.629 Low (SVO, fusional) Japanese 0.624 High (but excellent baseline) Swahili 0.591 Moderate (SVO, agglutinative) Tamil 0.587 High (SOV, agglutinative) Arabic 0.582 Moderate (VSO, fusional) Hindi 0.569 High (SOV, fusional) Chinese 0.540 Minimal (SVO, isolating) Thai 0.520 Low (SVO, isolating) where structural transformation is needed. Languages with greater structural distance from En- glish (Sinhala, Tamil, Hindi) show substantially higher change rates (45.16%, 26.69%, and 15.54% respectively), while typologically similar languages (Chinese, Thai, French) exhibit conservative intervention behavior (3.23%, 4.40%, and 9.09% respectively). Japanese represents a notable exception: despite high typological divergence, it shows a rela- tively low change rate (7.33%). This reflects GPT-5.2’s particularly strong baseline performance
Chunk 27 · 1,994 chars
ly similar languages (Chinese, Thai, French) exhibit conservative intervention behavior (3.23%, 4.40%, and 9.09% respectively). Japanese represents a notable exception: despite high typological divergence, it shows a rela- tively low change rate (7.33%). This reflects GPT-5.2’s particularly strong baseline performance for English-Japanese translation rather than framework insensitivity. When baseline quality is already high, UMF appropriately restrains intervention, demonstrating that the framework responds to actual baseline error exposure rather than divergence alone. UMF Compliance Scores show modest but systematic variation across languages (range: 0.520– 0.674). The elevation in UMF scores for high-divergence languages (Sinhala: 0.674, Tamil: 0.587, Hindi: 0.569) compared to low-divergence languages (Chinese: 0.540, Thai: 0.520) reflects the framework’s prioritization of typologically critical dimensions when directive weights are higher. French (0.629) and Japanese (0.624) exhibit strong compliance despite lower divergence, attributable to GPT-5.2’s robust baseline performance for these well-resourced languages. This gradient demon- strates that UMF-selected outputs achieve stronger structural compliance precisely in languages where typological constraints are most stringent. The relatively narrow range of UMF scores compared to the wide range of change rates (3.23%– 45.16%) indicates that selected outputs maintain consistent structural quality regardless of inter- vention frequency. This suggests that UMF’s selection mechanism successfully balances structural compliance across diverse typological profiles, intervening aggressively when baseline outputs vio- late critical constraints while maintaining quality even in conservative intervention regimes. Critically, the systematic relationship between typological divergence, intervention rates, and UMF compliance scores demonstrates that the framework’s behavior is driven by linguistic structure rather than
Chunk 28 · 1,999 chars
line outputs vio- late critical constraints while maintaining quality even in conservative intervention regimes. Critically, the systematic relationship between typological divergence, intervention rates, and UMF compliance scores demonstrates that the framework’s behavior is driven by linguistic structure rather than arbitrary heuristics. This provides quantitative validation that the metalinguistic framework successfully operationalizes typological distance as a predictor of both baseline model error patterns and the structural demands placed on corrective reranking. 5.3 High Gain-Risk Performance in Structurally Profiled Languages Chinese, Hindi, Arabic, and French exhibit consistently high Gain-Risk Ratios, reflecting great improvement in efficiency relative to incurred errors. Across these languages, UMF demonstrates moderate change rates aligned with baseline error exposure, low UMF error rates (single-digit percentages), high intervention precision (approxi- 18 -- 18 of 25 -- mately 80-100%), and Gain-Risk Ratios well above 1, indicating that improvements substantially outweigh errors. In practical terms, this means that each UMF error in these languages buys multiple correct improvements, making reranking decisively net-positive. Table 7: Improvement categories in high Gain-Risk languages showing where UMF adds most value. Improvement Category Description Frequency Tense-aspect alignment Correct temporal/aspectual marking 32% Modality selection Appropriate modal verb/particle choice 24% Clause ordering Improved constituent arrangement 18% Idiomatic preference Natural expression over literal translation 14% Lexical precision More contextually appropriate word choice 12% Improvements in these languages tend to involve linguistically subtle but structurally mean- ingful distinctions such as tense-aspect alignment, modality selection, clause ordering, or idiomatic preference that are often under-weighted in purely score-based ranking. Notably, these languages
Chunk 29 · 1,997 chars
appropriate word choice 12% Improvements in these languages tend to involve linguistically subtle but structurally mean- ingful distinctions such as tense-aspect alignment, modality selection, clause ordering, or idiomatic preference that are often under-weighted in purely score-based ranking. Notably, these languages span multiple typological families (Sinitic, Indo-Aryan, Semitic, Japonic, Romance). The consistency of high Gain-Risk performance across such diversity suggests that UMF is not exploiting proximity to English, but instead leveraging abstract linguistic constraints encoded in its metalinguistic framework. 5.4 Low Gain-Risk Regime in Morphologically Dense Languages Sinhala and Tamil display a sharply contrasting profile characterized by low Gain-Risk Ratios, despite high intervention activity. Specifically, these languages show very high change rates, indicating frequent detection of di- vergence, low intervention precision leading to elevated UMF error accumulation, and Gain-Risk Ratios well below 1, meaning that errors outweigh improvements. At face value, this appears to indicate weak performance. However, qualitative analysis reveals that UMF errors in these languages are highly structured and non-random. Dominant error categories include case marking and argument structure resolution, register and honorific selection, and over-normalization of discourse-driven constructions. Table 8: Error category distribution for morphologically dense languages (Sinhala and Tamil). Error Category Example Impact Sinhala % Tamil % Case marking errors Incorrect/missing case suffixes 38% 35% Register/Honorific Wrong formality level 28% 30% Over-normalization Natural discourse patterns lost 18% 20% Argument structure Subject/object role confusion 10% 10% Other Miscellaneous 6% 5% This pattern indicates that UMF is sensitive to divergence in morphologically rich languages but lacks sufficient resolution to consistently select the correct alternative. In other words,
Chunk 30 · 1,998 chars
ization Natural discourse patterns lost 18% 20% Argument structure Subject/object role confusion 10% 10% Other Miscellaneous 6% 5% This pattern indicates that UMF is sensitive to divergence in morphologically rich languages but lacks sufficient resolution to consistently select the correct alternative. In other words, the system detects that something is wrong, but sometimes applies the wrong corrective preference due to incomplete or misweighted language-specific features. Importantly, this reflects sensitivity without sufficient resolution, rather than the absence of signal. Such failure modes are characteristic of early-stage metalinguistic systems operating on low-resource, pragmatically dense languages and provide clear targets for iterative refinement. 19 -- 19 of 25 -- 5.5 Low Gain-Risk Behavior in Conservatively Treated Languages Thai and Swahili occupy a distinct regime characterized by low change rates, indicating conservative intervention, but also low intervention precision, resulting in Gain-Risk Ratios well below 1 (Thai: 0.02, Swahili: 0.19). In these languages, UMF intervenes infrequently, but when it does intervene, the corrections often fail to improve upon the baseline. This pattern differs from morphologically dense languages (Sinhala, Tamil), which show high intervention frequency with low precision. Thai and Swahili instead show low intervention frequency with low precision, indicating that UMF’s linguistic priors are insufficiently expressive for these language types. Notably, expert review confirms that unchanged baselines in these languages are rarely judged incorrect, indicating that UMF does not miss obvious baseline failures. This behavior suggests that UMF appropriately defaults to baseline ranking when confidence is low, but the sparse interventions that do occur reflect incomplete coverage of discourse structure, aspectual interpretation, and language-specific pragmatic patterns. The high baseline correctness rates (94.2% for Thai,
Chunk 31 · 1,999 chars
res. This behavior suggests that UMF appropriately defaults to baseline ranking when confidence is low, but the sparse interventions that do occur reflect incomplete coverage of discourse structure, aspectual interpretation, and language-specific pragmatic patterns. The high baseline correctness rates (94.2% for Thai, 88.6% for Swahili) confirm that UMF’s conservative behavior in these languages is appropriate: aggressive intervention would likely degrade rather than improve already-correct translations. 5.6 Error Structure and Interpretability Across all languages, UMF errors fall into a small, interpretable set of categories: aspect or tense mismatch, register or politeness errors, lexical unnaturalness, over-normalization, and typological overcorrection. The absence of diffuse or unclassifiable error patterns is a key scientific outcome. It demonstrates that UMF failures are systematic, explainable, and reproducible, enabling targeted improvements rather than heuristic tuning. This property distinguishes UMF from opaque reranking heuristics and supports its suitability for controlled scientific iteration. Table 9: UMF Error Taxonomy: Systematic classification of reranking errors across all evaluated languages. Error Type Primary Languages Frequency Linguistic Interpretation Case mismatch Sinhala, Tamil, Hindi High Incorrect argument structure resolution Register error Sinhala, Tamil High Honorific/formality misalignment Aspect mismatch All SOV languages Moderate Tense-aspect-mood selection error Over-normalization Tamil, Swahili Moderate Discourse-driven constructions flattened Lexical unnaturalness French, Arabic Low Unusual but grammatical word choice Typological overcorrection Japanese, Chinese Low Unnecessary structural transformation 5.7 Implications of Gain-Risk Analysis for Metalinguistic Reranking Viewed through the Gain-Risk Ratio, three conclusions emerge: 1. UMF delivers high-efficiency improvements in structurally profiled languages, where each error
Chunk 32 · 1,999 chars
oice Typological overcorrection Japanese, Chinese Low Unnecessary structural transformation 5.7 Implications of Gain-Risk Analysis for Metalinguistic Reranking Viewed through the Gain-Risk Ratio, three conclusions emerge: 1. UMF delivers high-efficiency improvements in structurally profiled languages, where each error yields multiple correct gains. 2. Performance degradation correlates with linguistic under-specification, not architectural in- stability. 20 -- 20 of 25 -- 3. UMF functions as a metalinguistic decision layer, complementing base model generation by selectively reallocating risk toward improvement opportunities. Crucially, UMF’s impact is observable under expert review, and its failures remain linguistically interpretable rather than opaque. 5.8 Benchmarks vs. UMF Across languages, standard automatic metrics (BLEU, chrF, COMET, and BERTScore) show small, often negative deltas between baseline and UMF-selected outputs, even in cases where expert evaluation confirms meaningful linguistic improvements. Table 10 presents the automatic metric scores along with the changes in scores between baseline and UMF-selected outputs across all evaluated languages. Table 10: Automatic metric scores: baseline vs. UMF-selected outputs Language BLEU chrF COMET BERTScore Base UMF ∆ Base UMF ∆ Base UMF ∆ Base UMF ∆ Tamil 32.79 30.59 -2.20 72.43 71.47 -0.96 0.928 0.925 -0.003 0.919 0.916 -0.003 Sinhala 32.35 29.15 -3.20 63.48 62.14 -1.34 0.911 0.906 -0.005 0.966 0.965 -0.001 Arabic 26.88 27.07 +0.19 54.76 54.65 -0.11 0.773 0.771 -0.002 0.828 0.828 0.000 Chinese 10.86 10.86 0.00 63.49 63.57 +0.08 0.930 0.930 0.000 0.889 0.888 -0.001 French 73.58 71.53 -2.05 84.02 82.72 -1.30 0.930 0.928 -0.002 0.882 0.874 -0.008 Hindi 62.68 61.70 -0.98 80.91 79.76 -1.15 0.907 0.899 -0.008 0.953 0.951 -0.002 Japanese - - - 62.31 62.65 +0.34 0.937 0.937 0.000 0.918 0.919 +0.001 Swahili 38.62 37.85 -0.77 68.49 68.67 +0.18 0.861 0.861 0.000 0.883 0.882 -0.001 Thai 14.60 13.58 -1.02 68.19 68.19
Chunk 33 · 1,990 chars
.02 82.72 -1.30 0.930 0.928 -0.002 0.882 0.874 -0.008 Hindi 62.68 61.70 -0.98 80.91 79.76 -1.15 0.907 0.899 -0.008 0.953 0.951 -0.002 Japanese - - - 62.31 62.65 +0.34 0.937 0.937 0.000 0.918 0.919 +0.001 Swahili 38.62 37.85 -0.77 68.49 68.67 +0.18 0.861 0.861 0.000 0.883 0.882 -0.001 Thai 14.60 13.58 -1.02 68.19 68.19 0.00 0.909 0.909 0.000 0.921 0.921 0.000 This pattern is consistent across representative languages from all three evaluation regimes (high-gain, such as French; conservative, such as Swahili; and low-gain morphologically dense lan- guages such as Sinhala and Tamil) and indicates a systematic disconnect between surface-form similarity metrics and linguistically grounded correctness. In particular, UMF-driven changes fre- quently preserve semantic equivalence while altering morphology, argument structure, discourse ordering, or pragmatic marking, resulting in lower n-gram overlap or embedding similarity despite being preferred by human reviewers. This explains why UMF can exhibit favorable Gain-Risk Ratios under expert judgment while appearing neutral or negative under conventional benchmarks. Importantly, this behavior aligns with earlier findings that UMF’s change rate closely tracks latent LLM error exposure rather than benchmark variance, suggesting that UMF is responding to linguistic divergence that these metrics are structurally incapable of detecting. Rather than competing with BLEU, chrF, or embedding-based scores, UMF operates orthog- onally to them, functioning as a metalinguistic decision layer that identifies and corrects errors invisible to surface similarity measures. In this sense, UMF should be understood not merely as an auxiliary reranker but as a complementary and in certain linguistic regimes, alternative evaluation signal for assessing correctness in typologically diverse and discourse-sensitive languages. 5.9 Limitations This study evaluates UMF strictly in a post-generation reranking setting. Results may differ when UMF
Chunk 34 · 1,993 chars
ly as an auxiliary reranker but as a complementary and in certain linguistic regimes, alternative evaluation signal for assessing correctness in typologically diverse and discourse-sensitive languages. 5.9 Limitations This study evaluates UMF strictly in a post-generation reranking setting. Results may differ when UMF constraints are applied earlier in the generation process. Additionally, several morphologi- cally rich and pragmatically dense languages remain under-represented in the current feature set. 21 -- 21 of 25 -- 0 10 20 30 40 50 0 0.5 1 SI TA HI AR SW JA FR TH ZH Change Rate (%) Typological Divergence High divergence Moderate divergence Low divergence Pearson r = 0.82 p < 0.01 Figure 7: Correlation between intervention rate (Change Rate) and typological divergence from English. The strong positive correlation (r = 0.82) validates that UMF correctly identifies and intervenes in cases where structural transformation is most needed. Japanese is a notable outlier: high divergence but low change rate due to GPT-5.2’s strong baseline quality for this language. Our findings show that while UMF already captures meaningful signals of interpretation error in LLM outputs, its detection performance is constrained by the current depth of language pro- files. Strengthening these profiles is expected to significantly improve UMF’s ability to identify interpretation-level errors. 6 Further Research and Required Work The results of this study indicate that while UMF-based reranking can detect linguistically salient divergences, it does not yet consistently resolve them correctly across all language types. In particu- lar, performance degradation in morphologically rich and low-resource languages highlights several limitations in the current framework. Addressing these limitations requires both representational and methodological advances. We outline the key areas of further research below. 6.1 Refinement of Linguistic Representations A primary source of UMF error
Chunk 35 · 1,990 chars
orphologically rich and low-resource languages highlights several limitations in the current framework. Addressing these limitations requires both representational and methodological advances. We outline the key areas of further research below. 6.1 Refinement of Linguistic Representations A primary source of UMF error arises from incomplete or misweighted linguistic feature representa- tions, particularly in languages with rich case systems, flexible word order, and pragmatic marking. Current profiles capture the presence of such features but do not yet encode their interactional constraints with sufficient resolution. Future work must focus on expanding language profiles to include hierarchical relationships between morphological, syntactic, and discourse-level features. Rather than treating features as independent signals, we must explicitly model feature dependencies such as case-verb agreement and register-context alignment. Additionally, we need to introduce negative constraints that specify not only what constructions are preferred but also which are disallowed in specific contexts. Without such refinements, UMF risks systematic overcorrection in precisely the languages it is intended to support. 22 -- 22 of 25 -- 6.2 Calibration of Intervention Confidence The current evaluation reveals that UMF exhibits high sensitivity but insufficient specificity in certain languages, intervening frequently without proportional gains. This indicates a need for improved confidence calibration. Future research should investigate language-specific intervention thresholds that modulate how willing UMF is to override baseline rankings. We need mechanisms for confidence decay when competing candidates are nearly equivalent, as well as uncertainty estimates that can distinguish between genuine structural errors and stylistic variation. A more conservative intervention policy is likely to reduce degradation in languages where features are underspecified while preserving
Chunk 36 · 1,998 chars
for confidence decay when competing candidates are nearly equivalent, as well as uncertainty estimates that can distinguish between genuine structural errors and stylistic variation. A more conservative intervention policy is likely to reduce degradation in languages where features are underspecified while preserving gains where linguistic signals are strong. 6.3 Disentangling Stylistic Preference from Structural Correctness Several observed improvements correspond to stylistic or register-level preferences rather than clear linguistic corrections. While such refinements may be acceptable, they complicate claims of struc- tural improvement. Further work is required to separate core grammatical correctness from stylistic optimization in both evaluation and reranking. This involves introducing explicit labels or tiers that distinguish grammatical, pragmatic, and stylistic constraints. We should also evaluate UMF under stricter criteria where only structurally necessary changes count as improvements. This distinction is essential for establishing the scientific contribution of UMF beyond stylistic reranking. 6.4 Expansion of Error Taxonomy and Cross-Language Analysis Although UMF errors are largely classifiable, current error analysis remains coarse-grained. A more detailed taxonomy is required to guide targeted improvements. Future studies should extend the error taxonomy with language-specific subcategories, such as honorific misalignment or discourse particle misuse. We need to quantify error distributions across languages to identify systematic typological failure modes. Conducting cross-language comparisons will help determine which error classes generalize and which are language-specific. Such analysis would enable principled prioritization of development effort and avoid ad hoc fixes. 6.5 Evaluation Beyond Reranking This study evaluates UMF exclusively as a post-generation reranking layer. While appropriate for initial validation, this setting limits the scope
Chunk 37 · 1,999 chars
neralize and which are language-specific. Such analysis would enable principled prioritization of development effort and avoid ad hoc fixes. 6.5 Evaluation Beyond Reranking This study evaluates UMF exclusively as a post-generation reranking layer. While appropriate for initial validation, this setting limits the scope of potential impact. Future research should explore integrating UMF constraints earlier in the generation process. We need to assess whether applying guidance before or during generation reduces the need for aggressive reranking. Comparing reranking-only and integrated approaches under identical evalu- ation protocols will help determine whether the limitations we observe stem from where UMF sits in the pipeline rather than from its conceptual design. 6.6 Longitudinal and Scaling Studies Finally, the current results represent a snapshot of UMF at an early stage of language coverage. Longitudinal evaluation is required to determine whether observed weaknesses diminish as language profiles mature. 23 -- 23 of 25 -- Future work should include re-evaluation as linguistic feature sets are expanded and reweighted. We need controlled ablation studies to measure the contribution of individual feature classes, and scaling experiments across additional low-resource and typologically extreme languages. Only through such iterative validation can claims of universality be meaningfully assessed. Acknowledgments We gratefully acknowledge Priya M. Nair, CEO of Zwag AI, for her vision, leadership, and financial support, which made this research possible. We also extend our sincere thanks to Dulmini Fernando and Shivalatha Sivasundaram for coordi- nating language specialists and native translators, and for conducting translation quality assessment and typological error analysis. We further thank the language specialists and native translators who contributed to the human evaluations, whose linguistic expertise and careful assessments were essential to the quality and
Chunk 38 · 1,998 chars
language specialists and native translators, and for conducting translation quality assessment and typological error analysis. We further thank the language specialists and native translators who contributed to the human evaluations, whose linguistic expertise and careful assessments were essential to the quality and reliability of this research. References [1] W3Techs. (2021). Usage statistics of content languages for websites. https://w3techs.com/ technologies/overview/content_language [2] Brinkmann, H., et al. (2025). Large Language Models Share Representations of Latent Gram- matical Concepts Across Typologically Diverse Languages. arXiv preprint arXiv:2501.06346. [3] Arnett, C., & Bergen, B. (2025). Why do language models perform worse for morphologically complex languages? arXiv preprint arXiv:2411.14198. [4] Eberhard, D. M., Simons, G. F., & Fennig, C. D. (Eds.). (2024). Ethnologue: Languages of the World (27th ed.). SIL International. https://www.ethnologue.com [5] Dorr, B. J. (1994). Machine Translation Divergences: A Formal Description and Proposed Solution. Computational Linguistics, 20(4), 597–633. [6] Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. [7] Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod- els. Advances in Neural Information Processing Systems, 35, 24824–24837. [8] Boujkian, S., et al. (2025). Improving Low-Resource Machine Translation via Cross- Linguistic Transfer from Typologically Similar High-Resource Languages. arXiv preprint arXiv:2501.00045. [9] Dryer, M. S., & Haspelmath, M. (Eds.). (2024). The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. https://wals.info/ [10] Franceschelli, G., & Musolesi, M. (2024). Creative Beam Search: LLM-as-a-Judge for Improv- ing Response Generation. arXiv preprint arXiv:2405.00099. [11] Papineni, K., et al. (2002). BLEU:
Chunk 39 · 1,992 chars
he World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. https://wals.info/ [10] Franceschelli, G., & Musolesi, M. (2024). Creative Beam Search: LLM-as-a-Judge for Improv- ing Response Generation. arXiv preprint arXiv:2405.00099. [11] Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of ACL 2002, 311–318. 24 -- 24 of 25 -- [12] Zhang, T., et al. (2020). BERTScore: Evaluating Text Generation with BERT. Proceedings of ICLR 2020. [13] Rei, R., et al. (2022). COMET: A Neural Framework for MT Evaluation. Proceedings of EMNLP 2022, 2685–2702. [14] Krishnamurthy, P. (2019). Development of Telugu-Tamil Transfer-Based Machine Translation System: An Improvization Using Divergence Index. Journal of Intelligent Systems, 28(3), 493–504. [15] Evans, N., & Levinson, S. C. (2009). The Myth of Language Universals: Language Diversity and Its Importance for Cognitive Science. Behavioral and Brain Sciences, 32(5), 429–448. [16] Newmeyer, F. J. (2004). Against a Parameter-Setting Approach to Typological Variation. Linguistic Variation Yearbook, 4(1), 181–234. [17] Uchida, H. (1989). ATLAS: Fujitsu Machine Translation System. Proceedings of MT Summit II, 112–119. [18] Nishida, T. (1985). English-Japanese Translation through Case-Structure Conversion. Proceed- ings of the 11th International Conference on Computational Linguistics, 395–400. [19] Hutchins, W. J., & Somers, H. L. (1992). An Introduction to Machine Translation. Academic Press, London. [20] Oflazer, K. (2008). Statistical Machine Translation into a Morphologically Complex Language. Proceedings of the 6th International Conference on Informatics and Information Technology. [21] Bender, E. M. (2019). The #BenderRule: On Naming the Languages We Study and Why It Matters. The Gradient. [22] Wang, Z., Lai, Y., Li, J., & Gao, X. (2023). On the Multilingual Ability of Decoder-based Pre- trained Language Models: Finding and
Chunk 40 · 606 chars
6th International Conference on Informatics and Information Technology. [21] Bender, E. M. (2019). The #BenderRule: On Naming the Languages We Study and Why It Matters. The Gradient. [22] Wang, Z., Lai, Y., Li, J., & Gao, X. (2023). On the Multilingual Ability of Decoder-based Pre- trained Language Models: Finding and Controlling Language-Specific Neurons. arXiv preprint arXiv:2305.00600. [23] Oncevay, A., Haddow, B., & Birch, A. (2020). Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations. Proceedings of EMNLP 2020, 2391–2406. 25 -- 25 of 25 --