Multilingual Language Models
Summary
This study investigates how multilingual language models (LMs) organize representations for diverse languages. Using the Language Activation Probability Entropy (LAPE) metric and Sparse Autoencoders (SAEs), the researchers analyze language-associated units across different model families and scales. Key findings show that these units are strongly conditioned on orthography: romanizing non-Latin languages leads to near-disjoint representations that do not align with native scripts or English. In contrast, word-order shuffling has limited impact on unit identity. Probing reveals that typological structure becomes more accessible in deeper layers, while causal interventions indicate that generation relies more on units invariant to surface perturbations than on typologically aligned units. The study concludes that multilingual LMs prioritize surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua. These results highlight the persistent role of orthography in shaping internal representations, even in larger models with high semantic competence.
PDF viewer
Chunks(87)
Chunk 0 · 1,997 chars
Multilingual Language Models Encode Script Over Linguistic Structure Aastha A K Verma†,1 Anwoy Chatterjee†,1 Mehak Gupta1 Tanmoy Chakraborty1,2 1Indian Institute of Technology Delhi, New Delhi, India 2Indian Institute of Technology Delhi, Abu Dhabi, UAE aastha.v1411@gmail.com anwoychatterjee@gmail.com mehak.gupta.tech@gmail.com tanchak@iitd.ac.in Abstract Multilingual language models (LMs) organize representations for typologically and ortho- graphically diverse languages into a shared pa- rameter space, yet the nature of this internal or- ganization remains elusive. In this work, we in- vestigate which linguistic properties – abstract language identity or surface-form cues – shape multilingual representations. To do so, we ana- lyze language-associated units across different model families and scales using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: roman- ization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has lim- ited effect on unit identity. Probing shows that typological structure becomes increasingly ac- cessible in deeper layers, while causal inter- ventions indicate that generation is most sensi- tive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our re- sults suggest that multilingual LMs organize representations around surface form, with lin- guistic abstraction emerging gradually without collapsing into a unified interlingua. 1 Introduction Language is an amalgamation of historical acci- dents, cognitive constraints, and cultural evolu- tion. It is rarely a monolith; rather, it emerges as a layered outcome of interactions among peoples, geographies, and time (Thomason and Kaufman, 1988; Toscano et al., 2008; Smith and Kirby, 2008; Beckner et al., 2009;
Chunk 1 · 1,993 chars
n Language is an amalgamation of historical acci- dents, cognitive constraints, and cultural evolu- tion. It is rarely a monolith; rather, it emerges as a layered outcome of interactions among peoples, geographies, and time (Thomason and Kaufman, 1988; Toscano et al., 2008; Smith and Kirby, 2008; Beckner et al., 2009; Evans and Levinson, 2009; Michaud, 2024). Modern English illustrates this clearly: while it is taxonomically a West Germanic †These two authors contributed equally to this work. https://github.com/loadthecode0/ multilingual-interpretability language, sharing core syntactic and phonologi- cal structure with German and Dutch, its lexicon is heavily shaped by Romance influence through Latin and French (Baugh and Cable, 2002; Crys- tal, 2003; Wardhaugh and Fuller, 2014). When a sentence such as “the magnitude of liberty” is processed, Latinate vocabulary is embedded within a Germanic grammatical frame (Reppucci, 2017). This raises a fundamental question for modern auto- regressive language models (LMs): do they inter- nally preserve such linguistic distinctions, or do they abstract away surface variations into a shared, language-agnostic representation? This question becomes especially crucial in mul- tilingual settings. When a model processes typo- logically distant languages such as English, Hindi, and Chinese, does it rely on distinct internal repre- sentations for each language, or does it converge toward a shared interlingual latent space? Insights from bilingual cognition show that shared semantic representations can coexist with segregated surface- form processing (Costa and Sebastián-Gallés, 2014; Marian et al., 2003; Buchweitz et al., 2011; Miozzo et al., 2010). However, within NLP, this distinction remains underexplored in modern auto-regressive multilingual models. Investigating these models across different parameter scales allows us to deter- mine whether the trade-offs between surface-form processing and linguistic abstraction are mere
Chunk 2 · 1,991 chars
et al., 2011; Miozzo et al., 2010). However, within NLP, this distinction remains underexplored in modern auto-regressive multilingual models. Investigating these models across different parameter scales allows us to deter- mine whether the trade-offs between surface-form processing and linguistic abstraction are mere arti- facts of limited capacity or fundamental properties of multilingual architectures. Recent work has begun to probe this question (Tang et al., 2024; Kojima et al., 2024; Deng et al., 2025; Andrylie et al., 2025). Specifically, Tang et al. (2024) introduced the Language Activation Probability Entropy (LAPE) metric to identify neu- rons that preferentially activate for specific lan- guages in multilingual LMs. They showed that a relatively small subset of neurons concentrated primarily in early and late layers has a strong in- fluence on language selection and can be causally arXiv:2604.05090v2 [cs.CL] 20 Apr 2026 -- 1 of 35 -- manipulated to steer the output language. Subse- quent work extended this approach using Sparse Autoencoders (SAEs), the method being referred to as SAE-LAPE (Andrylie et al., 2025), which decomposes dense activations into sparse latent fea- tures and performs selection of language-associated features in the latent space using LAPE. Related intervention-based analyses similarly suggest that language control can be induced by targeting care- fully selected units (Gurgurov et al., 2025; Rahman- isa et al., 2025). These studies show that language- associated units exist and can be causally manip- ulated, but they leave open a key question: what linguistic properties do these language-associated units encode? In this work, we systematically investigate this question by analyzing language-associated units at two complementary levels: raw model neurons in the MLP sublayers that directly affect generation, and sparse latent features extracted with SAEs for interpretability. Rather than assuming these units encode abstract
Chunk 3 · 1,998 chars
de? In this work, we systematically investigate this question by analyzing language-associated units at two complementary levels: raw model neurons in the MLP sublayers that directly affect generation, and sparse latent features extracted with SAEs for interpretability. Rather than assuming these units encode abstract language identity, we test their sen- sitivity to orthography, word order, and deeper lin- guistic structure. We study these representations across different model families and scales – specifi- cally in Llama-3.2-1B, Llama-3-8B, Gemma-2-2B, and Gemma-2-9B – analyzing languages that span Latin, Cyrillic, Devanagari, Perso-Arabic, and lo- gographic scripts. This diverse selection ensures our observations reflect broad architectural traits rather than scale-specific bottlenecks. Our analysis is guided by four research ques- tions: (i) Language vs. script: do language- associated units encode abstract language identity, or are they primarily tied to orthographic form? Furthermore, does semantic competence in a given script guarantee representational alignment? In par- ticular, does romanizing a language (e.g., Hindi or Chinese written in Latin script) activate the same neurons as its native script? (ii) Robustness to structural perturbation: how stable are these units when word order is disrupted? (iii) Typo- logical alignment: do language-associated units correlate with known typological properties, such as genealogy, phonology, or syntax, as captured by lang2vec (Littell et al., 2017)? (iv) Layer-wise organization: how does the accessibility of these properties vary across network depth, and how are they organized in deeper layers? To answer these questions, we combine sparse feature extraction with a series of controlled ex- periments. We analyze the behaviour of language- associated units under script romanization, struc- tural perturbations, typological probing, and causal intervention. Across these analyses, several consis- tent patterns emerge: •
Chunk 4 · 1,994 chars
wer these questions, we combine sparse feature extraction with a series of controlled ex- periments. We analyze the behaviour of language- associated units under script romanization, struc- tural perturbations, typological probing, and causal intervention. Across these analyses, several consis- tent patterns emerge: • Language-associated units are largely script- bound: native and romanized variants of non-Latin languages activate almost disjoint sets of language- associated units, whereas shared scripts exhibit sig- nificant overlap. Notably, units associated with romanized non-Latin inputs align with neither their native counterparts nor English, indicating frag- mented representations within the LMs, even when the models exhibit high semantic competence on the romanized text (c.f. Sections 4 and 8). • Disrupting word order has only a minor effect on unit identity, suggesting reliance on lexical statistics or orthographic cues rather than syntactic structure (c.f. Section 5). • Units in deeper layers show stronger typolog- ical alignment, indicating increased representa- tional accessibility with depth (c.f. Section 6). Causal interventions further show that functional importance during generation is more closely asso- ciated with invariance to surface perturbations than with typological alignment alone (c.f. Section 7). Together, these findings distinguish representa- tional accessibility from functional necessity in multilingual LMs: language-associated units are closely tied to surface form, while deeper linguis- tic regularities become accessible with depth, and causal importance aligns more with invariance to surface perturbations than with representational alignment alone. Key Takeaway Language-associated units primarily encode surface form, and units invariant to surface perturbations play a central role in generation. 2 Related Work Prior work has shown that multilingual language models do not form a fully language-agnostic in- terlingua, but instead
Chunk 5 · 1,997 chars
representational alignment alone. Key Takeaway Language-associated units primarily encode surface form, and units invariant to surface perturbations play a central role in generation. 2 Related Work Prior work has shown that multilingual language models do not form a fully language-agnostic in- terlingua, but instead organize representations in a partially shared space structured by language iden- tity and similarity (Johnson et al., 2017; Pires et al., 2019; Libovický et al., 2020). Neuron-level analy- ses further demonstrated that language control can be localized to specific internal units. In particu- lar, Tang et al. (2024) introduced the LAPE metric to identify language-selective neurons and showed that manipulating a small subset, often in early and late layers, can steer output language. Subsequent work confirmed that targeted interventions on such -- 2 of 35 -- units enable controlled language switching (Kojima et al., 2024; Gurgurov et al., 2025; Rahmanisa et al., 2025). While these studies establish the functional relevance of language-associated units, they leave open what linguistic properties these units encode. In parallel, SAEs have been proposed to decom- pose dense transformer activations into more in- terpretable sparse features (Bau et al., 2017; Shi et al., 2025), and have recently been applied to identify language-associated features in multilin- gual models (Andrylie et al., 2025; Deng et al., 2025). Separately, work on typology and script effects shows that orthography and transliteration can strongly shape multilingual representations and cross-lingual alignment (Littell et al., 2017; Artetxe et al., 2020; Jauhiainen et al., 2019). Our work connects these threads by moving from identification to interpretation: we test whether language-associated units – both raw neurons and sparse features – encode abstract linguistic struc- ture or are primarily driven by surface-form cues. In doing so, we contextualize recent literature sur- rounding
Chunk 6 · 1,999 chars
, 2019). Our work connects these threads by moving from identification to interpretation: we test whether language-associated units – both raw neurons and sparse features – encode abstract linguistic struc- ture or are primarily driven by surface-form cues. In doing so, we contextualize recent literature sur- rounding the “interlingua” hypothesis, which often highlights semantic alignment and shared gram- matical concepts across typologically diverse lan- guages (Wendler et al., 2024; Schut et al., 2025; Brinkmann et al., 2025; Fierro et al., 2025). Our findings complement these works by demonstrating that while semantic alignment is achievable, it does not necessitate the topological collapse of represen- tations into a single manifold. Instead, language- neutral components coexist with a persistent set of script-specific neurons. By identifying script as a primary barrier to global unification, our work reveals that what appears as a unified space is ac- tually deeply fragmented when orthography varies. For a more detailed discussion of prior works, we refer the reader to Appendix B. 3 Analysis Framework Terminology. We adopt the term unit as a unify- ing abstraction for the atomic elements of represen- tation. Specifically, a unit refers to either a raw neu- ron – an individual element of the MLP’s hidden activation vector – or an SAE feature representing a single direction within the latent space of the SAE. Accordingly, we define a language-associated unit as any unit that exhibits high selectivity for a spe- cific target language, as quantified by the LAPE metric. Identifying Language-Associated Units. Our analysis builds on the LAPE framework (Tang et al., 2024) and its sparse extension SAE- LAPE (Andrylie et al., 2025) to identify language- associated structure in multilingual LMs. For each transformer layer ℓ, we analyze both raw feed- forward (MLP) activations hℓ(x) and sparse la- tent representations obtained via pre-trained SAEs. Language association is
Chunk 7 · 1,991 chars
ang et al., 2024) and its sparse extension SAE- LAPE (Andrylie et al., 2025) to identify language- associated structure in multilingual LMs. For each transformer layer ℓ, we analyze both raw feed- forward (MLP) activations hℓ(x) and sparse la- tent representations obtained via pre-trained SAEs. Language association is quantified using LAPE: for each neuron or SAE feature f , we estimate its activation probability across languages and com- pute the entropy of this distribution. Units with low entropy and a dominant language are selected as language-associated, yielding a set Nℓ,L for each layer and language. Details of the LAPE and SAE- LAPE procedures along with the hyperparameters used are provided in Appendix C. Models and Representations. We conduct ex- periments across multiple model families and scales, specifically Llama-3.2-1B, Llama-3-8B (Grattafiori et al., 2024), Gemma-2-2B, and Gemma-2-9B (Team et al., 2024). Prior work has applied LAPE and SAE-based analyses to Llama- family and Gemma-family models (Tang et al., 2024; Andrylie et al., 2025; Deng et al., 2025), motivating our choice of architectures and sparse decompositions. Following this line of work, we use open-sourced Top-K SAEs1 for the Llama mod- els and JumpReLU SAEs for the Gemma models (Lieberum et al., 2024), focusing on MLP sub- layers. For clarity of exposition, we primarily present results for Llama-3.2-1B in the core anal- ysis sections of the main paper; corresponding analyses for Gemma-2-2B are provided in the Appendix, and validations on the larger 8B and 9B architectures are detailed in Section 8 and Appendix H. Experimental Design. We design a set of tar- geted experiments to probe what linguistic prop- erties language-associated units encode, including (i) controlled script perturbations via romanization, (ii) robustness tests under word-order shuffling, (iii) typological probing against lang2vec features, and (iv) targeted causal interventions. As each experi- ment involves
Chunk 8 · 1,996 chars
xperiments to probe what linguistic prop- erties language-associated units encode, including (i) controlled script perturbations via romanization, (ii) robustness tests under word-order shuffling, (iii) typological probing against lang2vec features, and (iv) targeted causal interventions. As each experi- ment involves distinct language sets, perturbations, and evaluation protocols, we describe the detailed setups in the corresponding sections. 1https://huggingface.co/EleutherAI/ sae-Llama-3.2-1B-131k, https://huggingface.co/ EleutherAI/sae-llama-3-8b-32x -- 3 of 35 -- Native (n=206) Romanized (No Diacritics) (n=120) Romanized (With Diacritics) (n=244) (a) For Raw Neurons Native (n=134) Romanized (No Diacritics) (n=144) Romanized (With Diacritics) (n=265) (b) For SAE Features Figure 1: Overlap of language-associated units for Hindi under script variation in Llama-3.2-1B. Euler diagrams show units shared among up to three languages for (a) raw neurons and (b) SAE features. Native, Romanized (with diacritics), and Romanized (without diacritics) in- puts activate largely disjoint sets in both representations. Corresponding results for all languages, for both raw neurons and SAE features, and for Gemma-2-2B are shown in Figures 10 and 9 in Appendix D.2. 4 Orthography as a Barrier to Latent Language Abstraction A central question in multilingual representation learning is whether neurons or features identified as language-associated encode abstract linguistic identity or merely respond to orthographic surface form. To disentangle these factors, we conduct a controlled romanization experiment that isolates script variation while holding lexical content and sentence structure fixed. Experimental Setup. We use sentence-aligned data from the dev split of FLORES+2, an extension of the FLORES-200 dataset (NLLB Team et al., 2024), covering a typologically and orthographi- cally diverse set of languages spanning Abugida, Abjad, Cyrillic, Logographic, and Syllabic
Chunk 9 · 1,998 chars
l content and sentence structure fixed. Experimental Setup. We use sentence-aligned data from the dev split of FLORES+2, an extension of the FLORES-200 dataset (NLLB Team et al., 2024), covering a typologically and orthographi- cally diverse set of languages spanning Abugida, Abjad, Cyrillic, Logographic, and Syllabic scripts. For each non-Latin language, we construct a paral- lel Romanized corpus using the ICU Transliterator (The Unicode Consortium, 2024). Where applica- ble, we generate two Romanized variants: one pre- serving diacritics and one ASCII-only version with diacritics removed. Language-associated units are identified independently for native and Romanized inputs using the LAPE criterion for raw neurons and SAE-LAPE for sparse features, and overlap is quantified using Jaccard similarity. More detailed experimental details are provided in Appendix D. 2https://huggingface.co/datasets/ openlanguagedata/flores_plus Bengali Bulgarian Chinese Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.2 0.4 0.6 0.8 Jaccard Similarity SAE Features (vs Native) SAE Features (vs English) Raw Neurons (vs Native) Raw Neurons (vs English) Figure 2: Jaccard similarity between Romanized and native-script or English language-associated units (raw neurons and SAE features) in Llama-3.2-1B (see Fig- ure 8 for Gemma-2-2B). Romanized inputs exhibit low overlap with their native-script counterparts and near- zero overlap with English in both representations, in- dicating limited cross-script alignment without conver- gence to English. Orthography Acts as a Barrier to Language Identity. If language-associated units encoded abstract linguistic identity, they would remain sta- ble under changes in script. Instead, Figure 1 shows near-complete fragmentation under romanization for Hindi (similar observations are also made for other languages, as shown in the Figures 10 and 9). Across both raw neurons and SAE features, native- script Hindi, Romanized Hindi with diacritics, and its
Chunk 10 · 1,990 chars
remain sta- ble under changes in script. Instead, Figure 1 shows near-complete fragmentation under romanization for Hindi (similar observations are also made for other languages, as shown in the Figures 10 and 9). Across both raw neurons and SAE features, native- script Hindi, Romanized Hindi with diacritics, and its ASCII-only variant activate largely disjoint sets of language-associated units, even when allowing overlap across multiple languages. This fragmenta- tion persists despite identical lexical content, indi- cating that language association in these models is strongly conditioned on orthographic form rather than abstract language identity. Takeaway 1 Language-associated units are tightly bound to orthog- raphy. Even minimal script changes induce near-disjoint unit sets in both raw neurons and sparse features. Romanization Induces an Isolated Latent Sub- space. Figure 2 examines whether Romanized inputs align with native-script or English represen- tations when considering all language-associated units. Across languages, overlap between Roman- ized and native-script representations remains con- sistently low (typically below 0.3) for both raw neu- rons and SAE features, with higher overlap only for Spanish, which already uses the Latin script. Cru- cially, overlap with English is near zero in all cases. Together, these results show that Romanization nei- ther recovers native-script representations nor in- duces convergence toward English. Instead, Ro- manized text occupies a distinct, script-conditioned subspace that remains isolated even when consid- -- 4 of 35 -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.0 0.2 0.4 0.6 0.8 Jaccard Similarity SAE Features Raw Neurons Figure 3: Layer-wise alignment between language- associated units for Native and Romanized inputs in Llama-3.2-1B (see Figure 14 for Gemma-2-2B). The red line denotes average Jaccard similarity for raw neurons, and the blue line for SAE features; shaded re- gions
Chunk 11 · 1,995 chars
0.4 0.6 0.8 Jaccard Similarity SAE Features Raw Neurons Figure 3: Layer-wise alignment between language- associated units for Native and Romanized inputs in Llama-3.2-1B (see Figure 14 for Gemma-2-2B). The red line denotes average Jaccard similarity for raw neurons, and the blue line for SAE features; shaded re- gions indicate standard deviation across languages. Raw neurons show a modest mid-layer increase in overlap, while SAE features remain uniformly low across depth. In all cases, alignment remains far from convergence, indicating that representational separation persists be- yond input tokenization. ering shared language-associated units, effectively forming a third latent configuration that is neither native nor English. Takeaway 2 Romanization does not lead to Anglicization. Roman- ized inputs form a distinct, script-conditioned latent sub- space, separate from both native-script and English repre- sentations. Limited Intermediate Alignment and Persis- tent Separation. Figure 3 shows how language- associated units for Native and Romanized inputs align across layers in Llama-3.2-1B. While low overlap in early layers is expected due to disjoint token embeddings, this separation persists well beyond the input stage. Raw neurons exhibit a modest mid-layer increase in overlap, peaking around layer 9, but the alignment remains lim- ited (Jaccard ≈ 0.3) and never approaches con- vergence. In contrast, SAE features show consis- tently low and flat overlap across all layers, indicat- ing that sparse language-associated features remain strongly script-conditioned throughout the model. Together, these trends indicate that although dense activations briefly align surface-level statistics, the model ultimately maintains parallel, script-specific subspaces, revealing a limitation in abstraction rather than a trivial consequence of tokenization. Implications for Model Capacity. The emer- gence of disjoint feature sets for native, Romanized, and even minor orthographic
Chunk 12 · 1,998 chars
iefly align surface-level statistics, the model ultimately maintains parallel, script-specific subspaces, revealing a limitation in abstraction rather than a trivial consequence of tokenization. Implications for Model Capacity. The emer- gence of disjoint feature sets for native, Romanized, and even minor orthographic variants (e.g., diacritic vs. ASCII) points to a fundamental fragmentation of representational capacity. This aligns with re- cent observations that orthographic variations, such as the presence of diacritics, cause severe subword fragmentation and representational shifts in modern tokenizers and LMs (Inoue et al., 2026). We refer to this latent phenomenon as capacity fragmenta- tion: the model allocates separate internal features to encode superficially different realizations of the same language. Even highly shared features fail to fully unify these variants, suggesting that many pur- portedly language-agnostic representations remain implicitly conditioned on script. Scaling and Semantic Competence. Crucially, this representational fragmentation is not merely an artifact of data sparsity or limited model capacity. As we discuss in Section 8, this topological dis- jointness persists even in larger architectures (e.g., Llama-3-8B and Gemma-2-9B) that achieve higher semantic competence on romanized inputs. 5 Robustness of Language-Associated Features to Structural Perturbations Section 4 illustrates that language-associated fea- tures are highly sensitive to script, with minor or- thographic changes inducing substantial reorgani- zation. We complement this with a perturbation that preserves surface form but disrupts structure by applying controlled word-level shuffling. Unlike ro- manization, shuffling preserves token identity, fre- quency, and script while breaking local word order, allowing us to test whether language-associated features depend on syntactic structure or primarily reflect token-level and distributional cues. Setup. For each language,
Chunk 13 · 1,996 chars
controlled word-level shuffling. Unlike ro- manization, shuffling preserves token identity, fre- quency, and script while breaking local word order, allowing us to test whether language-associated features depend on syntactic structure or primarily reflect token-level and distributional cues. Setup. For each language, we construct a shuf- fled version of the evaluation corpus by randomly permuting word order within sentences. Language- associated units are re-identified using the same SAE-LAPE procedure applied in earlier sections. Stability is measured via Jaccard similarity be- tween the unit sets obtained from original and shuf- fled text. Additional experimental details and anal- yses are reported in Appendix E. Shuffling Reveals Selective Instability in Sparse Features. Figure 4 shows that many languages retain a substantial fraction of their language- associated units under shuffling, indicating lim- ited dependence on word order. However, this -- 5 of 35 -- Japanese Chinese Thai Korean Hindi Turkish Bulgarian Russian Portuguese Vietnamese Italian German Spanish English French 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity SAE Features Raw Neurons Figure 4: Jaccard similarity between language- associated units identified from original and word- shuffled text in Llama-3.2-1B (see Figure 23 for Gemma-2-2B). Raw neurons exhibit consistently moderate-to-high overlap across languages, indicating robustness to word-order perturbation. In contrast, SAE features show only selective instability: languages with distinctive scripts (e.g., Chinese, Japanese, Thai) remain highly stable, whereas several Latin-script languages ex- hibit somewhat reduced overlap, revealing sensitivity of sparse features to local distributional patterns disrupted by shuffling for these languages. robustness varies across languages and represen- tations. Languages with distinctive scripts such as Chinese, Japanese, Thai, Korean, and Cyrillic languages remain highly stable, with overlap
Chunk 14 · 1,987 chars
overlap, revealing sensitivity of sparse features to local distributional patterns disrupted by shuffling for these languages. robustness varies across languages and represen- tations. Languages with distinctive scripts such as Chinese, Japanese, Thai, Korean, and Cyrillic languages remain highly stable, with overlap of- ten exceeding 0.7, suggesting dominance of token identity and orthographic cues. In contrast, several Latin-script languages exhibit relative reductions in overlap specifically in SAE features, indicating sensitivity of a subset of sparse features to local distributional or sequence-level statistics disrupted by shuffling. This selective instability is largely ab- sent in raw neurons, which maintain stable overlap across languages, highlighting that dense represen- tations encode language information redundantly, while sparse decompositions expose heterogeneity that is otherwise masked. Activation Statistics Remain Stable. Although shuffling alters feature identity for some languages, it induces negligible changes in activation entropy or probability. Both language-level means and full distributions remain nearly identical before and af- ter shuffling, indicating that shuffling affects which features are selected rather than overall activation behavior (see Figures 20, 21 and 22 in Appendix E for full distributional analyses and language-level means in case of both Llama and Gemma). Implications. In contrast to the fragmentation in- duced by script changes (Section 4), word-order disruption leaves most language-associated repre- sentations intact. The limited instability that does occur is selective, appearing mainly in sparse fea- tures for languages that share script and subword statistics, and not in raw neurons. Robustness at Scale. Furthermore, as detailed in Section 8, this robustness to structural perturbation consistently holds across larger parameter scales (Llama-3-8B and Gemma-2-9B), reinforcing that language-associated units
Chunk 15 · 1,996 chars
rse fea- tures for languages that share script and subword statistics, and not in raw neurons. Robustness at Scale. Furthermore, as detailed in Section 8, this robustness to structural perturbation consistently holds across larger parameter scales (Llama-3-8B and Gemma-2-9B), reinforcing that language-associated units fundamentally prioritize surface form over syntactic structure regardless of model capacity. Takeaway 3 Language-associated units are largely insensitive to word order, while sparse features expose limited, language-dependent reliance on local distributional cues. 6 Typological Structure Revealed by Probing Sections 4 and 5 show that language-associated units are strongly shaped by surface form: script changes induce near-complete reorganization, while word-order perturbations leave many units intact. We now ask whether, despite this surface sensitivity, model representations encode deeper linguistic structure in a linearly accessible form. Specifically, we use probing to characterize where typological information is concentrated and when it emerges across model depth. Setup. We probe both raw MLP activations and SAE-based representations against typological fea- tures from lang2vec (Littell et al., 2017). For each layer, linear probes are trained with cross- validation over languages, and performance is sum- marized using the average of family-wise maxi- mum R2 scores. We report results across different neuron subsets induced by romanization and shuf- fling (e.g., condition-specific vs. overlap sets). Full probing details are provided in Appendix F. Typological Structure Aligns with Invariance to Script. Figure 5 shows probing results across neuron subsets, in Llama-3.2-1B, induced by ro- manization (see Figure 15 for SAE-features in Llama, and Figures 16 and 17 for Gemma-2-2B). Across both raw neurons and SAE features, a con- sistent pattern emerges: neurons preserved across native and romanized inputs exhibit the strongest typological alignment.
Chunk 16 · 1,990 chars
across neuron subsets, in Llama-3.2-1B, induced by ro- manization (see Figure 15 for SAE-features in Llama, and Figures 16 and 17 for Gemma-2-2B). Across both raw neurons and SAE features, a con- sistent pattern emerges: neurons preserved across native and romanized inputs exhibit the strongest typological alignment. Overlap subsets dominate across genealogical, syntactic, and phonological families, while script-specific subsets (native-only or romanized-only) encode substantially weaker typological signal. This directly connects to Sec- -- 6 of 35 -- Only Native Overlap Only Romanized Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Average R2 Phonology Syntax Genealogy (Family) Figure 5: Average family-wise probing R2 scores across neuron subsets induced by romanization in Llama-3.2- 1B (raw neurons). Neurons overlapping between native and romanized inputs exhibit the strongest typological alignment, while script-specific subsets encode weaker signal. Baseline denotes probing over the pooled set of all neurons that were selected for either native or romanized inputs (across all layers), serving as a non- selective reference. Corresponding results for Llama- 3.2-1B (SAE features) and for Gemma-2-2B using both raw neurons and SAE features are shown in Figures 15, 16, and 17 in Appendix D.3. tion 4: the same units that are invariant to ortho- graphic change are those that preferentially encode deeper linguistic structure. Together, these results indicate that typological abstraction is not tied to language-specific or script-specific units, but in- stead concentrates in representations that are robust to script variation. Typological Structure Does Not Prefer Order- Invariant Units. In contrast, probing under word- order shuffling reveals a qualitatively different pat- tern. Figure 6 shows that typological alignment is comparable across normal-only, shuffled-only, and overlap subsets. This holds for both raw and sparse representations, although overall scores
Chunk 17 · 1,992 chars
es Not Prefer Order- Invariant Units. In contrast, probing under word- order shuffling reveals a qualitatively different pat- tern. Figure 6 shows that typological alignment is comparable across normal-only, shuffled-only, and overlap subsets. This holds for both raw and sparse representations, although overall scores are lower for SAE features. Unlike romanization, in- variance to word order does not preferentially select typologically informative units. This observation aligns with Section 5: while shuffling leaves many language-associated units intact, this robustness does not correspond to a privileged locus of lin- guistic abstraction. Depth-Dependent Emergence of Linguistic Ab- straction. While invariance determines where ty- pological information resides, model depth deter- mines when it becomes accessible. We illustrate this hierarchy using SAE features, where typolog- ical trends are most interpretable; raw activations show the same qualitative pattern (Appendix F). Figure 7 shows that genealogical properties are Only Original Overlap Only Shuffled Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Average R2 Phonology Syntax Genealogy (Family) Figure 6: Average family-wise probing R2 scores across neuron subsets induced by word-order shuffling in Llama-3.2-1B (raw neurons). Neurons specific to origi- nal text, shuffled text, and their overlap exhibit compa- rable typological alignment, indicating that sensitivity to word order is largely decoupled from typological in- formation. Baseline denotes probing over the pooled set of all neurons selected for either condition, serving as a non-selective reference. Corresponding results for Llama-3.2-1B (SAE features) and for Gemma-2-2B us- ing both raw neurons and SAE features are shown in Figures 24, 25, and 26 in Appendix E.2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average R2 Genealogy (Family) Phonology Syntax Figure 7: Average probing R2 scores across layers
Chunk 18 · 1,999 chars
1B (SAE features) and for Gemma-2-2B us- ing both raw neurons and SAE features are shown in Figures 24, 25, and 26 in Appendix E.2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average R2 Genealogy (Family) Phonology Syntax Figure 7: Average probing R2 scores across layers for SAE features in Llama-3.2-1B, grouped by typolog- ical family. Genealogical properties are accessible from early layers, while more abstract features such as phonology emerge mainly in deeper layers. Cor- responding results for raw neurons and Gemma-2-2B show the same hierarchy (Figures 27, 29). linearly decodable from early layers, whereas more abstract phonological features emerge only in the deepest layers. This hierarchy suggests that linguis- tic abstraction is constructed gradually with depth rather than encoded uniformly across the model. From Representational Accessibility to Func- tional Testing. Probing shows that typological information becomes increasingly linearly accessi- ble in deeper layers, particularly in script-invariant representations. However, probing alone does not establish functional necessity. In Section 7, we therefore test whether units identified by their in- variance properties play a causal role in genera- tion. Furthermore, we synthesize these structural -- 7 of 35 -- findings with downstream semantic competence in Section 8. Takeaway 4 Typological structure emerges with depth and is strongest in script-invariant representations. Abstrac- tion remains distributed across units. 7 Causal Roles of Script- and Structure-Invariant Units Sections 4-6 show how language-associated units vary with script, word order, and typological struc- ture. We now test whether these distinctions reflect functional necessity during generation by perform- ing targeted causal interventions on neuron sets defined solely by their invariance properties. Full experimental details, statistical tests, and qualita- tive analyses are provided in
Chunk 19 · 1,998 chars
order, and typological struc- ture. We now test whether these distinctions reflect functional necessity during generation by perform- ing targeted causal interventions on neuron sets defined solely by their invariance properties. Full experimental details, statistical tests, and qualita- tive analyses are provided in Appendix G. Setup. All interventions are performed on raw MLP activations. While we focus our main text exposition on Llama-3.2-1B, we concurrently val- idate all interventions on both Llama-3-8B and Gemma-2-2B to ensure causal effects hold across different architectures and scales. Neuron sets are defined by invariance to script or word-order pertur- bations (Sections 4, 5). For romanization-derived sets, we perform cross-language mean replacement; for shuffling-derived sets, we apply simultaneous zero ablation across all layers. Effects are com- pared against matched random controls using per- plexity on FLORES+ dev examples. Statistical significance is assessed via paired t-tests; exact p-values are reported in Appendix Tables 5 and 6. Script-Invariant Neurons Support Stable Gen- eration Under Perturbation. Using neuron sets derived from the romanization analysis (Section 4), we perform cross-language mean ablations be- tween Hindi and English (Table 1). Overlap neu- rons, which remain active across native and ro- manized scripts, exhibit only mild and asymmetric perplexity changes under cross-language replace- ment; while statistically significant (p < 0.05; Ta- ble 6), these effects are small, indicating that these neurons occupy a largely script-invariant subspace. In contrast, only-native neurons show extreme sensitivity: replacing English-only-native activa- tions with Hindi means causes severe degradation, while the reverse yields large apparent perplex- ity improvements. Qualitative inspection reveals that the latter corresponds to language switching rather than improved modeling, with generations collapsing into fluent English (Appendix G).
Chunk 20 · 1,997 chars
English-only-native activa- tions with Hindi means causes severe degradation, while the reverse yields large apparent perplex- ity improvements. Qualitative inspection reveals that the latter corresponds to language switching rather than improved modeling, with generations collapsing into fluent English (Appendix G). Cru- Language Neuron set P P Ltarget ratio P P Lrandom ratio English overlap 0.95 0.99 English only-native 1.50 0.96 Hindi overlap 1.05 0.98 Hindi only-native 0.31 0.97 Table 1: Cross-language mean ablations for romanization-derived neuron sets in Llama-3.2- 1B (see Table 8 for Gemma-2-2B). P P Ltarget ratio denotes perplexity relative to clean runs, and P P Lrandom ratio reports the same for matched random controls. All target effects are statistically significant (p < 0.05; Table 6). Ratios below 1 reflect language switching rather than improved modeling. cially, these effects generalize: in both Llama-3-8B and Gemma-2-2B (Appendix Tables 6 and 8), ab- lating only-native Hindi neurons causes dramatic perplexity changes (e.g., P P Ltarget ratio = 7.74 in Llama-3-8B), confirming extreme sensitivity. To- gether, these results causally validate Section 4, showing that script-specific neurons anchor sur- face realization and language identity, while script- invariant neurons support stable generation under orthographic perturbation. Word-Order-Invariant Neurons Support Core Language Modeling. We next examine neuron sets derived from the shuffling analysis in Section 5 using simultaneous zero ablation (Table 2). Across all languages, overlap neurons – those that re- main active under word-order shuffling – cause sub- stantially larger perplexity increases than matched random controls, with all effects statistically sig- nificant (p < 0.05; Table 5). In contrast, only- unshuffled neurons produce much weaker effects and often reduce perplexity, indicating that order- sensitive signals are largely redundant for gener- ation. This causal dissociation mirrors
Chunk 21 · 1,997 chars
xity increases than matched random controls, with all effects statistically sig- nificant (p < 0.05; Table 5). In contrast, only- unshuffled neurons produce much weaker effects and often reduce perplexity, indicating that order- sensitive signals are largely redundant for gener- ation. This causal dissociation mirrors the identi- fication results in Section 5: neurons invariant to structural perturbation are functionally necessary for stable language modeling, while order-sensitive neurons encode auxiliary or brittle patterns. Quali- tatively, only overlap-neuron ablations induce sys- tematic failures such as within-word script mix- ing and abrupt language switching (Appendix Fig- ure 32), further supporting their causal role. These causal dynamics are highly consistent across ar- chitectures, with both Llama-3-8B and Gemma-2- 2B exhibiting similar severe degradation and iden- tical qualitative failure modes specifically when shuffling-invariant overlap neurons are ablated (see Appendix Tables 5 and 7). -- 8 of 35 -- Language Neuron set P P Ltarget ratio P P Lrandom ratio English overlap 1.12 0.95 English only-unshuffled 0.96 1.04 Hindi overlap 2.79 1.06 Hindi only-unshuffled 1.08 0.95 Table 2: Zero-ablation results for shuffling-derived neu- ron sets in Llama-3.2-1B (see Table 7 for Gemma-2-2B). P P Ltarget ratio reports perplexity relative to clean runs after ablating the specified neuron set, and P P Lrandom ratio reports the same for matched random controls. All overlap- neuron effects differ significantly from random controls (p < 0.05; Table 5). Implications for Language Control and Abstrac- tion. Across both romanization- and shuffling- based interventions, causal importance consistently tracks invariance to surface perturbations. Neurons that remain stable under script or word-order varia- tion are more functionally necessary for generation, whereas surface-sensitive neurons primarily anchor realization. While probing in Section 6 shows that typological
Chunk 22 · 1,992 chars
d interventions, causal importance consistently tracks invariance to surface perturbations. Neurons that remain stable under script or word-order varia- tion are more functionally necessary for generation, whereas surface-sensitive neurons primarily anchor realization. While probing in Section 6 shows that typological structure becomes increasingly decod- able with depth, our causal interventions do not isolate a small set of neurons whose manipulation selectively disrupts such structure. Instead, causal effects are associated with invariance properties, suggesting that language control in these models is mediated by robustness to surface variation rather than by a single, localized abstraction module. Takeaway 5 Causal importance aligns with invariance to surface perturbations. Neurons stable under script or word-order variation are necessary for generation, while probing re- flects representational structure rather than direct control. 8 Discussion and Scaling Analysis Our results show that multilingual models do not converge to a fully abstract interlingua. Instead, representations are organized around surface-form cues, especially script, while deeper layers support abstraction without unifying script-conditioned subspaces. Robustness at Scale and Semantic Competence. To ensure these findings are not limited by model capacity, we validated our core experiments on larger architectures (Llama-3-8B and Gemma-2- 9B). As shown in Appendix H (Figures 34 – 37), representational fragmentation persists at scale: ro- manized inputs maintain low overlap with native- script counterparts and fail to converge toward En- glish. Conversely, robustness to word-order shuf- fling remains consistently high across these larger models (c.f. Figure 38, Appendix H). Further- more, to rule out data sparsity (i.e., the models simply failing to comprehend romanized text), we evaluated translation performance. As detailed in Appendix H (Table 9), larger models achieve sub- stantial
Chunk 23 · 1,996 chars
rd-order shuf- fling remains consistently high across these larger models (c.f. Figure 38, Appendix H). Further- more, to rule out data sparsity (i.e., the models simply failing to comprehend romanized text), we evaluated translation performance. As detailed in Appendix H (Table 9), larger models achieve sub- stantial semantic competence on romanized inputs. This confirms that models possess the requisite knowledge, but internally process different scripts through disjoint subspaces as a persistent architec- tural trait rather than a training deficiency. Implications for Cross-Lingual Transfer. The strong dependence on orthography suggests that cross-lingual transfer is more fragile than often assumed. Romanized inputs neither recover native- script representations nor align with English, even when considering shared language-associated units. Instead, they occupy distinct latent subspaces, help- ing explain why transliteration or script normal- ization alone yields limited gains without explicit adaptation or supervision. Orthography, Control, and Robustness. Our findings offer an alternative explanation for prior observations that changing the language or script of a prompt can alter model behavior, including safety- related responses (Deng et al., 2024; Yong et al., 2023). If language-associated units are tightly cou- pled to orthography, script changes may route in- puts through different internal subspaces, yield- ing divergent outputs. This suggests that some language-based control and jailbreak effects may stem from surface-form routing rather than seman- tic differences. 9 Conclusion In this study, we show that language-associated units in multilingual LMs are primarily organized around surface-form cues, with script acting as a primary barrier. While typological structure becomes more accessible at deeper layers, our causal analyses reveal that stable generation de- pends mostly on units invariant to surface per- turbations. By validating these findings
Chunk 24 · 1,991 chars
ultilingual LMs are primarily organized around surface-form cues, with script acting as a primary barrier. While typological structure becomes more accessible at deeper layers, our causal analyses reveal that stable generation de- pends mostly on units invariant to surface per- turbations. By validating these findings across model scales, we confirm that LMs process dif- ferent scripts through disjoint latent spaces despite high semantic competence, showing that multilin- gual abstraction remains limited by orthography rather than forming a fully unified interlingua. -- 9 of 35 -- Limitations While we validate our core representational and causal findings on models up to 9B parameters, evaluating these phenomena on massive-scale fron- tier models remains an important direction for future work, as models at much larger parame- ter scales may eventually exhibit different trade- offs between surface-form routing and abstraction. In addition, our analysis centers on feed-forward (MLP) activations and their sparse decompositions, and does not examine other architectural compo- nents such as attention heads or embedding layers. Finally, while our interventions assess the causal role of identified units at inference time, we do not study the training dynamics through which these representations emerge. Ethical Considerations This work analyzes internal representations of mul- tilingual LMs using publicly available pretrained models and established linguistic resources. While we generate and release systematically perturbed (romanized and shuffled) versions of existing eval- uation sets, we do not deploy systems in user- facing contexts or evaluate downstream social ap- plications. Our findings highlight how script and surface-form variation can influence internal pro- cessing, with potential implications for robustness and safety generalization across languages. We emphasize that our goal is interpretability and anal- ysis rather than exploitation, and we do not
Chunk 25 · 1,993 chars
ownstream social ap- plications. Our findings highlight how script and surface-form variation can influence internal pro- cessing, with potential implications for robustness and safety generalization across languages. We emphasize that our goal is interpretability and anal- ysis rather than exploitation, and we do not pro- pose methods for bypassing safeguards or induc- ing harmful behavior. Overall, this work aims to support safer and more transparent multilingual model development by clarifying how language- associated representations are organized internally. Acknowledgements Anwoy Chatterjee gratefully acknowledges the support of the Google PhD Fellowship. Tanmoy Chakraborty acknowledges the support of the Anu- sandhan National Research Foundation (Grant no: DST/INT/USA/NSF-DST/Tanmoy/P-2/2024) and Rajiv Khemani Young Faculty Chair Professorship in Artificial Intelligence. The authors acknowledge the support of the Google GCP Grant. References Lyzander Marciano Andrylie, Inaya Rahmanisa, Ma- hardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, and Alham Fikri Aji. 2025. Sparse autoencoders can capture language- specific concepts across diverse languages. Preprint, arXiv:2507.11230. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of mono- lingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network dissection: Quanti- fying interpretability of deep visual representations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3319–3327. IEEE Computer Society. A.C. Baugh and T. Cable. 2002. A History of the English Language. Prentice Hall. Clay Beckner, Richard Blythe, Joan Bybee, Morten Christiansen, William Croft, Nick
Chunk 26 · 1,992 chars
In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3319–3327. IEEE Computer Society. A.C. Baugh and T. Cable. 2002. A History of the English Language. Prentice Hall. Clay Beckner, Richard Blythe, Joan Bybee, Morten Christiansen, William Croft, Nick Ellis, John Hol- land, Jinyun Ke, Diane Larsen-Freeman, and Tom Schoenemann. 2009. Language is a complex adap- tive system: Position paper. Language Learning - LANG LEARN, 59:1–26. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguis- tics, 48(1):207–219. Jannik Brinkmann, Chris Wendler, Christian Bartelt, and Aaron Mueller. 2025. Large language models share representations of latent grammatical concepts across typologically diverse languages. In Proceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6131–6150, Al- buquerque, New Mexico. Association for Compu- tational Linguistics. Augusto Buchweitz, Svetlana V Shinkareva, Robert A Mason, Tom M Mitchell, and Marcel Adam Just. 2011. Identifying bilingual semantic neural represen- tations across languages. Brain Lang, 120(3):282– 289. Albert Costa and Núria Sebastián-Gallés. 2014. How does the bilingual experience sculpt the brain? Nat Rev Neurosci, 15(5):336–345. D. Crystal. 2003. English as a Global Language. Canto (Cambridge University Press). Cambridge University Press. Boyi Deng, Yu Wan, Baosong Yang, Yidan Zhang, and Fuli Feng. 2025. Unveiling language-specific fea- tures in large language models via sparse autoen- coders. In Proceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4563–4608, Vienna, Austria. Association for Computational Linguistics. -- 10 of 35 -- Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li- dong Bing. 2024. Multilingual
Chunk 27 · 1,988 chars
via sparse autoen- coders. In Proceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4563–4608, Vienna, Austria. Association for Computational Linguistics. -- 10 of 35 -- Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li- dong Bing. 2024. Multilingual jailbreak challenges in large language models. In The Twelfth Interna- tional Conference on Learning Representations. Nicholas Evans and Stephen C. Levinson. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32(5):429–448. Constanza Fierro, Negar Foroutan, Desmond Elliott, and Anders Søgaard. 2025. How do multilingual language models remember facts? In Findings of the Association for Computational Linguistics: ACL 2025, pages 16052–16106, Vienna, Austria. Associa- tion for Computational Linguistics. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- tra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Ro- driguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, et al. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783. Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef Van Genabith, and Simon Os- termann. 2025. Language arithmetics: Towards sys- tematic language neuron identification and manip- ulation. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics,
Chunk 28 · 1,987 chars
abith, and Simon Os- termann. 2025. Language arithmetics: Towards sys- tematic language neuron identification and manip- ulation. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 2911–2937, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics. John Hewitt and Percy Liang. 2019. Designing and in- terpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2024. Sparse autoen- coders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations. Go Inoue, Bashar Alhafni, Nizar Habash, and Timothy Baldwin. 2026. Do diacritics matter? evaluating the impact of Arabic diacritics on tokenization and LLM benchmarks. In Findings of the Association for Computational Linguistics: EACL 2026, pages 426– 442, Rabat, Morocco. Association for Computational Linguistics. Tommi Jauhiainen, Heidi Jauhiainen, Tero Alstola, and Krister Lindén. 2019. Language and dialect identifi- cation of cuneiform texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 89–98, Ann Arbor, Michigan. Association for Computational Linguistics. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: En- abling zero-shot translation. Transactions of the As- sociation for Computational
Chunk 29 · 1,995 chars
son, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: En- abling zero-shot translation. Transactions of the As- sociation for Computational Linguistics, 5:339–351. Takeshi Kojima, Itsuki Okimura, Yusuke Iwasawa, Hit- omi Yanaka, and Yutaka Matsuo. 2024. On the multi- lingual ability of decoder-based pre-trained language models: Finding and controlling language-specific neurons. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6919–6971, Mexico City, Mexico. Association for Computational Linguistics. Sneha Kudugunta, Ankur Bapna, Isaac Caswell, and Orhan Firat. 2019. Investigating multilingual NMT representations at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1565–1575, Hong Kong, China. Association for Computational Linguistics. Jindˇrich Libovický, Rudolf Rosa, and Alexander Fraser. 2020. On the language neutrality of pre-trained mul- tilingual representations. In Findings of the Associ- ation for Computational Linguistics: EMNLP 2020, pages 1663–1674, Online. Association for Computa- tional Linguistics. Jindˇrich Libovický, Rudolf Rosa, and Alexander Fraser. 2019. How language-neutral is multilingual bert? Preprint, arXiv:1911.03310. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 278–300, Miami, Florida, US. Association
Chunk 30 · 1,995 chars
mith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 278–300, Miami, Florida, US. Association for Computational Linguistics. Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceed- ings of the 15th Conference of the European Chap- ter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 8–14, Valencia, Spain. Association for Computational Linguistics. -- 11 of 35 -- Meng Lu, Ruochen Zhang, Carsten Eickhoff, and El- lie Pavlick. 2025. Paths not taken: Understanding and mending the multilingual factual recall pipeline. In Proceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 15077–15107, Suzhou, China. Association for Com- putational Linguistics. Viorica Marian, Michael Spivey, and Joy Hirsch. 2003. Shared and separate systems in bilingual language processing: Converging evidence from eyetracking and brain imaging. Brain and Language, 86(1):70– 82. Understanding Language. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. 2025. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representa- tions. Jérôme Michaud. 2024. A complex systems perspective on language evolution. In The Evolution of Language : Proceedings of the 15th International Conference (EVOLANG XV), The Evolution of Language Confer- ences, pages 374–382. Michele Miozzo, Albert Costa, Mireia Hernández, and Brenda Rapp. 2010. Lexical processing in the bilingual brain: Evidence from grammati- cal/morphological deficits.
Chunk 31 · 1,986 chars
on. In The Evolution of Language : Proceedings of the 15th International Conference (EVOLANG XV), The Evolution of Language Confer- ences, pages 374–382. Michele Miozzo, Albert Costa, Mireia Hernández, and Brenda Rapp. 2010. Lexical processing in the bilingual brain: Evidence from grammati- cal/morphological deficits. Aphasiology, 24(2):262– 287. Ibraheem Muhammad Moosa, Mahmud Elahi Akhter, and Ashfia Binte Habib. 2023. Does transliteration help multilingual language modeling? In Findings of the Association for Computational Linguistics: EACL 2023, pages 670–685, Dubrovnik, Croatia. As- sociation for Computational Linguistics. Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot, and Djamé Seddah. 2021. When being un- seen from mBERT is just the beginning: Handling new languages with multilingual language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 448–462, Online. Association for Computa- tional Linguistics. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Bar- rault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2024. Scaling neural machine translation to 200 lan- guages. Nature, 630(8018):841–846. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Flo- rence, Italy. Association for
Chunk 32 · 1,998 chars
Scaling neural machine translation to 200 lan- guages. Nature, 630(8018):841–846. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Flo- rence, Italy. Association for Computational Linguis- tics. Inaya Rahmanisa, Lyzander Marciano Andrylie, Ma- hardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, and Alham Fikri Aji. 2025. Unveiling the influence of amplifying language-specific neurons. In Proceedings of the 14th International Joint Conference on Natural Lan- guage Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics, pages 919–968, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics. Taraka Rama, Lisa Beinborn, and Steffen Eger. 2020. Probing multilingual BERT for genetic and typo- logical signals. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, pages 1214–1228, Barcelona, Spain (Online). Inter- national Committee on Computational Linguistics. Leah Estel Reppucci. 2017. Speaking denglish: Explor- ing the impact of denglish and anglicisms in german culture and identity. Alan Saji, Jaavid Aktar Husain, Thanmay Jayakumar, Raj Dabre, Anoop Kunchukuttan, and Ratish Pudup- pully. 2025. RomanLens: The role of latent Roman- ization in multilinguality in LLMs. In Findings of the Association for Computational Linguistics: ACL 2025, pages 26410–26429, Vienna, Austria. Associa- tion for Computational Linguistics. Lisa Schut, Yarin Gal, and Sebastian Farquhar. 2025. Do multilingual LLMs think in english? In ICLR 2025 Workshop on Building Trust in Language Mod- els and Applications. Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, and Xiangnan He. 2025. Route sparse autoencoder to interpret large language mod- els. In Proceedings of
Chunk 33 · 1,996 chars
and Sebastian Farquhar. 2025. Do multilingual LLMs think in english? In ICLR 2025 Workshop on Building Trust in Language Mod- els and Applications. Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, and Xiangnan He. 2025. Route sparse autoencoder to interpret large language mod- els. In Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pages 6801–6815, Suzhou, China. Association for Computational Linguistics. Kenny Smith and Simon Kirby. 2008. Cultural evo- lution: implications for understanding the human language faculty and its evolution. Philosophical Transactions of the Royal Society B: Biological Sci- ences, 363(1509):3591–3603. Tianyi Tang, Wenyang Luo, Haoyang Huang, Dong- dong Zhang, Xiaolei Wang, Xin Zhao, Furu Wei, and Ji-Rong Wen. 2024. Language-specific neurons: The key to multilingual capabilities in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 5701–5715, Bangkok, Thailand. Association for Computational Linguistics. -- 12 of 35 -- Gemma Team et al. 2024. Gemma 2: Improving open language models at a practical size. Preprint, arXiv:2408.00118. The Unicode Consortium. 2024. ICU: International components for unicode. Version 78.1. Sarah Grey Thomason and Terrence Kaufman. 1988. Language Contact, Creolization, and Genetic Lin- guistics, 1 edition. University of California Press. Joseph C. Toscano, Lynn K. Perry, Kathryn L. Mueller, Allison F. Bean, Marcus E. Galle, and Larissa K. Samuelson. 2008. Language as shaped by the brain; the brain as shaped by development. Behavioral and Brain Sciences, 31(5):535–536. Katharina A. T. T. Trinley, Toshiki Nakai, Tatiana Anikina, and Tanja Baeumel. 2025. What lan- guage(s) does aya-23 think in? how multilinguality affects internal language representations. In Proceed- ings of the Workshop on Beyond English: Natural Language Processing for all Languages in an
Chunk 34 · 1,994 chars
n Sciences, 31(5):535–536. Katharina A. T. T. Trinley, Toshiki Nakai, Tatiana Anikina, and Tanja Baeumel. 2025. What lan- guage(s) does aya-23 think in? how multilinguality affects internal language representations. In Proceed- ings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models, pages 159–171, Varna, Bulgaria. INCOMA Ltd., Shoumen, BULGARIA. R. Wardhaugh and J.M. Fuller. 2014. An Introduction to Sociolinguistics. Blackwell Textbooks in Linguistics. Wiley. Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. Do llamas work in English? on the latent language of multilingual transformers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, Bangkok, Thai- land. Association for Computational Linguistics. Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Com- putational Linguistics. Zheng Xin Yong, Cristina Menghini, and Stephen Bach. 2023. Low-resource languages jailbreak GPT-4. In Socially Responsible Language Modelling Research. Appendix Contents Below we provide an overview of the appendix. These sections are intended to support the core claims by providing methodological details and extended scaling results. • Appendix A: Frequently Asked Questions (FAQs). Addresses common questions regard- ing representational fragmentation, scale, and causal control. • Appendix B: Extended Related Work. Pro- vides a detailed discussion of prior work on mul- tilingual representations, language-associated units, sparse autoencoders, typology, and script effects. • Appendix C: Identifying Language- Associated Units with LAPE and SAE-LAPE. Explains the identification frameworks and provides the hyperparameter thresholds used for both neurons and sparse features. •
Chunk 35 · 1,993 chars
rk on mul- tilingual representations, language-associated units, sparse autoencoders, typology, and script effects. • Appendix C: Identifying Language- Associated Units with LAPE and SAE-LAPE. Explains the identification frameworks and provides the hyperparameter thresholds used for both neurons and sparse features. • Appendix D: Script Perturbation Experiments (Romanization). Describes dataset construc- tion, transliteration procedures, and distribu- tional analyses supporting the script romaniza- tion findings. • Appendix E: Structural Perturbation Exper- iments (Word Shuffling). Reports supplemen- tary results on shuffled inputs, aggregate overlap analyses, and stability of activation statistics. • Appendix F: Probing Typological Structure Across Layers. Details the ridge regression probing setup and provides comparative layer- wise informativeness across models. • Appendix G: Causal Interventions on Invari- ant Neuron Sets. Provides simultaneous ab- lation protocols and qualitative failure modes showing script mixing during generation. • Appendix H: Extended Scaling and Semantic Competence Analysis. Documents the persis- tence of representational fragmentation at the 8B and 9B scales and provides translation bench- marks ruling out data sparsity as an explanation. A Frequently Asked Questions (FAQs) 1. Do language-associated units imply the exis- tence of a universal interlingua? No. While language-associated units are clearly identifi- able and can influence model behavior, our re- sults show that they are predominantly sensitive to surface-form cues such as script and token distribution. 2. Is the observed script sensitivity simply an ar- tifact of tokenization? Tokenization necessar- ily introduces distinct input embeddings across scripts, but our analysis goes beyond early-layer effects. We observe that alignment remains low even in intermediate layers, indicating that script -- 13 of 35 -- sensitivity is not merely a tokenizer artifact but reflects
Chunk 36 · 1,996 chars
ifact of tokenization? Tokenization necessar- ily introduces distinct input embeddings across scripts, but our analysis goes beyond early-layer effects. We observe that alignment remains low even in intermediate layers, indicating that script -- 13 of 35 -- sensitivity is not merely a tokenizer artifact but reflects persistent representational fragmenta- tion within the model. 3. Why use 1B and 2B models for the main ex- position? We center our primary exposition on Llama-3.2-1B and Gemma-2-2B to enable extensive, computationally intensive represen- tational sweeps and causal interventions across many layers and languages. However, to ensure our findings are not artifacts of limited capac- ity, we explicitly validate our core experiments on larger models (Llama-3-8B and Gemma-2- 9B), confirming these representational proper- ties hold at scale. 4. Do these findings generalize to larger mod- els? Yes. As detailed in our scaling analysis, we validate our findings on Llama-3-8B and Gemma-2-9B. We observe that representational fragmentation under script variation, as well as robustness under structural perturbation, per- sist at these larger scales. Crucially, this frag- mentation remains even though these models exhibit strong semantic translation competence on romanized inputs, demonstrating that script- conditioned subspaces are a persistent architec- tural trait rather than a symptom of undertrain- ing or data sparsity. 5. Does strong probing performance imply func- tional importance? No. Probing reveals that typological properties become increasingly lin- early accessible in deeper layers, but causal interventions show that functional importance aligns with invariance to surface perturbations. This reinforces the view that linear decodability does not imply causal control. 6. Why analyze both raw neurons and SAE fea- tures? Raw neurons directly govern model behavior, while SAE features provide an inter- pretable decomposition of these activations. An- alyzing
Chunk 37 · 1,992 chars
aligns with invariance to surface perturbations. This reinforces the view that linear decodability does not imply causal control. 6. Why analyze both raw neurons and SAE fea- tures? Raw neurons directly govern model behavior, while SAE features provide an inter- pretable decomposition of these activations. An- alyzing both allows us to separate functional relevance from interpretability and avoid over- attributing abstract meaning to sparse features alone. 7. What is the main takeaway for interpret- ing language-associated neurons? Language- associated units exist and matter, but they pri- marily reflect surface-form processing rather than abstract language identity. B Extended Related Work B.1 Language-Associated Units and Multilingual Representations Understanding how multilingual LMs encode lan- guage identity has become a central question in interpretability and cross-lingual modeling. Early multilingual neural machine translation (NMT) sys- tems already suggested that jointly trained models do not form a fully language-agnostic interlingua, but instead organize representations in a partially shared space structured by language identity and similarity (Johnson et al., 2017; Kudugunta et al., 2019). Subsequent analyses showed that encoder representations cluster by genealogical and typo- logical proximity, with high-resource languages occupying more stable regions of the latent space (Pires et al., 2019; Libovický et al., 2020). More recently, investigation at the neuron level has provided evidence that language identity can be localized to specific internal units. Tang et al. (2024) introduced LAPE to identify neurons that preferentially activate for individual languages in multilingual LMs, showing that a small subset of neurons, often concentrated in early and late lay- ers, exerts disproportionate control over language selection. Contemporary works also showed that targeted interventions on such neurons can reliably steer output language, even without
Chunk 38 · 1,996 chars
ivate for individual languages in multilingual LMs, showing that a small subset of neurons, often concentrated in early and late lay- ers, exerts disproportionate control over language selection. Contemporary works also showed that targeted interventions on such neurons can reliably steer output language, even without modifying in- put prompts (Kojima et al., 2024; Gurgurov et al., 2025; Rahmanisa et al., 2025). These observations establish that language control is not purely emer- gent at the output layer but is mediated by identifi- able internal mechanisms. Earlier representational studies, however, cau- tion against interpreting such units as encoding abstract language identity (Wu and Dredze, 2020; Libovický et al., 2019). Analyses of multilin- gual NMT and representation spaces show substan- tial mixing across languages, particularly in mid- dle layers, with language separation re-emerging closer to the output where lexical constraints dom- inate (Kudugunta et al., 2019). This layered orga- nization parallels findings in bilingual cognition, where shared semantic representations coexist with partially segregated lexical and orthographic pro- cessing streams (Marian et al., 2003; Costa and Sebastián-Gallés, 2014). Our work builds on this literature but departs in emphasis. Rather than asking whether language- associated units exist, we ask what linguistic prop- erties they encode. Specifically, we test whether -- 14 of 35 -- such units reflect abstract language identity or are instead driven by surface-form cues such as script and token distributions, a distinction that remains underexplored in prior neuron-level studies. While our primary exposition focuses on highly capable 1B and 2B parameter auto-regressive models to enable computationally intensive feature sweeps, we explicitly validate our core findings on larger architectures (up to 9B parameters) to ensure our conclusions regarding orthography and abstraction hold at scale. B.2 Sparse Autoencoders
Chunk 39 · 1,995 chars
position focuses on highly capable 1B and 2B parameter auto-regressive models to enable computationally intensive feature sweeps, we explicitly validate our core findings on larger architectures (up to 9B parameters) to ensure our conclusions regarding orthography and abstraction hold at scale. B.2 Sparse Autoencoders and Feature-Level Interpretability SAEs have recently emerged as a promising tool for disentangling dense transformer activations into more interpretable, monosemantic latent fea- tures. The central idea – that sparsity can sepa- rate overlapping signals into distinct dimensions – has strong precedents in vision, where network dissection methods link individual units to human- interpretable concepts (Bau et al., 2017). In lan- guage models, sparse methods have been shown to isolate features corresponding to factual recall, for- matting, or syntactic regularities that are difficult to identify in dense representations (Huben et al., 2024; Marks et al., 2025). Several recent works extend SAEs to large lan- guage models at scale. For instance, recently Shi et al. (2025) proposed RouteSAE which in- troduces routing mechanisms that propagate sparse features across layers, improving interpretability while maintaining model performance. Open- source SAE frameworks further demonstrate that sparse latents can support causal interventions and analyses in modern transformer models (Lieberum et al., 2024). In multilingual settings, Andrylie et al. (2025) and Deng et al. (2025) show that SAE features can align with semantic concepts across languages, motivating the use of sparse representa- tions for cross-lingual interpretability. Our work leverages this progress but reframes the goal. We first identify language-associated sparse features as well as raw model neurons, by us- ing SAE-LAPE (Andrylie et al., 2025) and LAPE (Tang et al., 2024) respectively. We then systemati- cally analyze their sensitivity to script, word order, and typological structure. Unlike
Chunk 40 · 1,992 chars
ges this progress but reframes the goal. We first identify language-associated sparse features as well as raw model neurons, by us- ing SAE-LAPE (Andrylie et al., 2025) and LAPE (Tang et al., 2024) respectively. We then systemati- cally analyze their sensitivity to script, word order, and typological structure. Unlike prior studies that focus primarily on semantic or task-level concepts, we center our analysis on linguistic abstraction, ex- plicitly separating representational alignment (as revealed by probing) from functional necessity (as tested via causal intervention), echoing critiques of probing as a standalone interpretability tool (Hewitt and Liang, 2019; Belinkov, 2022). B.3 Typology, Script, and Romanization Effects Linguistic typology has long been used to study cross-lingual similarity and transfer in multilingual models. The URIEL and lang2vec framework provides structured vectors encoding genealogical, geographical, phonological, and syntactic prop- erties for various languages (Littell et al., 2017). Subsequent work shows that typological informa- tion becomes increasingly linearly accessible in deeper layers of multilingual transformers, suggest- ing a gradual emergence of abstraction (Rama et al., 2020). Orthography and script introduce an additional, often confounding, dimension. Prior work in mul- tilingual language identification shows that script cues dominate early decisions, and that romanized or transliterated text can significantly degrade per- formance when script information is not explicitly modeled (Jauhiainen et al., 2019). In representa- tion learning, transliteration and script normaliza- tion have been shown to alter clustering structure in multilingual embedding spaces, sometimes improv- ing transfer but often creating mismatches between surface form and linguistic identity (Artetxe et al., 2020; Moosa et al., 2023). Recent interpretability studies suggest that these effects extend to internal model mechanisms. Anal- yses of
Chunk 41 · 1,982 chars
lter clustering structure in multilingual embedding spaces, sometimes improv- ing transfer but often creating mismatches between surface form and linguistic identity (Artetxe et al., 2020; Moosa et al., 2023). Recent interpretability studies suggest that these effects extend to internal model mechanisms. Anal- yses of bilingual and multilingual models show that changing script can reroute activations through dif- ferent internal pathways, even when lexical content is preserved (Saji et al., 2025; Trinley et al., 2025; Muller et al., 2021; Lu et al., 2025). Our work builds on these observations by systematically com- paring native-script and romanized inputs under a unified neuron- and feature-identification frame- work, revealing that script changes induce near- complete reorganization of language-associated units. Importantly, we show that this fragmentation persists even in deeper layers where typological information is linearly decodable, indicating that abstraction and control are distributed across paral- lel, script-bound subspaces rather than unified into a single interlingua. -- 15 of 35 -- C Identifying Language-Associated Units with LAPE and SAE-LAPE This appendix summarizes the methods used to identify language-associated units in our analysis. C.1 LAPE for Raw Neurons Language Activation Probability Entropy (LAPE) quantifies how selectively an individual neuron re- sponds to different languages. Given a multilingual corpus, for each neuron j at layer ℓ and language k, we compute the activation probability P (ℓ) j,k = E h Ia(ℓ) j > 0 language k i , where a(ℓ) j denotes the neuron activation and I(·) is the indicator function. The vector of activation probabilities across languages is ℓ1-normalized to form a distribution, and its entropy is computed as LAPE(ℓ) j = − X k P ′(ℓ) j,k log P ′(ℓ) j,k . Low entropy indicates that a neuron activates pre- dominantly for a small subset of languages. Neu- rons with sufficiently low entropy and a
Chunk 42 · 1,999 chars
ion. The vector of activation probabilities across languages is ℓ1-normalized to form a distribution, and its entropy is computed as LAPE(ℓ) j = − X k P ′(ℓ) j,k log P ′(ℓ) j,k . Low entropy indicates that a neuron activates pre- dominantly for a small subset of languages. Neu- rons with sufficiently low entropy and a dominant language are identified as language-associated. C.2 SAE-LAPE for Sparse Features SAE-LAPE extends the LAPE criterion to sparse latent features obtained from Sparse Autoencoders (SAEs). SAEs are trained on feed-forward (MLP) activations to decompose dense representations into a sparse set of latent features. Each SAE feature is treated analogously to a neuron: we compute its ac- tivation probability per language based on whether the feature is active for a given token. The same entropy-based criterion is then applied to identify language-associated sparse features. To ensure robustness, we restrict attention to features that are active for a non-trivial fraction of tokens and examples within at least one language. This enables language association analysis at the level of sparse, interpretable features rather than individual neurons. C.3 Hyperparameters and Implementation Details All LAPE and SAE-LAPE analyses share a com- mon entropy-based framework for measuring lan- guage selectivity, differing primarily in their filter- ing criteria and membership assignment rules. Activation Statistics. For both methods, activa- tion probabilities are computed over a multilin- gual corpus by aggregating token-level activations within each language. A unit (raw neuron or SAE latent) is considered active for a token if its acti- vation exceeds zero. Activation probabilities are normalized across languages prior to entropy com- putation. SAE-LAPE Hyperparameters. SAE-LAPE op- erates on sparse latent features extracted from Sparse Autoencoders trained on MLP activations. To exclude noisy or overly idiosyncratic features, we apply two pre-selection thresholds:
Chunk 43 · 1,991 chars
zero. Activation probabilities are normalized across languages prior to entropy com- putation. SAE-LAPE Hyperparameters. SAE-LAPE op- erates on sparse latent features extracted from Sparse Autoencoders trained on MLP activations. To exclude noisy or overly idiosyncratic features, we apply two pre-selection thresholds: (i) an exam- ple rate of 0.98, requiring a latent to be active in at least 98% of examples within at least one language, and (ii) a high-frequency latent (HFL) rate of 0.1, requiring activation on at least 10% of tokens in that language. Latents failing either criterion have their entropy set to infinity and are excluded from selection. Language membership for SAE latents is deter- mined using a relative top-k criterion. A latent f is considered present in language l if its activation probability satisfies P (f | l) ≥ 0.8 × max l′∈L P (f | l′), where the threshold ratio of 0.8 is fixed across all experiments. This relative criterion allows features to be shared across a small number of languages when desired. Depending on the configuration, we further restrict selection to la- tents that are either unique to a single language (lang_specific) or shared by an exact number of languages (lang_shared). A methodological adaptation was required for Gemma models. Because the original SAE-LAPE implementation was designed for cardinally con- strained Top-K SAEs (as used for Llama), we in- troduced an additional filtering step for Gemma’s JumpReLU SAEs by restricting the analysis to the top-200 active latents by activation magnitude per token. While this introduces minor variance, the macro-level representational trends remain highly consistent across both SAE architectures. LAPE Hyperparameters for Raw Neurons. For raw model neurons, which are typically denser and more polysemantic, we adopt a more conser- vative, percentile-based filtering strategy. We com- pute the 95th percentile of activation probabilities -- 16 of 35 -- Parameter Value Activation
Chunk 44 · 1,998 chars
ent across both SAE architectures. LAPE Hyperparameters for Raw Neurons. For raw model neurons, which are typically denser and more polysemantic, we adopt a more conser- vative, percentile-based filtering strategy. We com- pute the 95th percentile of activation probabilities -- 16 of 35 -- Parameter Value Activation indicator Latent z > 0 Aggregation level Token + example Minimum example rate 0.98 Minimum HFL rate 0.10 Top-k threshold ratio 0.80 Entropy for invalid features ∞ Llama Top-K SAE: k-value 32 Gemma JumpReLU SAE: enforced Top-K 200 Table 3: Hyperparameters used for SAE-LAPE iden- tification of language-associated sparse latent features. Rates are computed per layer over the multilingual cor- pus. Parameter Value Activation indicator a > 0 Aggregation level Token-level Activation percentile (filter rate) 95th percentile Entropy selection fraction Lowest 1% neurons Language assignment threshold 95th percentile (global) Inactive neuron handling Discarded Table 4: Hyperparameters used for LAPE-based identi- fication of language-associated raw neurons. Activation percentiles are computed globally across all neurons and languages. across all neurons and languages, and discard neu- rons whose activation probability never exceeds this threshold in any language. Among the remain- ing candidates, we select the lowest-entropy neu- rons corresponding to the top 1% most language- selective units. Language assignment for these neurons uses an absolute activation criterion: a neuron is attributed to language l if its activation probability exceeds the same 95th-percentile threshold. This approach emphasizes globally salient, language-skewed neu- rons rather than fine-grained feature sharing. All models share the same setup. Outputs. Both methods export identified units with identified languages(s), activation probabili- ties, and entropy values. Overall, SAE-LAPE prioritizes consistent, inter- pretable sparsified features with controlled cross- lingual sharing, while
Chunk 45 · 1,999 chars
her than fine-grained feature sharing. All models share the same setup. Outputs. Both methods export identified units with identified languages(s), activation probabili- ties, and entropy values. Overall, SAE-LAPE prioritizes consistent, inter- pretable sparsified features with controlled cross- lingual sharing, while LAPE for raw neurons fo- cuses on identifying the most strongly language- skewed units in dense representations. Table 3 and Table 4 summarize the thresholds and hyperparam- eters used for SAE-LAPE and LAPE respectively. C.4 Usage in This Work In this paper, LAPE and SAE-LAPE are used strictly as identification tools for selecting language-associated neurons and sparse features. All subsequent analyses – including romanization, shuffling, probing, and causal interventions – are conducted on these identified units. We do not assume that low entropy alone implies abstract lin- guistic control or causal importance. D Script Perturbation Experiments (Romanization) D.1 Experimental Setup Datasets. We use the dev split of FLORES+, which provides sentence-aligned multilingual data across typologically diverse languages. For South Asian languages, we additionally consider the Dak- shina dataset to assess the effects of context-aware romanization, noting that these corpora are not sentence-aligned and are therefore used only for supplementary analysis. Language Selection. Our experiments cover Hindi (hi), Marathi (mr), Bengali (bn), Urdu (ur), Russian (ru), Bulgarian (bg), Japanese (ja), Chi- nese (zh), Korean (ko), English (en), and Spanish (es), spanning multiple writing systems and includ- ing both closely related and typologically distant language pairs. Romanization Procedure and Diacritics. Ro- manized text is generated using the ICU Transliter- ator. For applicable languages, we construct both diacritic-preserving and ASCII-only variants by removing diacritics via Unicode normalization, en- abling controlled analysis of sub-phonemic ortho- graphic cues.
Chunk 46 · 1,998 chars
uage pairs. Romanization Procedure and Diacritics. Ro- manized text is generated using the ICU Transliter- ator. For applicable languages, we construct both diacritic-preserving and ASCII-only variants by removing diacritics via Unicode normalization, en- abling controlled analysis of sub-phonemic ortho- graphic cues. The whole pipeline is run thrice: (a) with the native datasets, (b) with the diacritics- romanization datasets, (c) with the diacritics-free- romanization datasets. Then the resulting neuron sets are compared. Metrics. Overlap between language-associated feature sets is quantified using Jaccard similar- ity. We additionally compute cross-language over- laps to assess whether romanization induces in- creased sharing with English or other Latin-script languages. -- 17 of 35 -- Bengali Bulgarian Chinese Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.5 Jaccard Similarity SAE Features (vs Native) SAE Features (vs English) Raw Neurons (vs Native) Raw Neurons (vs English) Figure 8: Jaccard similarity between language- associated units identified from Romanized inputs and those from Native-script or English inputs in Gemma-2- 2B. Results are shown for both raw neurons and SAE features. Romanized inputs exhibit low overlap with their native-script counterparts and near-zero overlap with English in both representations, indicating limited cross-script alignment without convergence to English. D.2 Supplementary Romanization Analysis Across Models and Representations This appendix extends Section 4 by documenting the full set of romanization diagnostics across all evaluated model and representation configurations. While the main text focuses on feature identity and overlap, here we examine (i) aggregate neuron- sharing structure across all languages, and (ii) dis- tributional effects of romanization on activation behavior of language-specific neurons. Aggregate Neuron Sharing Under Orthographic Variation. We begin by reporting aggregate Venn diagrams
Chunk 47 · 1,993 chars
uses on feature identity and overlap, here we examine (i) aggregate neuron- sharing structure across all languages, and (ii) dis- tributional effects of romanization on activation behavior of language-specific neurons. Aggregate Neuron Sharing Under Orthographic Variation. We begin by reporting aggregate Venn diagrams computed jointly over all languages, re- stricted to language-specific neurons identified in- dependently per input condition. For each config- uration, we plot Venn diagrams for units shared among at most three languages, comparing native- script inputs, romanized inputs with diacritics, and romanized inputs without diacritics. This repre- sentation captures all low-order sharing behavior, ensuring a fair balance between specificity and cov- erage. Figures 9 and 10 summarize these results across Gemma and Llama, under both raw MLP and SAE representations. Across all available configurations, aggregate overlap between native and romanized variants remains low. Overlap between the two romanized variants is slightly higher but remains limited, indicating that even minor orthographic perturbations such as diacritic removal induce sub- stantial reassignment of language-specific neurons. Figure 8 shows these trends for Gemma, consis- tent with the trends for Llama from the main text. These aggregate results confirm that the effects re- ported per-language in the main text persist at the multilingual level. Distributional Effects of Romanization. Overlap-based analyses describe neuron reuse, but do not capture how retained neurons behave. We therefore analyze activation statistics under native versus romanized inputs for both (i) the complete sets of neurons active in each condition, and (ii) the subset of neurons overlapping between native and romanized representations. Figures 11 and 12 report these distributions for Gemma and Llama across raw and SAE represen- tations. Across all available configurations, roman- ization induces clear distributional
Chunk 48 · 1,979 chars
lete sets of neurons active in each condition, and (ii) the subset of neurons overlapping between native and romanized representations. Figures 11 and 12 report these distributions for Gemma and Llama across raw and SAE represen- tations. Across all available configurations, roman- ization induces clear distributional shifts in both activation probability and entropy. These shifts are observed both when considering complete neuron sets and when restricting to overlapping neurons, indicating that the effects are not solely driven by changes in neuron identity. Moreover, the shifts are substantially larger than those observed under shuf- fling baselines, suggesting structured changes in activation dynamics rather than random variance. Representation-Specific Distributional Trends. For raw MLP representations, romanization con- sistently shifts activation probability mass toward higher values while reducing entropy, indicating more concentrated and decisive neuron firing. This effect is pronounced for Gemma, whereas for Llama the entropy reduction is comparatively mild, despite similar probability shifts. For SAE representations, distributional shifts are again substantial, but the directionality is less con- sistent across configurations. In particular, both en- tropy and activation probability may increase or de- crease depending on the setup. However, the over- all magnitude of these shifts is larger for Gemma than for Llama, suggesting that sparse representa- tions in Gemma are more sensitive to orthographic perturbations. Stability of Mean Activation Statistics. Finally, we report mean activation statistics averaged across languages and neurons. Despite strong neuron- level redistribution and distributional shifts, mean activation values remain largely stable across native and romanized inputs, indicating that romanization reallocates activation mass without substantially altering global magnitude. Figure 13 summarizes these values for the raw
Chunk 49 · 1,988 chars
es and neurons. Despite strong neuron- level redistribution and distributional shifts, mean activation values remain largely stable across native and romanized inputs, indicating that romanization reallocates activation mass without substantially altering global magnitude. Figure 13 summarizes these values for the raw activations. Summary. Together with Section 4, these results show that orthographic variation affects both the allocation and dynamics of language-specific neu- rons. Degree-3 analyses confirm that low-order -- 18 of 35 -- Native (n=292) Romanized (with diacritics) (n=302) Romanized (without diacritics) (n=168) Bengali Native (n=614) Romanized (with diacritics) (n=278) Romanized (without diacritics) (n=319) Bulgarian Native (n=299) Romanized (with diacritics) (n=822) Romanized (without diacritics) (n=799) Chinese Native (n=241) Romanized (with diacritics) (n=269) Romanized (without diacritics) (n=192) Hindi Native (n=336) Romanized (with diacritics) (n=357) Romanized (without diacritics) (n=187) Japanese Native (n=239) Romanized (with diacritics) (n=522) Romanized (without diacritics) (n=587) Korean Native (n=750) Romanized (with diacritics) (n=350) Romanized (without diacritics) (n=239) Marathi Native (n=257) Romanized (with diacritics) (n=348) Romanized (without diacritics) (n=411) Russian Native (n=452) Romanized (with diacritics) (n=315) Romanized (without diacritics) (n=322) Spanish Native (n=828) Romanized (with diacritics) (n=215) Romanized (without diacritics) (n=253) Urdu Native (n=375) Romanized (with diacritics) (n=5) Romanized (without diacritics) (n=10) Bengali Native (n=1) Romanized (with diacritics) (n=3) Romanized (without diacritics) (n=2) Bulgarian Native (n=2) Romanized (with diacritics) (n=38) Romanized (without diacritics) (n=71) Chinese Native (n=208) Romanized (with diacritics) (n=9) Romanized (without diacritics) (n=6) Hindi Native (n=2)
Chunk 50 · 1,992 chars
Bengali Native (n=1) Romanized (with diacritics) (n=3) Romanized (without diacritics) (n=2) Bulgarian Native (n=2) Romanized (with diacritics) (n=38) Romanized (without diacritics) (n=71) Chinese Native (n=208) Romanized (with diacritics) (n=9) Romanized (without diacritics) (n=6) Hindi Native (n=2) Romanized (with diacritics) (n=13) Romanized (without diacritics) (n=11) Japanese Native (n=80) Romanized (with diacritics) (n=10) Romanized (without diacritics) (n=11) Korean Native (n=66) Romanized (with diacritics) (n=7) Romanized (without diacritics) (n=2) Marathi Native (n=5) Romanized (with diacritics) (n=368) Romanized (without diacritics) (n=383) Russian Native (n=6) Romanized (with diacritics) (n=761) Romanized (without diacritics) (n=716) Spanish Native (n=63) Romanized (with diacritics) (n=121) Romanized (without diacritics) (n=36) Urdu Figure 9: Aggregate degree-3 Venn diagrams of language-specific neurons under orthographic variation for Gemma- 2-2B. Degree-3 denotes the union of neurons shared by up to three languages. Panels correspond to raw MLP and SAE representations under diacritics-preserving and diacritics-removed romanization. sharing remains limited even when allowing pair- wise reuse, while distributional statistics reveal structured activation shifts under romanization that are not captured by identity-based overlap alone. D.3 Probing–Romanization Interaction: Typological Alignment of Neuron Subsets This subsection analyzes how typological struc- ture, as measured by lang2vec probing, distributes across neuron subsets induced by romanization. While earlier sections establish that romanization reorganizes language-specific features, here we ask whether this reorganization correlates with the de- gree to which neurons encode linguistic typology. Setup. For each layer, model, and representation (raw MLP or SAE), neurons are partitioned into four disjoint subsets based on their activity un- der
Chunk 51 · 1,972 chars
at romanization reorganizes language-specific features, here we ask whether this reorganization correlates with the de- gree to which neurons encode linguistic typology. Setup. For each layer, model, and representation (raw MLP or SAE), neurons are partitioned into four disjoint subsets based on their activity un- der native and romanized inputs: (i) native-only neurons, (ii) romanized-only neurons, (iii) overlap neurons active under both conditions, and (iv) a baseline consisting of all neurons in the layer. For each subset, we compute the average family-wise maximum probing R2 score across neurons for the three typological feature families used in the fi- nal analysis: fam, syntax, and phonology. All plots in this section report these averages using the specific_mean metric. -- 19 of 35 -- Native (n=401) Romanized (with diacritics) (n=302) Romanized (without diacritics) (n=144) Bengali Native (n=283) Romanized (with diacritics) (n=109) Romanized (without diacritics) (n=163) Bulgarian Native (n=126) Romanized (with diacritics) (n=424) Romanized (without diacritics) (n=473) Chinese Native (n=206) Romanized (with diacritics) (n=244) Romanized (without diacritics) (n=120) Hindi Native (n=173) Romanized (with diacritics) (n=159) Romanized (without diacritics) (n=125) Japanese Native (n=122) Romanized (with diacritics) (n=307) Romanized (without diacritics) (n=420) Korean Native (n=476) Romanized (with diacritics) (n=356) Romanized (without diacritics) (n=185) Marathi Native (n=124) Romanized (with diacritics) (n=140) Romanized (without diacritics) (n=177) Russian Native (n=229) Romanized (with diacritics) (n=107) Romanized (without diacritics) (n=140) Spanish Native (n=466) Romanized (with diacritics) (n=241) Romanized (without diacritics) (n=231) Urdu Native (n=381) Romanized (with diacritics) (n=274) Romanized (without diacritics) (n=112) Bengali Native (n=170) Romanized (with
Chunk 52 · 1,991 chars
nized (with diacritics) (n=107) Romanized (without diacritics) (n=140) Spanish Native (n=466) Romanized (with diacritics) (n=241) Romanized (without diacritics) (n=231) Urdu Native (n=381) Romanized (with diacritics) (n=274) Romanized (without diacritics) (n=112) Bengali Native (n=170) Romanized (with diacritics) (n=200) Romanized (without diacritics) (n=184) Bulgarian Native (n=121) Romanized (with diacritics) (n=368) Romanized (without diacritics) (n=290) Chinese Native (n=134) Romanized (with diacritics) (n=265) Romanized (without diacritics) (n=144) Hindi Native (n=153) Romanized (with diacritics) (n=308) Romanized (without diacritics) (n=197) Japanese Native (n=147) Romanized (with diacritics) (n=305) Romanized (without diacritics) (n=254) Korean Native (n=200) Romanized (with diacritics) (n=250) Romanized (without diacritics) (n=108) Marathi Native (n=129) Romanized (with diacritics) (n=198) Romanized (without diacritics) (n=173) Russian Native (n=49) Romanized (with diacritics) (n=54) Romanized (without diacritics) (n=59) Spanish Native (n=286) Romanized (with diacritics) (n=547) Romanized (without diacritics) (n=409) Urdu Figure 10: Aggregate degree-3 Venn diagrams of language-specific neurons under orthographic variation for Llama-3.2-1B. Degree-3 denotes the union of neurons shared by up to three languages. Panels correspond to raw MLP and SAE representations under diacritics-preserving and diacritics-removed romanization. Consistency Across Models and Representa- tions. Across all model and representation con- figurations, the qualitative behavior of these curves is remarkably consistent. Baseline probing val- ues are generally lower than those obtained from more selective neuron subsets. An exception arises for Gemma, where neurons active only for native inputs sometimes fall below the baseline. In the Gemma raw setting, probing values are compar- atively similar across subsets,
Chunk 53 · 1,960 chars
remarkably consistent. Baseline probing val- ues are generally lower than those obtained from more selective neuron subsets. An exception arises for Gemma, where neurons active only for native inputs sometimes fall below the baseline. In the Gemma raw setting, probing values are compar- atively similar across subsets, indicating weaker separation between neuron groups. Overlap Neurons Encode Stronger Typological Structure. The most robust result is that the over- lap subset consistently exhibits substantially higher probing R2 scores than all other subsets. This pat- tern holds across all models, representations, fea- ture families, and romanization conditions. Neu- rons that remain active across both native and ro- manized inputs are therefore not only orthography- invariant, but also more strongly aligned with lin- guistic typology than neurons that respond selec- tively to a single script variant. Model- and Representation-Level Effects. Con- sistent with prior probing analyses, Gemma achieves higher absolute probing scores than Llama across all neuron subsets. Within Llama, SAE rep- resentations exhibit markedly lower R2 values than raw MLP activations, often by a large margin. Cru- -- 20 of 35 -- 0.5 1.0 1.5 2.0 Entropy (All) 0.0 0.5 1.0 1.5 2.0 2.5 Density Mean (Native): 1.81 Mean (Romanized): 1.76 Native Romanized 0.7 0.8 0.9 1.0 Activation Probability (All) 0 1 2 3 4 5 Density Mean (Native): 0.80 Mean (Romanized): 0.86 Native Romanized 0.5 1.0 1.5 2.0 Entropy (Overlapping) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Density Mean (Native): 1.63 Mean (Romanized): 1.50 Native Romanized 0.7 0.8 0.9 1.0 Activation Probability (Overlapping) 0 1 2 3 4 5 Density Mean (Native): 0.82 Mean (Romanized): 0.86 Native Romanized 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Entropy (All) 0.0 0.5 1.0 1.5 2.0 2.5 Density Mean (Native): 1.66 Mean (Romanized): 1.99 Native Romanized 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Activation Probability
Chunk 54 · 1,996 chars
ivation Probability (Overlapping) 0 1 2 3 4 5 Density Mean (Native): 0.82 Mean (Romanized): 0.86 Native Romanized 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Entropy (All) 0.0 0.5 1.0 1.5 2.0 2.5 Density Mean (Native): 1.66 Mean (Romanized): 1.99 Native Romanized 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Activation Probability (All) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Density Mean (Native): 0.30 Mean (Romanized): 0.22 Native Romanized 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Entropy (Overlapping) 0 2 4 6 8 Density Mean (Native): 2.14 Mean (Romanized): 1.97 Native Romanized 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Activation Probability (Overlapping) 0 1 2 3 4 5 6 Density Mean (Native): 0.19 Mean (Romanized): 0.22 Native Romanized Figure 11: Activation probability and entropy distri- butions for language-specific neurons under native vs. romanized inputs (Gemma-2-2B). Top: raw MLP; Bot- tom: SAE. cially, however, the dominance of the overlap sub- set persists even in these lower-signal regimes, in- dicating that the relationship between orthographic stability and typological alignment is robust to over- all representational strength. Preservation of Typological Hierarchy. Across all neuron subsets and configurations, the relative ordering of feature families remains unchanged: fam > syntax > phonology. Romanization-induced partitioning thus modulates the magnitude of typological alignment, but not its hierarchical structure. Representative Results. Figures 5–17 show rep- resentative results for Llama and Gemma under both raw and SAE representations with diacritics- preserving romanization. Analogous trends are observed for the diacritics-removed setting. Summary. Together, these results establish a sys- tematic association between orthographic robust- ness and linguistic abstraction. Neurons that are preserved across romanization transformations con- sistently encode stronger typological structure than 0.5 1.0 1.5 2.0 Entropy (All) 0 1 2 3 4 Density Mean (Native): 2.10 Mean
Chunk 55 · 1,999 chars
ther, these results establish a sys- tematic association between orthographic robust- ness and linguistic abstraction. Neurons that are preserved across romanization transformations con- sistently encode stronger typological structure than 0.5 1.0 1.5 2.0 Entropy (All) 0 1 2 3 4 Density Mean (Native): 2.10 Mean (Romanized): 2.10 Native Romanized 0.8 0.9 1.0 Activation Probability (All) 0 2 4 6 8 10 12 Density Mean (Native): 0.84 Mean (Romanized): 0.93 Native Romanized 0.5 1.0 1.5 2.0 Entropy (Overlapping) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Density Mean (Native): 1.85 Mean (Romanized): 1.81 Native Romanized 0.8 0.9 1.0 Activation Probability (Overlapping) 0 2 4 6 8 10 12 Density Mean (Native): 0.86 Mean (Romanized): 0.94 Native Romanized 0.0 0.5 1.0 1.5 2.0 Entropy (All) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density Mean (Native): 0.75 Mean (Romanized): 1.03 Native Romanized 0.2 0.4 0.6 0.8 1.0 Activation Probability (All) 0 1 2 3 4 Density Mean (Native): 0.29 Mean (Romanized): 0.26 Native Romanized 0.0 0.5 1.0 1.5 2.0 Entropy (Overlapping) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Density Mean (Native): 0.81 Mean (Romanized): 0.77 Native Romanized 0.2 0.4 0.6 0.8 1.0 Activation Probability (Overlapping) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Density Mean (Native): 0.34 Mean (Romanized): 0.27 Native Romanized Figure 12: Activation probability and entropy distri- butions for language-specific neurons under native vs. romanized inputs (Llama-3.2-1B). Top: raw MLP; Bot- tom: SAE. neurons that are sensitive to script variation. Ro- manization thus serves as a diagnostic tool that re- veals not only representational fragmentation, but also the locus of stable linguistic abstraction within multilingual models. E Structural Perturbation Experiments (Word Shuffling) Datasets. Following the original setup, we use a combination of three datasets. (i) XNLI: 1,000 examples from the train split (en, de, fr, hi, es, th, bg, ru, tr, vi). (ii) PAWS-X: 1,000 examples from the train
Chunk 56 · 1,998 chars
le linguistic abstraction within multilingual models. E Structural Perturbation Experiments (Word Shuffling) Datasets. Following the original setup, we use a combination of three datasets. (i) XNLI: 1,000 examples from the train split (en, de, fr, hi, es, th, bg, ru, tr, vi). (ii) PAWS-X: 1,000 examples from the train split (en, de, fr, es, ja, ko, zh). (iii) FLORES+: 997 examples from the dev split (15+ languages). Procedure. For each dataset, we apply the LAPE and SAE-LAPE pipelines twice: (a) on sentences in their natural word order, (b) on sentences where words within each prompt are randomly permuted. All other parameters are held fixed. The resulting neuron sets are compared. -- 21 of 35 -- Bengali Bulgarian Chinese English Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.5 1.0 1.5 2.0 Mean Entropy Native Romanized Bengali Bulgarian Chinese English Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.2 0.4 0.6 0.8 Mean Activation Probability Native Romanized Bengali Bulgarian Chinese English Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.5 1.0 1.5 2.0 Mean Entropy Native Romanized Bengali Bulgarian Chinese English Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.2 0.4 0.6 0.8 1.0 Mean Activation Probability Native Romanized Figure 13: Mean activation statistics across languages for native and romanized inputs, for the raw MLP LAPE- identified features. Top: Gemma-2-2B; Bottom: Llama-3.2-1B. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.0 0.2 0.4 0.6 0.8 Jaccard Similarity SAE Features Raw Neurons Figure 14: Layer-wise alignment between language- associated units for Native and Romanized inputs in Gemma-2-2B. The red line denotes average Jaccard similarity for raw neurons, and the blue line for SAE features; shaded regions indicate standard deviation across languages. Both raw neurons and SAE features show a mid-layer increase in overlap. However, in all cases, alignment remains far from convergence,
Chunk 57 · 1,990 chars
inputs in Gemma-2-2B. The red line denotes average Jaccard similarity for raw neurons, and the blue line for SAE features; shaded regions indicate standard deviation across languages. Both raw neurons and SAE features show a mid-layer increase in overlap. However, in all cases, alignment remains far from convergence, indi- cating that representational separation persists beyond input tokenization. E.1 Supplementary Shuffling Analyses Across Models and Representations This appendix provides additional analyses for the shuffling experiments reported in Section 5. While the main text focuses on language-level stability and aggregate trends, here we document neuron- level overlap structure and distributional behavior Only Native Overlap Only Romanized Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Avg. Mean Probing Score ( R2) Phonology Syntax Genealogy (Family) Figure 15: Average family-wise maximum probing R2 scores across neuron subsets induced by romanization (Llama-3.2-1B, SAE). Overall probing scores are lower, but overlap neurons remain dominant. across all model and representation configurations. Aggregate Neuron Overlap Under Shuffling. We first examine neuron overlap between features identified from original and word-shuffled inputs, aggregated across all languages. Figures 18 and 19 show degree-based Venn diagrams for Llama and Gemma, respectively, under raw and SAE represen- tations. Across all configurations, overlap between orig- inal and shuffled feature sets remains high, indi- cating that shuffling preserves feature identity at the neuron level. This confirms that the stability -- 22 of 35 -- Only Native Overlap Only Romanized Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Avg. Mean Probing Score ( R2) Phonology Syntax Genealogy (Family) Figure 16: Average family-wise maximum probing R2 scores across neuron subsets induced by romanization (Gemma-2-2B, raw MLP). Scores are closer across sub- sets, with native-only neurons occasionally falling be- low
Chunk 58 · 1,989 chars
seline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Avg. Mean Probing Score ( R2) Phonology Syntax Genealogy (Family) Figure 16: Average family-wise maximum probing R2 scores across neuron subsets induced by romanization (Gemma-2-2B, raw MLP). Scores are closer across sub- sets, with native-only neurons occasionally falling be- low baseline. Only Native Overlap Only Romanized Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Avg. Mean Probing Score ( R2) Phonology Syntax Genealogy (Family) Figure 17: Average family-wise maximum probing R2 scores across neuron subsets induced by romanization (Gemma-2-2B, SAE). Overlap neurons continue to show stronger typological alignment despite increased spar- sity. observed at the language level in the main text also holds when aggregating across neurons. For Gemma SAE, the absolute number of identi- fied neurons is small for certain languages, making low-degree overlap estimates unstable. In this case, we report overlap up to degree 5 rather than de- gree 3. When restricting attention to settings with sufficient numbers of identified neurons, high over- lap is consistently recovered, in line with other configurations. Distributional Stability of Activation Statistics. Beyond feature identity, we analyze whether shuf- fling induces shifts in activation behavior. Fig- ures 20 and 21 compare distributions of activation entropy and selection probability for original ver- sus shuffled inputs, aggregated across languages. Across all models and representations, the distribu- tions are nearly overlapping, with only minor shifts in their means. This remains true when restricting the analysis to overlapping features (results omit- 348 288 Bulgarian 128 120 Chinese 48 80 English 246 213 French 270 248 German 254 253 Hindi 279 228 Italian 189 169 Japanese 133 119 Korean 253 216 Portuguese 138 155 Russian 191 165 Spanish 180 171 Thai 282 258 Turkish 130 187 Vietnamese Original Shuffled 148 154 Bulgarian 85 87 Chinese 10 16 English 13
Chunk 59 · 1,989 chars
288 Bulgarian 128 120 Chinese 48 80 English 246 213 French 270 248 German 254 253 Hindi 279 228 Italian 189 169 Japanese 133 119 Korean 253 216 Portuguese 138 155 Russian 191 165 Spanish 180 171 Thai 282 258 Turkish 130 187 Vietnamese Original Shuffled 148 154 Bulgarian 85 87 Chinese 10 16 English 13 13 French 14 21 German 179 192 Hindi 24 35 Italian 112 106 Japanese 112 111 Korean 61 66 Portuguese 102 114 Russian 14 21 Spanish 167 158 Thai 91 96 Turkish 94 113 Vietnamese Original Shuffled Figure 18: Aggregate degree-based Venn diagrams com- paring features from original and shuffled inputs in Llama-3.2-1B. Top: raw MLP; Bottom: SAE. High overlap indicates stability of neuron identity under word- order perturbation. ted for brevity), indicating that neurons preserved under shuffling also maintain stable activation pro- files. Mean Activation Statistics. Finally, we report mean activation statistics aggregated across lan- guages. As shown in Figure 22, mean entropy and selection probability change only marginally un- der shuffling, reiterating that syntactic perturbation does not significantly reweight feature activity. Summary. Together, these supplementary anal- yses reinforce the robustness conclusions in Sec- tion 5. Word-order shuffling preserves both neu- ron identity and activation statistics across models and representations. Differences observed in low- neuron regimes (e.g., Gemma SAE) are attributable to feature sparsity rather than systematic sensitivity to syntactic structure, further supporting the view that language-associated features primarily reflect token-level and distributional regularities. -- 23 of 35 -- 568 527 Bulgarian 262 216 Chinese 58 50 English 323 286 French 345 330 German 307 280 Hindi 425 398 Italian 321 294 Japanese 254 194 Korean 323 283 Portuguese 219 207 Russian 234 191 Spanish 276 250 Thai 518 494 Turkish 307 403 Vietnamese Original Shuffled 293 141 Bulgarian 7 2 Chinese 4
Chunk 60 · 1,992 chars
23 of 35 -- 568 527 Bulgarian 262 216 Chinese 58 50 English 323 286 French 345 330 German 307 280 Hindi 425 398 Italian 321 294 Japanese 254 194 Korean 323 283 Portuguese 219 207 Russian 234 191 Spanish 276 250 Thai 518 494 Turkish 307 403 Vietnamese Original Shuffled 293 141 Bulgarian 7 2 Chinese 4 5 English 1 6 French 1 German 566 341 Hindi 3 2 Italian 8 4 Japanese 481 251 Korean 2 1 Portuguese 8 20 Russian 2 Spanish 572 345 Thai 5 4 Turkish 60 15 Vietnamese Original Shuffled Figure 19: Aggregate degree-based Venn diagrams com- paring features from original and shuffled inputs in Gemma-2-2B. Top: raw MLP (degree 3); Bottom: SAE (degree 5). When sufficient neurons are identified, high overlap is preserved under shuffling. E.2 Probing–Shuffling Interaction: Typological Alignment Under Syntactic Perturbation This subsection analyzes how sensitivity to word- order shuffling correlates with typological structure, as measured by lang2vec probing. In contrast to romanization, shuffling preserves surface form and token identity while disrupting local syntactic order. We therefore examine how typological alignment distributes across neuron subsets that differ in their stability under shuffling. Setup. For each layer, model, and representation (raw MLP or SAE), neurons are partitioned into four disjoint subsets based on their activity under original and shuffled inputs: (i) normal-only neu- rons (active only for original text), (ii) shuffled-only neurons, (iii) overlap neurons active under both conditions, and (iv) a baseline consisting of all neu- rons in the layer. For each subset, we compute the 0.5 1.0 1.5 2.0 2.5 Entropy (All) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Density Mean (Original): 2.36 Mean (Shuffled): 2.33 Original Shuffled 0.7 0.8 0.9 1.0 Activation Probability (All) 0 1 2 3 4 Density Mean (Original): 0.85 Mean (Shuffled): 0.87 Original Shuffled 0.5 1.0 1.5 2.0 2.5 Entropy (Overlapping) 0.0 0.5 1.0 1.5 2.0 Density Mean
Chunk 61 · 1,999 chars
tropy (All) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Density Mean (Original): 2.36 Mean (Shuffled): 2.33 Original Shuffled 0.7 0.8 0.9 1.0 Activation Probability (All) 0 1 2 3 4 Density Mean (Original): 0.85 Mean (Shuffled): 0.87 Original Shuffled 0.5 1.0 1.5 2.0 2.5 Entropy (Overlapping) 0.0 0.5 1.0 1.5 2.0 Density Mean (Original): 2.31 Mean (Shuffled): 2.28 Original Shuffled 0.7 0.8 0.9 1.0 Activation Probability (Overlapping) 0 1 2 3 4 Density Mean (Original): 0.86 Mean (Shuffled): 0.87 Original Shuffled 0.0 0.5 1.0 1.5 2.0 2.5 Entropy (All) 0.0 0.2 0.4 0.6 0.8 Density Mean (Original): 0.71 Mean (Shuffled): 0.73 Original Shuffled 0.2 0.4 0.6 0.8 1.0 Activation Probability (All) 0.0 0.5 1.0 1.5 2.0 Density Mean (Original): 0.40 Mean (Shuffled): 0.41 Original Shuffled 0.0 0.5 1.0 1.5 2.0 2.5 Entropy (Overlapping) 0.0 0.2 0.4 0.6 0.8 Density Mean (Original): 0.69 Mean (Shuffled): 0.66 Original Shuffled 0.2 0.4 0.6 0.8 1.0 Activation Probability (Overlapping) 0.0 0.5 1.0 1.5 2.0 Density Mean (Original): 0.40 Mean (Shuffled): 0.41 Original Shuffled Figure 20: Activation entropy and selection proba- bility distributions for original and shuffled inputs in Llama-3.2-1B. Top: raw MLP; Bottom: SAE. The near- identical distributions indicate minimal distributional shift under shuffling. average family-wise maximum probing R2 score across neurons for the three typological feature families used throughout the paper: fam, syntax, and phonology. All plots report mean values ag- gregated across layers; we use degree3_mean for all configurations, except for Gemma SAE where degree5_mean is used due to low neuron counts in some languages. Raw Representations Show Uniform Typologi- cal Alignment. Figures 6 and 25 show results for raw MLP representations in Llama and Gemma. In both models, probing scores are remarkably similar across the normal-only, shuffled-only, and overlap subsets. This indicates that, at the level of dis- tributed raw activations, sensitivity
Chunk 62 · 1,990 chars
entations Show Uniform Typologi- cal Alignment. Figures 6 and 25 show results for raw MLP representations in Llama and Gemma. In both models, probing scores are remarkably similar across the normal-only, shuffled-only, and overlap subsets. This indicates that, at the level of dis- tributed raw activations, sensitivity to word-order perturbation is largely decoupled from typologi- cal alignment. Neurons that respond selectively to shuffled inputs are no less typologically informa- tive than those that respond to original inputs. SAE Representations Expose a Structured Hier- archy. A different pattern emerges for SAE rep- -- 24 of 35 -- 0.5 1.0 1.5 2.0 Entropy (All) 0.0 0.5 1.0 1.5 2.0 2.5 Density Mean (Original): 2.06 Mean (Shuffled): 2.07 Original Shuffled 0.6 0.7 0.8 0.9 1.0 Activation Probability (All) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Density Mean (Original): 0.80 Mean (Shuffled): 0.82 Original Shuffled 0.5 1.0 1.5 2.0 Entropy (Overlapping) 0.0 0.5 1.0 1.5 2.0 Density Mean (Original): 2.03 Mean (Shuffled): 1.99 Original Shuffled 0.6 0.7 0.8 0.9 1.0 Activation Probability (Overlapping) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Density Mean (Original): 0.80 Mean (Shuffled): 0.82 Original Shuffled 0.0 0.5 1.0 1.5 2.0 2.5 Entropy (All) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Density Mean (Original): 1.23 Mean (Shuffled): 1.30 Original Shuffled 0.1 0.2 0.3 0.4 0.5 0.6 Activation Probability (All) 0.0 0.5 1.0 1.5 2.0 2.5 Density Mean (Original): 0.28 Mean (Shuffled): 0.27 Original Shuffled 0.0 0.5 1.0 1.5 2.0 2.5 Entropy (Overlapping) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Density Mean (Original): 1.23 Mean (Shuffled): 1.22 Original Shuffled 0.1 0.2 0.3 0.4 0.5 0.6 Activation Probability (Overlapping) 0.0 0.5 1.0 1.5 2.0 2.5 Density Mean (Original): 0.28 Mean (Shuffled): 0.28 Original Shuffled Figure 21: Activation entropy and selection probability distributions for original and shuffled inputs in Gemma- 2-2B. Top: raw MLP; Bottom: SAE.
Chunk 63 · 1,996 chars
inal Shuffled 0.1 0.2 0.3 0.4 0.5 0.6 Activation Probability (Overlapping) 0.0 0.5 1.0 1.5 2.0 2.5 Density Mean (Original): 0.28 Mean (Shuffled): 0.28 Original Shuffled Figure 21: Activation entropy and selection probability distributions for original and shuffled inputs in Gemma- 2-2B. Top: raw MLP; Bottom: SAE. Distributional shifts remain small across representations. resentations (Figures 24 and 26). For both Llama and Gemma, we observe a consistent ordering: normal-only ≈ shuffled-only > overlap. That is, neurons selective to a single condition – whether original or shuffled – exhibit stronger typo- logical alignment than neurons that remain active across both. This contrasts sharply with the ro- manization setting, where overlap neurons were most informative, and suggests that invariance to word-order perturbation does not preferentially se- lect for typologically informative features in sparse representations. Baseline Effects in Llama. In Llama, baseline probing scores are substantially lower than those of any condition-specific subset, for both raw and SAE representations. This gap is less pronounced in Gemma. The result suggests that in Llama, typo- logical information is concentrated in a relatively small subset of neurons, and is diluted when aver- aging across the full layer. Preservation of Typological Hierarchy. Across all models, representations, and neuron subsets, the relative ordering of feature families remains unchanged: fam > syntax > phonology. Thus, while shuffling-sensitive partitioning modu- lates the strength of typological alignment, it does not alter the underlying hierarchy of linguistic in- formation. Representative Results. Figure 6, along with Figures 24–26, show the full set of results for all configurations. Summary. Together, these results indicate that robustness to syntactic perturbation is not a reliable indicator of typological abstraction. In raw repre- sentations, typological information is broadly dis- tributed and
Chunk 64 · 1,995 chars
ve Results. Figure 6, along with Figures 24–26, show the full set of results for all configurations. Summary. Together, these results indicate that robustness to syntactic perturbation is not a reliable indicator of typological abstraction. In raw repre- sentations, typological information is broadly dis- tributed and largely insensitive to shuffling-based partitioning. In contrast, sparse representations reveal that neurons invariant to shuffling are not necessarily those most aligned with linguistic ty- pology, highlighting a clear qualitative difference between orthographic and syntactic perturbations. F Probing Typological Structure Across Layers F.1 Experimental Setup This section describes the probing framework used to relate neuron- and SAE-feature activations to typological properties of languages. Activation Extraction. For each language and layer, we extract mean activations corresponding to either raw model hidden states or SAE latents, depending on the probing condition. Given a model layer ℓ and a selected set of neu- rons or SAE features Nℓ, we collect activations over a multilingual dataset as follows. For each minibatch, we extract the hidden states at layer ℓ (or the corresponding SAE latent activations) and average over both batch and token dimensions. These per-batch means are then aggregated across batches to obtain a single activation vector per lan- guage and layer: x(k) ℓ ∈ R|Nℓ|, where k indexes languages. Activations are collected from the FLO- RES+ dataset using the train split, with batch size 16. Typological Features. Typological targets are loaded from lang2vec features. Each feature set -- 25 of 35 -- Bulgarian Chinese English French German Hindi Italian Japanese Korean Portuguese Russian Spanish Thai Turkish Vietnamese 0.0 0.5 1.0 1.5 2.0 2.5 Mean Entropy Original Shuffled Bulgarian Chinese English French German Hindi Italian Japanese Korean Portuguese Russian Spanish Thai Turkish Vietnamese 0.0 0.2 0.4 0.6 0.8 Mean
Chunk 65 · 1,984 chars
ese English French German Hindi Italian Japanese Korean Portuguese Russian Spanish Thai Turkish Vietnamese 0.0 0.5 1.0 1.5 2.0 2.5 Mean Entropy Original Shuffled Bulgarian Chinese English French German Hindi Italian Japanese Korean Portuguese Russian Spanish Thai Turkish Vietnamese 0.0 0.2 0.4 0.6 0.8 Mean Activation Probability Original Shuffled Bulgarian Chinese English French German Hindi Italian Japanese Korean Portuguese Russian Spanish Thai Turkish Vietnamese 0.0 0.5 1.0 1.5 2.0 Mean Entropy Original Shuffled Bulgarian Chinese English French German Hindi Italian Japanese Korean Portuguese Russian Spanish Thai Turkish Vietnamese 0.0 0.2 0.4 0.6 0.8 Mean Activation Probability Original Shuffled Figure 22: Mean activation entropy and selection probability across languages before and after shuffling. Top: Llama-3.2-1B; Bottom: Gemma-2-2B (raw MLP). Mean-level changes are small, consistent with distribution-level stability. Russian Bulgarian Thai Italian Portuguese Japanese Chinese Turkish Hindi Korean French Spanish German Vietnamese English 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity SAE Features Raw Neurons Figure 23: Jaccard similarity between language- associated units identified from original and word- shuffled text in Gemma-2-2B. Raw neurons exhibit consistently moderate-to-high overlap across languages, indicating robustness to word-order perturbation. SAE features also show high overlap, revealing robustness of sparse features to local distributional patterns disrupted by shuffling. corresponds to a matrix Y ∈ RL×F , where L is the number of languages and F the number of ty- pological dimensions. Feature sets include syntactic, phonological, and inventory-based features, as well as genealogical family and geographic coordinates. Prior to prob- ing, feature dimensions with zero variance across the selected languages are removed to ensure well- defined regression targets. Only Normal Overlap Only Shuffled
Chunk 66 · 1,998 chars
ons. Feature sets include syntactic, phonological, and inventory-based features, as well as genealogical family and geographic coordinates. Prior to prob- ing, feature dimensions with zero variance across the selected languages are removed to ensure well- defined regression targets. Only Normal Overlap Only Shuffled Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Avg. Mean Probing Score ( R2) Phonology Syntax Genealogy (Family) Figure 24: Average family-wise maximum probing R2 scores across neuron subsets under shuffling (Llama-3.2- 1B, SAE). Condition-specific subsets dominate overlap neurons; baseline scores remain lowest. Regression Setup. Probing is formulated as a set of univariate regression problems. For each neuron or feature n ∈ Nℓ and each typological dimension f , we fit a linear model across languages: y(k) f = βn,f x(k) n + ϵ(k), where x(k) n denotes the mean activation of neuron n for language k. To stabilize estimation under small sample sizes, we use ridge regression with regularization coeffi- cient λ = 1.0. Importantly, each neuron is probed independently, i.e., regressions are single-predictor models rather than multivariate probes. -- 26 of 35 -- Only Normal Overlap Only Shuffled Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Avg. Mean Probing Score ( R2) Phonology Syntax Genealogy (Family) Figure 25: Average family-wise maximum probing R2 scores across neuron subsets under shuffling (Gemma-2- 2B, raw MLP). Typological alignment is similar across normal-only, shuffled-only, and overlap subsets. Only Normal Overlap Only Shuffled Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Avg. Mean Probing Score ( R2) Phonology Syntax Genealogy (Family) Figure 26: Average family-wise maximum probing R2 scores across neuron subsets under shuffling (Gemma- 2-2B, SAE). Results use degree-5 aggregation due to low neuron counts. As in Llama, condition-specific sub- sets show stronger typological alignment than overlap neurons. Cross-Validation and Evaluation. Probe qual- ity
Chunk 67 · 1,995 chars
ure 26: Average family-wise maximum probing R2 scores across neuron subsets under shuffling (Gemma- 2-2B, SAE). Results use degree-5 aggregation due to low neuron counts. As in Llama, condition-specific sub- sets show stronger typological alignment than overlap neurons. Cross-Validation and Evaluation. Probe qual- ity is assessed using 5-fold cross-validation over languages. In each fold, regression coefficients are estimated on the training languages and evaluated on held-out languages. The coefficient of determi- nation (R2) is computed for each neuron–feature pair on the test split. For numerical stability and ef- ficiency, regression is implemented in closed form and evaluated in blocks over both neuron and fea- ture dimensions. For each neuron n and feature f , the final probe score is obtained by averaging R2 across folds: R2 n,f = 1 K K X k=1 R2 n,f (k). Neuron–feature pairs with undefined R2 values (e.g., due to zero variance in the target) are ex- cluded. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Average Max R2 LLaMA-3.2-1B (Raw) Genealogy (Family) Syntax Phonology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Average Max R2 LLaMA-3.2-1B (SAE) Genealogy (Family) Syntax Phonology Figure 27: Layerwise probing performance in Llama- 3.2-1B. Top: Raw MLP activations. Bottom: SAE fea- tures. SAE representations are comparatively stronger in early layers, while raw activations dominate in later layers. F.2 Detailed Layerwise Probing Comparisons Here we provide a detailed layerwise analysis of probing results for the three typological feature families used in the final experiments: fam, syntax, and phonology. We focus on (i) differences be- tween raw MLP activations and SAE represen- tations, and (ii) cross-model differences between Llama-3.2-1B and Gemma-2-2B. All plots report layerwise averages of maximum R2 scores per fea- ture family. Raw vs. SAE representations in Llama.
Chunk 68 · 1,995 chars
final experiments: fam, syntax, and phonology. We focus on (i) differences be- tween raw MLP activations and SAE represen- tations, and (ii) cross-model differences between Llama-3.2-1B and Gemma-2-2B. All plots report layerwise averages of maximum R2 scores per fea- ture family. Raw vs. SAE representations in Llama. Fig- ure 27 shows the layerwise probing trends for Llama raw and SAE representations, while Fig- ure 28 visualizes their differences directly. In early layers, SAE features are more informative than raw MLP activations for all three feature families, resulting in negative raw–SAE differences. This in- dicates that SAE training amplifies weak but struc- tured typological signals that are only diffusely present in shallow raw activations. As depth in- creases, this advantage steadily diminishes, and the difference approaches positive values, indicating that raw representations become more linearly in- formative in deeper layers. This transition reflects a shift from early sparse amplification to richer distributed encoding in later layers. -- 27 of 35 -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.4 0.2 0.0 0.2 0.4 Difference in Average Max R2 LLaMA-3.2-1B (Raw - SAE) Genealogy (Family) Syntax Phonology Figure 28: Raw minus SAE probing score differences for Llama-3.2-1B. Negative values in shallow layers indicate higher SAE informativeness, while the gradual shift toward positive values reflects increasing raw dom- inance with depth. Raw vs. SAE representations in Gemma. The corresponding Gemma plots are shown in Fig- ures 29 and 30. Unlike Llama, Gemma exhibits a more stable relationship between raw and SAE representations across layers. For fam and syntax, raw activations are consistently more informative than SAE features, yielding positive differences across depth. In contrast, phonology shows consis- tently negative differences, indicating that Gemma SAEs preferentially preserve phonological struc- ture relative to raw MLP
Chunk 69 · 1,999 chars
entations across layers. For fam and syntax, raw activations are consistently more informative than SAE features, yielding positive differences across depth. In contrast, phonology shows consis- tently negative differences, indicating that Gemma SAEs preferentially preserve phonological struc- ture relative to raw MLP activations. This feature- specific asymmetry suggests that sparse factoriza- tion interacts differently with lower-level sound- related abstractions than with genealogical or syn- tactic structure. Cross-Model Comparison: Llama vs. Gemma. Figure 31 presents direct comparisons between Llama and Gemma under matched representational settings. Raw MLP activations exhibit stark cross- model differences in shallow layers for all three feature families, with phonology showing substan- tially larger gaps than fam or syntax. Moreover, raw cross-model differences decrease sharply with depth, producing a pronounced downward trend across all feature families. This suggests that early typological representations are strongly shaped by architectural and tokenizer-specific factors, while deeper layers converge toward more similar abstrac- tions. In contrast, SAE representations substan- tially attenuate these differences. Although phonol- ogy remains the most discriminative feature fam- ily, the overall magnitude and depth-dependence of cross-model differences are reduced, indicating that sparse representations emphasize later-stage, shared abstractions over model-specific surface variation. 1 3 5 7 9 11 13 15 17 19 21 23 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Average Max R2 Gemma-2-2B (Raw) Genealogy (Family) Syntax Phonology 1 3 5 7 9 11 13 15 17 19 21 23 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Average Max R2 Gemma-2-2B (SAE) Genealogy (Family) Syntax Phonology Figure 29: Layerwise probing performance in Gemma- 2-2B. Top: Raw MLP activations. Bottom: SAE features. Raw representations dominate for fam and syntax, while SAE features retain stronger
Chunk 70 · 1,997 chars
1 13 15 17 19 21 23 25 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Average Max R2 Gemma-2-2B (SAE) Genealogy (Family) Syntax Phonology Figure 29: Layerwise probing performance in Gemma- 2-2B. Top: Raw MLP activations. Bottom: SAE features. Raw representations dominate for fam and syntax, while SAE features retain stronger phonologi- cal signals across layers. 1 3 5 7 9 11 13 15 17 19 21 23 25 Layer 0.4 0.2 0.0 0.2 0.4 Difference in Average Max R2 Gemma-2-2B (Raw - SAE) Genealogy (Family) Syntax Phonology Figure 30: Raw minus SAE probing score differ- ences for Gemma-2-2B. Differences are stable across depth: positive for fam and syntax, and negative for phonology. Summary. These detailed comparisons show that sparse autoencoding reshapes typological structure in a depth-, model-, and feature-dependent man- ner. Llama SAEs transiently enhance early-layer typological accessibility, Gemma SAEs selectively favor Phonology features, and phonology consis- tently emerges as the most sensitive axis for cross- model differences – particularly in shallow raw representations. G Causal Interventions on Invariant Neuron Sets This appendix reports causal intervention experi- ments designed to assess whether neuron subsets -- 28 of 35 -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.4 0.2 0.0 0.2 0.4 Difference in Average Max R2 Gemma-2-2B - LLaMA-3.2-1B (Raw) Genealogy (Family) Syntax Phonology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.4 0.2 0.0 0.2 0.4 Difference in Average Max R2 Gemma-2-2B - LLaMA-3.2-1B (SAE) Genealogy (Family) Syntax Phonology Figure 31: Cross-model comparison of probing perfor- mance. Top: Raw MLP activations. Bottom: SAE features. Raw representations show large early-layer differences, especially for phonology, followed by sharp convergence with depth, while SAE representations compress these disparities. identified via invariance-based analyses are func- tionally necessary for multilingual language
Chunk 71 · 1,997 chars
p: Raw MLP activations. Bottom: SAE features. Raw representations show large early-layer differences, especially for phonology, followed by sharp convergence with depth, while SAE representations compress these disparities. identified via invariance-based analyses are func- tionally necessary for multilingual language mod- eling. We intervene on neuron sets defined by their stability under controlled input perturbations: word-order shuffling and script romanization, done in sections 5 and 4 respectively. All experiments are conducted on the first 100 examples per lan- guage from the FLORES+ dataset. G.1 Neuron Selection via Shuffling and Romanization Neuron subsets are derived from earlier analyses that characterize neuron behavior under targeted surface perturbations. Shuffling-Based Neuron Sets. Using word-level shuffling experiments, neurons are categorized into: (i) Overlap neurons: Neurons consistently identi- fied under both normal and shuffled inputs. These neurons are invariant to word-order perturbations and are hypothesized to encode structurally neces- sary representations. (ii) Only-unshuffled neurons: Neurons identified only under normal inputs and absent under shuffled conditions. These neurons are sensitive to surface word order and local syntactic structure. Romanization-Based Neuron Sets. Using native-script versus romanized inputs, neurons are grouped into: (i) Overlap neurons: Neurons shared across na- tive and romanized scripts, hypothesized to encode script-invariant representations. (ii) Only-native neurons: Neurons active only for native-script inputs and whose functional signature disappears under romanization, indicating sensitiv- ity to surface orthography. Across both regimes, overlap neurons are de- fined by invariance to the corresponding perturba- tion, while non-overlap neurons capture sensitivity to surface form. For all experiments, matched ran- dom control sets are constructed by sampling an equal number of neurons uniformly from
Chunk 72 · 1,997 chars
itiv- ity to surface orthography. Across both regimes, overlap neurons are de- fined by invariance to the corresponding perturba- tion, while non-overlap neurons capture sensitivity to surface form. For all experiments, matched ran- dom control sets are constructed by sampling an equal number of neurons uniformly from the over- all neuron pool of the model. G.2 Intervention Protocol All experiments are conducted on raw model acti- vations. Ablation Scope. To avoid layer-local confounds, we apply simultaneous ablation across all layers. For each layer ℓ, activations of the selected neuron set are modified during the forward pass. Ablation Types. We consider: (i) Zero ablation for shuffling-based neuron sets: Activations are set to zero. (ii) Cross-language mean ablation for romanization-based neuron sets: Activa- tions are replaced by mean activation vectors computed from another language. Mean vectors are computed over the correspond- ing FLORES+ split of the source language. G.3 Evaluation Metrics and Statistical Testing For each example, we compute clean and patched perplexities (P P Lclean, P P Lpatch), perplexity ra- tios, and perplexity deltas (∆P P L). Paired-sample t-tests compare targeted ablations against matched random controls over the 100 examples, with sig- nificance assessed at p < 0.05. G.4 Causal Intervention Results: Llama-3.2-1B and Llama-3-8B G.4.1 Shuffling-Based Zero Ablation Table 5 reports results for shuffling-derived neuron sets. Across languages, ablation of overlap neurons induces the largest and most consistent degrada- tions. For instance, Hindi in Llama-3.2-1B exhibits -- 29 of 35 -- Lang Category P P Ltarget ratio P P Lctrl ratio p (ratio) ∆P P Ltarget ∆P P Lctrl p (∆) Llama-3.2-1B en overlap 1.116 0.954 1.5×10−50 +272.3 −108.6 2.2×10−35 en only-unshuffled 0.963 1.044 3.0×10−48 −87.1 +99.6 1.9×10−34 hi overlap 2.786 1.055 2.2×10−19 +1914.0 +204.9 1.6×10−12 hi only-unshuffled 1.083 0.947 4.1×10−40 +228.6 −133.3 4.6×10−10 fr overlap
Chunk 73 · 1,996 chars
Lctrl ratio p (ratio) ∆P P Ltarget ∆P P Lctrl p (∆) Llama-3.2-1B en overlap 1.116 0.954 1.5×10−50 +272.3 −108.6 2.2×10−35 en only-unshuffled 0.963 1.044 3.0×10−48 −87.1 +99.6 1.9×10−34 hi overlap 2.786 1.055 2.2×10−19 +1914.0 +204.9 1.6×10−12 hi only-unshuffled 1.083 0.947 4.1×10−40 +228.6 −133.3 4.6×10−10 fr overlap 1.118 1.030 5.4×10−6 +114.6 +74.6 0.145 fr only-unshuffled 0.935 0.957 2.8×10−10 −130.7 −85.8 5.5×10−7 zh overlap 1.217 0.960 2.4×10−17 +952.6 −523.2 2.4×10−24 zh only-unshuffled 0.936 0.982 5.7×10−17 −695.3 −210.4 4.1×10−16 Llama-3-8B en overlap 0.925 0.992 3.7×10−8 −23.3 −1.7 5.3×10−6 en only-unshuffled 0.832 0.993 1.3×10−59 −40.1 −1.1 1.8×10−23 hi overlap 4.348 1.021 4.4×10−12 +289.7 +10.5 2.2×10−10 hi only-unshuffled 0.859 1.017 4.0×10−30 −26.8 +3.1 4.0×10−10 fr overlap 0.959 0.955 0.813 −29.1 −15.1 0.134 fr only-unshuffled 0.867 1.020 4.6×10−19 −59.1 +15.2 6.8×10−12 zh overlap 1.236 0.990 1.1×10−17 +53.0 −5.0 4.6×10−17 zh only-unshuffled 0.934 0.982 2.2×10−14 −20.9 −6.4 8.0×10−11 Table 5: Causal zero-ablation results for shuffling-derived neuron sets across models (Llama models, raw neurons). Values report means over the first 100 FLORES+ examples per language. Control sets consist of matched random neurons with identical cardinality. Llama-3-8B shows qualitatively similar directional effects to Llama-3.2-1B, with stronger amplification in Hindi and Chinese, while maintaining robustness patterns under only-unshuffled conditions. the strongest effect, with P P L ratios approach- ing 2.8 and ∆P P L exceeding +1900. Chinese shows similarly pronounced degradation, while En- glish and French exhibit smaller but still significant effects. In all cases, overlap ablations degrade per- formance substantially more than matched random controls. In contrast, only-unshuffled neurons yield weaker and sometimes inverted effects. For English and French, ablation leads to reductions in perplex- ity relative to clean runs. Hindi shows a small in- crease, but far
Chunk 74 · 1,991 chars
. In all cases, overlap ablations degrade per- formance substantially more than matched random controls. In contrast, only-unshuffled neurons yield weaker and sometimes inverted effects. For English and French, ablation leads to reductions in perplex- ity relative to clean runs. Hindi shows a small in- crease, but far weaker than overlap ablations, while Chinese exhibits negative ∆P P L despite statisti- cal significance. These patterns indicate that only- unshuffled neurons encode order-sensitive surface regularities that are largely redundant for language modeling. Moreover, the larger model (Llama-3-8B) pro- duces stronger effects for Hindi and Chinese as compared to the (Llama-3.2-1B). Qualitative Effects Under Shuffling. As shown in Figure 32, ablation of overlap neurons induces systematic qualitative failures in Hindi. Most no- tably, we observe within-word script mixing, where individual lexical items combine Devanagari and Latin characters (e.g., mixed-script morphemes). This phenomenon is not observed under random or Only-unshuffled ablations, where script switching – if present – occurs only at word boundaries. These qualitative failures align with the large perplexity degradation and indicate that overlap neurons play a role in maintaining subword-level orthographic coherence. G.4.2 Romanization-Based Cross-Language Mean Ablation Table 6 summarizes results for cross-language ab- lations between Hindi and English for both Llama models. Replacing overlap neuron activations across languages yields relatively mild effects. En- glish shows a slight decrease in perplexity, while Hindi shows a modest increase. The small magni- tude of these effects suggests that overlap neurons encode representations that are largely invariant to script and language identity. In contrast, replacing only-native neuron activa- tions leads to extreme effects. Generations in both languages are severely affected. Qualitative inspec- tion for the Hindi-to-English ablations in
Chunk 75 · 1,992 chars
effects suggests that overlap neurons encode representations that are largely invariant to script and language identity. In contrast, replacing only-native neuron activa- tions leads to extreme effects. Generations in both languages are severely affected. Qualitative inspec- tion for the Hindi-to-English ablations in Llama- 3.2-1B reveals that these reductions arise from lan- guage switching rather than improved Hindi mod- eling: many generations abandon Hindi entirely and continue fluently in English which has higher likelihood under the model. Sometimes the genera- tions switch to other languages like Bengali, which -- 30 of 35 -- idx Prefix (prompt context) Clean continuation Patched continuation (overlap-set ablation) Inference 68 लॉकवुड गाडन के करायेदारों का मानना है क अन् य 40 परवार या उससे ज़् यादा बेदखली का सामना कर सकते हैं, चूँक उन् हें पता चला है क ओएचए पुलस ऑकलैंड में अन् य सावजनक आवास संपयों की जाँच कर रही है जो आवास घोटाले में पकड़े जा सकत ◌े हैं। लॉकवुड गाडन के करायेदारों न ◌े हैं। लockwood Garden के करायedar◌ों का मानना Within-word script mixing: Devanagari tokens corrupted by Latin (sometimes Arabic) characters inside a single word. 73 कुछ सप् ताह पहले, पत्र कार माकीस त्र नताफेलोपोस द् वारा अपने लोकप्र य टेलीवजन शो ज़ूंगला में अल् फा टीवी में प्र काशत जानकारी के बाद, संसद के सदस् य और वकील, पेट्र ोस मंतौवलोस को पदमुक् त कर दया गया था क् यों क उनके कायालय के सदस् य अवैध भ्रष् टाचार और भ भ्रष् टाचार के लए िजम् मेदार थे। पेट्र भ्रष् टाचार के साथ संबंधत कriminal charges were filed against them 58 पुलस ने कहा क लो पकालो की िस् थत मज़बूत थी क् यों क वह पलेम में प्र ोवेनज़ानो का दाहना हाथ था और उसके अनुभव ने उसे पुरानी पीढ़ी के मालकों का सम् मान दलवाया था क् यों क वे अपने शिक् त तंत्र को मजबूत करते हुए अपनी छव को न् यून बन ◌ाने के लए प्र ोवेनज़ानो के साथ मल ◌ाने के लए कوشish करते थे. The police said that the condition 10 11:29 बजे के आस-पास, वरोध करने वाले लोग व् हाइटहॉल की ओर बढ़ गए. इसके बाद, वे कनारे-कनारे चलते हुए ट्र ाफ़ल् गर स्क्
Chunk 76 · 1,997 chars
के मालकों का सम् मान दलवाया था क् यों क वे अपने शिक् त तंत्र को मजबूत करते हुए अपनी छव को न् यून बन ◌ाने के लए प्र ोवेनज़ानो के साथ मल ◌ाने के लए कوشish करते थे. The police said that the condition 10 11:29 बजे के आस-पास, वरोध करने वाले लोग व् हाइटहॉल की ओर बढ़ गए. इसके बाद, वे कनारे-कनारे चलते हुए ट्र ाफ़ल् गर स्क् वायर, एल् डवच, कंग् सवे से गुजरते हुए होलबन की ओर बढ़ गए, जहाँ कंजरवेटव पाट ◌ेस् ट के लए बैठे हुए थे। वे लोगों क estival के प्र दशन के लए सCHEDULED TO GO TO THE WHITE HOUSE 9 11:20 पर, पुलस ने प्र दशनकारयों को यह कहते हुए फ़ुटपाथ पर वापस जाने के लए कहा क उन् हें भीड़ इकट् ठा करने के साथ वरोध करने के अधकार को संतुलत करने की ज़रू रत है. #DelhiPolice #DelhiRiots #DelhiRiots2020 # 11:20 pm, पulwama, Jammu and Kashmir, India: A 0 सोमवार को, स् टैनफ़ोड यूनवसटी स् कूल ऑफ़ मेडसन के वैज्ञ ानकों ने एक नए डायग् नोिस् टक उपकरण के आवष् कार की घोषणा की जो कोशकाओं को उनके प्र कार के आधार पर छाँट सकता है: एक छोटी प्रं ट करने योग् य चप िजसे स् टैण् डड इंकजेट प्रं टर का उपयोग करक ◌े कया जा सकता है और एक सॉफ् टवेयर जो क ◌े कया जाएगा और कlonal DNA कlonal DNA कlonal DNA 6 कातलान की राजधानी (बासलोना) में जाने के बाद से, वडाल ने क् लब के लए 49 गेम खेले थे. क् लब के लए 49 गेम खेले हैं क The 49-year-old Spaniard has been a regular in the first team since 201 Complete or cross-word language collapse into English mid-generation. 5 28 वषय वडाल तीन सीजन पहले सेवला से बारका में शामल हुए थे। 28 वषय वडाल तीन सीजन पहले सेवला 28 वषय vidal tien, who was previously a member of the Seville 1 शोधकताओं ने कहा है क यह अल् प आय वाले देशों में कैंसर, टीवी, एचआईवी और मलेरया के रोगयों की आसानी से पहचान करेगा, जहाँ अमीर देशों की तुलना में स् तन कैंसर जैसी बीमारी में जीवत रहने की दर आधी हो सकती है. शोधकताओं ने कहा है क यह अल् प आय The study, published in the journal Cancer, found that the risk of cancer, HIV and Figure 32: Representative Hindi generations under shuffling-based overlap-neuron ablation (Llama-3.2-1B, raw). Each row shows the input prefix, clean continuation, and ablated
Chunk 77 · 1,993 chars
रहने की दर आधी हो सकती है. शोधकताओं ने कहा है क यह अल् प आय The study, published in the journal Cancer, found that the risk of cancer, HIV and Figure 32: Representative Hindi generations under shuffling-based overlap-neuron ablation (Llama-3.2-1B, raw). Each row shows the input prefix, clean continuation, and ablated continuation for the same example. While clean generations preserve Devanagari script integrity, overlap-neuron ablations induce within-word mixed-script corruption, abrupt language switching, and topic drift. Such token-internal script mixing is not observed under matched random or Only-unshuffled neuron ablations. Lang Category P P Ltarget ratio P P Lctrl ratio p (ratio) ∆P P Ltarget ∆P P Lctrl p (∆) Llama-3.2-1B en overlap 0.947 0.991 6.9×10−45 −127.1 −21.1 9.7×10−28 en only-native 1.498 0.955 5.0×10−89 +1176.9 −104.9 3.1×10−40 hi overlap 1.047 0.982 9.5×10−34 +79.1 −53.8 1.7×10−7 hi only-native 0.312 0.970 1.2×10−38 −1800.5 −92.5 7.7×10−11 Llama-3-8B en overlap 0.946 0.989 2.9×10−3 −15.4 −2.0 1.1×10−3 en only-native 0.817 0.999 1.2×10−18 −46.4 +0.2 2.1×10−10 hi overlap 1.062 0.990 2.1×10−12 +7.8 −3.4 3.6×10−3 hi only-native 7.738 0.948 3.5×10−29 +1326.6 −15.4 2.8×10−11 Table 6: Cross-language mean ablation results for romanization-derived neuron sets (raw neurons). Rows indicate forward-pass language; mean activations are taken from the opposite language. Results are averaged over the first 100 FLORES+ examples. Control sets consist of matched random neurons with identical cardinality. Llama-3-8B exhibits qualitatively similar patterns to Llama-3.2-1B: only-native Hindi neurons cause dramatic perplexity changes when ablated during Hindi inference (P P Ltarget ratio = 7.74), while overlap neurons produce modest effects across both models. attributes to the drastic increase in perplexity for Llama-3-8B. G.5 Causal Intervention Results: Gemma-2-2B We repeat the same analyses on Gemma-2-2B to assess cross-model consistency. -- 31 of 35 -- G.5.1
Chunk 78 · 1,998 chars
g Hindi inference (P P Ltarget ratio = 7.74), while overlap neurons produce modest effects across both models. attributes to the drastic increase in perplexity for Llama-3-8B. G.5 Causal Intervention Results: Gemma-2-2B We repeat the same analyses on Gemma-2-2B to assess cross-model consistency. -- 31 of 35 -- G.5.1 Shuffling-Based Zero Ablation Table 7 reports shuffling-based interventions for Gemma-2-2B. As in Llama, ablation of overlap neurons produces the strongest disruptions across languages. English, French, and Chinese show large increases in perplexity relative to random con- trols, while Hindi exhibits weaker but directionally consistent effects. Paired tests confirm that these differences are statistically significant in nearly all cases. Only-unshuffled neurons again yield weaker and more variable effects. In several languages, ablation produces smaller changes than random controls or even reduces perplexity, reinforcing the conclusion that these neurons encode word-order- level regularities rather than load-bearing structure, and that the robust shuffling-overlap neuron sets are more correlated with orthographic and subword- level structures. Qualitative Effects Under Shuffling. Figure 33 presents representative Hindi and Chinese genera- tions. Similar to Llama, overlap-neuron ablation induces script changes within words, including par- tial Latin insertions and mixed-script morphemes. Crucially, such intra-word script violations do not appear under random or Only-unshuffled ablations, indicating that overlap neurons support low-level orthographic coordination during decoding. More importantly, fluency is not lost while ablating the overlapping neurons, indicating that these neurons are not responsible for syntactic behavior. G.5.2 Romanization-Based Cross-Language Mean Ablation Table 8 summarizes romanization-based interven- tions. Only-native neurons exhibit the largest sen- sitivity: English-to-Hindi replacement causes large perplexity increases,
Chunk 79 · 1,994 chars
rlapping neurons, indicating that these neurons are not responsible for syntactic behavior. G.5.2 Romanization-Based Cross-Language Mean Ablation Table 8 summarizes romanization-based interven- tions. Only-native neurons exhibit the largest sen- sitivity: English-to-Hindi replacement causes large perplexity increases, while Hindi-to-English re- placement often yields perplexity reductions. As in Llama, inspection of generations reveals frequent language switching in the latter case, explaining the apparent improvement. Overlap neurons again show smaller and more symmetric effects, consis- tent with a script-invariant functional role. G.6 Summary Across both models and perturbation regimes, a consistent causal pattern emerges: • Shuffling-Overlap neurons – defined by invari- ance to shuffling – form a causally necessary back- bone supporting stable, script-consistent genera- tion. They are not causally related to fluency, rather script and subword-level regularity. Hence, the features tied strongly to script are more causally important for generation. • Romanization-Overlap neurons – defined by invariance to romanization – are largely script in- sensitive. This suggests that representations that are not tied to script, are not causally important in generation. • Only-unshuffled neurons encode order- sensitive surface regularities that are largely redundant for orthographic structure. • Only-native neurons anchor script-specific realization, and their disruption induces lan- guage switching rather than structured degradation. Again, this reinforces the hypothesis that script- related neurons are most causally important. The convergence of quantitative metrics and qualitative failure modes across Llama and Gemma indicates that invariance-based neuron identifica- tion isolates functionally meaningful components of multilingual language models. H Extended Scaling and Semantic Competence Analysis This appendix provides a comprehensive evalu- ation of our findings on
Chunk 80 · 1,992 chars
ve metrics and qualitative failure modes across Llama and Gemma indicates that invariance-based neuron identifica- tion isolates functionally meaningful components of multilingual language models. H Extended Scaling and Semantic Competence Analysis This appendix provides a comprehensive evalu- ation of our findings on larger model architec- tures: Llama-3-8B and Gemma-2-9B. We analyze whether the observed representational fragmenta- tion is a symptom of limited capacity or data spar- sity, or whether it persists as a stable architectural trait at scale. H.1 Translation Performance on Romanized Inputs To rule out the hypothesis that low representational overlap is an artifact of undertraining or data spar- sity, we evaluate translation performance from na- tive and romanized inputs to English. Experimental Setup. We use the dev split of FLORES+ in an 8-shot setting. Translation quality is measured using BERTScore (roberta-large) against gold English references across 10 diverse languages. Results and Discussion. As shown in Table 9, larger models achieve robust translation perfor- mance on romanized inputs. For instance, Llama- 3-8B obtains BERTScores of 0.67 for Romanized Russian and 0.42 for Hindi, confirming the models have learned the semantics of romanized text and are not treating it as out-of-distribution noise. -- 32 of 35 -- Lang Category P P Ltarget ratio P P Lctrl ratio p (ratio) ∆P P Ltarget ∆P P Lctrl p (∆) en overlap 3.045 0.312 3.5×10−81 +588.9 −191.2 2.5×10−39 en only-unshuffled 0.799 0.312 2.8×10−130 −56.5 −191.2 1.1×10−47 hi overlap 1.109 0.953 0.173 −34.2 −7.6 1.0×10−3 hi only-unshuffled 0.397 2.557 8.1×10−50 −102.7 +289.9 3.6×10−22 fr overlap 1.547 0.952 1.2×10−89 +116.1 −10.0 4.4×10−36 fr only-unshuffled 1.260 1.403 3.1×10−40 +56.3 +90.4 4.4×10−24 zh overlap 0.842 0.150 2.0×10−52 −58.8 −241.5 1.3×10−56 zh only-unshuffled 0.082 3.152 6.0×10−76 −259.8 +645.7 8.5×10−38 Table 7: Causal zero-ablation results for shuffling-derived neuron sets
Chunk 81 · 1,999 chars
6×10−22 fr overlap 1.547 0.952 1.2×10−89 +116.1 −10.0 4.4×10−36 fr only-unshuffled 1.260 1.403 3.1×10−40 +56.3 +90.4 4.4×10−24 zh overlap 0.842 0.150 2.0×10−52 −58.8 −241.5 1.3×10−56 zh only-unshuffled 0.082 3.152 6.0×10−76 −259.8 +645.7 8.5×10−38 Table 7: Causal zero-ablation results for shuffling-derived neuron sets (Gemma-2-2B, raw). Values report means over the first 100 FLORES+ examples per language. Control sets consist of matched random neurons with identical cardinality. Lang Prompt Prefix Clean Continuation Patched Continuation (overlap-set zero ablation) Inference hi सोमवार को, स् टैनफ़ोड यूनवसटी स् कूल ऑफ़ मेडसन के वैज्ञ ानकों ने एक नए डायग् नोिस् टक उपकरण… इस उपकरण का उपयोग करने वाले वैज्ञ ानकों ने यह भी दावा कया है क यह… “यह एक बहुत ही простой, सस् ती, और ﻞﺑﺎﻗ دﺎﻤﺘﻋا ﺖﺑا कार है,” Dr Intra-word script mixing (Devanagari + Cyrillic + Arabic) hi शोधकताओं ने कहा है क यह अल् प आय वाले देशों में कैंसर, टीवी, एचआईवी और मलेरया… अमेरकी शोधकताओं ने कहा है क यह अल् प आय वाले देशों … “यह एकमात्र tool है, जो कैंसर, HIV, मलेria और…” English characters injected inside Hindi words hi कातलान की राजधानी (बासलोना) में जाने के बाद से, वडाल ने क् लब के लए 49 गेम खेले थे। उन् हों ने 2008–09 सीज़न में 12 गोल कए थे 2011–12 UEFA Champions League Final में, वडal hat-trick Partial Latin substitution inside proper noun hi व् हाइटहॉल पर लगभग 11:00 बजे स् थानीय समय (यूटीसी +1) पर वरोध शुरू हुआ… वरोधयों ने एक-दूसरे को हलाते हुए… “यह एकमात्र वoher मैं आ सकता हूं,” Prime Minister… Orthographic corruption inside word hi “पनामा पेपस” पनामा की कानूनी फम मोसाक फों सेका से लगभग दस मलयन दस् तावेजों … इन दस् तावेजों में से कुछ ने दुनया भर में… “पनाma पेpers” के 11 मलयन दस् तावेजों में Cross-script character insertion within named entity hi दस् तावेजों से पता चलता है क चौदह बैंकों ने अमीर ग्र ाहकों को… अमेरकी बैंकों ने अरबों डॉलर की संप… Amri Bank, Citibank, HSBC, JP Morgan Chase Sudden switch to English entities mid-sentence hi इराक की अबू ग़रीब जेल को दंगे के दौरान आग के हवाले कर
Chunk 82 · 1,996 chars
दस् तावेजों में Cross-script character insertion within named entity hi दस् तावेजों से पता चलता है क चौदह बैंकों ने अमीर ग्र ाहकों को… अमेरकी बैंकों ने अरबों डॉलर की संप… Amri Bank, Citibank, HSBC, JP Morgan Chase Sudden switch to English entities mid-sentence hi इराक की अबू ग़रीब जेल को दंगे के दौरान आग के हवाले कर दया गया। इस जेल में 1000 से अधक कैदयों को… Abu garib jail, Iraq Script change aligned with entities but breaks Hindi flow hi पूवानुमान कहते हैं क यह तूफ़ान केप वड द् वीपों के पिचम में… क् या तूफ़ान केप वड द् वीपों में आए हैं? क् या तूफ़ान केप वर्de 2021 केप वर्de English letters injected inside Devanagari word hi ऑस्ट्रे लया के ट्रे ज़रार और लबरल पाट के नेता… उन् हों ने कहा क न् यूिलयर पॉवर इंडस्ट्र ी को… उनका यह ﻊﺗalm है, “यह एकमात्र راه है…” Multiscript blending within same clause zh 周一,斯坦福大学医学院的科学 家宣布,他们发明了一种可以将 细胞按类型分类的新型诊断工具 … 斯坦福大学医学院 的科学家们开发了 一种新型的微型芯 片,可以将细胞… 斯坦ford University Medical School scientists announced Monday… Mid-word switch to English generation zh 自从转会到加泰罗尼亚的首府球 队,维达尔已经为俱乐部踢了 49 场比赛。 在 2019–2020 赛 季,维达尔在… 在 2019/2020 ฤดูกาล, 维da Thai and Latin characters injected inside Chinese zh “巴拿马文件” 是巴拿马莫萨克· 冯塞卡律师事务所约 1,000 万 份文件的总称… 这些文件记录了 2000 年至 2015 年 间… “巴拿马文件” 是巴拿马莫萨ck Fonseca 律师事务所 Script split inside named entity zh 谢长廷还声称马英九虽然很上镜 ,但中看不中用。 谢长廷还说,马英九 的“中看不中用”… 他认为,马英九的“上镜”和“上 manship”是两码 English morpheme inserted into Chinese word zh 不过,一位知情人士透露,凶手 是红湖部落主席之子… 据悉,弗洛伊德·乔 丹是红湖部落… 路易斯·乔dan(Louis Jourdain)被 控… Partial Latin insertion inside Chinese name Figure 33: Qualitative examples of model behavior under shuffling-based overlap ablation (Gemma-2-2B, raw). Ablation of overlap neurons induces systematic script mixing, including partial Latin insertions and mixed-script morphemes occurring within words. Such orthographic violations are not observed for random or only-normal neuron ablations. Critically, despite this functional competence, representational overlap remains comparatively low (∼0.25 for Llama-3-8B vs. ∼0.11 for
Chunk 83 · 1,985 chars
mixing, including partial Latin insertions and mixed-script morphemes occurring within words. Such orthographic violations are not observed for random or only-normal neuron ablations. Critically, despite this functional competence, representational overlap remains comparatively low (∼0.25 for Llama-3-8B vs. ∼0.11 for Llama- 3.2-1B), which is significantly lower than the -- 33 of 35 -- Lang Category P P Ltarget ratio P P Lctrl ratio p (ratio) ∆P P Ltarget ∆P P Lctrl p (∆) en only-native 5.208 1.405 2.2×10−65 +1222.6 +114.6 8.2×10−35 en overlap 0.899 0.822 6.2×10−96 −28.3 −50.0 3.6×10−43 hi only-native 0.684 1.136 1.2×10−54 −55.9 +24.1 2.4×10−24 hi overlap 2.228 1.036 5.8×10−45 +228.0 +6.5 7.1×10−21 Table 8: Cross-language mean ablation results for romanization-derived neuron sets (Gemma-2-2B, raw). Rows indicate forward-pass language; mean activations are taken from the opposite language. Results are averaged over the first 100 FLORES+ examples. Llama-3.2-1B Llama-3-8B Gemma-2-2B Gemma-2-9B Language Nat Rom Nat Rom Nat Rom Nat Rom Bengali 0.40 0.08 0.67 0.23 0.60 0.11 0.72 0.33 Bulgarian 0.66 0.36 0.77 0.62 0.74 0.46 0.79 0.69 Chinese 0.60 0.12 0.69 0.29 0.68 0.12 0.72 0.33 Hindi 0.58 0.14 0.72 0.42 0.69 0.22 0.76 0.53 Japanese 0.54 0.08 0.67 0.14 0.65 0.11 0.71 0.19 Korean 0.53 0.08 0.68 0.14 0.64 0.10 0.71 0.20 Marathi 0.46 0.10 0.66 0.23 0.58 0.12 0.72 0.33 Russian 0.68 0.42 0.75 0.67 0.73 0.52 0.76 0.72 Spanish 0.68 0.67 0.73 0.73 0.72 0.71 0.74 0.74 Urdu 0.47 0.07 0.68 0.19 0.60 0.10 0.73 0.32 Table 9: Translation performance (BERTScore, roberta-large) from Native (Nat) and Romanized (Rom) inputs to English (8-shot). Large BERTScores on romanized inputs confirm genuine semantic competence, proving that low neuron overlap is not a result of data sparsity. Bengali Bulgarian Chinese Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.5 Jaccard Similarity SAE Features (vs Native) SAE Features (vs English) Raw Neurons (vs Native) Raw Neurons (vs
Chunk 84 · 1,999 chars
s on romanized inputs confirm genuine semantic competence, proving that low neuron overlap is not a result of data sparsity. Bengali Bulgarian Chinese Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.5 Jaccard Similarity SAE Features (vs Native) SAE Features (vs English) Raw Neurons (vs Native) Raw Neurons (vs English) Figure 34: Jaccard similarity between romanized and native-script or English units in Llama-3-8B. Represen- tational isolation persists at the 8B scale. ∼0.60 overlap observed under word-order shuf- fling. This dissociation between competence and alignment confirms that models process different scripts through disjoint subspaces as a represen- tational choice. This pattern persists even for high-resource languages like Russian and Span- ish, where romanized performance nearly matches native-script performance, effectively ruling out undertraining as the sole explanation for represen- tational fragmentation. H.2 Script Fragmentation at Scale We extend the representational overlap analysis to larger models to verify if increased parameter counts facilitate script unification. Global Overlap Trends. Figures 34 and 35 report Jaccard similarities for Llama-3-8B and Bengali Bulgarian Chinese Hindi Japanese Korean Marathi Russian Spanish Urdu 0.0 0.5 Jaccard Similarity SAE Features (vs Native) SAE Features (vs English) Raw Neurons (vs Native) Raw Neurons (vs English) Figure 35: Jaccard similarity between romanized and native-script or English units in Gemma-2-9B. Frag- mentation remains a dominant feature despite increased model capacity. Gemma-2-9B. Across both models, romanized in- puts maintain near-zero overlap with English and consistently low overlap with their native-script counterparts. This indicates that even with in- creased capacity, models do not converge toward a unified, script-invariant representation. Layer-wise Persistence. Figures 36 and 37 illus- trate layer-wise alignment. While a slight increase in overlap is observed in middle
Chunk 85 · 1,996 chars
istently low overlap with their native-script counterparts. This indicates that even with in- creased capacity, models do not converge toward a unified, script-invariant representation. Layer-wise Persistence. Figures 36 and 37 illus- trate layer-wise alignment. While a slight increase in overlap is observed in middle layers, alignment remains far from convergence across the entire depth of the network. This confirms that represen- tational separation is a fundamental architectural trait that persists even as models become larger and more competent. -- 34 of 35 -- 1 6 11 16 21 26 31 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity SAE Features Raw Neurons Figure 36: Layer-wise alignment in Llama-3-8B. Mid- layer increases do not lead to cross-script convergence. 1 6 11 16 21 26 31 36 4142 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity SAE Features Raw Neurons Figure 37: Layer-wise alignment in Gemma-2-9B, showing consistent representational separation in raw neurons across depth. H.3 Structural Robustness at Scale Finally, we examine whether larger models main- tain the high robustness to structural (word-order) perturbations observed in 1B and 2B models. High Overlap Under Shuffling. As shown in Figure 38, both Llama-3-8B and Gemma-2-9B ex- hibit consistently high Jaccard overlap between units identified from original and shuffled inputs. This confirms that the models’ reliance on token- level and distributional cues (rather than strict syn- tactic order) is a scale-invariant property. Russian Bulgarian Hindi Spanish Italian Portuguese German French Thai Turkish Japanese Chinese Korean Vietnamese English 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity SAE Features Raw Neurons (a) Llama-3-8B Bulgarian Portuguese Russian German French Italian Turkish Spanish Hindi Thai Chinese Korean English Japanese Vietnamese 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity SAE Features Raw Neurons (b) Gemma-2-9B Figure 38: Jaccard similarity between units from
Chunk 86 · 469 chars
0 Jaccard Similarity SAE Features Raw Neurons (a) Llama-3-8B Bulgarian Portuguese Russian German French Italian Turkish Spanish Hindi Thai Chinese Korean English Japanese Vietnamese 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity SAE Features Raw Neurons (b) Gemma-2-9B Figure 38: Jaccard similarity between units from origi- nal and shuffled text at scale. Robustness to word-order perturbation remains consistently high across larger ar- chitectures. -- 35 of 35 --