Deep Language Geometry: Constructing a Metric Space from LLM Weights
Summary
This paper introduces a novel framework for constructing a metric space of languages using the internal weight activations of Large Language Models (LLMs). Unlike traditional methods that rely on hand-crafted linguistic features, the approach automatically derives high-dimensional vector representations by computing weight importance scores via an adapted pruning algorithm. These vectors capture intrinsic language characteristics and are transformed into a binary format to reduce storage requirements. A Hamming distance metric is then applied, and the high-dimensional space is projected into a lower-dimensional Euclidean space using classical multidimensional scaling. The method is validated across 106 languages using three multilingual LLMs and three datasets, revealing language clusters that align with established linguistic families while also uncovering unexpected inter-language connections, potentially indicating historical contact or evolution. The results show that the metric space supports meaningful clustering, with higher accuracy for finer-grained language groups. The authors also explore transfer learning but find no significant improvements in model performance. The work is open-sourced, providing language vectors, visualization tools, and analysis resources for further research. Limitations include computational expense and potential biases from the source models.
PDF viewer
Chunks(27)
Chunk 0 · 1,977 chars
Deep Language Geometry: Constructing a Metric Space from LLM Weights Maksym Shamrai Institute of Mathematics of NASU MacPaw Kyiv, Ukraine mshamrai@macpaw.com Vladyslav Hamolia MacPaw Kyiv, Ukraine vhamolya@macpaw.com Abstract We introduce a novel framework that utilizes the internal weight activations of modern Large Language Models (LLMs) to construct a met- ric space of languages. Unlike traditional ap- proaches based on hand-crafted linguistic fea- tures, our method automatically derives high- dimensional vector representations by comput- ing weight importance scores via an adapted pruning algorithm. Our approach captures in- trinsic language characteristics that reflect lin- guistic phenomena. We validate our approach across diverse datasets and multilingual LLMs, covering 106 languages. The results align well with established linguistic families while also revealing unexpected inter-language connec- tions that may indicate historical contact or lan- guage evolution. The source code, computed language latent vectors, and visualization tool are made publicly available at https://github. com/mshamrai/deep-language-geometry. 1 Introduction Languages are complex systems with rich inter- nal structures and dynamic evolution. Traditional linguistic classifications based on typological fea- tures, historical migration patterns, or lexical sim- ilarity have long served to group languages into families such as Indo-European, Uralic, and Tur- kic. However, these approaches typically capture only historical or static aspects of language sim- ilarity, potentially overlooking modern linguistic influences driven by technology and globalization. In an era where language use and structure are con- tinuously reshaped, it is timely to develop methods that automatically capture both historical and cur- rent linguistic characteristics. Recent advances in Natural Language Process- ing (NLP) have been largely driven by Large Lan- guage Models (LLMs), which have
Chunk 1 · 1,987 chars
ion. In an era where language use and structure are con- tinuously reshaped, it is timely to develop methods that automatically capture both historical and cur- rent linguistic characteristics. Recent advances in Natural Language Process- ing (NLP) have been largely driven by Large Lan- guage Models (LLMs), which have demonstrated remarkable capabilities in language modeling and a wide range of linguistic tasks (Devlin et al., 2018; Radford et al., 2019). These models, trained on vast multilingual corpora, learn representations that im- plicitly encode a wide variety of lexical, syntactic, and even phonological properties (Conneau et al., 2019). Building on prior work (Shamrai, 2024), which empirically shows that the internal activations of LLM weights vary with the language of the in- put data, we hypothesize that the internal weights of LLMs encode valuable information about inter- language similarity and can serve as a foundation for quantifying relationships between languages. Therefore, in this work, we propose a novel ap- proach for constructing a metric space of languages by leveraging the weights of modern LLMs. Our method extracts high-dimensional vector represen- tations from LLM weights activations, where the distance between any two vectors reflects the simi- larity between the underlying linguistic structures. Activations encode patterns of co-occurrence and contextual relationships specific to each language’s grammar and lexical properties. We construct a metric space (X, dh), where X is the set of high-dimensional language vectors and dh is the Hamming distance between them. We then design a distance-preserving mapping that projects these high-dimensional vectors into a low- dimensional space (Y, de), where distances are in- duced from the Euclidean (L2) norm. This trans- formation provides deeper insight into the latent structures encoded by LLMs. Furthermore, we calculate this latent representa- tion for 106 languages. This revealed the
Chunk 2 · 1,999 chars
rojects these high-dimensional vectors into a low- dimensional space (Y, de), where distances are in- duced from the Euclidean (L2) norm. This trans- formation provides deeper insight into the latent structures encoded by LLMs. Furthermore, we calculate this latent representa- tion for 106 languages. This revealed the opportu- nity to visualize, cluster and analyze the relation- ships between the languages. Our code, computed language latent vectors, and analysis tool are made publicly available, designed to assist researchers and practitioners in linguistic analysis and offering valuable resource for further linguistic investigation. The contributions of this work are as follows: 1 arXiv:2508.11676v1 [cs.CL] 8 Aug 2025 -- 1 of 18 -- • We introduce a novel approach that con- structs a metric space of languages using LLM weights and apply it to 106 languages, en- abling automatic and data-driven measure- ment of linguistic distances. • We demonstrate that the derived metric space supports meaningful clustering of languages, reflecting both historical relationships and modern linguistic features. • We fully open-source our work along with a tool for preliminary analysis. While not claiming linguistic expertise, this study introduces a novel toolset intended to sup- port linguistic research. It offers a fresh view of language similarity by exploiting the latent knowl- edge embedded in LLMs. 2 Related Work The quantification of language similarity has a rich history, beginning with early lexical approaches. Pi- oneering work (Swadesh, 1952) established meth- ods for comparing languages using shared cognates, a practice later refined by Holman et al. (2011), which employs normalized Levenshtein distances over fixed word lists. Although these lexical meth- ods have been successfully used to construct lan- guage family trees, they are handcrafted and re- quire manual effort to select and curate appropriate word lists and features. Also, resources such as the World Atlas
Chunk 3 · 1,999 chars
1), which employs normalized Levenshtein distances over fixed word lists. Although these lexical meth- ods have been successfully used to construct lan- guage family trees, they are handcrafted and re- quire manual effort to select and curate appropriate word lists and features. Also, resources such as the World Atlas of Lan- guage Structures (Dryer and Haspelmath, 2005) offer comprehensive typological data that allow languages to be represented as feature vectors. Dis- tance measures computed over these vectors have been shown to reveal groupings consistent with established genetic relationships (O’Horan et al., 2016; De Gregorio et al., 2024). However, these methods are limited by the quality and coverage of available databases, their reliance on expert- curated features, and their inability to fully capture language-specific variations or recent evolutionary trends. Phonological properties offer another valuable dimension for language comparison. Studies uti- lizing phoneme inventory data from resources like PHOIBLE (Moran et al., 2014) demonstrate that phonological distances – often measured by overlap indices such as the Jaccard similarity – can capture both genetic relationships and areal phenomena. But phonological methods need reliable phoneme lists, are affected by how sounds are written, and often miss language structure beyond sounds. Recent deep- learning work has popularised embedding- based measures of language distance. Multilingual encoders such as mBERT (Devlin et al., 2018), XLM- R (Conneau et al., 2020) and LASER (Artetxe and Schwenk, 2019; Heffernan et al., 2022) produce contextual token embeddings that implicitly encode lexical, syntactic and seman- tic features. LASER is trained to output a sin- gle sentence vector directly, whereas mBERT and XLM -R require a pooling step (e.g., mean pool- ing or the [CLS] token) to obtain a sentence- level embedding. When sentence embeddings are aver- aged over large, balanced corpora, the resulting language-
Chunk 4 · 1,999 chars
ntactic and seman- tic features. LASER is trained to output a sin- gle sentence vector directly, whereas mBERT and XLM -R require a pooling step (e.g., mean pool- ing or the [CLS] token) to obtain a sentence- level embedding. When sentence embeddings are aver- aged over large, balanced corpora, the resulting language- level representations have proved use- ful for quantifying cross- lingual similarity (Rama et al., 2020). However, because the underlying en- coders operate at the token – and therefore sentence – level, their effectiveness still depends on corpus size and domain balance. Overall, the literature on language distance met- rics has evolved from classical lexicostatistical methods and handcrafted feature extraction to so- phisticated neural representations. Each approach offers valuable insights into the relationships be- tween languages, but they often suffer from labor- intensive preprocessing, limited database coverage, or sensitivity to input variations. This motivates our approach: rather than relying on manually curated features or sentence-based embeddings, we pro- pose an automatic, data-driven method that lever- ages the internal weights of modern LLMs to con- struct a metric space of languages. Moreover, to best of our knowledge, no study has attempted to derive a language metric space from decoder- only LLMs. The method introduced here is therefore the first to use weight -level signals in causal transformers for measuring cross -language similarity. 3 Methodology The main hypothesis in this work is that Large Language Models are a good choice to measure internal language structure since they are trained to model languages. Formally, this is typically framed as maximizing the log-likelihood of the observed sequence of tokens. Let x1, x2, . . . , xT represent a sequence of tokens, where xt ∈ V and V is the vocabulary. The objective is to maximize the likelihood of the sequence under the model’s 2 -- 2 of 18 -- parameters θ: L(θ) = T X t=1 log
Chunk 5 · 1,999 chars
ly, this is typically framed as maximizing the log-likelihood of the observed sequence of tokens. Let x1, x2, . . . , xT represent a sequence of tokens, where xt ∈ V and V is the vocabulary. The objective is to maximize the likelihood of the sequence under the model’s 2 -- 2 of 18 -- parameters θ: L(θ) = T X t=1 log p(xt|x1, x2, . . . , xt−1; θ), where p(xt|x1, x2, . . . , xt−1; θ) is the conditional probability of the token xt given the previous to- kens, modeled by a neural network or another prob- abilistic model. 3.1 Weight Importance Metric We begin by revisiting classical pruning approaches such as Optimal Brain Damage (LeCun et al., 1989), which motivate the rationale behind our ap- proach. The typical pruning objective is to minimize the error introduced by approximating the original weight matrix. Consider the following objective function: E = ∥WX − ˆ WX∥2 2 → min, (1) where W is the original weight matrix of a layer, ˆ W is the pruned (sparse) weight matrix, and X is the input to that layer. The variation of the error E for a weight row w can be expressed as: δE = ∂E ∂w T δw + 1 2 δwT H δw + O(∥δw∥3), where H ≡ ∂2E ∂w2 is the Hessian matrix. At a local minimum of the training error, we have ∂E ∂w ≈ 0, and higher order terms are neglected. Our goal is to set one of the weights, say wq, to zero while minimizing the increase in error. This introduces the constraint: eT q δw + wq = 0, where eq is the qth standard basis vector. Thus, the optimization problem in Equation (1) can be reformulated as: min δw 1 2 δwT H δw, s.t. eT q δw + wq = 0. (2) This constrained problem can be solved using Lagrange multipliers. For the detailed derivation see Appendix A. The resulting increase in error is given by: Eq = 1 2 · w2 q e⊤ q H−1eq . (3) By computing Eq for every weight wq, one can prune the weight that causes the smallest increase in error, thereby minimally affecting the layer’s output. Intuitively, this means we identify which weights are most critical for the
Chunk 6 · 1,987 chars
Appendix A. The resulting increase in error is given by: Eq = 1 2 · w2 q e⊤ q H−1eq . (3) By computing Eq for every weight wq, one can prune the weight that causes the smallest increase in error, thereby minimally affecting the layer’s output. Intuitively, this means we identify which weights are most critical for the model’s perfor- mance on a specific language. Weights with high importance scores are those whose removal would substantially degrade the model’s ability to predict tokens in that language. SparseGPT (Frantar and Alistarh, 2023) adopts this idea within an LLM pruning algorithm. They compute the importance metric Sij for a layer as follows (Sun et al., 2023): Sij = |W|2 diag (XT X + λI)−1 ij . (4) As in SparseGPT, we build X per linear sub- layer by stacking the pre -activation hidden states of a small calibration set into an N × din matrix. For a weight matrix W the local Hessian is H = X⊤X, and we invert X⊤X + λI once per layer. Thus, Equation (4) is simply a matrix-valued, regularised version of the scalar error-increase cri- terion in Equation (3). Shamrai (2024) suggests that the SparseGPT algorithm provides statistically stable results for different LLMs and subsets of a data in language- specific setting. Therefore, in our work, we adopt the algorithm to compute weight importance vec- tors. 3.2 Rationale Behind the Approach By definition, Sij quantifies the importance of weight Wij for a given input. In our approach, we estimate the importance of the weights for a spe- cific language by using datasets in that language. Consequently, Sij reflects the contribution of each weight to language modeling. Assuming that the network is well-trained on lan- guage modeling, higher S scores indicate greater contribution. If two languages yield similar pat- terns of important weights, it suggests that they are similar in terms of language modeling characteris- tics. 3 -- 3 of 18 -- 3.3 Constructing a Metric Space To derive a vector
Chunk 7 · 1,990 chars
that the network is well-trained on lan-
guage modeling, higher S scores indicate greater
contribution. If two languages yield similar pat-
terns of important weights, it suggests that they are
similar in terms of language modeling characteris-
tics.
3
-- 3 of 18 --
3.3 Constructing a Metric Space
To derive a vector representation from the impor-
tance metric, we treat the importance scores as co-
ordinates in a high-dimensional space. Specifically,
we define the vector
v = S0
00, S0
01, . . . , Sk
ij , . . . , Sl
nm
∈ RN ,
where the set {Wk}l
k=0 consists of weight matri-
ces Wk ∈ Rnk ×mk for each layer k, and N is the
total number of parameters in the chosen LLM. In
other words, the vector v is obtained by flattening
and concatenating all the importance matrices Sk
corresponding to each layer.
There are two challenges with using the raw im-
portance matrix S to form this vector representa-
tion:
1. The importance scores are not normalized
across layers, meaning that they are only
meaningful within the context of a single
layer.
2. The resulting vector is high-dimensional, with
each dimension represented by a floating-
point number (typically 16 bits), leading to
large memory requirements.
To mitigate this, we propose a thresholding ap-
proach analogous to binary quantization. Specifi-
cally, we assign a value of 1 only to the most im-
portant weights by thresholding Sij at its median:
ˆSij = 1Sij > median(S).
This binary representation requires only 1 bit per
value, reducing the storage requirement substan-
tially compared to 16-bit floating-point representa-
tions.
Let X denote the set of language vectors (one
per language) of length N . We then define a metric
space on X using the Hamming distance (i.e., the
XOR operation) as the metric.
For x, y ∈ X the Hamming distance is
dh(x, y) =
N X
i=1
1xi̸ = yi
,
where 1[·] is the indicator function.
The function dh is non- negative, symmetric,
equals 0 iff x = y, and satisfies the triangle in-
equality,Chunk 8 · 1,996 chars
. We then define a metric
space on X using the Hamming distance (i.e., the
XOR operation) as the metric.
For x, y ∈ X the Hamming distance is
dh(x, y) =
N X
i=1
1xi̸ = yi
,
where 1[·] is the indicator function.
The function dh is non- negative, symmetric,
equals 0 iff x = y, and satisfies the triangle in-
equality, therefore, (X, dh) is a metric space.
Algorithm 1 Torgerson Scaling (Classical MDS)
Require: Distance matrix D ∈ Rn×n, n = |X|
Ensure: Coordinates Y ∈ Rn×d representing
points in d dimensions
1: J ← In − 1
n 1n ▷ Compute centering matrix
2: D2 ← D ⊙ D ▷ Element-wise square of D
3: B ← − 1
2 J D2 J ▷ Compute Gram matrix
4: (λ, V ) ← eigh(B) ▷ Compute the
eigen-decomposition of B
5: (λ, V ) ← sort((λ, V )) ▷ Sort eigenvalues
in descending order and reorder eigenvectors
accordingly
6: d ← #{λi | λi > ϵ} ▷ Select dimensions
with significant eigenvalues (ϵ ≈ 10−10)
7: L ← diag(√λ1, √λ2, . . . , √λd)
8: Vd ← [v1, v2, . . . , vd]
9: return Y ← Vd L
3.4 Isometry via Dimensionality Reduction
Even after quantization, the binary vectors remain
high-dimensional due to the large number of model
parameters, making distance computations and
other latent space applications computationally ex-
pensive. To address this, we construct an isometry
– a transformation that preserves distances between
points when mapping from one metric space to
another.
In our experiments, we employ different LLMs
and multiple datasets. We compute the language-
by-language distance matrix for each model and
dataset, and then average them to obtain a robust
distance measure:
Dlk ∈ R|X|×|X|,
Dlk = {dh(vi, vj ) : vi, vj ∈ X},
ˆ D = El∼pLLM Ek∼pdata [Dlk]
≈ 1
nm
n X
l=0
m X
k=0
Dlk,
where Dlk is the distance matrix computed for the
lth LLM and the kth dataset, n is the number of
LLMs, m is the number of datasets, and |X| is the
number of languages.
This averaging process reduces noise and en-
sures that the final distances are not overly depen-
dent on any particular dataset or model.
4
-- 4Chunk 9 · 1,988 chars
lk, where Dlk is the distance matrix computed for the lth LLM and the kth dataset, n is the number of LLMs, m is the number of datasets, and |X| is the number of languages. This averaging process reduces noise and en- sures that the final distances are not overly depen- dent on any particular dataset or model. 4 -- 4 of 18 -- Dataset # Languages in Dataset # Languages Used in Work Wikipedia 323 106 CulturaX 167 102 fineweb-2 2051 103 Table 1: Comparison of datasets: Wikipedia, CulturaX, and fineweb-2. The table reports the total number of languages in each dataset and the number of languages used in this work. We then construct an isometry f : X → Y, where Y is a metric space endowed with the Eu- clidean metric de(x, y) = ∥x − y∥2. To build f , we apply Torgerson scaling (classi- cal multidimensional scaling) (Borg and Groenen, 2007). The result is a set of points Y ∈ R|X|×d, where d is the minimum number of dimensions required to preserve the distances in ˆ D (see Al- gorithm 1). Notably, d is much smaller than the original dimensionality N of the language vectors and satisfies d ≤ |X|. Therefore, our method leverages LLMs weights to construct a language vector representation and embed it in a metric space which could be used for analysis of languages similarities. 4 Results To analyze the metric space of languages, we vary clustering algorithms along with dimensionality reduction ones. In particular, for clustering HDB- SCAN (Campello et al., 2013), k-means (Lloyd, 1982), and predefined linguistic families with its subfamilies are used to highlight the correspon- dence between the derived metric space and estab- lished linguistic classifications. Throughout this paper we adhere to the language classification pro- vided in Hammarström et al. (2024). For two - dimensional visualizations, we reduce the dimensionality of the language vectors using t-SNE (van der Maaten and Hinton, 2008), UMAP (McInnes et al., 2018), and minimum spanning trees (Pettie and
Chunk 10 · 1,992 chars
ications. Throughout this paper we adhere to the language classification pro- vided in Hammarström et al. (2024). For two - dimensional visualizations, we reduce the dimensionality of the language vectors using t-SNE (van der Maaten and Hinton, 2008), UMAP (McInnes et al., 2018), and minimum spanning trees (Pettie and Ramachandran, 2002). Although all methods yield valuable insights, we include in the main text only the minimum spanning trees (MST) visualizations colored by language families and subfamilies, as they most clearly represent the inter-language relationships. Additional figures are provided in Appendix D and also available via our open-source tool1. 4.1 Datasets and Models In our experiments, we employ three LLMs and three datasets. The models used are Mistral 7B (Jiang et al., 2023), Gemma 3 4B (Team et al., 2025), and Llama 3.2 1B (Grattafiori et al., 2024). All models are multilingual and have been trained on more than 100 languages. Notably, although Llama officially supports only 8 languages, our re- sults indicate that it still produces useful represen- tations for our purposes. As datasets, we selected those with a high number of languages: Wikipedia (Foundation), CulturaX (Nguyen et al., 2024), and fineweb-2 (Penedo et al., 2024). We start with a target inventory of 106 languages and attempted to apply the same list across all cor- pora. Wikipedia contains material for every lan- guage in this set, but CulturaX omits Chinese (Tra- ditional), Min Nan Chinese, Scots, and Crimean Tatar, whereas fineweb-2 lacks Chinese (Tradi- tional), English2, Serbo-Croatian, and Tagalog. Ta- ble 1 lists the total number of languages present in each dataset alongside the subset that could be retained from our 106-language list. For the full list of languages see Appendix B. To compute the language vectors, we proceed as follows: 1. Calibration data. For every language in each corpus (Wikipedia, CulturaX, fineweb) we sample 219 = 524,288 tokens. 2. Weight -
Chunk 11 · 1,986 chars
n each dataset alongside the subset that could be retained from our 106-language list. For the full list of languages see Appendix B. To compute the language vectors, we proceed as follows: 1. Calibration data. For every language in each corpus (Wikipedia, CulturaX, fineweb) we sample 219 = 524,288 tokens. 2. Weight - importance vectors. For each lan- guage–corpus pair and for each LLM (Mistral 7B, Gemma 3 4B, Llama 3.2 1B) we com- pute a binary weight importance vector whose length matches the model’s parameter count, yielding 3(106 + 102 + 103) = 933 vectors. 1https://huggingface.co/spaces/mshamrai/ language-metric-analysis 2For the English subset, we use the fineweb dataset (https: //huggingface.co/datasets/HuggingFaceFW/fineweb) 5 -- 5 of 18 -- 3. Distance matrices. Hamming distances between language vectors produce nine language–by–language matrices (one per model–corpus combination). 4. Aggregation. These nine matrices are aver- aged element- wise over the observed entries to form a single average distance matrix. 5. Embedding. Classical MDS on the average matrix embeds the languages space in R104, where Euclidean distance defines the final lan- guage metric. 4.2 Evaluation of k- means Clustering Against Two Linguistic Categorization After we embed the |X| = 106 language vectors into R104 via classical MDS we evaluate the lan- guage embeddings using k- means. The resulting partition is compared with two reference label sets: (i) high- level families (18 macro - families) and (ii) primary branches (35 sub -families). The number of clusters in k-means is equal to the number of labels in the reference sets. We compute the following metrics: • Silhouette score (Rousseeuw, 1987): the mean difference between a point’s average distance to its own cluster and to the nearest neighboring cluster. Values range from −1 (poor separation) to +1 (well- separated, com- pact clusters). • Adjusted Rand Index (ARI) (Hubert and Arabie, 1985): agreement between two
Chunk 12 · 1,999 chars
trics: • Silhouette score (Rousseeuw, 1987): the mean difference between a point’s average distance to its own cluster and to the nearest neighboring cluster. Values range from −1 (poor separation) to +1 (well- separated, com- pact clusters). • Adjusted Rand Index (ARI) (Hubert and Arabie, 1985): agreement between two parti- tions, corrected for chance. 1 indicates perfect alignment, 0 indicates random overlap. • Cluster purity (Schütze et al., 2008): the fraction of data points that share the majority label within their cluster. Values in [0, 1]. Reference Sil. ARI Purity Macro families 0.047 0.116 0.755 Primary branches 0.056 0.434 0.811 Table 2: Clustering metrics for the k- means solution against two standard language classification. "Sil" is the internal silhouette score. Table 2 shows that switching from broad fami- lies to primary branches raises the ARI from 0.116 to 0.434 and the purity from 0.755 to 0.811. There- fore, the metric space captures finer - grained lan- guage groups and can estimate similarity at a micro level. However, the internal silhouette remains low (about 0.05), meaning many languages lie almost as close to other clusters as to their own. 4.3 Language Trees A minimum spanning tree (MST) connects all data points in the dataset with the smallest possible total edge weight, where the edge weight corresponds to the distance between language vectors. We employ the Kamada-Kawai layout, a force-directed algo- rithm where edge lengths are proportional to the distances (Kamada and Kawai, 1989). This layout effectively visualizes the structure and connectivity within the MST, revealing not only the clusters of closely related languages but also links between different language families. Figure 1 shows the MST for all languages used in our work. The visualization highlights well- established clusters corresponding to known lan- guage families as well as some unexpected con- nections. For example, Tajik (an Indo-European language) appears linked to
Chunk 13 · 1,991 chars
es but also links between different language families. Figure 1 shows the MST for all languages used in our work. The visualization highlights well- established clusters corresponding to known lan- guage families as well as some unexpected con- nections. For example, Tajik (an Indo-European language) appears linked to a cluster of Turkic lan- guages, which can likely be explained by geograph- ical proximity. Similarly, the branch containing Latvian and Lithuanian is connected to a cluster of Uralic languages, possibly due to regional contact with Finnish and Estonian. A less obvious connec- tion is observed between Turkish and Hungarian, which might be attributed to historical interactions. Additionally, Vietnamese is found to be close to Chinese, despite Vietnamese using the Latin alpha- bet and Chinese employing logographic characters, indicating that our method captures internal lan- guage characteristics beyond mere orthographic features. Figure 2 focuses on Indo-European and Tur- kic languages, with coloring based on their pri- mary branches. This figure clearly illustrates that Crimean Tatar, although belonging to the Kipchak branch, is closely connected to Turkish, an Oghuz language. The MST also links English, a Germanic language, directly to Spanish, a Romance language, likely reflecting their close geographic and sociolin- guistic contact in the Americas. One intriguing observation is that Ukrainian does not exhibit a direct connection with Polish in the MST, which is unexpected. However, fur- ther analysis reveals that Polish consistently ranks among the top five closest languages to Ukrainian 6 -- 6 of 18 -- Figure 1: Minimum spanning tree for all languages. Colors represent language families. 7 -- 7 of 18 -- Figure 2: Minimum spanning tree for languages from the Indo-European and Turkic families. Colors represent language primary branches. across all models and datasets, coming in third after averaging the distances. In summary, the minimum
Chunk 14 · 1,987 chars
ning tree for all languages. Colors represent language families. 7 -- 7 of 18 -- Figure 2: Minimum spanning tree for languages from the Indo-European and Turkic families. Colors represent language primary branches. across all models and datasets, coming in third after averaging the distances. In summary, the minimum spanning trees reveal logical relationships among languages and their families. In addition, the presence of uncommon connections suggests potential historical contacts or convergent evolution. We leave further investi- gation of areal influences or language borrowing phenomena to professionals. 5 Conclusion In this work, we introduced a novel framework for constructing a metric space that quantifies lan- guage similarity by leveraging the internal weight activations of Large Language Models. Our approach, based on computing binary vec- tors from weight importance metrics and reducing their dimensionality via isometric mappings, cap- tures linguistic features, and the resulting metric space not only aligns with established linguistic families but also reveals intriguing inter-language connections. Overall, this study lays the groundwork for a data-driven paradigm in language similarity anal- ysis with significant implications for theoretical linguistics. Limitations While our approach offers a novel perspective on constructing a metric space for languages using LLM weight activations, several limitations re- main: 1. Computational Expense: Computing the bi- nary vectors is time-consuming. For exam- ple, on the Mistral 7B model, generating one binary vector requires approximately 20 min- utes on an NVIDIA RTX 3090 GPU. 2. Scalability to Larger Models: We have not yet evaluated the method on LLMs with a sig- nificantly higher number of parameters due to resource constraints. It is possible that larger models might yield more accurate or robust representations. 3. Remaining Bias from Source Models: Av- eraging distances across three LLMs does
Chunk 15 · 1,997 chars
bility to Larger Models: We have not yet evaluated the method on LLMs with a sig- nificantly higher number of parameters due to resource constraints. It is possible that larger models might yield more accurate or robust representations. 3. Remaining Bias from Source Models: Av- eraging distances across three LLMs does not eliminate their shared weaknesses. In particu- lar, the metric space can still reflect poor per- formance on low resource languages, which 8 -- 8 of 18 -- may introduce inconsistencies with known language family relationships. Additionally, we were unable to mathematically or empirically validate that the derived distance metric can serve as an effective guideline for fine- tuning and transfer learning of LLMs. Although the underlying hypothesis suggests that linguistic similarity may enhance the language modeling ca- pabilities through transfer learning between related languages, our preliminary experiments – where we fine-tuned an LLM on similar languages using various configurations – did not yield statistically significant improvements. This indicates that a more sophisticated approach may be required, and we leave this investigation for future work. For more details see Appendix C. Another promising direction for future research is to identify which specific weights or layers con- tribute most to the observed similarities and dis- similarities. It is likely that only a subset of layers significantly influences the metric. By pinpoint- ing these layers, we may reduce computational complexity and accelerate the metric computation without compromising accuracy. Acknowledgment We express our gratitude to the Armed Forces of Ukraine for their protection, which has made this research possible. References Mikel Artetxe and Holger Schwenk. 2019. Mas- sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610. Ingwer Borg and Patrick JF Groenen.
Chunk 16 · 1,993 chars
ne for their protection, which has made this research possible. References Mikel Artetxe and Holger Schwenk. 2019. Mas- sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610. Ingwer Borg and Patrick JF Groenen. 2007. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media. Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-based clustering based on hi- erarchical density estimates. In Advances in Knowl- edge Discovery and Data Mining, pages 160–172, Berlin, Heidelberg. Springer Berlin Heidelberg. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8440– 8451, Online. Association for Computational Lin- guistics. Alexis Conneau, Guillaume Lample, Marc’Aurelio Ran- zato, Ludovic Denoyer, and Hervé Jégou. 2019. What unsupervised multilingual sentence represen- tations learn about language. In Proceedings of the 57th Annual Meeting of the Association for Compu- tational Linguistics, pages 4598–4608. Association for Computational Linguistics. Juan De Gregorio, Raúl Toral, and David Sánchez. 2024. Exploring language relations through syntactic dis- tances and geographic proximity. EPJ Data Science, 13(1):61. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers), pages 4171– 4186. Association for Computational Linguistics. Matthew S. Dryer and
Chunk 17 · 1,999 chars
rectional transformers for language understand- ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers), pages 4171– 4186. Association for Computational Linguistics. Matthew S. Dryer and Martin Haspelmath, editors. 2005. World Atlas of Language Structures. Oxford Univer- sity Press, Oxford. Wikimedia Foundation. Wikimedia downloads. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Mas- sive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Harald Hammarström, Robert Forkel, Martin Haspel- math, and Sebastian Bank. 2024. Glottolog database 5.1. https://glottolog.org. Kevin Heffernan, Onur Çelebi, and Holger Schwenk. 2022. Bitext mining using distilled sentence repre- sentations for low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2101–2112, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Eric W. Holman, Søren Wichmann, and Christopher H. Brown. 2011. Automated dating of languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2010), pages 2052–2058. European Language Resources Association. Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of classification, 2:193–218. Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, and 1 others. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825. 9 -- 9 of 18 -- Tomihisa Kamada and Satoru Kawai. 1989.
Chunk 18 · 1,988 chars
:193–218. Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, and 1 others. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825. 9 -- 9 of 18 -- Tomihisa Kamada and Satoru Kawai. 1989. An algo- rithm for drawing general undirected graphs. Infor- mation Processing Letters, 31(1):7–15. Yann LeCun, John Denker, and Sara Solla. 1989. Opti- mal brain damage. Advances in neural information processing systems, 2. Stuart Lloyd. 1982. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137. Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and pro- jection for dimension reduction. arXiv preprint arXiv:1802.03426. Simon Moran, Daniel McCloy, and Sue Wright. 2014. Phonological similarity and its applications in cross- linguistic phonetics. Journal of Phonetics, 42:20–35. Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. Cul- turaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Pro- ceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4226– 4237, Torino, Italia. ELRA and ICCL. Kris O’Horan, Steven Galle, and Nathalie Schneider. 2016. Syntactic variation in multilingual represen- tations: Investigating cross-lingual transfer. In Pro- ceedings of the 2016 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1–11. Association for Computational Linguistics. Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Le- andro von Werra, and Thomas Wolf. 2024. Fineweb2: A sparkling update with 1000s of languages. Seth Pettie and Vijaya
Chunk 19 · 1,994 chars
al Linguistics: Human Language Technologies, pages 1–11. Association for Computational Linguistics. Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Le- andro von Werra, and Thomas Wolf. 2024. Fineweb2: A sparkling update with 1000s of languages. Seth Pettie and Vijaya Ramachandran. 2002. An opti- mal minimum spanning tree algorithm. Journal of the ACM (JACM), 49(1):16–34. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and 1 others. 2019. Language models are unsupervised multi- task learners. volume 1, page 9. OpenAI Blog, https://openai.com/blog/better-language-models/. Taraka Rama, Lisa Beinborn, and Steffen Eger. 2020. Probing multilingual BERT for genetic and typo- logical signals. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, pages 1214–1228, Barcelona, Spain (Online). Inter- national Committee on Computational Linguistics. Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65. Hinrich Schütze, Christopher D Manning, and Prab- hakar Raghavan. 2008. Introduction to information retrieval, volume 39. Cambridge University Press Cambridge. Maksym Shamrai. 2024. Language-specific pruning for efficient reduction of large language models. In Pro- ceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP)@ LREC-COLING 2024, pages 135–140. Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2023. A simple and effective pruning ap- proach for large language models. arXiv preprint arXiv:2306.11695. Morris Swadesh. 1952. Lexico-statistic dating of prehis- toric ethnic contacts: with special reference to north american indians and eskimos. Proceedings of the American philosophical society, 96(4):452–463. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana
Chunk 20 · 1,993 chars
Morris Swadesh. 1952. Lexico-statistic dating of prehis- toric ethnic contacts: with special reference to north american indians and eskimos. Proceedings of the American philosophical society, 96(4):452–463. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, and 1 others. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605. A Derivation of weight importance metric min δw 1 2 δwT H δw, s.t. eT q δw + wq = 0. The problem could be solved using Lagrange multiplier. We begin with the Lagrangian: L = 1 2 δw⊤Hδw + λ e⊤ q δw + wq . Taking the derivative with respect to δw and setting it to zero: ∇δwL = Hδw + λeq = 0, ⇒ δw = −H−1eqλ. Substituting into the constraint: e⊤ q (−H−1eqλ) + wq = 0, we get: λ = wq e⊤ q H−1eq . Thus, the change in weights: δw = −H−1eq · wq e⊤ q H−1eq . Notice that: 10 -- 10 of 18 -- Hδw = H −H−1eq wq e⊤ q H−1eq = −eq wq e⊤ q H−1eq , and δw⊤ = −H−1eq wq e⊤ q H−1eq ⊤ = − wq e⊤ q H−1eq e⊤ q H−1 Now compute the increase in error: Eq = 1 2 δw⊤Hδw = 1 2 wq e⊤ q H−1eq e⊤ q H−1eq wq e⊤ q H−1eq = 1 2 · w2 q e⊤ q H−1eq B Full list of languages used Afrikaans, Albanian, Arabic, Aragonese, Arme- nian, Asturian, Azerbaijani, Bashkir, Basque, Bavarian, Belarusian, Bengali, Bishnupriya Ma- nipuri, Bosnian, Breton, Bulgarian, Burmese, Cata- lan, Cebuano, Chechen, Chinese (Simplified), Chi- nese (Traditional), Chuvash, Crimean Tatar, Croa- tian, Czech, Danish, Dutch, Egyptian Arabic, En- glish, Esperanto, Estonian, Finnish, French, Gali- cian, Georgian, German, Greek, Gujarati, Haitian, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indone- sian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Kirghiz, Korean, Latin, Latvian, Lithua- nian, Lombard, Low Saxon, Luxembourgish, Mace- donian, Malagasy,
Chunk 21 · 1,987 chars
En- glish, Esperanto, Estonian, Finnish, French, Gali- cian, Georgian, German, Greek, Gujarati, Haitian, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indone- sian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Kirghiz, Korean, Latin, Latvian, Lithua- nian, Lombard, Low Saxon, Luxembourgish, Mace- donian, Malagasy, Malay, Malayalam, Marathi, Min Nan Chinese, Minangkabau, Nepali, Newar, Norwegian (Bokmal), Norwegian (Nynorsk), Oc- citan, Persian (Farsi), Piedmontese, Polish, Por- tuguese, Punjabi, Romanian, Russian, Scots, Ser- bian, Serbo-Croatian, Sicilian, Slovak, Slovenian, South Azerbaijani, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Waray-Waray, Welsh, West Frisian, West- ern Punjabi, Yoruba. Wikipedia includes all these languages. Cul- turaX lacks Chinese (Traditional), Min Nan Chi- nese, Scots, and Crimean Tatar. fineweb-2 does 1000 2000 3000 4000 5000 6000 Global step 0.85 0.90 0.95 1.00 1.05 Evaluation loss ukr-bg ukr-only ukr-pol Figure 3: Evaluation loss on Ukrainian for three weighted- loss runs: ukr -only (baseline), ukr - bg (Ukrainian + Bulgarian), and ukr- pol (Ukrainian + Pol- ish). Two-language datasets are twice as large, hence the longer training schedule. not include Chinese (Traditional), English, Serbo- Croatian, or Tagalog. For the English subset in fineweb-2, we use the fineweb dataset3. C Transfer-Learning Experiments We investigated whether adding data from a similar language can improve a low - resource target, where similarity is measured by the language - distance metric introduced in this paper. All experiments fine- tune Llama 3.2 1B and evaluate exclusively on a held-out set in the target language. We perform our experiments using the following strategies: 1. Mixed (size-matched). An equal amount of auxiliary- language text is concatenated to the low - resource corpus; the joint data are shuf- fled and used for fine-tuning. 2. Mixed
Chunk 22 · 1,996 chars
Llama 3.2 1B and evaluate exclusively on a held-out set in the target language. We perform our experiments using the following strategies: 1. Mixed (size-matched). An equal amount of auxiliary- language text is concatenated to the low - resource corpus; the joint data are shuf- fled and used for fine-tuning. 2. Mixed (loss-weighted). The same joint cor- pus is used, but the loss is re- weighted: e.g. 0.8 for target- language tokens and 0.2 for auxiliary-language tokens. 3. Sequential. Fine- tune first on the auxil- iary language, then continue training on the low-resource corpus. Figure 3 shows that augmenting Ukrainian with the metrically close Bulgarian does not improve evaluation loss, and Polish yields only a minor reduction. A similar pattern emerges for sequential fine- tuning on Turkish followed by Crimean Tatar: 3https://huggingface.co/datasets/ HuggingFaceFW/fineweb 11 -- 11 of 18 -- perplexity drops from 5.48 (Crimean Tatar only) to 5.36, an insignificant change. Across all settings, none of the three transfer regimes produced a consistent, significant gain over single-language fine - tuning. Future work should revisit these transfer strategies with substantially larger models and much larger datasets, where the benefits of distance - based language pairing may emerge more clearly. D Additional Figures Figure 4 displays the MST coloured by k- means clusters. We set k = 18 – one cluster for each category plotted in Figure 1 (15 natural families plus 3 constructed languages) – so that the cluster colours can be compared directly with the family colours. Most clusters coincide with their expected families, but not all. Notably, Turkish is grouped with Hungarian and Finnish rather than with the other Turkic languages. Figure 5 uses HDBSCAN with a minimum clus- ter size of two. This gives 24 clusters. Crimean Tatar is treated as outlier, while Ukrainian now connects directly to Polish. Figures 6 and 7 give two other views of the same data using t- SNE and
Chunk 23 · 1,999 chars
rouped with Hungarian and Finnish rather than with the other Turkic languages. Figure 5 uses HDBSCAN with a minimum clus- ter size of two. This gives 24 clusters. Crimean Tatar is treated as outlier, while Ukrainian now connects directly to Polish. Figures 6 and 7 give two other views of the same data using t- SNE and UMAP. Like the MST, they highlight clear family groups. Figure 8 shows the confusion matrix between k- means clusters and high - level language families. The clusters are first matched to families with the Hungarian algorithm for clearer alignment. Fig- ure 9 presents the same matrix, but for the finer primary branches of each family. 12 -- 12 of 18 -- Figure 4: MST of all languages. Colours show k-means clusters with k = 18 (one cluster for each language family). 13 -- 13 of 18 -- Figure 5: MST of all languages. Colours show HDBSCAN clusters (minimum cluster size = 2). Points marked as outliers by the algorithm are left out. 14 -- 14 of 18 -- Figure 6: t-SNE plot of all languages. Colours show language families. 15 -- 15 of 18 -- Figure 7: UMAP plot of all languages. Colours show language families. 16 -- 16 of 18 -- 10 16 6 3 0 15 1 8 11 12 2 14 7 5 13 17 4 9 Predicted Clusters (Aligned) Afroasiatic Austroasiatic Austronesian Constructed (Esperanto) Constructed (Ido) Constructed (Volapük) Creole Dravidian Indo-European Japonic Kartvelian Koreanic Language Isolate Niger-Congo Northeast Caucasian Sino-Tibetan Turkic Uralic True Classes 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 6 0 0 6 2 0 9 5 15 0 3 3 0 1 3 0 1 6 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
Chunk 24 · 1,996 chars
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 6 0 0 6 2 0 9 5 15 0 3 3 0 1 3 0 1 6 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 2 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 6 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 Aligned Confusion Matrix 0 2 4 6 8 10 12 14 Figure 8: Adjusted confusion matrix between clusters obtained by k-means and macro families of languages. Number of clusters equal to 18. 17 -- 17 of 18 -- 8 34 3 0 131631 5 21 2 4 3227 9 2912172515 7 231133141024 1 202818 6 26223019 Predicted Clusters (Aligned) Albanian Armenian Atlantic-Congo Baltic Celtic Constructed (Esperanto) Constructed (Ido) Constructed (Volapük) Finnic French-based Creole Germanic Hellenic Indo-Aryan Iranian Italic Japonic Karluk Kartvelian Kipchak Koreanic Language Isolate Malayo-Polynesian Nakh Oghur Oghuz Philippine Romance Semitic Sinitic Slavic South Dravidian South-Central Dravidian Tibeto-Burman Ugric Vietic True Classes 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Chunk 25 · 1,993 chars
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Aligned Confusion Matrix 0 1 2 3 4 5 6 7 8 Figure 9: Adjusted
Chunk 26 · 467 chars
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Aligned Confusion Matrix 0 1 2 3 4 5 6 7 8 Figure 9: Adjusted confusion matrix between clusters obtained by k-means and primary branches of language families. Number of clusters equal to 35. 18 -- 18 of 18 --