Deep Language Geometry: Constructing a Metric Space from LLM Weights

Summary

This paper introduces a novel framework for constructing a metric space of languages using the internal weight activations of Large Language Models (LLMs). Unlike traditional methods that rely on hand-crafted linguistic features, the approach automatically derives high-dimensional vector representations by computing weight importance scores via an adapted pruning algorithm. These vectors capture intrinsic language characteristics and are transformed into a binary format to reduce storage requirements. A Hamming distance metric is then applied, and the high-dimensional space is projected into a lower-dimensional Euclidean space using classical multidimensional scaling. The method is validated across 106 languages using three multilingual LLMs and three datasets, revealing language clusters that align with established linguistic families while also uncovering unexpected inter-language connections, potentially indicating historical contact or evolution. The results show that the metric space supports meaningful clustering, with higher accuracy for finer-grained language groups. The authors also explore transfer learning but find no significant improvements in model performance. The work is open-sourced, providing language vectors, visualization tools, and analysis resources for further research. Limitations include computational expense and potential biases from the source models.

PDF viewer

Chunks(27)

Chunk 0 · 1,977 chars

Deep Language Geometry:
Constructing a Metric Space from LLM Weights
Maksym Shamrai
Institute of Mathematics of NASU
MacPaw
Kyiv, Ukraine
mshamrai@macpaw.com
Vladyslav Hamolia
MacPaw
Kyiv, Ukraine
vhamolya@macpaw.com
Abstract
We introduce a novel framework that utilizes
the internal weight activations of modern Large
Language Models (LLMs) to construct a met-
ric space of languages. Unlike traditional ap-
proaches based on hand-crafted linguistic fea-
tures, our method automatically derives high-
dimensional vector representations by comput-
ing weight importance scores via an adapted
pruning algorithm. Our approach captures in-
trinsic language characteristics that reflect lin-
guistic phenomena. We validate our approach
across diverse datasets and multilingual LLMs,
covering 106 languages. The results align well
with established linguistic families while also
revealing unexpected inter-language connec-
tions that may indicate historical contact or lan-
guage evolution. The source code, computed
language latent vectors, and visualization tool
are made publicly available at https://github.
com/mshamrai/deep-language-geometry.
1 Introduction
Languages are complex systems with rich inter-
nal structures and dynamic evolution. Traditional
linguistic classifications based on typological fea-
tures, historical migration patterns, or lexical sim-
ilarity have long served to group languages into
families such as Indo-European, Uralic, and Tur-
kic. However, these approaches typically capture
only historical or static aspects of language sim-
ilarity, potentially overlooking modern linguistic
influences driven by technology and globalization.
In an era where language use and structure are con-
tinuously reshaped, it is timely to develop methods
that automatically capture both historical and cur-
rent linguistic characteristics.
Recent advances in Natural Language Process-
ing (NLP) have been largely driven by Large Lan-
guage Models (LLMs), which have

Chunk 1 · 1,987 chars

ion.
In an era where language use and structure are con-
tinuously reshaped, it is timely to develop methods
that automatically capture both historical and cur-
rent linguistic characteristics.
Recent advances in Natural Language Process-
ing (NLP) have been largely driven by Large Lan-
guage Models (LLMs), which have demonstrated
remarkable capabilities in language modeling and
a wide range of linguistic tasks (Devlin et al., 2018;
Radford et al., 2019). These models, trained on vast
multilingual corpora, learn representations that im-
plicitly encode a wide variety of lexical, syntactic,
and even phonological properties (Conneau et al.,
2019).
Building on prior work (Shamrai, 2024), which
empirically shows that the internal activations of
LLM weights vary with the language of the in-
put data, we hypothesize that the internal weights
of LLMs encode valuable information about inter-
language similarity and can serve as a foundation
for quantifying relationships between languages.
Therefore, in this work, we propose a novel ap-
proach for constructing a metric space of languages
by leveraging the weights of modern LLMs. Our
method extracts high-dimensional vector represen-
tations from LLM weights activations, where the
distance between any two vectors reflects the simi-
larity between the underlying linguistic structures.
Activations encode patterns of co-occurrence and
contextual relationships specific to each language’s
grammar and lexical properties.
We construct a metric space (X, dh), where X
is the set of high-dimensional language vectors
and dh is the Hamming distance between them.
We then design a distance-preserving mapping that
projects these high-dimensional vectors into a low-
dimensional space (Y, de), where distances are in-
duced from the Euclidean (L2) norm. This trans-
formation provides deeper insight into the latent
structures encoded by LLMs.
Furthermore, we calculate this latent representa-
tion for 106 languages. This revealed the

Chunk 2 · 1,999 chars

rojects these high-dimensional vectors into a low-
dimensional space (Y, de), where distances are in-
duced from the Euclidean (L2) norm. This trans-
formation provides deeper insight into the latent
structures encoded by LLMs.
Furthermore, we calculate this latent representa-
tion for 106 languages. This revealed the opportu-
nity to visualize, cluster and analyze the relation-
ships between the languages.
Our code, computed language latent vectors, and
analysis tool are made publicly available, designed
to assist researchers and practitioners in linguistic
analysis and offering valuable resource for further
linguistic investigation.
The contributions of this work are as follows:
1
arXiv:2508.11676v1 [cs.CL] 8 Aug 2025

-- 1 of 18 --

• We introduce a novel approach that con-
structs a metric space of languages using LLM
weights and apply it to 106 languages, en-
abling automatic and data-driven measure-
ment of linguistic distances.
• We demonstrate that the derived metric space
supports meaningful clustering of languages,
reflecting both historical relationships and
modern linguistic features.
• We fully open-source our work along with a
tool for preliminary analysis.
While not claiming linguistic expertise, this
study introduces a novel toolset intended to sup-
port linguistic research. It offers a fresh view of
language similarity by exploiting the latent knowl-
edge embedded in LLMs.
2 Related Work
The quantification of language similarity has a rich
history, beginning with early lexical approaches. Pi-
oneering work (Swadesh, 1952) established meth-
ods for comparing languages using shared cognates,
a practice later refined by Holman et al. (2011),
which employs normalized Levenshtein distances
over fixed word lists. Although these lexical meth-
ods have been successfully used to construct lan-
guage family trees, they are handcrafted and re-
quire manual effort to select and curate appropriate
word lists and features.
Also, resources such as the World Atlas

Chunk 3 · 1,999 chars

1),
which employs normalized Levenshtein distances
over fixed word lists. Although these lexical meth-
ods have been successfully used to construct lan-
guage family trees, they are handcrafted and re-
quire manual effort to select and curate appropriate
word lists and features.
Also, resources such as the World Atlas of Lan-
guage Structures (Dryer and Haspelmath, 2005)
offer comprehensive typological data that allow
languages to be represented as feature vectors. Dis-
tance measures computed over these vectors have
been shown to reveal groupings consistent with
established genetic relationships (O’Horan et al.,
2016; De Gregorio et al., 2024). However, these
methods are limited by the quality and coverage
of available databases, their reliance on expert-
curated features, and their inability to fully capture
language-specific variations or recent evolutionary
trends.
Phonological properties offer another valuable
dimension for language comparison. Studies uti-
lizing phoneme inventory data from resources like
PHOIBLE (Moran et al., 2014) demonstrate that
phonological distances – often measured by overlap
indices such as the Jaccard similarity – can capture
both genetic relationships and areal phenomena.
But phonological methods need reliable phoneme
lists, are affected by how sounds are written, and
often miss language structure beyond sounds.
Recent deep- learning work has popularised
embedding- based measures of language distance.
Multilingual encoders such as mBERT (Devlin
et al., 2018), XLM- R (Conneau et al., 2020) and
LASER (Artetxe and Schwenk, 2019; Heffernan
et al., 2022) produce contextual token embeddings
that implicitly encode lexical, syntactic and seman-
tic features. LASER is trained to output a sin-
gle sentence vector directly, whereas mBERT and
XLM -R require a pooling step (e.g., mean pool-
ing or the [CLS] token) to obtain a sentence- level
embedding. When sentence embeddings are aver-
aged over large, balanced corpora, the resulting
language-

Chunk 4 · 1,999 chars

ntactic and seman-
tic features. LASER is trained to output a sin-
gle sentence vector directly, whereas mBERT and
XLM -R require a pooling step (e.g., mean pool-
ing or the [CLS] token) to obtain a sentence- level
embedding. When sentence embeddings are aver-
aged over large, balanced corpora, the resulting
language- level representations have proved use-
ful for quantifying cross- lingual similarity (Rama
et al., 2020). However, because the underlying en-
coders operate at the token – and therefore sentence
– level, their effectiveness still depends on corpus
size and domain balance.
Overall, the literature on language distance met-
rics has evolved from classical lexicostatistical
methods and handcrafted feature extraction to so-
phisticated neural representations. Each approach
offers valuable insights into the relationships be-
tween languages, but they often suffer from labor-
intensive preprocessing, limited database coverage,
or sensitivity to input variations. This motivates our
approach: rather than relying on manually curated
features or sentence-based embeddings, we pro-
pose an automatic, data-driven method that lever-
ages the internal weights of modern LLMs to con-
struct a metric space of languages.
Moreover, to best of our knowledge, no study has
attempted to derive a language metric space from
decoder- only LLMs. The method introduced here
is therefore the first to use weight -level signals in
causal transformers for measuring cross -language
similarity.
3 Methodology
The main hypothesis in this work is that Large
Language Models are a good choice to measure
internal language structure since they are trained
to model languages. Formally, this is typically
framed as maximizing the log-likelihood of the
observed sequence of tokens. Let x1, x2, . . . , xT
represent a sequence of tokens, where xt ∈ V and
V is the vocabulary. The objective is to maximize
the likelihood of the sequence under the model’s
2

-- 2 of 18 --

parameters θ:
L(θ) =
T	X
t=1
log

Chunk 5 · 1,999 chars

ly, this is typically
framed as maximizing the log-likelihood of the
observed sequence of tokens. Let x1, x2, . . . , xT
represent a sequence of tokens, where xt ∈ V and
V is the vocabulary. The objective is to maximize
the likelihood of the sequence under the model’s
2

-- 2 of 18 --

parameters θ:
L(θ) =
T	X
t=1
log p(xt|x1, x2, . . . , xt−1; θ),
where p(xt|x1, x2, . . . , xt−1; θ) is the conditional
probability of the token xt given the previous to-
kens, modeled by a neural network or another prob-
abilistic model.
3.1 Weight Importance Metric
We begin by revisiting classical pruning approaches
such as Optimal Brain Damage (LeCun et al.,
1989), which motivate the rationale behind our ap-
proach.
The typical pruning objective is to minimize
the error introduced by approximating the original
weight matrix. Consider the following objective
function:
E = ∥WX − ˆ	WX∥2
2 → min, (1)
where W is the original weight matrix of a layer,
ˆ	W is the pruned (sparse) weight matrix, and X is
the input to that layer.
The variation of the error E for a weight row w
can be expressed as:
δE =
 ∂E
∂w
T
δw + 1
2 δwT H δw + O(∥δw∥3),
where H ≡ ∂2E
∂w2 is the Hessian matrix.
At a local minimum of the training error, we
have ∂E
∂w ≈ 0,
and higher order terms are neglected.
Our goal is to set one of the weights, say wq, to
zero while minimizing the increase in error. This
introduces the constraint:
eT
q δw + wq = 0,
where eq is the qth standard basis vector. Thus,
the optimization problem in Equation (1) can be
reformulated as:
min
δw
1
2 δwT H δw,
s.t. eT
q δw + wq = 0.
(2)
This constrained problem can be solved using
Lagrange multipliers. For the detailed derivation
see Appendix A.
The resulting increase in error is given by:
Eq = 1
2 · w2
q
e⊤
q H−1eq
. (3)
By computing Eq for every weight wq, one can
prune the weight that causes the smallest increase
in error, thereby minimally affecting the layer’s
output. Intuitively, this means we identify which
weights are most critical for the

Chunk 6 · 1,987 chars

Appendix A.
The resulting increase in error is given by:
Eq = 1
2 · w2
q
e⊤
q H−1eq
. (3)
By computing Eq for every weight wq, one can
prune the weight that causes the smallest increase
in error, thereby minimally affecting the layer’s
output. Intuitively, this means we identify which
weights are most critical for the model’s perfor-
mance on a specific language. Weights with high
importance scores are those whose removal would
substantially degrade the model’s ability to predict
tokens in that language.
SparseGPT (Frantar and Alistarh, 2023) adopts
this idea within an LLM pruning algorithm. They
compute the importance metric Sij for a layer as
follows (Sun et al., 2023):
Sij =

 |W|2
diag

(XT X + λI)−1



ij
. (4)
As in SparseGPT, we build X per linear
sub- layer by stacking the pre -activation hidden
states of a small calibration set into an N × din
matrix. For a weight matrix W the local Hessian is
H = X⊤X, and we invert X⊤X + λI once per
layer. Thus, Equation (4) is simply a matrix-valued,
regularised version of the scalar error-increase cri-
terion in Equation (3).
Shamrai (2024) suggests that the SparseGPT
algorithm provides statistically stable results for
different LLMs and subsets of a data in language-
specific setting. Therefore, in our work, we adopt
the algorithm to compute weight importance vec-
tors.
3.2 Rationale Behind the Approach
By definition, Sij quantifies the importance of
weight Wij for a given input. In our approach, we
estimate the importance of the weights for a spe-
cific language by using datasets in that language.
Consequently, Sij reflects the contribution of each
weight to language modeling.
Assuming that the network is well-trained on lan-
guage modeling, higher S scores indicate greater
contribution. If two languages yield similar pat-
terns of important weights, it suggests that they are
similar in terms of language modeling characteris-
tics.
3

-- 3 of 18 --

3.3 Constructing a Metric Space
To derive a vector

Chunk 7 · 1,990 chars

that the network is well-trained on lan-
guage modeling, higher S scores indicate greater
contribution. If two languages yield similar pat-
terns of important weights, it suggests that they are
similar in terms of language modeling characteris-
tics.
3

-- 3 of 18 --

3.3 Constructing a Metric Space
To derive a vector representation from the impor-
tance metric, we treat the importance scores as co-
ordinates in a high-dimensional space. Specifically,
we define the vector
v = S0
00, S0
01, . . . , Sk
ij , . . . , Sl
nm
 ∈ RN ,
where the set {Wk}l
k=0 consists of weight matri-
ces Wk ∈ Rnk ×mk for each layer k, and N is the
total number of parameters in the chosen LLM. In
other words, the vector v is obtained by flattening
and concatenating all the importance matrices Sk
corresponding to each layer.
There are two challenges with using the raw im-
portance matrix S to form this vector representa-
tion:
1. The importance scores are not normalized
across layers, meaning that they are only
meaningful within the context of a single
layer.
2. The resulting vector is high-dimensional, with
each dimension represented by a floating-
point number (typically 16 bits), leading to
large memory requirements.
To mitigate this, we propose a thresholding ap-
proach analogous to binary quantization. Specifi-
cally, we assign a value of 1 only to the most im-
portant weights by thresholding Sij at its median:
ˆSij = 1Sij > median(S).
This binary representation requires only 1 bit per
value, reducing the storage requirement substan-
tially compared to 16-bit floating-point representa-
tions.
Let X denote the set of language vectors (one
per language) of length N . We then define a metric
space on X using the Hamming distance (i.e., the
XOR operation) as the metric.
For x, y ∈ X the Hamming distance is
dh(x, y) =
N	X
i=1
1xi̸ = yi
,
where 1[·] is the indicator function.
The function dh is non- negative, symmetric,
equals 0 iff x = y, and satisfies the triangle in-
equality,

Chunk 8 · 1,996 chars

. We then define a metric
space on X using the Hamming distance (i.e., the
XOR operation) as the metric.
For x, y ∈ X the Hamming distance is
dh(x, y) =
N	X
i=1
1xi̸ = yi
,
where 1[·] is the indicator function.
The function dh is non- negative, symmetric,
equals 0 iff x = y, and satisfies the triangle in-
equality, therefore, (X, dh) is a metric space.
Algorithm 1 Torgerson Scaling (Classical MDS)
Require: Distance matrix D ∈ Rn×n, n = |X|
Ensure: Coordinates Y ∈ Rn×d representing
points in d dimensions
1: J ← In − 1
n 1n ▷ Compute centering matrix
2: D2 ← D ⊙ D ▷ Element-wise square of D
3: B ← − 1
2 J D2 J ▷ Compute Gram matrix
4: (λ, V ) ← eigh(B) ▷ Compute the
eigen-decomposition of B
5: (λ, V ) ← sort((λ, V )) ▷ Sort eigenvalues
in descending order and reorder eigenvectors
accordingly
6: d ← #{λi | λi > ϵ} ▷ Select dimensions
with significant eigenvalues (ϵ ≈ 10−10)
7: L ← diag(√λ1, √λ2, . . . , √λd)
8: Vd ← [v1, v2, . . . , vd]
9: return Y ← Vd L
3.4 Isometry via Dimensionality Reduction
Even after quantization, the binary vectors remain
high-dimensional due to the large number of model
parameters, making distance computations and
other latent space applications computationally ex-
pensive. To address this, we construct an isometry
– a transformation that preserves distances between
points when mapping from one metric space to
another.
In our experiments, we employ different LLMs
and multiple datasets. We compute the language-
by-language distance matrix for each model and
dataset, and then average them to obtain a robust
distance measure:
Dlk ∈ R|X|×|X|,
Dlk = {dh(vi, vj ) : vi, vj ∈ X},
ˆ	D = El∼pLLM Ek∼pdata [Dlk]
≈ 1
nm
n	X
l=0
m	X
k=0
Dlk,
where Dlk is the distance matrix computed for the
lth LLM and the kth dataset, n is the number of
LLMs, m is the number of datasets, and |X| is the
number of languages.
This averaging process reduces noise and en-
sures that the final distances are not overly depen-
dent on any particular dataset or model.
4

-- 4

Chunk 9 · 1,988 chars

lk,
where Dlk is the distance matrix computed for the
lth LLM and the kth dataset, n is the number of
LLMs, m is the number of datasets, and |X| is the
number of languages.
This averaging process reduces noise and en-
sures that the final distances are not overly depen-
dent on any particular dataset or model.
4

-- 4 of 18 --

Dataset # Languages in Dataset # Languages Used in Work
Wikipedia 323 106
CulturaX 167 102
fineweb-2 2051 103
Table 1: Comparison of datasets: Wikipedia, CulturaX, and fineweb-2. The table reports the total number of
languages in each dataset and the number of languages used in this work.
We then construct an isometry
f : X → Y,
where Y is a metric space endowed with the Eu-
clidean metric de(x, y) = ∥x − y∥2.
To build f , we apply Torgerson scaling (classi-
cal multidimensional scaling) (Borg and Groenen,
2007). The result is a set of points Y ∈ R|X|×d,
where d is the minimum number of dimensions
required to preserve the distances in ˆ D (see Al-
gorithm 1). Notably, d is much smaller than the
original dimensionality N of the language vectors
and satisfies d ≤ |X|.
Therefore, our method leverages LLMs weights
to construct a language vector representation and
embed it in a metric space which could be used for
analysis of languages similarities.
4 Results
To analyze the metric space of languages, we vary
clustering algorithms along with dimensionality
reduction ones. In particular, for clustering HDB-
SCAN (Campello et al., 2013), k-means (Lloyd,
1982), and predefined linguistic families with its
subfamilies are used to highlight the correspon-
dence between the derived metric space and estab-
lished linguistic classifications. Throughout this
paper we adhere to the language classification pro-
vided in Hammarström et al. (2024).
For two - dimensional visualizations, we reduce
the dimensionality of the language vectors using
t-SNE (van der Maaten and Hinton, 2008), UMAP
(McInnes et al., 2018), and minimum spanning
trees (Pettie and

Chunk 10 · 1,992 chars

ications. Throughout this
paper we adhere to the language classification pro-
vided in Hammarström et al. (2024).
For two - dimensional visualizations, we reduce
the dimensionality of the language vectors using
t-SNE (van der Maaten and Hinton, 2008), UMAP
(McInnes et al., 2018), and minimum spanning
trees (Pettie and Ramachandran, 2002). Although
all methods yield valuable insights, we include in
the main text only the minimum spanning trees
(MST) visualizations colored by language families
and subfamilies, as they most clearly represent the
inter-language relationships. Additional figures are
provided in Appendix D and also available via our
open-source tool1.
4.1 Datasets and Models
In our experiments, we employ three LLMs and
three datasets. The models used are Mistral 7B
(Jiang et al., 2023), Gemma 3 4B (Team et al.,
2025), and Llama 3.2 1B (Grattafiori et al., 2024).
All models are multilingual and have been trained
on more than 100 languages. Notably, although
Llama officially supports only 8 languages, our re-
sults indicate that it still produces useful represen-
tations for our purposes. As datasets, we selected
those with a high number of languages: Wikipedia
(Foundation), CulturaX (Nguyen et al., 2024), and
fineweb-2 (Penedo et al., 2024).
We start with a target inventory of 106 languages
and attempted to apply the same list across all cor-
pora. Wikipedia contains material for every lan-
guage in this set, but CulturaX omits Chinese (Tra-
ditional), Min Nan Chinese, Scots, and Crimean
Tatar, whereas fineweb-2 lacks Chinese (Tradi-
tional), English2, Serbo-Croatian, and Tagalog. Ta-
ble 1 lists the total number of languages present
in each dataset alongside the subset that could be
retained from our 106-language list. For the full
list of languages see Appendix B.
To compute the language vectors, we proceed as
follows:
1. Calibration data. For every language in each
corpus (Wikipedia, CulturaX, fineweb) we
sample 219 = 524,288 tokens.
2. Weight -

Chunk 11 · 1,986 chars

n each dataset alongside the subset that could be
retained from our 106-language list. For the full
list of languages see Appendix B.
To compute the language vectors, we proceed as
follows:
1. Calibration data. For every language in each
corpus (Wikipedia, CulturaX, fineweb) we
sample 219 = 524,288 tokens.
2. Weight - importance vectors. For each lan-
guage–corpus pair and for each LLM (Mistral
7B, Gemma 3 4B, Llama 3.2 1B) we com-
pute a binary weight importance vector whose
length matches the model’s parameter count,
yielding 3(106 + 102 + 103) = 933 vectors.
1https://huggingface.co/spaces/mshamrai/
language-metric-analysis
2For the English subset, we use the fineweb dataset (https:
//huggingface.co/datasets/HuggingFaceFW/fineweb)
5

-- 5 of 18 --

3. Distance matrices. Hamming distances
between language vectors produce nine
language–by–language matrices (one per
model–corpus combination).
4. Aggregation. These nine matrices are aver-
aged element- wise over the observed entries
to form a single average distance matrix.
5. Embedding. Classical MDS on the average
matrix embeds the languages space in R104,
where Euclidean distance defines the final lan-
guage metric.
4.2 Evaluation of k- means Clustering Against
Two Linguistic Categorization
After we embed the |X| = 106 language vectors
into R104 via classical MDS we evaluate the lan-
guage embeddings using k- means. The resulting
partition is compared with two reference label sets:
(i) high- level families (18 macro - families) and (ii)
primary branches (35 sub -families). The number
of clusters in k-means is equal to the number of
labels in the reference sets.
We compute the following metrics:
• Silhouette score (Rousseeuw, 1987): the
mean difference between a point’s average
distance to its own cluster and to the nearest
neighboring cluster. Values range from −1
(poor separation) to +1 (well- separated, com-
pact clusters).
• Adjusted Rand Index (ARI) (Hubert and
Arabie, 1985): agreement between two

Chunk 12 · 1,999 chars

trics:
• Silhouette score (Rousseeuw, 1987): the
mean difference between a point’s average
distance to its own cluster and to the nearest
neighboring cluster. Values range from −1
(poor separation) to +1 (well- separated, com-
pact clusters).
• Adjusted Rand Index (ARI) (Hubert and
Arabie, 1985): agreement between two parti-
tions, corrected for chance. 1 indicates perfect
alignment, 0 indicates random overlap.
• Cluster purity (Schütze et al., 2008): the
fraction of data points that share the majority
label within their cluster. Values in [0, 1].
Reference Sil. ARI Purity
Macro families 0.047 0.116 0.755
Primary branches 0.056 0.434 0.811
Table 2: Clustering metrics for the k- means solution
against two standard language classification. "Sil" is the
internal silhouette score.
Table 2 shows that switching from broad fami-
lies to primary branches raises the ARI from 0.116
to 0.434 and the purity from 0.755 to 0.811. There-
fore, the metric space captures finer - grained lan-
guage groups and can estimate similarity at a micro
level. However, the internal silhouette remains low
(about 0.05), meaning many languages lie almost
as close to other clusters as to their own.
4.3 Language Trees
A minimum spanning tree (MST) connects all data
points in the dataset with the smallest possible total
edge weight, where the edge weight corresponds to
the distance between language vectors. We employ
the Kamada-Kawai layout, a force-directed algo-
rithm where edge lengths are proportional to the
distances (Kamada and Kawai, 1989). This layout
effectively visualizes the structure and connectivity
within the MST, revealing not only the clusters of
closely related languages but also links between
different language families.
Figure 1 shows the MST for all languages used
in our work. The visualization highlights well-
established clusters corresponding to known lan-
guage families as well as some unexpected con-
nections. For example, Tajik (an Indo-European
language) appears linked to

Chunk 13 · 1,991 chars

es but also links between
different language families.
Figure 1 shows the MST for all languages used
in our work. The visualization highlights well-
established clusters corresponding to known lan-
guage families as well as some unexpected con-
nections. For example, Tajik (an Indo-European
language) appears linked to a cluster of Turkic lan-
guages, which can likely be explained by geograph-
ical proximity. Similarly, the branch containing
Latvian and Lithuanian is connected to a cluster of
Uralic languages, possibly due to regional contact
with Finnish and Estonian. A less obvious connec-
tion is observed between Turkish and Hungarian,
which might be attributed to historical interactions.
Additionally, Vietnamese is found to be close to
Chinese, despite Vietnamese using the Latin alpha-
bet and Chinese employing logographic characters,
indicating that our method captures internal lan-
guage characteristics beyond mere orthographic
features.
Figure 2 focuses on Indo-European and Tur-
kic languages, with coloring based on their pri-
mary branches. This figure clearly illustrates that
Crimean Tatar, although belonging to the Kipchak
branch, is closely connected to Turkish, an Oghuz
language. The MST also links English, a Germanic
language, directly to Spanish, a Romance language,
likely reflecting their close geographic and sociolin-
guistic contact in the Americas.
One intriguing observation is that Ukrainian
does not exhibit a direct connection with Polish
in the MST, which is unexpected. However, fur-
ther analysis reveals that Polish consistently ranks
among the top five closest languages to Ukrainian
6

-- 6 of 18 --

Figure 1: Minimum spanning tree for all languages. Colors represent language families.
7

-- 7 of 18 --

Chunk 14 · 1,987 chars

ning tree for all languages. Colors represent language families.
7

-- 7 of 18 --

Figure 2: Minimum spanning tree for languages from the Indo-European and Turkic families. Colors represent
language primary branches.
across all models and datasets, coming in third after
averaging the distances.
In summary, the minimum spanning trees reveal
logical relationships among languages and their
families. In addition, the presence of uncommon
connections suggests potential historical contacts
or convergent evolution. We leave further investi-
gation of areal influences or language borrowing
phenomena to professionals.
5 Conclusion
In this work, we introduced a novel framework
for constructing a metric space that quantifies lan-
guage similarity by leveraging the internal weight
activations of Large Language Models.
Our approach, based on computing binary vec-
tors from weight importance metrics and reducing
their dimensionality via isometric mappings, cap-
tures linguistic features, and the resulting metric
space not only aligns with established linguistic
families but also reveals intriguing inter-language
connections.
Overall, this study lays the groundwork for a
data-driven paradigm in language similarity anal-
ysis with significant implications for theoretical
linguistics.
Limitations
While our approach offers a novel perspective on
constructing a metric space for languages using
LLM weight activations, several limitations re-
main:
1. Computational Expense: Computing the bi-
nary vectors is time-consuming. For exam-
ple, on the Mistral 7B model, generating one
binary vector requires approximately 20 min-
utes on an NVIDIA RTX 3090 GPU.
2. Scalability to Larger Models: We have not
yet evaluated the method on LLMs with a sig-
nificantly higher number of parameters due to
resource constraints. It is possible that larger
models might yield more accurate or robust
representations.
3. Remaining Bias from Source Models: Av-
eraging distances across three LLMs does

Chunk 15 · 1,997 chars

bility to Larger Models: We have not
yet evaluated the method on LLMs with a sig-
nificantly higher number of parameters due to
resource constraints. It is possible that larger
models might yield more accurate or robust
representations.
3. Remaining Bias from Source Models: Av-
eraging distances across three LLMs does not
eliminate their shared weaknesses. In particu-
lar, the metric space can still reflect poor per-
formance on low resource languages, which
8

-- 8 of 18 --

may introduce inconsistencies with known
language family relationships.
Additionally, we were unable to mathematically
or empirically validate that the derived distance
metric can serve as an effective guideline for fine-
tuning and transfer learning of LLMs. Although
the underlying hypothesis suggests that linguistic
similarity may enhance the language modeling ca-
pabilities through transfer learning between related
languages, our preliminary experiments – where
we fine-tuned an LLM on similar languages using
various configurations – did not yield statistically
significant improvements. This indicates that a
more sophisticated approach may be required, and
we leave this investigation for future work. For
more details see Appendix C.
Another promising direction for future research
is to identify which specific weights or layers con-
tribute most to the observed similarities and dis-
similarities. It is likely that only a subset of layers
significantly influences the metric. By pinpoint-
ing these layers, we may reduce computational
complexity and accelerate the metric computation
without compromising accuracy.
Acknowledgment
We express our gratitude to the Armed Forces of
Ukraine for their protection, which has made this
research possible.
References
Mikel Artetxe and Holger Schwenk. 2019. Mas-
sively multilingual sentence embeddings for zero-
shot cross-lingual transfer and beyond. Transactions
of the Association for Computational Linguistics,
7:597–610.
Ingwer Borg and Patrick JF Groenen.

Chunk 16 · 1,993 chars

ne for their protection, which has made this
research possible.
References
Mikel Artetxe and Holger Schwenk. 2019. Mas-
sively multilingual sentence embeddings for zero-
shot cross-lingual transfer and beyond. Transactions
of the Association for Computational Linguistics,
7:597–610.
Ingwer Borg and Patrick JF Groenen. 2007. Modern
multidimensional scaling: Theory and applications.
Springer Science & Business Media.
Ricardo JGB Campello, Davoud Moulavi, and Jörg
Sander. 2013. Density-based clustering based on hi-
erarchical density estimates. In Advances in Knowl-
edge Discovery and Data Mining, pages 160–172,
Berlin, Heidelberg. Springer Berlin Heidelberg.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440–
8451, Online. Association for Computational Lin-
guistics.
Alexis Conneau, Guillaume Lample, Marc’Aurelio Ran-
zato, Ludovic Denoyer, and Hervé Jégou. 2019.
What unsupervised multilingual sentence represen-
tations learn about language. In Proceedings of the
57th Annual Meeting of the Association for Compu-
tational Linguistics, pages 4598–4608. Association
for Computational Linguistics.
Juan De Gregorio, Raúl Toral, and David Sánchez. 2024.
Exploring language relations through syntactic dis-
tances and geographic proximity. EPJ Data Science,
13(1):61.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), pages 4171–
4186. Association for Computational Linguistics.
Matthew S. Dryer and

Chunk 17 · 1,999 chars

rectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), pages 4171–
4186. Association for Computational Linguistics.
Matthew S. Dryer and Martin Haspelmath, editors. 2005.
World Atlas of Language Structures. Oxford Univer-
sity Press, Oxford.
Wikimedia Foundation. Wikimedia downloads.
Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Mas-
sive language models can be accurately pruned in
one-shot. In International Conference on Machine
Learning, pages 10323–10337. PMLR.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten,
Alex Vaughan, and 1 others. 2024. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783.
Harald Hammarström, Robert Forkel, Martin Haspel-
math, and Sebastian Bank. 2024. Glottolog database
5.1. https://glottolog.org.
Kevin Heffernan, Onur Çelebi, and Holger Schwenk.
2022. Bitext mining using distilled sentence repre-
sentations for low-resource languages. In Findings
of the Association for Computational Linguistics:
EMNLP 2022, pages 2101–2112, Abu Dhabi, United
Arab Emirates. Association for Computational Lin-
guistics.
Eric W. Holman, Søren Wichmann, and Christopher H.
Brown. 2011. Automated dating of languages. In
Proceedings of the 9th International Conference on
Language Resources and Evaluation (LREC 2010),
pages 2052–2058. European Language Resources
Association.
Lawrence Hubert and Phipps Arabie. 1985. Comparing
partitions. Journal of classification, 2:193–218.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, and 1 others. 2023.
Mistral 7b. arXiv preprint arXiv:2310.06825.
9

-- 9 of 18 --

Tomihisa Kamada and Satoru Kawai. 1989.

Chunk 18 · 1,988 chars

:193–218.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, and 1 others. 2023.
Mistral 7b. arXiv preprint arXiv:2310.06825.
9

-- 9 of 18 --

Tomihisa Kamada and Satoru Kawai. 1989. An algo-
rithm for drawing general undirected graphs. Infor-
mation Processing Letters, 31(1):7–15.
Yann LeCun, John Denker, and Sara Solla. 1989. Opti-
mal brain damage. Advances in neural information
processing systems, 2.
Stuart Lloyd. 1982. Least squares quantization in
pcm. IEEE Transactions on Information Theory,
28(2):129–137.
Leland McInnes, John Healy, and James Melville. 2018.
Umap: Uniform manifold approximation and pro-
jection for dimension reduction. arXiv preprint
arXiv:1802.03426.
Simon Moran, Daniel McCloy, and Sue Wright. 2014.
Phonological similarity and its applications in cross-
linguistic phonetics. Journal of Phonetics, 42:20–35.
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai,
Hieu Man, Nghia Trung Ngo, Franck Dernoncourt,
Ryan A. Rossi, and Thien Huu Nguyen. 2024. Cul-
turaX: A cleaned, enormous, and multilingual dataset
for large language models in 167 languages. In Pro-
ceedings of the 2024 Joint International Conference
on Computational Linguistics, Language Resources
and Evaluation (LREC-COLING 2024), pages 4226–
4237, Torino, Italia. ELRA and ICCL.
Kris O’Horan, Steven Galle, and Nathalie Schneider.
2016. Syntactic variation in multilingual represen-
tations: Investigating cross-lingual transfer. In Pro-
ceedings of the 2016 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
1–11. Association for Computational Linguistics.
Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec,
Bettina Messmer, Negar Foroutan, Martin Jaggi, Le-
andro von Werra, and Thomas Wolf. 2024. Fineweb2:
A sparkling update with 1000s of languages.
Seth Pettie and Vijaya

Chunk 19 · 1,994 chars

al
Linguistics: Human Language Technologies, pages
1–11. Association for Computational Linguistics.
Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec,
Bettina Messmer, Negar Foroutan, Martin Jaggi, Le-
andro von Werra, and Thomas Wolf. 2024. Fineweb2:
A sparkling update with 1000s of languages.
Seth Pettie and Vijaya Ramachandran. 2002. An opti-
mal minimum spanning tree algorithm. Journal of
the ACM (JACM), 49(1):16–34.
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, Ilya Sutskever, and 1 others.
2019. Language models are unsupervised multi-
task learners. volume 1, page 9. OpenAI Blog,
https://openai.com/blog/better-language-models/.
Taraka Rama, Lisa Beinborn, and Steffen Eger. 2020.
Probing multilingual BERT for genetic and typo-
logical signals. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics,
pages 1214–1228, Barcelona, Spain (Online). Inter-
national Committee on Computational Linguistics.
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid
to the interpretation and validation of cluster analysis.
Journal of computational and applied mathematics,
20:53–65.
Hinrich Schütze, Christopher D Manning, and Prab-
hakar Raghavan. 2008. Introduction to information
retrieval, volume 39. Cambridge University Press
Cambridge.
Maksym Shamrai. 2024. Language-specific pruning for
efficient reduction of large language models. In Pro-
ceedings of the Third Ukrainian Natural Language
Processing Workshop (UNLP)@ LREC-COLING
2024, pages 135–140.
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico
Kolter. 2023. A simple and effective pruning ap-
proach for large language models. arXiv preprint
arXiv:2306.11695.
Morris Swadesh. 1952. Lexico-statistic dating of prehis-
toric ethnic contacts: with special reference to north
american indians and eskimos. Proceedings of the
American philosophical society, 96(4):452–463.
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya
Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin,
Tatiana

Chunk 20 · 1,993 chars

Morris Swadesh. 1952. Lexico-statistic dating of prehis-
toric ethnic contacts: with special reference to north
american indians and eskimos. Proceedings of the
American philosophical society, 96(4):452–463.
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya
Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin,
Tatiana Matejovicova, Alexandre Ramé, Morgane
Rivière, and 1 others. 2025. Gemma 3 technical
report. arXiv preprint arXiv:2503.19786.
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-sne. Journal of Machine
Learning Research, 9(86):2579–2605.
A Derivation of weight importance metric
min
δw
1
2 δwT H δw,
s.t. eT
q δw + wq = 0.
The problem could be solved using Lagrange
multiplier. We begin with the Lagrangian:
L = 1
2 δw⊤Hδw + λ

e⊤
q δw + wq

.
Taking the derivative with respect to δw and
setting it to zero:
∇δwL = Hδw + λeq = 0,
⇒ δw = −H−1eqλ.
Substituting into the constraint:
e⊤
q (−H−1eqλ) + wq = 0,
we get:
λ = wq
e⊤
q H−1eq
.
Thus, the change in weights:
δw = −H−1eq · wq
e⊤
q H−1eq
.
Notice that:
10

-- 10 of 18 --

Hδw = H

−H−1eq
wq
e⊤
q H−1eq

= −eq
wq
e⊤
q H−1eq
,
and
δw⊤ =

−H−1eq
wq
e⊤
q H−1eq
⊤
= − wq
e⊤
q H−1eq
e⊤
q H−1
Now compute the increase in error:
Eq = 1
2 δw⊤Hδw
= 1
2
wq
e⊤
q H−1eq
e⊤
q H−1eq
wq
e⊤
q H−1eq
= 1
2 · w2
q
e⊤
q H−1eq
B Full list of languages used
Afrikaans, Albanian, Arabic, Aragonese, Arme-
nian, Asturian, Azerbaijani, Bashkir, Basque,
Bavarian, Belarusian, Bengali, Bishnupriya Ma-
nipuri, Bosnian, Breton, Bulgarian, Burmese, Cata-
lan, Cebuano, Chechen, Chinese (Simplified), Chi-
nese (Traditional), Chuvash, Crimean Tatar, Croa-
tian, Czech, Danish, Dutch, Egyptian Arabic, En-
glish, Esperanto, Estonian, Finnish, French, Gali-
cian, Georgian, German, Greek, Gujarati, Haitian,
Hebrew, Hindi, Hungarian, Icelandic, Ido, Indone-
sian, Irish, Italian, Japanese, Javanese, Kannada,
Kazakh, Kirghiz, Korean, Latin, Latvian, Lithua-
nian, Lombard, Low Saxon, Luxembourgish, Mace-
donian, Malagasy,

Chunk 21 · 1,987 chars

En-
glish, Esperanto, Estonian, Finnish, French, Gali-
cian, Georgian, German, Greek, Gujarati, Haitian,
Hebrew, Hindi, Hungarian, Icelandic, Ido, Indone-
sian, Irish, Italian, Japanese, Javanese, Kannada,
Kazakh, Kirghiz, Korean, Latin, Latvian, Lithua-
nian, Lombard, Low Saxon, Luxembourgish, Mace-
donian, Malagasy, Malay, Malayalam, Marathi,
Min Nan Chinese, Minangkabau, Nepali, Newar,
Norwegian (Bokmal), Norwegian (Nynorsk), Oc-
citan, Persian (Farsi), Piedmontese, Polish, Por-
tuguese, Punjabi, Romanian, Russian, Scots, Ser-
bian, Serbo-Croatian, Sicilian, Slovak, Slovenian,
South Azerbaijani, Spanish, Sundanese, Swahili,
Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu,
Turkish, Ukrainian, Urdu, Uzbek, Vietnamese,
Volapük, Waray-Waray, Welsh, West Frisian, West-
ern Punjabi, Yoruba.
Wikipedia includes all these languages. Cul-
turaX lacks Chinese (Traditional), Min Nan Chi-
nese, Scots, and Crimean Tatar. fineweb-2 does
1000 2000 3000 4000 5000 6000
Global step
0.85
0.90
0.95
1.00
1.05
Evaluation loss
ukr-bg
ukr-only
ukr-pol
Figure 3: Evaluation loss on Ukrainian for three
weighted- loss runs: ukr -only (baseline), ukr - bg
(Ukrainian + Bulgarian), and ukr- pol (Ukrainian + Pol-
ish). Two-language datasets are twice as large, hence
the longer training schedule.
not include Chinese (Traditional), English, Serbo-
Croatian, or Tagalog. For the English subset in
fineweb-2, we use the fineweb dataset3.
C Transfer-Learning Experiments
We investigated whether adding data from a similar
language can improve a low - resource target, where
similarity is measured by the language - distance
metric introduced in this paper. All experiments
fine- tune Llama 3.2 1B and evaluate exclusively on
a held-out set in the target language.
We perform our experiments using the following
strategies:
1. Mixed (size-matched). An equal amount of
auxiliary- language text is concatenated to the
low - resource corpus; the joint data are shuf-
fled and used for fine-tuning.
2. Mixed

Chunk 22 · 1,996 chars

Llama 3.2 1B and evaluate exclusively on
a held-out set in the target language.
We perform our experiments using the following
strategies:
1. Mixed (size-matched). An equal amount of
auxiliary- language text is concatenated to the
low - resource corpus; the joint data are shuf-
fled and used for fine-tuning.
2. Mixed (loss-weighted). The same joint cor-
pus is used, but the loss is re- weighted: e.g.
0.8 for target- language tokens and 0.2 for
auxiliary-language tokens.
3. Sequential. Fine- tune first on the auxil-
iary language, then continue training on the
low-resource corpus.
Figure 3 shows that augmenting Ukrainian with
the metrically close Bulgarian does not improve
evaluation loss, and Polish yields only a minor
reduction.
A similar pattern emerges for sequential
fine- tuning on Turkish followed by Crimean Tatar:
3https://huggingface.co/datasets/
HuggingFaceFW/fineweb
11

-- 11 of 18 --

perplexity drops from 5.48 (Crimean Tatar only) to
5.36, an insignificant change.
Across all settings, none of the three transfer
regimes produced a consistent, significant gain over
single-language fine - tuning. Future work should
revisit these transfer strategies with substantially
larger models and much larger datasets, where the
benefits of distance - based language pairing may
emerge more clearly.
D Additional Figures
Figure 4 displays the MST coloured by k- means
clusters. We set k = 18 – one cluster for each
category plotted in Figure 1 (15 natural families
plus 3 constructed languages) – so that the cluster
colours can be compared directly with the family
colours. Most clusters coincide with their expected
families, but not all. Notably, Turkish is grouped
with Hungarian and Finnish rather than with the
other Turkic languages.
Figure 5 uses HDBSCAN with a minimum clus-
ter size of two. This gives 24 clusters. Crimean
Tatar is treated as outlier, while Ukrainian now
connects directly to Polish.
Figures 6 and 7 give two other views of the same
data using t- SNE and

Chunk 23 · 1,999 chars

rouped
with Hungarian and Finnish rather than with the
other Turkic languages.
Figure 5 uses HDBSCAN with a minimum clus-
ter size of two. This gives 24 clusters. Crimean
Tatar is treated as outlier, while Ukrainian now
connects directly to Polish.
Figures 6 and 7 give two other views of the same
data using t- SNE and UMAP. Like the MST, they
highlight clear family groups.
Figure 8 shows the confusion matrix between
k- means clusters and high - level language families.
The clusters are first matched to families with the
Hungarian algorithm for clearer alignment. Fig-
ure 9 presents the same matrix, but for the finer
primary branches of each family.
12

-- 12 of 18 --

Figure 4: MST of all languages. Colours show k-means clusters with k = 18 (one cluster for each language family).
13

-- 13 of 18 --

Figure 5: MST of all languages. Colours show HDBSCAN clusters (minimum cluster size = 2). Points marked as
outliers by the algorithm are left out.
14

-- 14 of 18 --

Figure 6: t-SNE plot of all languages. Colours show language families.
15

-- 15 of 18 --

Figure 7: UMAP plot of all languages. Colours show language families.
16

-- 16 of 18 --

10 16 6 3 0 15 1 8 11 12 2 14 7 5 13 17 4 9
Predicted Clusters (Aligned)
Afroasiatic
Austroasiatic
Austronesian
Constructed (Esperanto)
Constructed (Ido)
Constructed (Volapük)
Creole
Dravidian
Indo-European
Japonic
Kartvelian
Koreanic
Language Isolate
Niger-Congo
Northeast Caucasian
Sino-Tibetan
Turkic
Uralic
True Classes
2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 3 3 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0
6 0 0 6 2 0 9 5 15 0 3 3 0 1 3 0 1 6
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
0 0 0 0 0 0

Chunk 24 · 1,996 chars

0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0
6 0 0 6 2 0 9 5 15 0 3 3 0 1 3 0 1 6
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 0 2 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 6 0
0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
Aligned Confusion Matrix
0
2
4
6
8
10
12
14
Figure 8: Adjusted confusion matrix between clusters obtained by k-means and macro families of languages.
Number of clusters equal to 18.
17

-- 17 of 18 --

8 34 3 0 131631 5 21 2 4 3227 9 2912172515 7 231133141024 1 202818 6 26223019
Predicted Clusters (Aligned)
Albanian
Armenian
Atlantic-Congo
Baltic
Celtic
Constructed (Esperanto)
Constructed (Ido)
Constructed (Volapük)
Finnic
French-based Creole
Germanic
Hellenic
Indo-Aryan
Iranian
Italic
Japonic
Karluk
Kartvelian
Kipchak
Koreanic
Language Isolate
Malayo-Polynesian
Nakh
Oghur
Oghuz
Philippine
Romance
Semitic
Sinitic
Slavic
South Dravidian
South-Central Dravidian
Tibeto-Burman
Ugric
Vietic
True Classes
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 5 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Chunk 25 · 1,993 chars

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 5 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 1 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0
0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Aligned Confusion Matrix
0
1
2
3
4
5
6
7
8
Figure 9: Adjusted

Chunk 26 · 467 chars

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Aligned Confusion Matrix
0
1
2
3
4
5
6
7
8
Figure 9: Adjusted confusion matrix between clusters obtained by k-means and primary branches of language
families. Number of clusters equal to 35.
18

-- 18 of 18 --