The Geometry of Multilingual Language Models: An Equality Lens

Analyzes the geometric representation of multilingual language models from an equality perspective. Provides insights into how different languages, including low-resource Southeast Asian languages, are represented differently in the hidden spaces of LLMs. Reveals representation disparities that support the thesis of representation gaps between high-resource and SEA languages.

Summary

This study examines the geometric properties of multilingual language models (mBERT, MiniLM, XLMR) to understand how different languages are represented in shared embedding spaces. Using the XNLI-15 dataset, the authors analyze language representations through PCA visualization, a Cross-Lingual Similarity Index (Γ), and a Geometric Separability Index (Φ). Results show that while languages from the same linguistic family tend to cluster together, they remain largely separable from languages in other families. The Γ values indicate that high-resource languages and those from the same family have higher cross-lingual similarity, whereas low-resource languages are less well-represented. The Φ values reveal that languages often form near-isolated clusters, with high separability within families but limited intersection, which may hinder cross-lingual transfer. The findings highlight a data dependency in multilingual models and suggest that better representation of low-resource languages is needed to improve cross-lingual performance.

PDF viewer

Chunks(17)

Chunk 0 · 1,997 chars

Published as a Tiny Paper at ICLR 2023
THE GEOMETRY OF MULTILINGUAL LANGUAGE
MODELS: AN EQUALITY LENS
Cheril Shah, Yashashree Chandak
PICT
{shahcheril311,chandakyashashree304}@gmail.com
Manan Suri
NSUT
manansuri27@gmail.com
ABSTRACT
Understanding the representations of different languages in multilingual language
models is essential for comprehending their cross-lingual properties, predicting
their performance on downstream tasks, and identifying any biases across lan-
guages. In our study, we analyze the geometry of three multilingual language
models in Euclidean space and find that all languages are represented by unique
geometries. Using a geometric separability index we find that although languages
tend to be closer according to their linguistic family, they are almost separable
with languages from other families. We also introduce a Cross-Lingual Similarity
Index to measure the distance of languages with each other in the semantic space.
Our findings indicate that the low-resource languages are not represented as good
as high resource languages in any of the models
1 METHODOLOGY
We use the XNLI-15way dataset Conneau et al. (2018) and sample 300 parallel sentences across
the 15 languages for our analysis. We use common multilingual transformers, mBERT Devlin et al.
(2019), MiniLM Wang et al. (2020), and XLMR Conneau et al. (2020a). More details about the
models and dataset are available in Appendix A. We study the geometric properties of multilingual
models using three methods:
1) We visualise the embedding space of a group of languages by taking the top 3 PCA components.
2) Cross-lingual Similarity Index Γ: There have been many approaches to compute cross-
lingual similarity such as Liu et al. (2020), however due to the extremely high anisotropy(average
cosine similarity of any two randomly sampled words in the dataset) Ethayarajh (2019) of mod-
els like XLMR, it is difficult to make conclusions using cosine similarity. Thus we introduce a
metric to quantify

Chunk 1 · 1,997 chars

pute cross-
lingual similarity such as Liu et al. (2020), however due to the extremely high anisotropy(average
cosine similarity of any two randomly sampled words in the dataset) Ethayarajh (2019) of mod-
els like XLMR, it is difficult to make conclusions using cosine similarity. Thus we introduce a
metric to quantify cross-lingual similarity. For a pair of languages, l1, l2, we take embeddings
s(l1,i), s(l2,i)∀i∈[0,299] using model M and calculate the Cross-lingual Similarity Index as follows:
Γl1,l2 =
1
n
∑n
i=1 cosine(s(l1,i), s(l2,i))
Anisotropy(M) (1)
For a model M, Γ ranges from −1/Anisotropy(M) to 1/Anisotropy(M). Low model anisotropy
and high average cosine similarity is ideal, therefore higher Γ is ideal. Positive and negative values
of Γ correspond to average directional orientation between embeddings. |Γ| ≤ 1/Anisotropy(M )
would mean that the average similarity is less than the similarity for random words.
3) Language Separability Φ: We study the separation of different languages in embedding space
by treating all the points belonging to a language as a single cluster and calculating the pairwise
Geometric Separability Index Thornton (2008) between two languages.
1
arXiv:2305.07839v1 [cs.CL] 13 May 2023

-- 1 of 8 --

Published as a Tiny Paper at ICLR 2023
2 RESULTS
First Principal Component	
5.0
2.5
0.0
2.5
5.0
7.5
10.0
12.5	
Second Principal Component	
4
2
0
2
4
6
8
Third Principal Component	
6
4
2
0
2
4
6
ar
de
en
hi
ur
(a) PCA-3D
ar 	bg 	de 	el 	en 	es 	fr 	hi 	ru 	sw 	th 	tr 	ur 	vi 	zh	
lang2
ar
bg
de
el
en
es
fr
hi
ru
sw
th
tr
ur
vi
zh
lang1
2.456
1.378 2.456
1.296 1.483 2.456
1.428 1.519 	1.44 	2.456
1.272 	1.41 	1.596 1.351 2.456
1.375 1.486 1.541 1.499 1.591 2.456
1.278 1.431 1.499 1.425 1.574 1.593 2.456
1.374 1.361 1.297 1.379 1.237 1.272 1.266 2.456
1.39 	1.652 1.528 1.476 1.496 1.565 1.487 1.354 2.456
1.12 	1.122 1.106 1.197 1.048 1.147 	1.02 	1.123 1.046 2.456
1.326 1.263 1.145 1.323 1.096 1.197 1.197 1.264 1.226 1.168 2.456
1.278 1.358

Chunk 2 · 1,993 chars

1.375 1.486 1.541 1.499 1.591 2.456
1.278 1.431 1.499 1.425 1.574 1.593 2.456
1.374 1.361 1.297 1.379 1.237 1.272 1.266 2.456
1.39 	1.652 1.528 1.476 1.496 1.565 1.487 1.354 2.456
1.12 	1.122 1.106 1.197 1.048 1.147 	1.02 	1.123 1.046 2.456
1.326 1.263 1.145 1.323 1.096 1.197 1.197 1.264 1.226 1.168 2.456
1.278 1.358 1.345 1.364 1.283 1.377 1.301 1.312 1.301 1.206 1.215 2.456
1.385 1.232 1.212 1.274 1.174 1.189 1.154 1.449 1.212 1.236 1.214 1.254 2.456
1.251 	1.39 	1.398 1.456 1.474 1.452 1.424 1.302 1.448 1.143 1.219 1.344 1.159 2.456
0.55260.52360.58430.53960.60680.61260.56320.41820.55270.36770.36450.56360.39290.5936 2.456	
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
 (b) Cross-lingual Similarity
ar 	bg 	de 	el 	en 	es 	fr 	hi 	ru 	sw 	th 	tr 	ur 	vi 	zh	
lang2
ar
bg
de
el
en
es
fr
hi
ru
sw
th
tr
ur
vi
zh
lang1
1
0.99 	1
0.99 0.96 	1
0.99 0.96 0.97 	1
0.99 0.97 	0.9 	0.97 	1
0.99 0.96 0.93 0.95 	0.9 	1
0.99 0.97 0.93 0.96 0.91 0.87 	1
0.99 0.99 0.99 0.99 0.99 0.99 0.99 	1
0.99 0.88 0.96 0.98 0.97 0.96 0.97 0.99 	1
0.99 0.98 0.97 0.98 0.97 0.97 0.97 0.99 	1 	1
0.99 0.98 0.98 0.98 0.97 0.98 0.98 0.99 0.99 0.97 	1
0.99 0.98 0.95 0.97 0.95 0.96 0.96 0.99 0.99 0.96 0.97 	1
0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 	1 	0.99 0.99 0.99 	1
0.99 0.98 0.96 0.96 0.95 0.95 0.95 0.99 0.98 0.97 0.98 0.96 	1 	1
0.99 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.96 0.97 	1 	0.97 	1	
0.88
0.90
0.92
0.94
0.96
0.98
1.00
 (c) Geometric Separability
Figure 1: PCA, Γ and Φ plots for mBERT
PCA Analysis: Figure 1a depicts the plots of one group (Hindi, Urdu, German, English, Arabic)
for each model, while the remaining plots are available in the AppendixA.4. mBERT’s word vec-
tors demonstrate significant isolation, with the languages occupying different orientations in space.
Conversely, MiniLM and XLMR’s word embeddings appear less spread out, owing to their high
anisotropy, but the languages still occupy different affines relative to one another. It is interesting
to note that in XLMR,

Chunk 3 · 1,999 chars

-
tors demonstrate significant isolation, with the languages occupying different orientations in space.
Conversely, MiniLM and XLMR’s word embeddings appear less spread out, owing to their high
anisotropy, but the languages still occupy different affines relative to one another. It is interesting
to note that in XLMR, low-resource languages such as Urdu and Swahili are significantly more
dispersed than high-resource languages.
Cross-lingual Similarity: Since the sentences are perfect translations of each other and the model
embeds different languages in a shared region of the embedding space, the Γ for all the language
pairs should be close to 1/Anisotropy, however, this is not the case as can be seen in figure 1b for
mBERT and A.4 for other models. We observe that the Language Model starts by representing a lan-
guage in relative isolation, and the more text it sees, the more it contextualizes that language in terms
of the other languages, leading to better Γ for high-resource languages and languages from same
families. The pre-training procedure of models also makes a difference in the language representa-
tions, for example, English has high Γ on XLMR, possibly because of the cross-lingual pretraining
procedure.
Geometric Separability It is thought that a language model sharing the same vector space to rep-
resent different languages must form an interlingua, however this is not observed. The intra-family
Φ is relatively low for all linguistic families. In the case of mBERT (figure 1c) and XLMR(figure
A.4), Φ of language clusters is extremely high, indicating that the languages form near-isolated vec-
tor spaces. In the case of MiniLM(figure A.4), the Φ for Germanic, Romantic, Slavic and Hellenic
families is low, indicating that the word vectors of these languages are assimilated, however, this
trend is not observed in the case of other families.
3 DISCUSSION AND CONCLUSION
The Γ and Φ values of all the multilingual models clearly show that not all languages are

Chunk 4 · 1,994 chars

he Φ for Germanic, Romantic, Slavic and Hellenic
families is low, indicating that the word vectors of these languages are assimilated, however, this
trend is not observed in the case of other families.
3 DISCUSSION AND CONCLUSION
The Γ and Φ values of all the multilingual models clearly show that not all languages are equally
represented in the shared embedding space. The low average Γ values for low resource languages
indicate a serious shortcoming of multilingual models in respect to their cross-lingual capabilities
and point out on the heavy dependence of these models on data. The high Φ values for languages be-
longing to same language families show that while it is trivial for language subspaces to be separable
for individual masked modelling, it is also important for them to have some degree of intersection for
better cross-lingual transfer. This observation on Φ also relate with the results obtained by Philippy
et al. (2023) where they show that the distance between language representations in the embedding
space of a multilingual model correlate with model’s cross-lingual performance. The results we
obtain are in line with the work of Conneau et al. (2020b) which prove the correlation of geometric
properties of representations with their results when finetuned on downstream tasks.
2

-- 2 of 8 --

Published as a Tiny Paper at ICLR 2023
URM STATEMENT
“The authors acknowledge that all the authors of this work meet the URM criteria of ICLR 2023
Tiny Papers Track.”
REFERENCES
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger
Schwenk, and Veselin Stoyanov. XNLI: Evaluating cross-lingual sentence representations. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.
2475–2485, Brussels, Belgium, October-November 2018. Association for Computational Lin-
guistics. doi: 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav

Chunk 5 · 1,995 chars

s. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.
2475–2485, Brussels, Belgium, October-November 2018. Association for Computational Lin-
guistics. doi: 10.18653/v1/D18-1269. URL https://aclanthology.org/D18-1269.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Un-
supervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July 2020a.
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https:
//aclanthology.org/2020.acl-main.747.
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. Emerging cross-
lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pp. 6022–6034, Online, July 2020b. Asso-
ciation for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.536. URL https:
//aclanthology.org/2020.acl-main.536.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June
2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https:
//aclanthology.org/N19-1423.
Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the ge-
ometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Con-
ference on Natural Language Processing (EMNLP-IJCNLP), pp. 55–65, Hong Kong, China,
November

Chunk 6 · 1,998 chars

contextualized word representations? Comparing the ge-
ometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Con-
ference on Natural Language Processing (EMNLP-IJCNLP), pp. 55–65, Hong Kong, China,
November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1006. URL
https://aclanthology.org/D19-1006.
Jiapeng Liu, Xiao Zhang, Dan Goldwasser, and Xiao Wang. Cross-lingual document retrieval with
smooth learning. pp. 3616–3629, December 2020. doi: 10.18653/v1/2020.coling-main.323. URL
https://aclanthology.org/2020.coling-main.323.
Fred Philippy, Siwen Guo, and Shohreh Haddadan. Identifying the correlation between language
distance and cross-lingual transfer in a multilingual representation space, 2023.
Chris Thornton. Separability is a learner’s best friend. 08 2008. doi: 10.1007/978-1-4471-1546-5 4.
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-
attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings
of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red
Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
A APPENDIX
A.1 DATASET DETAILS
The XNLI-15 dataset has 112,500 parallel sentences in 15 languages: English (en), French (fr),
Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (ar), Viet-
namese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw) and Urdu (ur). We sample the first 300
sentences from each dataset and perform our analysis on a vocabulary of 45889 words.
3

-- 3 of 8 --

Published as a Tiny Paper at ICLR 2023
A.2 MODEL DETAILS
The various transformer based pre-trained models used in our system are as follows:
• mBERT: mBERT refers to multilingual BERT trained on 104 languages on the Wikipedia
corpus. It is pretrained in a self-supervised fashion, with

Chunk 7 · 1,996 chars

of 45889 words.
3

-- 3 of 8 --

Published as a Tiny Paper at ICLR 2023
A.2 MODEL DETAILS
The various transformer based pre-trained models used in our system are as follows:
• mBERT: mBERT refers to multilingual BERT trained on 104 languages on the Wikipedia
corpus. It is pretrained in a self-supervised fashion, with two pre-training objectives: 1)
Masked Language Modelling (MLM) where 15% of the words of a sentence are randomly
masked, and the model predicts the masked words. 2) Next Sentence Prediction (NSP):
Given a pair of sentences, the model has to predict whether one sentence succeeds the
other. We use the ’bert-base-multilingual-cased’ implementation of mBERT 1.
• XLM-RoBERTa: XLM-RoBERTa is a multilingual version of RoBERTa, pre-trained on
2.5TB of filtered CommonCrawl data in 100 languages. RoBERTa builds on BERT’s archi-
tecture but uses more data and modifies key hyperparameters such as larger mini-batches
and learning rates. It doesn’t pretrain on the NSP task. We use the ’xlm-roberta-base’
implementation of the XLMR model from HuggingFace. 2.
• MiniLM: MiniLM generalises distillation of deep self-attention by using by using self-
attention relation distillation for task-agnostic compression of pre-trained Transformers.
This eliminates the restriction on the number of attention heads in the student model.
The model we use is distilled from XLM RoBERTa base. MiniLM has a significant
speed-up compared to it’s teacher models and gives a competitive performance. We use
the ’Multilingual-MiniLM-L12-H384’ implementation of the MiniLM model as found on
HuggingFace 3
A.3 LINGUISTIC FAMILIES
Table 1: Languages and their Linguistic Families
Language Family Code
English Germanic en
German Germanic de
Hindi Hindustani hi
Urdu Hindustani ur
Arabic Arabic ar
Spanish Romance es
French Romance fr
Russian Slavic ru
Bulgarian Slavic bg
Swahili Niger-Congo sw
Thai Tai th
Vietnamese Vietic vi
Chinese Chinese zh
Greek Hellenic el
Turkic Turkic tr
A.4 METRICS
Anisotropy:

Chunk 8 · 1,995 chars

age Family Code
English Germanic en
German Germanic de
Hindi Hindustani hi
Urdu Hindustani ur
Arabic Arabic ar
Spanish Romance es
French Romance fr
Russian Slavic ru
Bulgarian Slavic bg
Swahili Niger-Congo sw
Thai Tai th
Vietnamese Vietic vi
Chinese Chinese zh
Greek Hellenic el
Turkic Turkic tr
A.4 METRICS
Anisotropy: Anisotropy of a model is defined as the cosine similarity of any two word embeddings
selected on random. For a set of languages L, with n sentences in each language and a model Ml,
the anisotropy can be found by taking the average of the cosine similarity of all possible sentence
pairs from |L| ∗ n sentences. The absolute value is taken.
Anisotropy(Ml) = mod
∑
lu,lv ∈L
∑n
i=1
∑n
j=i+1 cosine(slu,i, slv ,j )
(|L|∗n
2
) (2)
1https://huggingface.co/bert-base-multilingual-cased
2https://huggingface.co/xlm-roberta-base
3https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384
4

-- 4 of 8 --

Published as a Tiny Paper at ICLR 2023
Geometric Separability Index Φ: calculates the average number of instances that share the same
class label as their nearest neighbours.
GSI(f ) =
∑n
i=1 (f (xi) + f (x′
i) + 1) mod 2
n (3)
Here, f is a target function that maps instances to their cluster label, x′
i is the nearest neighbour of
xi and n is the total number of data points.
A.5 ADDITIONAL INFORMATION
Γ for XLMR and MiniLM
ar 	bg 	de 	el 	en 	es 	fr 	hi 	ru 	sw 	th 	tr 	ur 	vi 	zh	
lang2
ar
bg
de
el
en
es
fr
hi
ru
sw
th
tr
ur
vi
zh
lang1
1.12
1.042 	1.12
1.039 1.061 	1.12
1.043 1.058 1.051 	1.12
1.04 	1.059 1.075 1.046 	1.12
1.038 1.062 1.064 1.053 	1.07 	1.12
1.034 1.055 1.067 1.052 1.069 1.073 	1.12
1.036 1.035 1.034 1.034 1.038 1.028 1.029 	1.12
1.037 1.069 1.055 1.051 1.055 1.052 1.049 1.036 	1.12
1.015 1.001 1.009 0.9957 1.009 1.013 1.003 0.99880.9912 1.12
1.023 1.003 1.009 1.009 1.016 1.001 0.9955 1.017 1.004 0.9888 1.12
1.022 	1.03 	1.036 1.027 1.033 	1.03 	1.023 1.026 1.029 1.004 1.007 	1.12
1.027 1.022 1.026 1.019 1.024 1.018 1.017 1.053 1.015 1.009

Chunk 9 · 1,995 chars

.12
1.037 1.069 1.055 1.051 1.055 1.052 1.049 1.036 	1.12
1.015 1.001 1.009 0.9957 1.009 1.013 1.003 0.99880.9912 1.12
1.023 1.003 1.009 1.009 1.016 1.001 0.9955 1.017 1.004 0.9888 1.12
1.022 	1.03 	1.036 1.027 1.033 	1.03 	1.023 1.026 1.029 1.004 1.007 	1.12
1.027 1.022 1.026 1.019 1.024 1.018 1.017 1.053 1.015 1.009 1.001 1.021 	1.12
1.036 1.038 1.041 1.038 1.057 	1.04 	1.038 1.041 1.037 1.003 1.036 1.028 1.026 	1.12
1.029 1.016 1.025 1.016 1.036 	1.02 	1.01 	1.029 1.024 0.9991 1.045 1.025 1.015 1.048 	1.12	
1.00
1.02
1.04
1.06
1.08
1.10
1.12
(a) MiniLM
ar 	bg 	de 	el 	en 	es 	fr 	hi 	ru 	sw 	th 	tr 	ur 	vi 	zh	
lang2
ar
bg
de
el
en
es
fr
hi
ru
sw
th
tr
ur
vi
zh
lang1
1.019
1.011 1.019
1.011 1.011 1.019
1.012 1.012 1.015 1.019
1.011 1.011 1.016 1.015 1.019
1.011 1.011 1.014 1.015 1.016 1.019
1.011 	1.01 	1.015 1.014 1.015 1.015 1.019
1.012 1.011 1.014 1.014 1.014 1.013 1.013 1.019
1.012 1.012 1.015 1.015 1.015 1.015 1.014 1.014 1.019
1.009 1.008 1.011 1.012 1.012 1.012 1.011 1.012 1.012 1.019
1.011 	1.01 	1.013 1.014 1.014 1.013 1.012 1.014 1.013 1.011 1.019
1.009 1.009 1.013 1.013 1.013 1.013 1.012 1.013 1.013 1.011 1.012 1.019
1.011 	1.01 	1.013 1.013 1.013 1.012 1.012 1.014 1.013 1.011 1.012 1.013 1.019
1.012 1.011 1.014 1.015 1.015 1.014 1.014 1.014 1.015 1.012 1.014 1.013 1.013 1.019
1.011 	1.01 	1.013 1.014 1.014 1.013 1.012 1.014 1.014 1.011 1.014 1.013 1.013 1.014 1.019
1.010
1.012
1.014
1.016
1.018
 (b) XLMR
Figure 2: Cross-lingual Similariy Γ
Φ for XLMR and MiniLM
ar 	bg 	de 	el 	en 	es 	fr 	hi 	ru 	sw 	th 	tr 	ur 	vi 	zh	
lang2
ar
bg
deel
en
esfr
hi
ru
swthtr
ur
vi
zh
lang1
1
0.82 	1
0.86 0.69 	1
0.83 0.68 0.76 	1
0.81 0.68 0.62 0.73 	1
0.84 0.69 	0.7 	0.7 	0.6 	1
0.85 0.73 0.71 0.73 0.63 0.58 	1
0.89 0.88 0.88 0.89 0.85 	0.9 	0.9 	1
0.83 0.54 0.71 0.74 0.71 0.75 0.76 0.88 	1
0.89 0.88 0.87 	0.9 	0.85 0.86 0.87 0.94 0.91 	1
0.87 	0.9 	0.9 	0.92 0.88 0.92 0.92 0.91 0.89 0.92 	1
0.9 	0.83 0.82 0.87 0.79 0.84 0.86 0.86 0.84 0.89 0.89 	1
0.89 	0.9 	0.9

Chunk 10 · 1,992 chars

1
0.84 0.69 	0.7 	0.7 	0.6 	1
0.85 0.73 0.71 0.73 0.63 0.58 	1
0.89 0.88 0.88 0.89 0.85 	0.9 	0.9 	1
0.83 0.54 0.71 0.74 0.71 0.75 0.76 0.88 	1
0.89 0.88 0.87 	0.9 	0.85 0.86 0.87 0.94 0.91 	1
0.87 	0.9 	0.9 	0.92 0.88 0.92 0.92 0.91 0.89 0.92 	1
0.9 	0.83 0.82 0.87 0.79 0.84 0.86 0.86 0.84 0.89 0.89 	1
0.89 	0.9 	0.9 	0.91 0.87 0.91 0.91 	0.7 	0.91 0.93 0.94 0.88 	1
0.89 0.87 0.86 0.86 0.79 0.83 0.85 0.92 0.87 0.91 	0.9 	0.89 0.93 	1
0.91 0.93 0.92 0.95 0.91 0.94 0.95 0.93 	0.9 	0.95 0.73 	0.9 	0.96 0.94 	1
0.6
0.7
0.8
0.9
1.0
(a) MiniLM
ar 	bg 	de 	el 	en 	es 	fr 	hi 	ru 	sw 	th 	tr 	ur 	vi 	zh	
lang2
ar
bg
deel
en
esfr
hi
ru
swthtr
ur
vi
zh
lang1
1
0.95 	1
0.98 0.92 	1
0.97 0.92 0.93 	1
0.97 	0.9 	0.84 0.89 	1
0.97 0.92 0.91 	0.9 	0.8 	1
0.98 0.95 0.92 0.92 0.84 0.83 	1
0.98 0.97 0.97 0.97 0.96 0.98 0.98 	1
0.97 0.81 0.92 0.93 0.89 0.91 0.93 0.97 	1
0.98 0.97 0.95 0.96 0.93 0.94 0.95 0.98 0.97 	1
0.94 0.94 0.95 0.95 0.93 0.94 0.95 0.96 0.94 0.95 	1
0.98 0.96 0.94 0.95 0.92 0.94 0.95 0.97 0.96 0.95 0.95 	1
0.98 0.98 0.98 0.98 0.97 0.98 0.98 0.93 0.98 0.98 0.97 0.97 	1
0.98 0.96 0.94 0.94 	0.9 	0.93 0.94 0.97 0.95 0.95 0.94 0.95 0.98 	1
0.94 0.95 0.96 0.97 0.95 0.95 0.97 0.96 0.92 0.97 0.92 0.95 0.96 0.95 	1	
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
 (b) XLMR
Figure 3: Language Separability Φ
Other PCA plots
First Principal Component
1
0
1
2
3
4
5	
Second Principal Component	
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Third Principal Component	
1.5
1.0
0.5
0.0
0.5
1.0
1.5
ar
de
en
hi
ur
(a) MiniLM
First Principal Component
4
2
0
2
4
6
8
10
12	
Second Principal Component	
2
1
0
1
2
3
4
Third Principal Component	
2
1
0
1
2
3
4
ar
de
en
hi
ur
 (b) XLMR
Figure 4: PCA Plots of group (Hindi,Urdu,English,German,Arabic)
5

-- 5 of 8 --

Published as a Tiny Paper at ICLR 2023
First Principal Component
5.0
2.5
0.0
2.5
5.0
7.5
10.0
12.5
15.0	
Second Principal Component	
6
4
2
0
2
4
6
8
Third Principal Component	
6
4
2
0
2
4
6
el
th
tr
vi
zh
(a) mBert
First Principal

Chunk 11 · 1,996 chars

b) XLMR
Figure 4: PCA Plots of group (Hindi,Urdu,English,German,Arabic)
5

-- 5 of 8 --

Published as a Tiny Paper at ICLR 2023
First Principal Component
5.0
2.5
0.0
2.5
5.0
7.5
10.0
12.5
15.0	
Second Principal Component	
6
4
2
0
2
4
6
8
Third Principal Component	
6
4
2
0
2
4
6
el
th
tr
vi
zh
(a) mBert
First Principal Component
1
0
1
2
3
4
5	
Second Principal Component	
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Third Principal Component	
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
el
th
tr
vi
zh
 (b) MiniLM
First Principal Component
5.0
2.5
0.0
2.5
5.0
7.5
10.0
12.5	
Second Principal Component	
2
1
0
1
2
3
Third Principal Component	
2
1
0
1
2
el
th
tr
vi
zh
(c) XLMR
Figure 5: PCA Plots of group (Greek,Turkic,Thai,Vietnamese,Chinese)
First Principal Component
5.0
2.5
0.0
2.5
5.0
7.5
10.0
12.5	
Second Principal Component	
6
4
2
0
2
4
6
8
10
Third Principal Component	
4
2
0
2
4
6
8
bg
es
fr
ru
sw
(a) mBert
First Principal Component
1
0
1
2
3
4	
5	
Second Principal Component	
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Third Principal Component	
1.5
1.0
0.5
0.0
0.5
1.0
1.5
bg
es
fr
ru
sw
 (b) MiniLM
First Principal Component	
2.5
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5	
Second Principal Component	
2
1
0
1
2
3
Third Principal Component	
2
1
0
1
2
3
4
5
bg
es
fr
ru
sw
(c) XLMR
Figure 6: PCA Plots of group (Bulgarian,Russian,Spanish,French,Swahili)
Cross-lingual Similarity Index (Γ) values: These are the exact Γ values
6

-- 6 of 8 --

Published as a Tiny Paper at ICLR 2023
Table 2: Γ values of mBert
lang ar bg de el en es fr hi ru sw th tr ur vi zh
ar 2.455 1.377 1.295 1.427 1.272 1.375 1.278 1.374 1.389 1.119 1.325 1.277 1.384 1.251 0.552
bg 1.377 2.455 1.483 1.519 1.409 1.485 1.43 1.361 1.652 1.121 1.262 1.358 1.232 1.389 0.523
de 1.295 1.483 2.455 1.439 1.595 1.541 1.498 1.297 1.527 1.106 1.144 1.345 1.211 1.398 0.584
el 1.427 1.519 1.439 2.455 1.351 1.498 1.425 1.379 1.475 1.196 1.323 1.364 1.273 1.456 0.539
en 1.272 1.409 1.595 1.351 2.455 1.59 1.574 1.236 1.495 1.048 1.096 1.283 1.173 1.473 0.606
es

Chunk 12 · 1,998 chars

52 1.121 1.262 1.358 1.232 1.389 0.523
de 1.295 1.483 2.455 1.439 1.595 1.541 1.498 1.297 1.527 1.106 1.144 1.345 1.211 1.398 0.584
el 1.427 1.519 1.439 2.455 1.351 1.498 1.425 1.379 1.475 1.196 1.323 1.364 1.273 1.456 0.539
en 1.272 1.409 1.595 1.351 2.455 1.59 1.574 1.236 1.495 1.048 1.096 1.283 1.173 1.473 0.606
es 1.375 1.485 1.541 1.498 1.59 2.455 1.592 1.272 1.564 1.147 1.197 1.376 1.188 1.451 0.612
fr 1.278 1.43 1.498 1.425 1.574 1.592 2.455 1.266 1.487 1.019 1.197 1.301 1.154 1.423 0.563
hi 1.374 1.361 1.297 1.379 1.236 1.272 1.266 2.455 1.354 1.122 1.263 1.311 1.448 1.302 0.418
ru 1.389 1.652 1.527 1.475 1.495 1.564 1.487 1.354 2.455 1.045 1.226 1.3 1.212 1.447 0.552
sw 1.119 1.121 1.106 1.196 1.048 1.147 1.019 1.122 1.045 2.455 1.167 1.206 1.235 1.143 0.367
th 1.325 1.262 1.144 1.323 1.096 1.197 1.197 1.263 1.226 1.167 2.455 1.214 1.214 1.219 0.364
tr 1.277 1.358 1.345 1.364 1.283 1.376 1.301 1.311 1.3 1.206 1.214 2.455 1.254 1.343 0.563
ur 1.384 1.232 1.211 1.273 1.173 1.188 1.154 1.448 1.212 1.235 1.214 1.254 2.455 1.159 0.392
vi 1.251 1.389 1.398 1.456 1.473 1.451 1.423 1.302 1.447 1.143 1.219 1.343 1.159 2.455 0.593
zh 0.552 0.523 0.584 0.539 0.606 0.612 0.563 0.418 0.552 0.367 0.364 0.563 0.392 0.593 2.455
Table 3: Γ values of MiniLM
lang ar bg de el en es fr hi ru sw th tr ur vi zh
ar 1.12 1.041 1.038 1.042 1.04 1.037 1.033 1.036 1.036 1.014 1.023 1.022 1.027 1.036 1.028
bg 1.041 1.12 1.061 1.057 1.058 1.061 1.055 1.035 1.068 1.0 1.002 1.03 1.021 1.037 1.016
de 1.038 1.061 1.12 1.05 1.075 1.064 1.067 1.033 1.055 1.009 1.008 1.035 1.025 1.04 1.025
el 1.042 1.057 1.05 1.12 1.045 1.052 1.052 1.034 1.051 0.995 1.008 1.027 1.018 1.037 1.015
en 1.04 1.058 1.075 1.045 1.12 1.07 1.069 1.038 1.054 1.009 1.016 1.033 1.023 1.057 1.036
es 1.037 1.061 1.064 1.052 1.07 1.12 1.072 1.028 1.051 1.012 1.0 1.03 1.018 1.04 1.019
fr 1.033 1.055 1.067 1.052 1.069 1.072 1.12 1.028 1.048 1.002 0.995 1.022 1.016 1.038 1.009
hi 1.036 1.035 1.033 1.034 1.038 1.028 1.028 1.12

Chunk 13 · 1,996 chars

en 1.04 1.058 1.075 1.045 1.12 1.07 1.069 1.038 1.054 1.009 1.016 1.033 1.023 1.057 1.036
es 1.037 1.061 1.064 1.052 1.07 1.12 1.072 1.028 1.051 1.012 1.0 1.03 1.018 1.04 1.019
fr 1.033 1.055 1.067 1.052 1.069 1.072 1.12 1.028 1.048 1.002 0.995 1.022 1.016 1.038 1.009
hi 1.036 1.035 1.033 1.034 1.038 1.028 1.028 1.12 1.035 0.998 1.017 1.026 1.053 1.041 1.029
ru 1.036 1.068 1.055 1.051 1.054 1.051 1.048 1.035 1.12 0.991 1.004 1.029 1.014 1.037 1.023
sw 1.014 1.0 1.009 0.995 1.009 1.012 1.002 0.998 0.991 1.12 0.988 1.004 1.009 1.002 0.999
th 1.023 1.002 1.008 1.008 1.016 1.0 0.995 1.017 1.004 0.988 1.12 1.006 1.001 1.035 1.045
tr 1.022 1.03 1.035 1.027 1.033 1.03 1.022 1.026 1.029 1.004 1.006 1.12 1.02 1.028 1.024
ur 1.027 1.021 1.025 1.018 1.023 1.018 1.016 1.053 1.014 1.009 1.001 1.02 1.12 1.026 1.015
vi 1.036 1.037 1.04 1.037 1.057 1.04 1.038 1.041 1.037 1.002 1.035 1.028 1.026 1.12 1.047
zh 1.028 1.016 1.025 1.015 1.036 1.019 1.009 1.029 1.023 0.999 1.045 1.024 1.015 1.047 1.12
Table 4: Γ values of XLMR
lang ar bg de el en es fr hi ru sw th tr ur vi zh
ar 1.019 1.011 1.01 1.012 1.011 1.01 1.01 1.011 1.011 1.009 1.011 1.009 1.01 1.011 1.011
bg 1.011 1.019 1.01 1.011 1.011 1.01 1.01 1.011 1.011 1.008 1.009 1.009 1.01 1.011 1.01
de 1.01 1.01 1.019 1.014 1.015 1.014 1.014 1.013 1.014 1.011 1.012 1.012 1.012 1.013 1.013
el 1.012 1.011 1.014 1.019 1.015 1.014 1.014 1.014 1.015 1.011 1.013 1.013 1.012 1.014 1.013
en 1.011 1.011 1.015 1.015 1.019 1.015 1.015 1.014 1.015 1.012 1.013 1.013 1.012 1.015 1.013
es 1.01 1.01 1.014 1.014 1.015 1.019 1.014 1.013 1.014 1.011 1.012 1.012 1.012 1.014 1.013
fr 1.01 1.01 1.014 1.014 1.015 1.014 1.019 1.013 1.014 1.01 1.012 1.012 1.011 1.014 1.012
hi 1.011 1.011 1.013 1.014 1.014 1.013 1.013 1.019 1.014 1.011 1.013 1.012 1.014 1.014 1.014
ru 1.011 1.011 1.014 1.015 1.015 1.014 1.014 1.014 1.019 1.011 1.013 1.012 1.012 1.014 1.013
sw 1.009 1.008 1.011 1.011 1.012 1.011 1.01 1.011 1.011 1.019 1.011 1.01 1.01 1.011 1.011
th 1.011 1.009

Chunk 14 · 1,994 chars

.012 1.012 1.011 1.014 1.012
hi 1.011 1.011 1.013 1.014 1.014 1.013 1.013 1.019 1.014 1.011 1.013 1.012 1.014 1.014 1.014
ru 1.011 1.011 1.014 1.015 1.015 1.014 1.014 1.014 1.019 1.011 1.013 1.012 1.012 1.014 1.013
sw 1.009 1.008 1.011 1.011 1.012 1.011 1.01 1.011 1.011 1.019 1.011 1.01 1.01 1.011 1.011
th 1.011 1.009 1.012 1.013 1.013 1.012 1.012 1.013 1.013 1.011 1.019 1.012 1.012 1.013 1.014
tr 1.009 1.009 1.012 1.013 1.013 1.012 1.012 1.012 1.012 1.01 1.012 1.019 1.012 1.013 1.012
ur 1.01 1.01 1.012 1.012 1.012 1.012 1.011 1.014 1.012 1.01 1.012 1.012 1.019 1.013 1.012
vi 1.011 1.011 1.013 1.014 1.015 1.014 1.014 1.014 1.014 1.011 1.013 1.013 1.013 1.019 1.014
zh 1.011 1.01 1.013 1.013 1.013 1.013 1.012 1.014 1.013 1.011 1.014 1.012 1.012 1.014 1.019
Language Separability (Φ) Values: These are the exact Φ values
7

-- 7 of 8 --

Published as a Tiny Paper at ICLR 2023
Table 5: Φ values of mBert
lang ar bg de el en es fr hi ru sw th tr ur vi zh
ar 1.0 0.986 0.987 0.985 0.989 0.985 0.988 0.991 0.986 0.991 0.988 0.99 0.991 0.991 0.989
bg 0.986 1.0 0.96 0.963 0.967 0.963 0.966 0.989 0.876 0.983 0.983 0.975 0.994 0.977 0.983
de 0.987 0.96 1.0 0.965 0.904 0.929 0.934 0.989 0.963 0.967 0.975 0.954 0.992 0.957 0.976
el 0.985 0.963 0.965 1.0 0.967 0.952 0.962 0.991 0.979 0.98 0.978 0.969 0.994 0.963 0.984
en 0.989 0.967 0.904 0.967 1.0 0.9 0.905 0.989 0.969 0.968 0.974 0.949 0.992 0.945 0.978
es 0.985 0.963 0.929 0.952 0.9 1.0 0.869 0.99 0.958 0.969 0.978 0.958 0.993 0.952 0.975
fr 0.988 0.966 0.934 0.962 0.905 0.869 1.0 0.99 0.967 0.973 0.978 0.964 0.993 0.953 0.976
hi 0.991 0.989 0.989 0.991 0.989 0.99 0.99 1.0 0.993 0.99 0.991 0.986 0.979 0.989 0.99
ru 0.986 0.876 0.963 0.979 0.969 0.958 0.967 0.993 1.0 0.995 0.993 0.985 0.996 0.982 0.985
sw 0.991 0.983 0.967 0.98 0.968 0.969 0.973 0.99 0.995 1.0 0.974 0.964 0.992 0.971 0.985
th 0.988 0.983 0.975 0.978 0.974 0.978 0.978 0.991 0.993 0.974 1.0 0.971 0.994 0.981 0.96
tr 0.99 0.975 0.954 0.969 0.949 0.958 0.964 0.986

Chunk 15 · 1,998 chars

ru 0.986 0.876 0.963 0.979 0.969 0.958 0.967 0.993 1.0 0.995 0.993 0.985 0.996 0.982 0.985
sw 0.991 0.983 0.967 0.98 0.968 0.969 0.973 0.99 0.995 1.0 0.974 0.964 0.992 0.971 0.985
th 0.988 0.983 0.975 0.978 0.974 0.978 0.978 0.991 0.993 0.974 1.0 0.971 0.994 0.981 0.96
tr 0.99 0.975 0.954 0.969 0.949 0.958 0.964 0.986 0.985 0.964 0.971 1.0 0.99 0.962 0.966
ur 0.991 0.994 0.992 0.994 0.992 0.993 0.993 0.979 0.996 0.992 0.994 0.99 1.0 0.995 0.995
vi 0.991 0.977 0.957 0.963 0.945 0.952 0.953 0.989 0.982 0.971 0.981 0.962 0.995 1.0 0.972
zh 0.989 0.983 0.976 0.984 0.978 0.975 0.976 0.99 0.985 0.985 0.96 0.966 0.995 0.972 1.0
Table 6: Φ values of MiniLM
lang ar bg de el en es fr hi ru sw th tr ur vi zh
ar 1.0 0.821 0.859 0.831 0.814 0.841 0.85 0.893 0.833 0.894 0.871 0.895 0.887 0.89 0.911
bg 0.821 1.0 0.685 0.684 0.681 0.687 0.726 0.876 0.536 0.883 0.898 0.833 0.895 0.868 0.927
de 0.859 0.685 1.0 0.764 0.615 0.703 0.712 0.884 0.707 0.874 0.896 0.817 0.899 0.863 0.919
el 0.831 0.684 0.764 1.0 0.726 0.704 0.726 0.892 0.742 0.895 0.916 0.867 0.912 0.859 0.95
en 0.814 0.681 0.615 0.726 1.0 0.602 0.634 0.847 0.71 0.847 0.882 0.787 0.872 0.787 0.912
es 0.841 0.687 0.703 0.704 0.602 1.0 0.576 0.895 0.746 0.855 0.918 0.84 0.908 0.833 0.943
fr 0.85 0.726 0.712 0.726 0.634 0.576 1.0 0.899 0.764 0.871 0.921 0.858 0.913 0.853 0.945
hi 0.893 0.876 0.884 0.892 0.847 0.895 0.899 1.0 0.875 0.937 0.911 0.859 0.696 0.922 0.931
ru 0.833 0.536 0.707 0.742 0.71 0.746 0.764 0.875 1.0 0.905 0.89 0.84 0.908 0.874 0.902
sw 0.894 0.883 0.874 0.895 0.847 0.855 0.871 0.937 0.905 1.0 0.922 0.894 0.932 0.905 0.946
th 0.871 0.898 0.896 0.916 0.882 0.918 0.921 0.911 0.89 0.922 1.0 0.887 0.939 0.901 0.732
tr 0.895 0.833 0.817 0.867 0.787 0.84 0.858 0.859 0.84 0.894 0.887 1.0 0.882 0.891 0.896
ur 0.887 0.895 0.899 0.912 0.872 0.908 0.913 0.696 0.908 0.932 0.939 0.882 1.0 0.933 0.955
vi 0.89 0.868 0.863 0.859 0.787 0.833 0.853 0.922 0.874 0.905 0.901 0.891 0.933 1.0 0.94
zh 0.911 0.927 0.919 0.95 0.912

Chunk 16 · 1,798 chars

0.939 0.901 0.732
tr 0.895 0.833 0.817 0.867 0.787 0.84 0.858 0.859 0.84 0.894 0.887 1.0 0.882 0.891 0.896
ur 0.887 0.895 0.899 0.912 0.872 0.908 0.913 0.696 0.908 0.932 0.939 0.882 1.0 0.933 0.955
vi 0.89 0.868 0.863 0.859 0.787 0.833 0.853 0.922 0.874 0.905 0.901 0.891 0.933 1.0 0.94
zh 0.911 0.927 0.919 0.95 0.912 0.943 0.945 0.931 0.902 0.946 0.732 0.896 0.955 0.94 1.0
Table 7: Φ values of XLMR
lang ar bg de el en es fr hi ru sw th tr ur vi zh
ar 1.0 0.946 0.98 0.966 0.971 0.971 0.978 0.984 0.967 0.985 0.935 0.984 0.978 0.98 0.939
bg 0.946 1.0 0.92 0.916 0.9 0.916 0.95 0.971 0.805 0.971 0.943 0.961 0.978 0.958 0.948
de 0.98 0.92 1.0 0.932 0.845 0.911 0.916 0.973 0.919 0.948 0.949 0.936 0.981 0.939 0.958
el 0.966 0.916 0.932 1.0 0.892 0.897 0.922 0.971 0.925 0.965 0.95 0.954 0.981 0.939 0.967
en 0.971 0.9 0.845 0.892 1.0 0.8 0.841 0.96 0.89 0.931 0.927 0.916 0.972 0.899 0.952
es 0.971 0.916 0.911 0.897 0.8 1.0 0.828 0.976 0.914 0.94 0.944 0.935 0.981 0.926 0.952
fr 0.978 0.95 0.916 0.922 0.841 0.828 1.0 0.977 0.934 0.953 0.951 0.951 0.985 0.937 0.97
hi 0.984 0.971 0.973 0.971 0.96 0.976 0.977 1.0 0.966 0.983 0.957 0.972 0.931 0.975 0.965
ru 0.967 0.805 0.919 0.925 0.89 0.914 0.934 0.966 1.0 0.975 0.935 0.96 0.98 0.95 0.919
sw 0.985 0.971 0.948 0.965 0.931 0.94 0.953 0.983 0.975 1.0 0.955 0.953 0.983 0.951 0.97
th 0.935 0.943 0.949 0.95 0.927 0.944 0.951 0.957 0.935 0.955 1.0 0.946 0.969 0.939 0.916
tr 0.984 0.961 0.936 0.954 0.916 0.935 0.951 0.972 0.96 0.953 0.946 1.0 0.972 0.948 0.949
ur 0.978 0.978 0.981 0.981 0.972 0.981 0.985 0.931 0.98 0.983 0.969 0.972 1.0 0.983 0.965
vi 0.98 0.958 0.939 0.939 0.899 0.926 0.937 0.975 0.95 0.951 0.939 0.948 0.983 1.0 0.955
zh 0.939 0.948 0.958 0.967 0.952 0.952 0.97 0.965 0.919 0.97 0.916 0.949 0.965 0.955 1.0
8

-- 8 of 8 --