On the limited utility of parallel data for learning shared multilingual representations

Summary

This study investigates the impact of parallel data on learning shared multilingual representations in language models. The authors train four 1.4 billion-parameter models on a 200 billion-token corpus with 0%, 1%, 2%, and 5% parallel data. They evaluate cross-lingual alignment using principal component analysis, cosine similarity, projection weighted canonical correlation analysis, language-specific neuron analysis, and language control vectors. Results show that all models, including the one with no parallel data, develop shared representations, particularly in middle layers. The amount of parallel data has little effect on overall alignment but may accelerate early-phase representation sharing and reduce language-specific neurons. The study concludes that parallel data provides minimal benefit for cross-lingual alignment in fully trained models, suggesting other factors like parameter capacity or incidental bilingualism may drive representation sharing. The findings challenge the common assumption that parallel data is essential for multilingual alignment.

PDF viewer

Chunks(32)

Chunk 0 · 1,995 chars

On the limited utility of parallel data for learning shared multilingual
representations
Julius Leino
University of Helsinki
leino.julius2@gmail.com
Jörg Tiedemann
University of Helsinki
jorg.tiedemann@helsinki.fi
Abstract
Shared multilingual representations are essen-
tial for cross-lingual tasks and knowledge trans-
fer across languages. This study looks at the
impact of parallel data, i.e. translated sentences,
in pretraining as a signal to trigger representa-
tions that are aligned across languages. We
train reference models with different propor-
tions of parallel data and show that parallel data
seem to have only a minimal effect on the cross-
lingual alignment. Based on multiple evalua-
tion methods, we find that the effect is limited
to potentially accelerating the representation
sharing in the early phases of pretraining, and
to decreasing the amount of language-specific
neurons in the model. Cross-lingual alignment
seems to emerge on similar levels even without
the explicit signal from parallel data.1
1 Introduction
A desirable property of multilingual language mod-
els is the cross-lingual sharing of representations,
which enables cross-lingual transfer (Pires et al.,
2019; Artetxe et al., 2019; Conneau et al., 2020), a
phenomenon where representations learned for one
language can be leveraged to boost performance
in another. A common method thought to improve
this sharing is to add parallel data, that is, trans-
lated texts, typically aligned at the sentence level,
to the pretraining corpus of multilingual language
models (Conneau and Lample, 2019; Anil et al.,
2023; Luukkonen et al., 2024; Gemma Team et al.,
2025). However, while intuitively attractive, it is
still unclear to what extent parallel data actually
helps in representation sharing.
In this paper, we systematically assess the as-
sumption that including parallel data in the pre-
training corpus increases the level of representa-
tion sharing across languages. To this end, we
1Code for the

Chunk 1 · 1,993 chars

tively attractive, it is
still unclear to what extent parallel data actually
helps in representation sharing.
In this paper, we systematically assess the as-
sumption that including parallel data in the pre-
training corpus increases the level of representa-
tion sharing across languages. To this end, we
1Code for the experiments is available at https://github.
com/shiftleino/limited-utility-of-parallel-data.
train from scratch four 1.4 billion-parameter trans-
former language models on a 200 billion token
multilingual corpus with varying amounts of paral-
lel data: 0%, 1%, 2%, and 5%. We then assess the
level of representation sharing in the models using
principal component projections, cosine similarity,
projection weighted canonical correlation analysis
(PWCCA), language-specific neuron analysis, and
cross-lingual control vectors.
The primary findings of this study are as follows:
• Our results across the different methods show
that a partially shared representation space
exists in all of the trained models, including
the model with no parallel data added to the
training corpus. The sharing of representa-
tions is strongest in the middle layers of all
the models (Figure 1), aligning well with the
observations of previous research on the mid-
dle layers containing more semantic represen-
tations (Wendler et al., 2024; Kojima et al.,
2024).
• We discover that the amount of parallel data in
the pretraining corpus has little to no effect on
the overall level of cross-lingual alignment in
the fully trained models, with most of the eval-
uation methods showing no significant rela-
tionship with the amount of parallel data used
in training. Instead, the effects of parallel data
are limited to potentially accelerating the rep-
resentation sharing during the early phase of
the pretraining and to decreasing the number
of language-specific neurons in the models.
2 Background
Prior research has suggested that multilingual lan-
guage models develop shared cross-lingual

Chunk 2 · 1,995 chars

the effects of parallel data
are limited to potentially accelerating the rep-
resentation sharing during the early phase of
the pretraining and to decreasing the number
of language-specific neurons in the models.
2 Background
Prior research has suggested that multilingual lan-
guage models develop shared cross-lingual repre-
sentations. Numerous studies have, for example,
shown that multilingual language models are ca-
pable of cross-lingual transfer, where a pretrained
1
arXiv:2603.29026v1 [cs.CL] 30 Mar 2026

-- 1 of 13 --

4 	3 	2 	1 	0 	1 	2 	3 	4
8
6
4
2
0
2
lm-0
PC 2
Layer 1
60 	50 	40 	30 	20 	10 	0 	10 	20	
1.5
1.0
0.5
0.0
0.5
1.0
1.5	
Layer 12
60 	40 	20 	0 	20	
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
Layer 13
100 	80 	60 	40 	20 	0 	20	
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0	
Layer 14
100 	80 	60 	40 	20 	0 	20
2
1
0
1
2	
Layer 15
100 	80 	60 	40 	20 	0 	20	
3
2
1
0
1
2
Layer 16
100 	80 	60 	40 	20 	0 	20	
2
1
0
1
2
Layer 17
15 	10 	5 	0 	5 	10 	15 	20	
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
12.5	
Layer 24
4 	2 	0 	2 	4
8
6
4
2
0
2
lm-1
PC 2
3 	2 	1 	0 	1 	2 	3 	4 	5	
2.0
1.5
1.0
0.5
0.0
0.5
1.0
5.0 	2.5 	0.0 	2.5 	5.0 	7.5 	10.0 	12.5	
3
2
1
0
1
10 	0 	10 	20 	30 	40 	50
2
1
0
1
10 	0 	10 	20 	30 	40 	50
3
2
1
0
1
2
10 	0 	10 	20 	30 	40 	50
3
2
1
0
1
2
10 	0 	10 	20 	30 	40 	50
2
1
0
1
2
15 	10 	5 	0 	5 	10 	15 	20
5
0
5
10
4 	2 	0 	2 	4
8
6
4
2
0
2
lm-2
PC 2
2 	1 	0 	1 	2 	3 	4 	5 	6
1.5
1.0
0.5
0.0
0.5
1.0
4 	2 	0 	2 	4 	6 	8 	10
1
0
1
2
3
10 	0 	10 	20 	30 	40 	50	
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
10 	0 	10 	20 	30 	40 	50	
2
1
0
1
2
3
10 	0 	10 	20 	30 	40 	50	
2
1
0
1
2
3
10 	0 	10 	20 	30 	40 	50	
3
2
1
0
1
2
20 	15 	10 	5 	0 	5 	10 	15
5
0
5
10
4 	3 	2 	1 	0 	1 	2 	3 	4	
PC 1
8
6
4
2
0
2
lm-5
PC 2
10 	0 	10 	20 	30 	40	
PC 1
1.5
1.0
0.5
0.0
0.5
1.0
10 	0 	10 	20 	30 	40	
PC 1
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
20 	0 	20 	40 	60 	80	
PC 1
1.5
1.0
0.5
0.0
0.5
1.0
1.5
20 	0 	20 	40 	60 	80	
PC 1
2
1
0
1
2
20 	0 	20 	40 	60 	80	
PC 1
3
2
1
0
1
2
20 	0 	20 	40 	60

Chunk 3 · 1,992 chars

0
4 	3 	2 	1 	0 	1 	2 	3 	4	
PC 1
8
6
4
2
0
2
lm-5
PC 2
10 	0 	10 	20 	30 	40	
PC 1
1.5
1.0
0.5
0.0
0.5
1.0
10 	0 	10 	20 	30 	40	
PC 1
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
20 	0 	20 	40 	60 	80	
PC 1
1.5
1.0
0.5
0.0
0.5
1.0
1.5
20 	0 	20 	40 	60 	80	
PC 1
2
1
0
1
2
20 	0 	20 	40 	60 	80	
PC 1
3
2
1
0
1
2
20 	0 	20 	40 	60 	80	
PC 1
2
1
0
1
2
15 	10 	5 	0 	5 	10 	15 	20	
PC 1
5
0
5
10
Figure 1: English and Finnish translation sentence pair representations of each model projected to the first two
principal components. Finnish sentences are in blue and English sentences in yellow. Each model is exposed to
different amounts of parallel data from no parallel data in lm-0 (top row) to 5% parallel data in lm-5 (bottom row).
model trained for a task in one language achieves
performance improvements in the same task in an-
other language (Pires et al., 2019; Artetxe et al.,
2019; Conneau et al., 2020). Similar results have
also been obtained from more direct analysis of the
representations. For example, a study found that
Llama 2 uses English as a pivot language when
translating sentences from one language to another
(Wendler et al., 2024), while another study found
that for many multilingual language models, neu-
rons activating with a single language were pre-
dominantly located in the initial and final layers
(Kojima et al., 2024). Furthermore, research has
shown that the model output language can easily
be changed using simple control vectors (Leino
and Karlgren, 2025), suggesting a disentangled lan-
guage direction from the other concept directions.
Beyond merely exploring how multilingual lan-
guage models internally represent various lan-
guages, prior research has also examined methods
to align these representations, often involving the
usage of parallel data, i.e., aligned translated text,
typically in the form of sentence pairs. In pretrain-
ing, parallel data has been involved, for example,
with some auxiliary training objectives, such as a
machine translation

Chunk 4 · 1,998 chars

r research has also examined methods
to align these representations, often involving the
usage of parallel data, i.e., aligned translated text,
typically in the form of sentence pairs. In pretrain-
ing, parallel data has been involved, for example,
with some auxiliary training objectives, such as a
machine translation objective (Kale et al., 2021). A
common technique has also been to simply add a
small portion of concatenated parallel sentences to
the pretraining mix of language models without any
additional training objective (Conneau and Lam-
ple, 2019; Anil et al., 2023; Luukkonen et al., 2024;
Gemma Team et al., 2025). Intuitively, the attention
blocks in the transformer architecture could pro-
vide the necessary communication mechanism be-
tween the translation sentences for the cross-lingual
signal, thus improving cross-lingual representation
sharing in the models. However, it is not yet fully
understood whether this actually occurs in practice
and how dependent it is on the amount of parallel
data used during pretraining.
3 Model training
To study the effects of parallel data, we train four
language models of 1.4 billion parameters using a
bilingual corpus of 200 billion tokens with vary-
ing proportions of parallel data, using the training
setting described in the following sections.
3.1 Language models
For the models, we closely follow the open source
OLMo decoder-only transformer language model
architecture (Groeneveld et al., 2024). Our version
of the model has 24 transformer layers in total, a
residual stream of 2048 dimensions, and 16 atten-
tion heads with dimension 128 (see Table 4 in the
Appendix A for full details).
3.2 Training data
To train each model, we construct a 200 billion
token training corpus consisting of tokens in two
languages, English and Finnish. Due to English
and Finnish forming a typologically distant lan-
guage pair, this setting provides a robust founda-
tion for the generalization of the results to more
similar language pairs.

Chunk 5 · 1,996 chars

To train each model, we construct a 200 billion
token training corpus consisting of tokens in two
languages, English and Finnish. Due to English
and Finnish forming a typologically distant lan-
guage pair, this setting provides a robust founda-
tion for the generalization of the results to more
similar language pairs. For each of the four mod-
els, we dedicate a different proportion of the data
to translated sentence pairs (name of the model in
2

-- 2 of 13 --

Model English Finnish Parallel
lm-0 160 × 109 40 × 109 0
lm-1 159 × 109 39 × 109 2 × 109
lm-2 158 × 109 38 × 109 4 × 109
lm-5 155 × 109 35 × 109 10 × 109
Table 1: The distributions of the training data in tokens
for the four language models.
parenthesis): 0% (lm-0), 1% (lm-1), 2% (lm-2),
and 5% (lm-5).
We used a skewed language mix of 80% tokens
in English and 20% in Finnish, with the assumption
of parallel data having an equal split between the
languages and then adjusting the monolingual cor-
pus sizes to preserve this ratio. The distributions of
the training corpora for the four models are shown
in Table 1.
We utilized the FineWeb (Penedo et al., 2024)
and FineWeb2 (Penedo et al., 2025) datasets as
the sources for our English and Finnish monolin-
gual corpora, respectively, subject to additional
custom quality filtering and upsampling steps de-
tailed in Appendix B. For parallel data, we utilized
three separate datasets from the OPUS collection
(Tiedemann, 2012): HPLT2 (29,067,875 Finnish-
English sentence pairs) (de Gibert et al., 2024),
CCMatrix (35,982,562 sentence pairs) (Fan et al.,
2020; Schwenk et al., 2021), and the 2024 version
of the OpenSubtitles dataset3 (110,093 sentence
pairs) (Tiedemann, 2016). Our final parallel dataset
contains a total of 65,160,530 parallel sentences,
which results in approximately 4.5B tokens. We
then up-sample this dataset to reach the required
amount. To control for potential domain differ-
ences, a fixed baseline of 5% of the full training
corpora is sourced from

Chunk 6 · 1,997 chars

s) (Tiedemann, 2016). Our final parallel dataset
contains a total of 65,160,530 parallel sentences,
which results in approximately 4.5B tokens. We
then up-sample this dataset to reach the required
amount. To control for potential domain differ-
ences, a fixed baseline of 5% of the full training
corpora is sourced from the parallel dataset for all
the models.
Instead of feeding the parallel data only in a sin-
gle format, we apply to each parallel pair a format
sampled from 60 different instruction or comple-
tion formats covering both English and Finnish
(example in Appendix C). The intuition is that the
variability in the format helps the model avoid over-
fitting to the specific format structure and instead
allows learning solely the relationships between
the languages.
To tokenize the collected text corpus, we train
a SentencePiece tokenizer (Kudo and Richardson,
2https://hplt-project.org/
3http://www.opensubtitles.org/
2018) using Byte Pair Encoding (BPE) (Sennrich
et al., 2016) with a balanced corpus and a vocabu-
lary size of 50,280 tokens.
3.3 Training setup
We trained the models largely following the code-
base used for training the OLMo models (Groen-
eveld et al., 2024), utilizing the LUMI supercom-
puter infrastructure4 and a context window of 2048
tokens. To avoid any unintended cross-lingual sig-
nals during the training, we did not concatenate se-
quences across the three corpora into same chunks,
but still shuffled the chunks in the batches. A de-
tailed description of the training configuration is
outlined in Appendix D.
4 Evaluation experiments
We use five methods in total to evaluate the level
of representation sharing in the models to obtain a
comprehensive view on the potential effects.
4.1 Principal component projections
Our first method involves visualizing the repre-
sentation space in two dimensions, similar to
many other studies on multilingual language mod-
els (Marks and Tegmark, 2024; Rimsky et al.,
2024). More specifically, we use the

Chunk 7 · 1,997 chars

to obtain a
comprehensive view on the potential effects.
4.1 Principal component projections
Our first method involves visualizing the repre-
sentation space in two dimensions, similar to
many other studies on multilingual language mod-
els (Marks and Tegmark, 2024; Rimsky et al.,
2024). More specifically, we use the dev-split of
the Finnish-English subset of the FLORES+ ma-
chine translation dataset (NLLB Team et al., 2024),
which contains 997 translated sentences in total,
run these sentences through the model and collect
the mean pooled representations of each sentence
from the residual stream after each layer of the
network. After obtaining the representations for
each sentence, we stack the representations for both
languages and apply principal component analysis
(PCA) to the resulting dataset. We then project the
sentence representations to the first two principal
components and visualize them for all layers. If
the Finnish and English sequences align visually,
the primary sources of variation are not language-
specific, indicating that the sequences share some
common representations. This will, therefore, pro-
vide first evidence of the existence of shared repre-
sentations.
4.2 Cosine similarity of parallel sentences
To directly compare the representations across lan-
guages, we compute the average cosine similari-
ties between representations of 2000 translation
4https://lumi-supercomputer.eu
3

-- 3 of 13 --

sentences sampled from the OPUS WMT-News
dataset (Tiedemann, 2012). Similarly as with the
projections, we use the mean pooled representa-
tions extracted from the residual stream after each
layer of the model. Since cosine similarity mea-
sures the angular closeness of vectors, we expect
higher scores between the parallel pairs when the
representation space of the model is more cross-
lingually aligned. We also compute the average co-
sine similarities between random sentences across
the parallel data to act as a reference point. More
specifically,

Chunk 8 · 1,997 chars

ea-
sures the angular closeness of vectors, we expect
higher scores between the parallel pairs when the
representation space of the model is more cross-
lingually aligned. We also compute the average co-
sine similarities between random sentences across
the parallel data to act as a reference point. More
specifically, we compute for each representation of
Finnish sentence in the dataset the cosine similarity
with 50 randomly sampled English sentences, take
the average over these 50 cosine similarities, and
finally take the average over all the Finnish sen-
tences, thus resulting in one score representing the
baseline cosine similarity.
4.3 Projection weighted canonical correlation
analysis
We also perform projection weighted canonical cor-
relation analysis (PWCCA) (Morcos et al., 2018)
between 20,000 translation pairs, similar to a study
analyzing multilingual representations of mBERT
(Singh et al., 2019). We again sample sentence
pairs from the OPUS WMT-News dataset (Tiede-
mann, 2012). The advantage of PWCCA is that
it is invariant to linear transformations of the rep-
resentations, thus making it suitable for analyzing
multilingual representations, where some language-
specific linear transformations might be applied on
top of the shared representation space. Intuitively,
in case the parallel data helps in cross-lingual repre-
sentation sharing, we would expect higher PWCCA
scores with larger amounts of parallel data used in
training. To calculate the PWCCA scores, we use
the implementation of the original paper (Morcos
et al., 2018) available on GitHub5.
4.4 Distribution of language-specific neurons
To assess the representation space from the point of
view of neuron activations, we identify language-
specific neurons and assess their amount and distri-
bution across layers using a method proposed by
Kojima et al. (2024). More specifically, we will
first collect the neuron activation values for each
neuron in the model when feeding the model with
both English

Chunk 9 · 1,995 chars

int of
view of neuron activations, we identify language-
specific neurons and assess their amount and distri-
bution across layers using a method proposed by
Kojima et al. (2024). More specifically, we will
first collect the neuron activation values for each
neuron in the model when feeding the model with
both English and Finnish text from the Finnish-
English development subset of the FLORES+
5https://github.com/google/svcca/blob/master/
pwcca.py
dataset (NLLB Team et al., 2024). We then take
the mean activation value of each neuron over the
tokens of each sequence, capturing one activation
value for each neuron per sequence in the dataset.
Using these activation values and language labels
per sentence, we regard the activation values of
each neuron as the predicted scores for the lan-
guage of an input sequence, and the original lan-
guage of the sequence as the ground truth. We then
measure the average precision (AP), which should
capture the language sensitivity of the neuron. Hav-
ing obtained the AP score for each neuron, we use
these scores as a measure of language specificity in
two analyses: 1) we plot the distribution of the top
k scoring neurons across layers to understand the
overall distribution of language-specific neurons
and 2) we bucket neurons by AP thresholds for
better model-to-model comparability. We expect
the number of less language-specific neurons to
be higher, i.e. more neurons to be cross-lingually
shared, when more parallel data is used in training.
4.5 Effectiveness of language control vectors
We also evaluate the effectiveness of language con-
trol vectors (Leino and Karlgren, 2025) that we em-
ploy to force output in Finnish for the completions
of 200 English context sentences sampled from the
ROCStories dataset (Mostafazadeh et al., 2016).
When employing the control vectors, we use a sim-
ilar approach to that of Leino and Karlgren (2025),
creating the control vectors from representations
of 997 Finnish-English parallel sentences

Chunk 10 · 1,999 chars

Finnish for the completions
of 200 English context sentences sampled from the
ROCStories dataset (Mostafazadeh et al., 2016).
When employing the control vectors, we use a sim-
ilar approach to that of Leino and Karlgren (2025),
creating the control vectors from representations
of 997 Finnish-English parallel sentences sampled
from FLORES+ (NLLB Team et al., 2024), con-
ducting a hyperparameter search for the best scal-
ing factor a, and applying these vectors to the resid-
ual stream after each layer of the network at each
token position of both the prompt and generated
tokens.
For each story in the dataset, we let the models
generate five continuations and evaluate using an
LLM-judge, Mistral Medium 3 (Mistral AI team,
2025), both the Finnish fluency and the coherence
of the completion. More specifically, the task of the
LLM-judge is for both of the metrics to map each
completion to one of four categories provided in the
prompt, with the first category representing a com-
plete failure, while the fourth category a perfectly
valid completion with the rest of the categories
falling in between (Appendix E). Furthermore, to
provide baselines for both of these metrics, we also
generate completions without control vectors for
the same English stories and for the corresponding
4

-- 4 of 13 --

Finnish translations obtained from Gemini 2.5 Pro
(Gemini Team, Google, 2025). Intuitively, we ex-
pect to get better fluency and coherence scores with
more aligned representations because the language
direction captured by the control vectors becomes
more decoupled from the other representations.
5 Results
5.1 Fully trained models
The visualizations of the projections to the first two
principal components show that even the model
without parallel data achieves some degree of cross-
lingual alignment in the middle layers, with the
English and Finnish data points heavily overlap-
ping each other, as shown in Figure 1. The overlap
seems to happen in the same middle layers across
all

Chunk 11 · 1,988 chars

s to the first two
principal components show that even the model
without parallel data achieves some degree of cross-
lingual alignment in the middle layers, with the
English and Finnish data points heavily overlap-
ping each other, as shown in Figure 1. The overlap
seems to happen in the same middle layers across
all the models, with only some variation in the de-
gree of overlap in the layers. Overall, these results
provide the first evidence that cross-lingual repre-
sentation sharing is occurring in each of the models,
regardless of parallel data.
From Figure 2, we can see that for each model
the average cosine similarity of translation pairs in-
creases quickly between layers 4 and 6, stays high
during the middle layers, and decreases sharply af-
ter layer 22. Although these results alone indicate
that the translation pairs share common representa-
tions in the middle layers, we can see that the co-
sine similarities of Finnish sentences with random
English sentences also follow this pattern. There-
fore, instead of simply sharing semantic representa-
tions between languages and therefore pushing the
cosine similarity of translation pairs close to 1.0,
the model seems to push all representations into
a small region in the representation space. This
phenomenon is a well-known property of many
language models, often referred to as anisotropy
(Ethayarajh, 2019).
Furthermore, while we can still see from Fig-
ure 2 that the average cosine similarity between
translation pairs seems to be significantly higher
for all models compared to random sentence pairs,
even the untrained model seems to show a clear gap
between the average cosine similarities of transla-
tion pairs and random Finnish-English sentence
pairs. Therefore, it remains difficult to obtain a
definite answer as to whether the higher cosine sim-
ilarity of translation pairs is due to representation
sharing or a residual effect from token embeddings
that might be shared across languages through,

Chunk 12 · 1,990 chars

ties of transla-
tion pairs and random Finnish-English sentence
pairs. Therefore, it remains difficult to obtain a
definite answer as to whether the higher cosine sim-
ilarity of translation pairs is due to representation
sharing or a residual effect from token embeddings
that might be shared across languages through, e.g.,
named entities.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Layer
0.0
0.2
0.4
0.6
0.8
1.0
Cosine Similarity
lm-0 (translation)
lm-1 (translation)
lm-2 (translation)
lm-5 (translation)
untrained (translation)
lm-0 (random)
lm-1 (random)
lm-2 (random)
lm-5 (random)
untrained (random)
Figure 2: The average cosine similarities across the lay-
ers for the fully trained models and the untrained model.
Translation refers to the average cosine similarity be-
tween the Finnish-English translation pairs, whereas
random refers to the average cosine similarity between
random Finnish and English sentence pairs.
However, as shown in Figure 3, the PWCCA
scores are well beyond the scores of the untrained
model and the scores with shuffled pairs, thus sup-
porting the findings of the PCA projections that
all models learn to share representations between
Finnish and English. Furthermore, the PWCCA
scores show a clear trend of increasing first slowly
during the early layers, then rapidly in the early
middle layers, and staying close to the peak value
of 0.7 until layer 22, from where the scores then
drop only slightly. Intuitively, these results indi-
cate that in the early middle layers the different
components of the models start to write increasing
amounts of cross-lingually shared representations
into the residual stream, thus making the linear
transformations of the representations of the paral-
lel pairs correlate more with each other.
Furthermore, the results do not show any rela-
tionship between the amount of parallel data in
the training corpus and the PWCCA scores since
each model obtains a similar peak score with a sim-
ilar

Chunk 13 · 1,997 chars

m, thus making the linear
transformations of the representations of the paral-
lel pairs correlate more with each other.
Furthermore, the results do not show any rela-
tionship between the amount of parallel data in
the training corpus and the PWCCA scores since
each model obtains a similar peak score with a sim-
ilar increasing pattern in the early middle layers.
These results indicate that with our models, after
around 200B tokens, the general level of alignment
between languages is invariant to the amount of
parallel data used in the pretraining.
Figure 4 shows the distribution of the 1000
highest-scoring neurons for both Finnish and En-
glish across the models. From the figure, one
can see that for all the models the most language-
specific neurons are distributed mainly to the first
5

-- 5 of 13 --

1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Layer
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
PWCCA Score
lm-0
lm-1
lm-2
lm-5
untrained
lm-0 (random)
lm-1 (random)
lm-2 (random)
lm-5 (random)
untrained (random)
Figure 3: The PWCCA scores for the final checkpoints
of all the models and the untrained model. Random
refers to the scores obtained when shuffling the parallel
dataset, thus breaking the possible correlation.
few layers and the last few layers, with little to
no language-specific neurons in the middle lay-
ers, aligning well with previous results on multi-
lingual neurons (Kojima et al., 2024). The lack
of language-specific neurons in the middle layers
indicates that in these middle layers the models’
neurons are more focused on shared features in the
input representations rather than on features bound
to either Finnish or English.
To better understand and compare the distribu-
tions of cross-lingually shared neurons, we visu-
alize in Figure 5 the percentage of neurons in
each layer of the models that have an average
precision of less than or equal to 0.55 for both
Finnish and English. From the figure, we can see
that for each model,

Chunk 14 · 1,995 chars

r English.
To better understand and compare the distribu-
tions of cross-lingually shared neurons, we visu-
alize in Figure 5 the percentage of neurons in
each layer of the models that have an average
precision of less than or equal to 0.55 for both
Finnish and English. From the figure, we can see
that for each model, the portion of these highly
language-agnostic neurons peaks in layers 14 and
15 with close to 60% of the neurons being language-
agnostic. Furthermore, there seems to be no signif-
icant differences between the models in the propor-
tions of language-agnostic neurons, indicating that
increasing amounts of parallel data does not have
a notable effect on the number of cross-lingually
shared neurons.
Table 2 shows the total sum of language-specific
neurons with varying thresholds, where lm-0
has the most language-specific neurons for both
Finnish and English with the first three thresholds.
Therefore, even though parallel data has limited
Figure 4: The distribution of the 1000 most language-
specific neurons across the layers of each model.
Figure 5: The percentage of neurons on each layer that
score lower than or equal to 0.55 average precision in
predicting both of the languages.
effect on completely language-agnostic neurons,
including such data seems to decrease the over-
all number of language-specific neurons, with the
exception of most language-specific neurons.
Finally, Table 3 shows that the control vectors
with the best scaling factor found during the ini-
tial hyperparameter search (a=0.09) seem to be
highly effective in changing the language of the
completions from English to Finnish, obtaining
mean fluency scores slightly below the maximum
of 4 (grammatically perfect Finnish) and almost
the same as when generating completions to the
corresponding Finnish translations. However, there
seems to be no positive correlation between flu-
ency scores and the amount of parallel data. Fur-
6

-- 6 of 13 --

Finnish lm-0 lm-1 lm-2 lm-5
>= 0.75 4666

Chunk 15 · 1,989 chars

low the maximum
of 4 (grammatically perfect Finnish) and almost
the same as when generating completions to the
corresponding Finnish translations. However, there
seems to be no positive correlation between flu-
ency scores and the amount of parallel data. Fur-
6

-- 6 of 13 --

Finnish lm-0 lm-1 lm-2 lm-5
>= 0.75 4666 4485 4538 4531
>= 0.9 496 423 398 437
>= 0.95 134 106 92 108
>= 0.99 16 19 16 15
English lm-0 lm-1 lm-2 lm-5
>= 0.75 3448 3208 3165 3201
>= 0.9 317 274 243 279
>= 0.95 92 78 67 78
>= 0.99 8 10 9 11
Both lm-0 lm-1 lm-2 lm-5
<= 0.55 50,716 50,719 51,051 50,696
Table 2: The total number of neurons in each model
with varying average precision thresholds. Finnish
refers to average precision predicting Finnish, whereas
English to average precision predicting English. The
total number of neurons in each model is 132,096.
thermore, although controlled completions obtain
lower coherence scores than the uncontrolled ones,
the scores are still well above the baseline score of
1 (completely unrelated continuations to the preced-
ing story context). Therefore, even when shifting
the language to Finnish, all the models can generate
completions that are somewhat related to the pre-
vious English context, suggesting cross-lingually
shared representations (example shown in Table 5
in Appendix F). Between the models, lm-5 seems
to obtain slightly higher coherence score than the
rest of the models; however, as shown by the higher
scores for the baseline continuations, this might just
be due to better overall coherence of the model’s
outputs.
5.2 Throughout training
We also evaluated the level of cross-lingual repre-
sentation sharing in the checkpoints of the models
throughout the training. Based on the evaluation
results, the cross-lingual alignment starts to emerge
relatively early in the training. For example, Fig-
ure 6 shows that for all the models, the PWCCA
scores start to increase very early in training with a
bump in the scores appearing in the middle

Chunk 16 · 1,996 chars

oints of the models
throughout the training. Based on the evaluation
results, the cross-lingual alignment starts to emerge
relatively early in the training. For example, Fig-
ure 6 shows that for all the models, the PWCCA
scores start to increase very early in training with a
bump in the scores appearing in the middle layers
even as early as after 5000 steps. There seems to
be no clear correlation with the amount of parallel
data.
On the other hand, the mean fluency and co-
herence scores in the language control experiment
reveal a more noticeable distinction. To isolate
the effects of general improvements in model ca-
pabilities, we normalized the scores by subtracting
Fluency Coherence
lm-0 μ σ μ σ
FI 3.818 0.221 2.899 0.479
EN 1.008 0.052 3.123 0.425
0.09 3.713 0.266 2.324 0.636
lm-1 μ σ μ σ
FI 3.827 0.211 2.92 0.478
EN 1.002 0.020 3.147 0.448
0.09 3.705 0.270 2.231 0.663
lm-2 μ σ μ σ
FI 3.83 0.192 2.99 0.425
EN 1.009 0.054 3.152 0.433
0.09 3.694 0.273 2.343 0.623
lm-5 μ σ μ σ
FI 3.805 0.214 3.03 0.402
EN 1.005 0.042 3.175 0.426
0.09 3.678 0.267 2.461 0.622
Table 3: The mean (μ) scores with standard deviations
(σ) from the LLM-judge for fluency and coherence for
the continuations to the 200 stories. 0.09 refers to the
controlled continuations, FI to the Finnish baseline con-
tinuations without any control vectors applied, and EN
to English baseline continuations.
them from the Finnish baseline scores. Figure 7
shows these differences across the model check-
points (a=0.09). We can see that, while the dif-
ferences in mean fluency scores do not show any
trend, a clear distinction emerges in the coherence
score differences. Assuming that perfectly aligned
representations would produce a coherence differ-
ence of zero in the experiment, lm-5 appears to
have the most aligned representations during the
initial 20,000 steps, while lm-0 the least. These
results suggest that adding parallel data helps in
sharing representations in the early phases of the
training.
We

Chunk 17 · 1,999 chars

aligned
representations would produce a coherence differ-
ence of zero in the experiment, lm-5 appears to
have the most aligned representations during the
initial 20,000 steps, while lm-0 the least. These
results suggest that adding parallel data helps in
sharing representations in the early phases of the
training.
We also examined the number of language-
specific neurons throughout the training with dif-
ferent average precision thresholds as shown in
Figure 8. From the figure, we can see that lm-0
has for most thresholds the most language-specific
neurons throughout the training. However, what is
even more interesting is that after an initial phase
of rapid decrease, the number of language-specific
neurons increases for all of the models throughout
the whole training without showing any signs of
convergence. This calls into question the use of the
number of language-specific neurons as a measure
for cross-lingual alignment, as the models seem to
naturally converge toward a higher number of these
7

-- 7 of 13 --

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.55
0.60
0.65
0.70
PWCCA Score
 	Step 1000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.55
0.60
0.65
0.70
PWCCA Score
 	Step 5000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.55
0.60
0.65
0.70
PWCCA Score
 	Step 10000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Layer
0.55
0.60
0.65
0.70
PWCCA Score
 	Step 15000
lm-0
lm-1
lm-2
lm-5
Figure 6: PWCCA scores computed across layers for
early checkpoints of each model.
10000	
20000	
30000	
40000	
50000
0.00
0.05
0.10
0.15
0.20
0.25
Difference
Mean Fluency
10000	
20000	
30000	
40000	
50000
Training steps
0.6
0.7
0.8
0.9
1.0
1.1
1.2
Difference
Mean Coherence
lm-0
lm-1
lm-2
lm-5
Figure 7: Difference in mean fluency and coherence
scores between the controlled and Finnish baseline com-
pletions for checkpoints during training.
neurons over time.
6 Conclusions
Overall, all of our experiments show

Chunk 18 · 1,995 chars

0000	
50000
Training steps
0.6
0.7
0.8
0.9
1.0
1.1
1.2
Difference
Mean Coherence
lm-0
lm-1
lm-2
lm-5
Figure 7: Difference in mean fluency and coherence
scores between the controlled and Finnish baseline com-
pletions for checkpoints during training.
neurons over time.
6 Conclusions
Overall, all of our experiments show evidence
for the shared cross-lingual representation space
emerging in the middle layers of all models, with
the exception of cosine similarity, which provided
somewhat unreliable results. This observation of
cross-lingual representation sharing in the mid-
dle layers also aligns well with previous research
(Wendler et al., 2024; Kojima et al., 2024), and was
therefore somewhat expected.
The more surprising result of the experiments
is that, contrary to general belief, parallel data ap-
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95367
Training step
2000
2400
2800
3200
3600
4000
4400
4800
Count of Neurons
Average precision >= 0.75
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95367
Training step
120
180
240
300
360
420
480
540
600
Count of Neurons
Average precision >= 0.9
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95367
Training step
15
30
45
60
75
90
105
120
135
150
Count of Neurons
Average precision >= 0.95
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95367
Training step
0
3
6
9
12
15
18
21
24
Count of Neurons
Average precision >= 0.99
lm-0 - English
lm-0 - Finnish
lm-1 - English
lm-1 - Finnish
lm-2 - English
lm-2 - Finnish
lm-5 - English
lm-5 - Finnish
Figure 8: The development of the number of neurons
given four thresholds for average precisions.
pears to have little effect on overall cross-lingual
alignment in language models. In general, most of
the evaluation experiments did not show any

Chunk 19 · 1,997 chars

lm-1 - Finnish
lm-2 - English
lm-2 - Finnish
lm-5 - English
lm-5 - Finnish
Figure 8: The development of the number of neurons
given four thresholds for average precisions.
pears to have little effect on overall cross-lingual
alignment in language models. In general, most of
the evaluation experiments did not show any signif-
icant differences between the models. Therefore,
it seems likely that some other factor encourages
even the model without parallel data to share rep-
resentations between languages, such as a limited
parameter capacity budget (Conneau et al., 2020).
It is also possible that, despite the language filtering
steps (Penedo et al., 2024, 2025), the monolingual
corpora still contain parallel data similar to some
other training corpora (Blevins and Zettlemoyer,
2022; Briakou et al., 2023). We leave the examina-
tion of these potentially overshadowing factors for
future research.
Despite not finding evidence of parallel data
helping in the representation sharing in fully trained
models, we still observed some evidence of acceler-
ated cross-lingual representation sharing during the
early stages of pretraining. Furthermore, it seems
that including parallel data to the training corpus
might push some of the language-specific neurons
to be less language-specific. Interestingly, we also
found that the number of language-specific neurons
increases across all models as training progresses.
To summarize, the results of this study show
that while being an intuitively intriguing technique,
including parallel data to the pretraining corpus
does not seem to provide any major benefit to cross-
lingual alignment of language models.
Limitations
A primary limitation of this study is the restric-
tion of our experiments to a single language pair,
8

-- 8 of 13 --

English and Finnish, which could limit the ability
to generalize the findings to typologically closer
pairs or other distant pairs. However, we selected
English-Finnish specifically because it represents
a

Chunk 20 · 1,994 chars

ary limitation of this study is the restric-
tion of our experiments to a single language pair,
8

-- 8 of 13 --

English and Finnish, which could limit the ability
to generalize the findings to typologically closer
pairs or other distant pairs. However, we selected
English-Finnish specifically because it represents
a "hard case" for cross-lingual alignment due to
significant typological differences, thus making the
results likely to generalize to other language pairs
as well and the results not to be biased by the simi-
larity of the languages.
Another limitation of the study is the scale
of the trained language models. Constrained by
computational resources, we trained only 1.4B-
parameter models on 200B tokens, thus obtaining
performance-wise limited models. Consequently,
the generalizability of our findings to the large-
scale models remains an open question as phenom-
ena observed with small models might not always
generalize to larger ones. However, training smaller
models from scratch allowed us to exercise precise
control over the data mixture, specifically the exact
ratio of parallel tokens, thus allowing us to conduct
the experiments in a more controlled manner.
Finally, our models ranged from leveraging 0%
to 5% parallel data. While this range might be
insufficient to observe potential effects that might
emerge at higher ratios (e.g., 20% or 50%), we
restricted our scope to this range to reflect real-
istic constraints in pretraining language models.
For many language pairs, particularly low-resource
ones, acquiring high-quality parallel data constitut-
ing more than 5% of a massive pretraining corpus
is often infeasible. Our objective was to evaluate
the utility of parallel data within a regime of re-
alistic scarcity, rather than in synthetic scenarios
dominated by this parallel data.
Acknowledgments
The research was supported by the Technology In-
dustries of Finland Centennial Foundation.
References
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin

Chunk 21 · 1,991 chars

was to evaluate
the utility of parallel data within a regime of re-
alistic scarcity, rather than in synthetic scenarios
dominated by this parallel data.
Acknowledgments
The research was supported by the Technology In-
dustries of Finland Centennial Foundation.
References
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John-
son, Dmitry Lepikhin, Alexandre Passos, Siamak
Shakeri, Emanuel Taropa, Paige Bailey, Z. Chen, Eric
Chu, J. Clark, Laurent El Shafey, Yanping Huang,
Kathleen S. Meier-Hellstern, Gaurav Mishra, Er-
ica Moreira, Mark Omernick, Kevin Robinson, and
109 others. 2023. PaLM 2 technical report. arXiv
preprint arXiv:2305.10403v3.
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.
2019. On the cross-lingual transferability of mono-
lingual representations. In Annual Meeting of the
Association for Computational Linguistics.
Terra Blevins and Luke Zettlemoyer. 2022. Language
contamination helps explains the cross-lingual capa-
bilities of English pretrained models. In Proceedings
of the 2022 Conference on Empirical Methods in Nat-
ural Language Processing, pages 3563–3574, Abu
Dhabi, United Arab Emirates. Association for Com-
putational Linguistics.
Eleftheria Briakou, Colin Cherry, and George F. Foster.
2023. Searching for needles in a haystack: On the
role of incidental bilingualism in PaLM’s translation
capability. In Annual Meeting of the Association for
Computational Linguistics.
Alexis Conneau and Guillaume Lample. 2019. Cross-
lingual language model pretraining. In Proceedings
of the 33rd International Conference on Neural In-
formation Processing Systems, Red Hook, NY, USA.
Curran Associates Inc.
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Emerging cross-
lingual structure in pretrained language models. In
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 6022–
6034, Online. Association for Computational Lin-
guistics.
Ona de Gibert, Graeme Nail, Nikolay

Chunk 22 · 1,999 chars

n Li, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Emerging cross-
lingual structure in pretrained language models. In
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 6022–
6034, Online. Association for Computational Lin-
guistics.
Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta
Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume
Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-
Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan
Oepen, and Jörg Tiedemann. 2024. A new massive
multilingual dataset for high-performance language
technologies. In Proceedings of the 2024 Joint In-
ternational Conference on Computational Linguis-
tics, Language Resources and Evaluation (LREC-
COLING 2024), pages 1116–1128, Torino, Italia.
ELRA and ICCL.
Kawin Ethayarajh. 2019. How contextual are contextu-
alized word representations? Comparing the geom-
etry of BERT, ELMo, and GPT-2 embeddings. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 55–65,
Hong Kong, China. Association for Computational
Linguistics.
Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Ma, Ahmed El-Kishky, Siddharth Goyal, Man-
deep Baines, Onur Celebi, Guillaume Wenzek,
Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi-
taliy Liptchinsky, Sergey Edunov, Edouard Grave,
Michael Auli, and Armand Joulin. 2020. Beyond
English-centric multilingual machine translation. J.
Mach. Learn. Res., 22:107:1–107:48.
Gemini Team, Google. 2025. Gemini 2.5: Pushing
the frontier with advanced reasoning, multimodality,
long context, and next generation agentic capabilities.
arXiv preprint arXiv:2507.06261v4.
9

-- 9 of 13 --

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya
Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin,
Tatiana Matejovicova, Alexandre Ramé, Morgane
Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey
Cideron, Jean bastien Grill, Sabela

Chunk 23 · 1,997 chars

eration agentic capabilities.
arXiv preprint arXiv:2507.06261v4.
9

-- 9 of 13 --

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya
Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin,
Tatiana Matejovicova, Alexandre Ramé, Morgane
Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey
Cideron, Jean bastien Grill, Sabela Ramos, Edouard
Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev,
and 197 others. 2025. Gemma 3 technical report.
arXiv preprint arXiv:2503.19786v1.
Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita
Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya
Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang,
Shane Arora, David Atkinson, Russell Authur, Khy-
athi Chandu, Arman Cohan, Jennifer Dumas, Yanai
Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024.
OLMo: Accelerating the science of language mod-
els. In Proceedings of the 62nd Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 15789–15809, Bangkok,
Thailand. Association for Computational Linguistics.
Mihir Kale, Aditya Siddhant, Rami Al-Rfou, Linting
Xue, Noah Constant, and Melvin Johnson. 2021.
nmT5 - Is parallel data still relevant for pre-training
massively multilingual language models? In Pro-
ceedings of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Processing (Volume 2: Short Papers), pages 683–
691.
Takeshi Kojima, Itsuki Okimura, Yusuke Iwasawa, Hit-
omi Yanaka, and Yutaka Matsuo. 2024. On the multi-
lingual ability of decoder-based pre-trained language
models: Finding and controlling language-specific
neurons. In Proceedings of the 2024 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies (Volume 1: Long Papers), pages 6919–6971,
Mexico City, Mexico. Association for Computational
Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece:
A simple and language independent subword tok-
enizer

Chunk 24 · 1,986 chars

the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies (Volume 1: Long Papers), pages 6919–6971,
Mexico City, Mexico. Association for Computational
Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece:
A simple and language independent subword tok-
enizer and detokenizer for neural text processing. In
Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium.
Association for Computational Linguistics.
Julius Leino and Jussi Karlgren. 2025. Controlling lan-
guage and style of multi-lingual generative language
models with control vectors. Northern European
Journal of Language Technology, 11(1):1–26.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In International Confer-
ence on Learning Representations.
Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne
Talman, Ville Komulainen, Väinö Hatanpää, Peter
Sarlin, and Sampo Pyysalo. 2024. Poro 34B and
the blessing of multilinguality. In Conference on
Language Modeling.
Samuel Marks and Max Tegmark. 2024. The geometry
of truth: Emergent linear structure in large language
model representations of true/false datasets. arXiv
preprint arXiv:2310.06824v3.
Mistral AI team. 2025. Medium is the
new large. https://mistral.ai/news/
mistral-medium-3/.
Ari S. Morcos, Maithra Raghu, and Samy Bengio. 2018.
Insights on representational similarity in neural net-
works with canonical correlation. In Neural Informa-
tion Processing Systems.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Pushmeet Kohli, and James Allen. 2016. A corpus
and cloze evaluation for deeper understanding of
commonsense stories. In Proceedings of the 2016
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies (NAACL HLT).
Niklas Muennighoff, Alexander M. Rush, Boaz

Chunk 25 · 1,989 chars

ushmeet Kohli, and James Allen. 2016. A corpus
and cloze evaluation for deeper understanding of
commonsense stories. In Proceedings of the 2016
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies (NAACL HLT).
Niklas Muennighoff, Alexander M. Rush, Boaz Barak,
Teven Le Scao, Aleksandra Piktus, Nouamane Tazi,
Sampo Pyysalo, Thomas Wolf, and Colin Raffel.
2023. Scaling data-constrained language models. In
Proceedings of the 37th International Conference on
Neural Information Processing Systems, NIPS ’23,
Red Hook, NY, USA. Curran Associates Inc.
NLLB Team, Marta R. Costa-jussà, James Cross, Onur
Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillaume
Wenzek, Al Youngblood, Bapi Akula, Loic Barrault,
Gabriel Mejia Gonzalez, Prangthip Hansanti, and 20
others. 2024. Scaling neural machine translation to
200 languages. Nature, 630(8018):841–846.
Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben al-
lal, Anton Lozhkov, Margaret Mitchell, Colin Raffel,
Leandro Von Werra, and Thomas Wolf. 2024. The
FineWeb datasets: Decanting the web for the finest
text data at scale. In The Thirty-eight Conference on
Neural Information Processing Systems Datasets and
Benchmarks Track.
Guilherme Penedo, Hynek Kydlíˇcek, Vinko Sabolˇcec,
Bettina Messmer, Negar Foroutan, Amir Hossein
Kargaran, Colin Raffel, Martin Jaggi, Leandro Von
Werra, and Thomas Wolf. 2025. FineWeb2: One
pipeline to scale them all – adapting pre-training
data processing to every language. arXiv preprint
arXiv:2506.20920v1.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
How multilingual is multilingual BERT? In Proceed-
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 4996–5001. As-
sociation for Computational Linguistics.
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong,
Evan Hubinger, and Alexander Turner. 2024.

Chunk 26 · 1,998 chars

ger, and Dan Garrette. 2019.
How multilingual is multilingual BERT? In Proceed-
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 4996–5001. As-
sociation for Computational Linguistics.
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong,
Evan Hubinger, and Alexander Turner. 2024. Steer-
ing Llama 2 via contrastive activation addition. In
10

-- 10 of 13 --

Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 15504–15522.
Holger Schwenk, Guillaume Wenzek, Sergey Edunov,
Edouard Grave, Armand Joulin, and Angela Fan.
2021. CCMatrix: Mining billions of high-quality
parallel sentences on the web. In Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 6490–6500, Online. As-
sociation for Computational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 1715–1725,
Berlin, Germany. Association for Computational Lin-
guistics.
Jasdeep Singh, Bryan McCann, Richard Socher, and
Caiming Xiong. 2019. BERT is not an interlingua
and the bias of tokenization. In Conference on Em-
pirical Methods in Natural Language Processing.
Jörg Tiedemann. 2012. Parallel data, tools and inter-
faces in OPUS. In Proceedings of the Eighth In-
ternational Conference on Language Resources and
Evaluation (LREC’12), pages 2214–2218, Istanbul,
Turkey. European Language Resources Association
(ELRA).
Jörg Tiedemann. 2016. Finding alternative translations
in a large corpus of movie subtitle. In Proceedings
of the Tenth International Conference on Language
Resources and Evaluation (LREC’16), pages 3518–
3522, Portorož, Slovenia. European Language Re-
sources

Chunk 27 · 1,997 chars

l,
Turkey. European Language Resources Association
(ELRA).
Jörg Tiedemann. 2016. Finding alternative translations
in a large corpus of movie subtitle. In Proceedings
of the Tenth International Conference on Language
Resources and Evaluation (LREC’16), pages 3518–
3522, Portorož, Slovenia. European Language Re-
sources Association (ELRA).
Chris Wendler, Veniamin Veselovsky, Giovanni Monea,
and Robert West. 2024. Do Llamas work in English?
On the latent language of multilingual transformers.
In Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), pages 15366–15394. Association
for Computational Linguistics.
A Model architecture hyperparameters
Table 4 shows a detailed breakdown of the model
architecture used in this study.
B Quality filtering and upsampling steps
applied to the monolingual corpora
As we noticed that the raw FineWeb2 dataset con-
tained significant portions of low quality content,
we applied our own keyword- and URL-based con-
tent filtering routines tailored for Finnish, resulting
in the removal of approximately 8.8% of the doc-
uments. To avoid a potential content imbalance
Hyperparameter Value
Parameters 1, 420, 296, 192
Non-embedding parameters 1, 317, 273, 600
Number of layers 24
Model dimension 2048
Attention type Multi-headed at-
tention
Number of heads 16
Q/K/V dimension 128
Activation function SwiGLU
MLP dimension 5504
Layer normalization Non-parametric
LayerNorm
Position information RoPE
RoPE base θ 10, 000
Vocabulary size 50, 280
Embedding size 50, 304
Table 4: Hyperparameters for the model architecture
used in this study.
between the Finnish and English corpora, we also
then applied a similar content filtering step to the
monolingual English corpus, leading to the removal
of approximately 1.4% of the documents. Due to
the resulting Finnish dataset being only around 20B
tokens, we further up-sampled the corpus accord-
ingly to fit the Finnish data requirements of the
models. This

Chunk 28 · 1,998 chars

also
then applied a similar content filtering step to the
monolingual English corpus, leading to the removal
of approximately 1.4% of the documents. Due to
the resulting Finnish dataset being only around 20B
tokens, we further up-sampled the corpus accord-
ingly to fit the Finnish data requirements of the
models. This decision is supported by previous
studies suggesting that training with the same data
even up to four epochs has negligible effect on
the downstream performance (Muennighoff et al.,
2023).
C Example instruction format
Translate the English sentence ’{exam-
ple_en}’ into Finnish. Finnish: {exam-
ple_fi}.
D Training hyperparameters and
hardware setup
For monolingual corpora, we concatenated the doc-
uments that were shorter than the context window
of 2048 tokens together to fill the whole context
window in one chunk, using special beginning of
sequence <s> and end of sequence </s> tokens
to separate the sequences. For parallel data, we
concatenated the maximum number of formatted
parallel sequences without overflowing the context
window, again using the special tokens as sepa-
rators, and applied padding to fill the remaining
11

-- 11 of 13 --

context window. To avoid any unintended cross-
lingual signals, we kept all the datasets in separate
chunks. However, we still applied shuffling to the
chunks, thus allowing one batch to contain chunks
from all the datasets. We trained the models with
a batch size of approximately two million tokens
(1024 chunks per batch), resulting in 95,367 total
training steps.
For the optimizer, we used AdamW (Loshchilov
and Hutter, 2019) with beta coefficients of 0.9 and
0.95, and a weight decay coefficient of 0.1. For the
learning rate, we employed a cosine learning rate
scheduler with a warmup of 2000 steps and a maxi-
mum learning rate of 2.0 × 10−4 and a minimum
learning rate of 2.0 × 10−5. We also employed
gradient clipping to a maximum norm of 1.0.
From a hardware perspective, we trained the
models with a cluster of

Chunk 29 · 1,997 chars

f 0.1. For the
learning rate, we employed a cosine learning rate
scheduler with a warmup of 2000 steps and a maxi-
mum learning rate of 2.0 × 10−4 and a minimum
learning rate of 2.0 × 10−5. We also employed
gradient clipping to a maximum norm of 1.0.
From a hardware perspective, we trained the
models with a cluster of 32 AMD Instinct MI250X
GPUs (64 Graphics Compute Dies, GCDs) divided
across 8 nodes (4 physical GPUs per node, 8 GCDs)
hosted on the GPU partition of the LUMI super-
computer.
E LLM-judge evaluation categories
Fluency
1. The text is grammatically malformed to the
point of being incomprehensible (gibberish),
is not in Finnish, or contains only a few
Finnish words in a sea of another language.
2. The text is understandable and mostly in
Finnish, but contains many significant gram-
matical errors or words from other languages.
Grammatical errors may include fundamen-
tally wrong verb conjugations or noun inflec-
tions, severe phrasing issues, or highly repeti-
tive words or phrases.
3. The text is fully in Finnish and grammatically
correct for the most part, but contains minor,
non-critical errors. Examples include unnatu-
ral phrasing that suggests a literal translation
from another language (calque), occasional
incorrect word inflections, or awkward word
choices.
4. The text is fully in Finnish and grammatically
perfect, idiomatic, and reads as if written by
a native Finnish speaker. All word choices,
inflections, and structures are natural.
Coherence
1. The continuation is completely unrelated to
the previous sentences with no semantically
resembling words, concepts, or structure indi-
cating of relatedness.
2. The continuation is related to the overarching
topic or themes of the previous sentences (e.g.,
’dogs’, ’shopping’) but fails to connect to the
specific events or characters of the sentences.
3. The continuation correctly identifies and uses
specific elements (characters, objects, loca-
tions) from the previous sentences. However,
it contains

Chunk 30 · 1,994 chars

ed to the overarching
topic or themes of the previous sentences (e.g.,
’dogs’, ’shopping’) but fails to connect to the
specific events or characters of the sentences.
3. The continuation correctly identifies and uses
specific elements (characters, objects, loca-
tions) from the previous sentences. However,
it contains some major logical inconsistencies,
e.g., by mixing up the subject of some actions
or presenting conflicting facts about the ele-
ments.
4. The continuation provides a coherent continu-
ation to the previous sentences without illogi-
cal or inconsistent elements.
F Example text completions from
language control experiment
Table 5 shows controlled Finnish text completions
to an example English story context.
12

-- 12 of 13 --

Input Kia had always had a distaste for
mushrooms. But then one day she or-
dered pasta at a restaurant. The sauce
was wonderful, and she asked what
was in it. To her surprise, the waiter
said it was mostly mushrooms!
lm-0 Kia ei ollut ainoa, joka oli ihas-
tunut tähän ruokaan. Pian muutkin
asiakkaat alkoivat kertoa syövänsä
sieniruokaa. Sitten Kia päätti Kia
wasn’t the only one who was in love
with this dish. Soon other customers
started telling me they were eating
mushroom food. Then Kia decided
lm-1 Kia oli täysin järkyttynyt. Hän ei ol-
lut koskaan syönyt mitään niin hyvää.
Mikä ihme on tämä? Mutta siinä vai-
heessa hän tajusi, että Kia was com-
pletely shocked. She had never eaten
anything so good. What the hell is
this? But at that moment she realized
that
lm-2 Kia ei ollut tästä moksiskaan, vaan
ajatteli, että totta kai hän söi tuon.
Mutta sitten hän huomasi, että ruoka
oli Kia wasn’t at all surprised by this,
and thought that of course he ate that.
But then he noticed that the food was
lm-5 Kia ei ollut koskaan aiemmin ostanut
ravintolasta mitään muuta kuin ruokaa,
joten hän ei ollut kovin yllättynyt ku-
ullessaan, että ateria oli tehty sienistä
Kia had never bought anything other
than food from a restaurant before,

Chunk 31 · 523 chars

this,
and thought that of course he ate that.
But then he noticed that the food was
lm-5 Kia ei ollut koskaan aiemmin ostanut
ravintolasta mitään muuta kuin ruokaa,
joten hän ei ollut kovin yllättynyt ku-
ullessaan, että ateria oli tehty sienistä
Kia had never bought anything other
than food from a restaurant before, so
she wasn’t too surprised to hear that
the meal was made with mushrooms.
Table 5: Example text completions from the language
control vector experiment. English translation in italics.
13

-- 13 of 13 --