On the limited utility of parallel data for learning shared multilingual representations
Summary
This study investigates the impact of parallel data on learning shared multilingual representations in language models. The authors train four 1.4 billion-parameter models on a 200 billion-token corpus with 0%, 1%, 2%, and 5% parallel data. They evaluate cross-lingual alignment using principal component analysis, cosine similarity, projection weighted canonical correlation analysis, language-specific neuron analysis, and language control vectors. Results show that all models, including the one with no parallel data, develop shared representations, particularly in middle layers. The amount of parallel data has little effect on overall alignment but may accelerate early-phase representation sharing and reduce language-specific neurons. The study concludes that parallel data provides minimal benefit for cross-lingual alignment in fully trained models, suggesting other factors like parameter capacity or incidental bilingualism may drive representation sharing. The findings challenge the common assumption that parallel data is essential for multilingual alignment.
PDF viewer
Chunks(32)
Chunk 0 · 1,995 chars
On the limited utility of parallel data for learning shared multilingual representations Julius Leino University of Helsinki leino.julius2@gmail.com Jörg Tiedemann University of Helsinki jorg.tiedemann@helsinki.fi Abstract Shared multilingual representations are essen- tial for cross-lingual tasks and knowledge trans- fer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representa- tions that are aligned across languages. We train reference models with different propor- tions of parallel data and show that parallel data seem to have only a minimal effect on the cross- lingual alignment. Based on multiple evalua- tion methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.1 1 Introduction A desirable property of multilingual language mod- els is the cross-lingual sharing of representations, which enables cross-lingual transfer (Pires et al., 2019; Artetxe et al., 2019; Conneau et al., 2020), a phenomenon where representations learned for one language can be leveraged to boost performance in another. A common method thought to improve this sharing is to add parallel data, that is, trans- lated texts, typically aligned at the sentence level, to the pretraining corpus of multilingual language models (Conneau and Lample, 2019; Anil et al., 2023; Luukkonen et al., 2024; Gemma Team et al., 2025). However, while intuitively attractive, it is still unclear to what extent parallel data actually helps in representation sharing. In this paper, we systematically assess the as- sumption that including parallel data in the pre- training corpus increases the level of representa- tion sharing across languages. To this end, we 1Code for the
Chunk 1 · 1,993 chars
tively attractive, it is still unclear to what extent parallel data actually helps in representation sharing. In this paper, we systematically assess the as- sumption that including parallel data in the pre- training corpus increases the level of representa- tion sharing across languages. To this end, we 1Code for the experiments is available at https://github. com/shiftleino/limited-utility-of-parallel-data. train from scratch four 1.4 billion-parameter trans- former language models on a 200 billion token multilingual corpus with varying amounts of paral- lel data: 0%, 1%, 2%, and 5%. We then assess the level of representation sharing in the models using principal component projections, cosine similarity, projection weighted canonical correlation analysis (PWCCA), language-specific neuron analysis, and cross-lingual control vectors. The primary findings of this study are as follows: âą Our results across the different methods show that a partially shared representation space exists in all of the trained models, including the model with no parallel data added to the training corpus. The sharing of representa- tions is strongest in the middle layers of all the models (Figure 1), aligning well with the observations of previous research on the mid- dle layers containing more semantic represen- tations (Wendler et al., 2024; Kojima et al., 2024). âą We discover that the amount of parallel data in the pretraining corpus has little to no effect on the overall level of cross-lingual alignment in the fully trained models, with most of the eval- uation methods showing no significant rela- tionship with the amount of parallel data used in training. Instead, the effects of parallel data are limited to potentially accelerating the rep- resentation sharing during the early phase of the pretraining and to decreasing the number of language-specific neurons in the models. 2 Background Prior research has suggested that multilingual lan- guage models develop shared cross-lingual
Chunk 2 · 1,995 chars
the effects of parallel data are limited to potentially accelerating the rep- resentation sharing during the early phase of the pretraining and to decreasing the number of language-specific neurons in the models. 2 Background Prior research has suggested that multilingual lan- guage models develop shared cross-lingual repre- sentations. Numerous studies have, for example, shown that multilingual language models are ca- pable of cross-lingual transfer, where a pretrained 1 arXiv:2603.29026v1 [cs.CL] 30 Mar 2026 -- 1 of 13 -- 4 3 2 1 0 1 2 3 4 8 6 4 2 0 2 lm-0 PC 2 Layer 1 60 50 40 30 20 10 0 10 20 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Layer 12 60 40 20 0 20 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Layer 13 100 80 60 40 20 0 20 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Layer 14 100 80 60 40 20 0 20 2 1 0 1 2 Layer 15 100 80 60 40 20 0 20 3 2 1 0 1 2 Layer 16 100 80 60 40 20 0 20 2 1 0 1 2 Layer 17 15 10 5 0 5 10 15 20 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 Layer 24 4 2 0 2 4 8 6 4 2 0 2 lm-1 PC 2 3 2 1 0 1 2 3 4 5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 3 2 1 0 1 10 0 10 20 30 40 50 2 1 0 1 10 0 10 20 30 40 50 3 2 1 0 1 2 10 0 10 20 30 40 50 3 2 1 0 1 2 10 0 10 20 30 40 50 2 1 0 1 2 15 10 5 0 5 10 15 20 5 0 5 10 4 2 0 2 4 8 6 4 2 0 2 lm-2 PC 2 2 1 0 1 2 3 4 5 6 1.5 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 6 8 10 1 0 1 2 3 10 0 10 20 30 40 50 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 10 0 10 20 30 40 50 2 1 0 1 2 3 10 0 10 20 30 40 50 2 1 0 1 2 3 10 0 10 20 30 40 50 3 2 1 0 1 2 20 15 10 5 0 5 10 15 5 0 5 10 4 3 2 1 0 1 2 3 4 PC 1 8 6 4 2 0 2 lm-5 PC 2 10 0 10 20 30 40 PC 1 1.5 1.0 0.5 0.0 0.5 1.0 10 0 10 20 30 40 PC 1 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 20 0 20 40 60 80 PC 1 1.5 1.0 0.5 0.0 0.5 1.0 1.5 20 0 20 40 60 80 PC 1 2 1 0 1 2 20 0 20 40 60 80 PC 1 3 2 1 0 1 2 20 0 20 40 60
Chunk 3 · 1,992 chars
0 4 3 2 1 0 1 2 3 4 PC 1 8 6 4 2 0 2 lm-5 PC 2 10 0 10 20 30 40 PC 1 1.5 1.0 0.5 0.0 0.5 1.0 10 0 10 20 30 40 PC 1 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 20 0 20 40 60 80 PC 1 1.5 1.0 0.5 0.0 0.5 1.0 1.5 20 0 20 40 60 80 PC 1 2 1 0 1 2 20 0 20 40 60 80 PC 1 3 2 1 0 1 2 20 0 20 40 60 80 PC 1 2 1 0 1 2 15 10 5 0 5 10 15 20 PC 1 5 0 5 10 Figure 1: English and Finnish translation sentence pair representations of each model projected to the first two principal components. Finnish sentences are in blue and English sentences in yellow. Each model is exposed to different amounts of parallel data from no parallel data in lm-0 (top row) to 5% parallel data in lm-5 (bottom row). model trained for a task in one language achieves performance improvements in the same task in an- other language (Pires et al., 2019; Artetxe et al., 2019; Conneau et al., 2020). Similar results have also been obtained from more direct analysis of the representations. For example, a study found that Llama 2 uses English as a pivot language when translating sentences from one language to another (Wendler et al., 2024), while another study found that for many multilingual language models, neu- rons activating with a single language were pre- dominantly located in the initial and final layers (Kojima et al., 2024). Furthermore, research has shown that the model output language can easily be changed using simple control vectors (Leino and Karlgren, 2025), suggesting a disentangled lan- guage direction from the other concept directions. Beyond merely exploring how multilingual lan- guage models internally represent various lan- guages, prior research has also examined methods to align these representations, often involving the usage of parallel data, i.e., aligned translated text, typically in the form of sentence pairs. In pretrain- ing, parallel data has been involved, for example, with some auxiliary training objectives, such as a machine translation
Chunk 4 · 1,998 chars
r research has also examined methods to align these representations, often involving the usage of parallel data, i.e., aligned translated text, typically in the form of sentence pairs. In pretrain- ing, parallel data has been involved, for example, with some auxiliary training objectives, such as a machine translation objective (Kale et al., 2021). A common technique has also been to simply add a small portion of concatenated parallel sentences to the pretraining mix of language models without any additional training objective (Conneau and Lam- ple, 2019; Anil et al., 2023; Luukkonen et al., 2024; Gemma Team et al., 2025). Intuitively, the attention blocks in the transformer architecture could pro- vide the necessary communication mechanism be- tween the translation sentences for the cross-lingual signal, thus improving cross-lingual representation sharing in the models. However, it is not yet fully understood whether this actually occurs in practice and how dependent it is on the amount of parallel data used during pretraining. 3 Model training To study the effects of parallel data, we train four language models of 1.4 billion parameters using a bilingual corpus of 200 billion tokens with vary- ing proportions of parallel data, using the training setting described in the following sections. 3.1 Language models For the models, we closely follow the open source OLMo decoder-only transformer language model architecture (Groeneveld et al., 2024). Our version of the model has 24 transformer layers in total, a residual stream of 2048 dimensions, and 16 atten- tion heads with dimension 128 (see Table 4 in the Appendix A for full details). 3.2 Training data To train each model, we construct a 200 billion token training corpus consisting of tokens in two languages, English and Finnish. Due to English and Finnish forming a typologically distant lan- guage pair, this setting provides a robust founda- tion for the generalization of the results to more similar language pairs.
Chunk 5 · 1,996 chars
To train each model, we construct a 200 billion token training corpus consisting of tokens in two languages, English and Finnish. Due to English and Finnish forming a typologically distant lan- guage pair, this setting provides a robust founda- tion for the generalization of the results to more similar language pairs. For each of the four mod- els, we dedicate a different proportion of the data to translated sentence pairs (name of the model in 2 -- 2 of 13 -- Model English Finnish Parallel lm-0 160 Ă 109 40 Ă 109 0 lm-1 159 Ă 109 39 Ă 109 2 Ă 109 lm-2 158 Ă 109 38 Ă 109 4 Ă 109 lm-5 155 Ă 109 35 Ă 109 10 Ă 109 Table 1: The distributions of the training data in tokens for the four language models. parenthesis): 0% (lm-0), 1% (lm-1), 2% (lm-2), and 5% (lm-5). We used a skewed language mix of 80% tokens in English and 20% in Finnish, with the assumption of parallel data having an equal split between the languages and then adjusting the monolingual cor- pus sizes to preserve this ratio. The distributions of the training corpora for the four models are shown in Table 1. We utilized the FineWeb (Penedo et al., 2024) and FineWeb2 (Penedo et al., 2025) datasets as the sources for our English and Finnish monolin- gual corpora, respectively, subject to additional custom quality filtering and upsampling steps de- tailed in Appendix B. For parallel data, we utilized three separate datasets from the OPUS collection (Tiedemann, 2012): HPLT2 (29,067,875 Finnish- English sentence pairs) (de Gibert et al., 2024), CCMatrix (35,982,562 sentence pairs) (Fan et al., 2020; Schwenk et al., 2021), and the 2024 version of the OpenSubtitles dataset3 (110,093 sentence pairs) (Tiedemann, 2016). Our final parallel dataset contains a total of 65,160,530 parallel sentences, which results in approximately 4.5B tokens. We then up-sample this dataset to reach the required amount. To control for potential domain differ- ences, a fixed baseline of 5% of the full training corpora is sourced from
Chunk 6 · 1,997 chars
s) (Tiedemann, 2016). Our final parallel dataset contains a total of 65,160,530 parallel sentences, which results in approximately 4.5B tokens. We then up-sample this dataset to reach the required amount. To control for potential domain differ- ences, a fixed baseline of 5% of the full training corpora is sourced from the parallel dataset for all the models. Instead of feeding the parallel data only in a sin- gle format, we apply to each parallel pair a format sampled from 60 different instruction or comple- tion formats covering both English and Finnish (example in Appendix C). The intuition is that the variability in the format helps the model avoid over- fitting to the specific format structure and instead allows learning solely the relationships between the languages. To tokenize the collected text corpus, we train a SentencePiece tokenizer (Kudo and Richardson, 2https://hplt-project.org/ 3http://www.opensubtitles.org/ 2018) using Byte Pair Encoding (BPE) (Sennrich et al., 2016) with a balanced corpus and a vocabu- lary size of 50,280 tokens. 3.3 Training setup We trained the models largely following the code- base used for training the OLMo models (Groen- eveld et al., 2024), utilizing the LUMI supercom- puter infrastructure4 and a context window of 2048 tokens. To avoid any unintended cross-lingual sig- nals during the training, we did not concatenate se- quences across the three corpora into same chunks, but still shuffled the chunks in the batches. A de- tailed description of the training configuration is outlined in Appendix D. 4 Evaluation experiments We use five methods in total to evaluate the level of representation sharing in the models to obtain a comprehensive view on the potential effects. 4.1 Principal component projections Our first method involves visualizing the repre- sentation space in two dimensions, similar to many other studies on multilingual language mod- els (Marks and Tegmark, 2024; Rimsky et al., 2024). More specifically, we use the
Chunk 7 · 1,997 chars
to obtain a comprehensive view on the potential effects. 4.1 Principal component projections Our first method involves visualizing the repre- sentation space in two dimensions, similar to many other studies on multilingual language mod- els (Marks and Tegmark, 2024; Rimsky et al., 2024). More specifically, we use the dev-split of the Finnish-English subset of the FLORES+ ma- chine translation dataset (NLLB Team et al., 2024), which contains 997 translated sentences in total, run these sentences through the model and collect the mean pooled representations of each sentence from the residual stream after each layer of the network. After obtaining the representations for each sentence, we stack the representations for both languages and apply principal component analysis (PCA) to the resulting dataset. We then project the sentence representations to the first two principal components and visualize them for all layers. If the Finnish and English sequences align visually, the primary sources of variation are not language- specific, indicating that the sequences share some common representations. This will, therefore, pro- vide first evidence of the existence of shared repre- sentations. 4.2 Cosine similarity of parallel sentences To directly compare the representations across lan- guages, we compute the average cosine similari- ties between representations of 2000 translation 4https://lumi-supercomputer.eu 3 -- 3 of 13 -- sentences sampled from the OPUS WMT-News dataset (Tiedemann, 2012). Similarly as with the projections, we use the mean pooled representa- tions extracted from the residual stream after each layer of the model. Since cosine similarity mea- sures the angular closeness of vectors, we expect higher scores between the parallel pairs when the representation space of the model is more cross- lingually aligned. We also compute the average co- sine similarities between random sentences across the parallel data to act as a reference point. More specifically,
Chunk 8 · 1,997 chars
ea- sures the angular closeness of vectors, we expect higher scores between the parallel pairs when the representation space of the model is more cross- lingually aligned. We also compute the average co- sine similarities between random sentences across the parallel data to act as a reference point. More specifically, we compute for each representation of Finnish sentence in the dataset the cosine similarity with 50 randomly sampled English sentences, take the average over these 50 cosine similarities, and finally take the average over all the Finnish sen- tences, thus resulting in one score representing the baseline cosine similarity. 4.3 Projection weighted canonical correlation analysis We also perform projection weighted canonical cor- relation analysis (PWCCA) (Morcos et al., 2018) between 20,000 translation pairs, similar to a study analyzing multilingual representations of mBERT (Singh et al., 2019). We again sample sentence pairs from the OPUS WMT-News dataset (Tiede- mann, 2012). The advantage of PWCCA is that it is invariant to linear transformations of the rep- resentations, thus making it suitable for analyzing multilingual representations, where some language- specific linear transformations might be applied on top of the shared representation space. Intuitively, in case the parallel data helps in cross-lingual repre- sentation sharing, we would expect higher PWCCA scores with larger amounts of parallel data used in training. To calculate the PWCCA scores, we use the implementation of the original paper (Morcos et al., 2018) available on GitHub5. 4.4 Distribution of language-specific neurons To assess the representation space from the point of view of neuron activations, we identify language- specific neurons and assess their amount and distri- bution across layers using a method proposed by Kojima et al. (2024). More specifically, we will first collect the neuron activation values for each neuron in the model when feeding the model with both English
Chunk 9 · 1,995 chars
int of view of neuron activations, we identify language- specific neurons and assess their amount and distri- bution across layers using a method proposed by Kojima et al. (2024). More specifically, we will first collect the neuron activation values for each neuron in the model when feeding the model with both English and Finnish text from the Finnish- English development subset of the FLORES+ 5https://github.com/google/svcca/blob/master/ pwcca.py dataset (NLLB Team et al., 2024). We then take the mean activation value of each neuron over the tokens of each sequence, capturing one activation value for each neuron per sequence in the dataset. Using these activation values and language labels per sentence, we regard the activation values of each neuron as the predicted scores for the lan- guage of an input sequence, and the original lan- guage of the sequence as the ground truth. We then measure the average precision (AP), which should capture the language sensitivity of the neuron. Hav- ing obtained the AP score for each neuron, we use these scores as a measure of language specificity in two analyses: 1) we plot the distribution of the top k scoring neurons across layers to understand the overall distribution of language-specific neurons and 2) we bucket neurons by AP thresholds for better model-to-model comparability. We expect the number of less language-specific neurons to be higher, i.e. more neurons to be cross-lingually shared, when more parallel data is used in training. 4.5 Effectiveness of language control vectors We also evaluate the effectiveness of language con- trol vectors (Leino and Karlgren, 2025) that we em- ploy to force output in Finnish for the completions of 200 English context sentences sampled from the ROCStories dataset (Mostafazadeh et al., 2016). When employing the control vectors, we use a sim- ilar approach to that of Leino and Karlgren (2025), creating the control vectors from representations of 997 Finnish-English parallel sentences
Chunk 10 · 1,999 chars
Finnish for the completions of 200 English context sentences sampled from the ROCStories dataset (Mostafazadeh et al., 2016). When employing the control vectors, we use a sim- ilar approach to that of Leino and Karlgren (2025), creating the control vectors from representations of 997 Finnish-English parallel sentences sampled from FLORES+ (NLLB Team et al., 2024), con- ducting a hyperparameter search for the best scal- ing factor a, and applying these vectors to the resid- ual stream after each layer of the network at each token position of both the prompt and generated tokens. For each story in the dataset, we let the models generate five continuations and evaluate using an LLM-judge, Mistral Medium 3 (Mistral AI team, 2025), both the Finnish fluency and the coherence of the completion. More specifically, the task of the LLM-judge is for both of the metrics to map each completion to one of four categories provided in the prompt, with the first category representing a com- plete failure, while the fourth category a perfectly valid completion with the rest of the categories falling in between (Appendix E). Furthermore, to provide baselines for both of these metrics, we also generate completions without control vectors for the same English stories and for the corresponding 4 -- 4 of 13 -- Finnish translations obtained from Gemini 2.5 Pro (Gemini Team, Google, 2025). Intuitively, we ex- pect to get better fluency and coherence scores with more aligned representations because the language direction captured by the control vectors becomes more decoupled from the other representations. 5 Results 5.1 Fully trained models The visualizations of the projections to the first two principal components show that even the model without parallel data achieves some degree of cross- lingual alignment in the middle layers, with the English and Finnish data points heavily overlap- ping each other, as shown in Figure 1. The overlap seems to happen in the same middle layers across all
Chunk 11 · 1,988 chars
s to the first two principal components show that even the model without parallel data achieves some degree of cross- lingual alignment in the middle layers, with the English and Finnish data points heavily overlap- ping each other, as shown in Figure 1. The overlap seems to happen in the same middle layers across all the models, with only some variation in the de- gree of overlap in the layers. Overall, these results provide the first evidence that cross-lingual repre- sentation sharing is occurring in each of the models, regardless of parallel data. From Figure 2, we can see that for each model the average cosine similarity of translation pairs in- creases quickly between layers 4 and 6, stays high during the middle layers, and decreases sharply af- ter layer 22. Although these results alone indicate that the translation pairs share common representa- tions in the middle layers, we can see that the co- sine similarities of Finnish sentences with random English sentences also follow this pattern. There- fore, instead of simply sharing semantic representa- tions between languages and therefore pushing the cosine similarity of translation pairs close to 1.0, the model seems to push all representations into a small region in the representation space. This phenomenon is a well-known property of many language models, often referred to as anisotropy (Ethayarajh, 2019). Furthermore, while we can still see from Fig- ure 2 that the average cosine similarity between translation pairs seems to be significantly higher for all models compared to random sentence pairs, even the untrained model seems to show a clear gap between the average cosine similarities of transla- tion pairs and random Finnish-English sentence pairs. Therefore, it remains difficult to obtain a definite answer as to whether the higher cosine sim- ilarity of translation pairs is due to representation sharing or a residual effect from token embeddings that might be shared across languages through,
Chunk 12 · 1,990 chars
ties of transla- tion pairs and random Finnish-English sentence pairs. Therefore, it remains difficult to obtain a definite answer as to whether the higher cosine sim- ilarity of translation pairs is due to representation sharing or a residual effect from token embeddings that might be shared across languages through, e.g., named entities. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Cosine Similarity lm-0 (translation) lm-1 (translation) lm-2 (translation) lm-5 (translation) untrained (translation) lm-0 (random) lm-1 (random) lm-2 (random) lm-5 (random) untrained (random) Figure 2: The average cosine similarities across the lay- ers for the fully trained models and the untrained model. Translation refers to the average cosine similarity be- tween the Finnish-English translation pairs, whereas random refers to the average cosine similarity between random Finnish and English sentence pairs. However, as shown in Figure 3, the PWCCA scores are well beyond the scores of the untrained model and the scores with shuffled pairs, thus sup- porting the findings of the PCA projections that all models learn to share representations between Finnish and English. Furthermore, the PWCCA scores show a clear trend of increasing first slowly during the early layers, then rapidly in the early middle layers, and staying close to the peak value of 0.7 until layer 22, from where the scores then drop only slightly. Intuitively, these results indi- cate that in the early middle layers the different components of the models start to write increasing amounts of cross-lingually shared representations into the residual stream, thus making the linear transformations of the representations of the paral- lel pairs correlate more with each other. Furthermore, the results do not show any rela- tionship between the amount of parallel data in the training corpus and the PWCCA scores since each model obtains a similar peak score with a sim- ilar
Chunk 13 · 1,997 chars
m, thus making the linear transformations of the representations of the paral- lel pairs correlate more with each other. Furthermore, the results do not show any rela- tionship between the amount of parallel data in the training corpus and the PWCCA scores since each model obtains a similar peak score with a sim- ilar increasing pattern in the early middle layers. These results indicate that with our models, after around 200B tokens, the general level of alignment between languages is invariant to the amount of parallel data used in the pretraining. Figure 4 shows the distribution of the 1000 highest-scoring neurons for both Finnish and En- glish across the models. From the figure, one can see that for all the models the most language- specific neurons are distributed mainly to the first 5 -- 5 of 13 -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Layer 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 PWCCA Score lm-0 lm-1 lm-2 lm-5 untrained lm-0 (random) lm-1 (random) lm-2 (random) lm-5 (random) untrained (random) Figure 3: The PWCCA scores for the final checkpoints of all the models and the untrained model. Random refers to the scores obtained when shuffling the parallel dataset, thus breaking the possible correlation. few layers and the last few layers, with little to no language-specific neurons in the middle lay- ers, aligning well with previous results on multi- lingual neurons (Kojima et al., 2024). The lack of language-specific neurons in the middle layers indicates that in these middle layers the modelsâ neurons are more focused on shared features in the input representations rather than on features bound to either Finnish or English. To better understand and compare the distribu- tions of cross-lingually shared neurons, we visu- alize in Figure 5 the percentage of neurons in each layer of the models that have an average precision of less than or equal to 0.55 for both Finnish and English. From the figure, we can see that for each model,
Chunk 14 · 1,995 chars
r English. To better understand and compare the distribu- tions of cross-lingually shared neurons, we visu- alize in Figure 5 the percentage of neurons in each layer of the models that have an average precision of less than or equal to 0.55 for both Finnish and English. From the figure, we can see that for each model, the portion of these highly language-agnostic neurons peaks in layers 14 and 15 with close to 60% of the neurons being language- agnostic. Furthermore, there seems to be no signif- icant differences between the models in the propor- tions of language-agnostic neurons, indicating that increasing amounts of parallel data does not have a notable effect on the number of cross-lingually shared neurons. Table 2 shows the total sum of language-specific neurons with varying thresholds, where lm-0 has the most language-specific neurons for both Finnish and English with the first three thresholds. Therefore, even though parallel data has limited Figure 4: The distribution of the 1000 most language- specific neurons across the layers of each model. Figure 5: The percentage of neurons on each layer that score lower than or equal to 0.55 average precision in predicting both of the languages. effect on completely language-agnostic neurons, including such data seems to decrease the over- all number of language-specific neurons, with the exception of most language-specific neurons. Finally, Table 3 shows that the control vectors with the best scaling factor found during the ini- tial hyperparameter search (a=0.09) seem to be highly effective in changing the language of the completions from English to Finnish, obtaining mean fluency scores slightly below the maximum of 4 (grammatically perfect Finnish) and almost the same as when generating completions to the corresponding Finnish translations. However, there seems to be no positive correlation between flu- ency scores and the amount of parallel data. Fur- 6 -- 6 of 13 -- Finnish lm-0 lm-1 lm-2 lm-5 >= 0.75 4666
Chunk 15 · 1,989 chars
low the maximum of 4 (grammatically perfect Finnish) and almost the same as when generating completions to the corresponding Finnish translations. However, there seems to be no positive correlation between flu- ency scores and the amount of parallel data. Fur- 6 -- 6 of 13 -- Finnish lm-0 lm-1 lm-2 lm-5 >= 0.75 4666 4485 4538 4531 >= 0.9 496 423 398 437 >= 0.95 134 106 92 108 >= 0.99 16 19 16 15 English lm-0 lm-1 lm-2 lm-5 >= 0.75 3448 3208 3165 3201 >= 0.9 317 274 243 279 >= 0.95 92 78 67 78 >= 0.99 8 10 9 11 Both lm-0 lm-1 lm-2 lm-5 <= 0.55 50,716 50,719 51,051 50,696 Table 2: The total number of neurons in each model with varying average precision thresholds. Finnish refers to average precision predicting Finnish, whereas English to average precision predicting English. The total number of neurons in each model is 132,096. thermore, although controlled completions obtain lower coherence scores than the uncontrolled ones, the scores are still well above the baseline score of 1 (completely unrelated continuations to the preced- ing story context). Therefore, even when shifting the language to Finnish, all the models can generate completions that are somewhat related to the pre- vious English context, suggesting cross-lingually shared representations (example shown in Table 5 in Appendix F). Between the models, lm-5 seems to obtain slightly higher coherence score than the rest of the models; however, as shown by the higher scores for the baseline continuations, this might just be due to better overall coherence of the modelâs outputs. 5.2 Throughout training We also evaluated the level of cross-lingual repre- sentation sharing in the checkpoints of the models throughout the training. Based on the evaluation results, the cross-lingual alignment starts to emerge relatively early in the training. For example, Fig- ure 6 shows that for all the models, the PWCCA scores start to increase very early in training with a bump in the scores appearing in the middle
Chunk 16 · 1,996 chars
oints of the models throughout the training. Based on the evaluation results, the cross-lingual alignment starts to emerge relatively early in the training. For example, Fig- ure 6 shows that for all the models, the PWCCA scores start to increase very early in training with a bump in the scores appearing in the middle layers even as early as after 5000 steps. There seems to be no clear correlation with the amount of parallel data. On the other hand, the mean fluency and co- herence scores in the language control experiment reveal a more noticeable distinction. To isolate the effects of general improvements in model ca- pabilities, we normalized the scores by subtracting Fluency Coherence lm-0 ÎŒ Ï ÎŒ Ï FI 3.818 0.221 2.899 0.479 EN 1.008 0.052 3.123 0.425 0.09 3.713 0.266 2.324 0.636 lm-1 ÎŒ Ï ÎŒ Ï FI 3.827 0.211 2.92 0.478 EN 1.002 0.020 3.147 0.448 0.09 3.705 0.270 2.231 0.663 lm-2 ÎŒ Ï ÎŒ Ï FI 3.83 0.192 2.99 0.425 EN 1.009 0.054 3.152 0.433 0.09 3.694 0.273 2.343 0.623 lm-5 ÎŒ Ï ÎŒ Ï FI 3.805 0.214 3.03 0.402 EN 1.005 0.042 3.175 0.426 0.09 3.678 0.267 2.461 0.622 Table 3: The mean (ÎŒ) scores with standard deviations (Ï) from the LLM-judge for fluency and coherence for the continuations to the 200 stories. 0.09 refers to the controlled continuations, FI to the Finnish baseline con- tinuations without any control vectors applied, and EN to English baseline continuations. them from the Finnish baseline scores. Figure 7 shows these differences across the model check- points (a=0.09). We can see that, while the dif- ferences in mean fluency scores do not show any trend, a clear distinction emerges in the coherence score differences. Assuming that perfectly aligned representations would produce a coherence differ- ence of zero in the experiment, lm-5 appears to have the most aligned representations during the initial 20,000 steps, while lm-0 the least. These results suggest that adding parallel data helps in sharing representations in the early phases of the training. We
Chunk 17 · 1,999 chars
aligned representations would produce a coherence differ- ence of zero in the experiment, lm-5 appears to have the most aligned representations during the initial 20,000 steps, while lm-0 the least. These results suggest that adding parallel data helps in sharing representations in the early phases of the training. We also examined the number of language- specific neurons throughout the training with dif- ferent average precision thresholds as shown in Figure 8. From the figure, we can see that lm-0 has for most thresholds the most language-specific neurons throughout the training. However, what is even more interesting is that after an initial phase of rapid decrease, the number of language-specific neurons increases for all of the models throughout the whole training without showing any signs of convergence. This calls into question the use of the number of language-specific neurons as a measure for cross-lingual alignment, as the models seem to naturally converge toward a higher number of these 7 -- 7 of 13 -- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.55 0.60 0.65 0.70 PWCCA Score Step 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.55 0.60 0.65 0.70 PWCCA Score Step 5000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.55 0.60 0.65 0.70 PWCCA Score Step 10000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Layer 0.55 0.60 0.65 0.70 PWCCA Score Step 15000 lm-0 lm-1 lm-2 lm-5 Figure 6: PWCCA scores computed across layers for early checkpoints of each model. 10000 20000 30000 40000 50000 0.00 0.05 0.10 0.15 0.20 0.25 Difference Mean Fluency 10000 20000 30000 40000 50000 Training steps 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Difference Mean Coherence lm-0 lm-1 lm-2 lm-5 Figure 7: Difference in mean fluency and coherence scores between the controlled and Finnish baseline com- pletions for checkpoints during training. neurons over time. 6 Conclusions Overall, all of our experiments show
Chunk 18 · 1,995 chars
0000 50000 Training steps 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Difference Mean Coherence lm-0 lm-1 lm-2 lm-5 Figure 7: Difference in mean fluency and coherence scores between the controlled and Finnish baseline com- pletions for checkpoints during training. neurons over time. 6 Conclusions Overall, all of our experiments show evidence for the shared cross-lingual representation space emerging in the middle layers of all models, with the exception of cosine similarity, which provided somewhat unreliable results. This observation of cross-lingual representation sharing in the mid- dle layers also aligns well with previous research (Wendler et al., 2024; Kojima et al., 2024), and was therefore somewhat expected. The more surprising result of the experiments is that, contrary to general belief, parallel data ap- 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95367 Training step 2000 2400 2800 3200 3600 4000 4400 4800 Count of Neurons Average precision >= 0.75 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95367 Training step 120 180 240 300 360 420 480 540 600 Count of Neurons Average precision >= 0.9 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95367 Training step 15 30 45 60 75 90 105 120 135 150 Count of Neurons Average precision >= 0.95 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95367 Training step 0 3 6 9 12 15 18 21 24 Count of Neurons Average precision >= 0.99 lm-0 - English lm-0 - Finnish lm-1 - English lm-1 - Finnish lm-2 - English lm-2 - Finnish lm-5 - English lm-5 - Finnish Figure 8: The development of the number of neurons given four thresholds for average precisions. pears to have little effect on overall cross-lingual alignment in language models. In general, most of the evaluation experiments did not show any
Chunk 19 · 1,997 chars
lm-1 - Finnish lm-2 - English lm-2 - Finnish lm-5 - English lm-5 - Finnish Figure 8: The development of the number of neurons given four thresholds for average precisions. pears to have little effect on overall cross-lingual alignment in language models. In general, most of the evaluation experiments did not show any signif- icant differences between the models. Therefore, it seems likely that some other factor encourages even the model without parallel data to share rep- resentations between languages, such as a limited parameter capacity budget (Conneau et al., 2020). It is also possible that, despite the language filtering steps (Penedo et al., 2024, 2025), the monolingual corpora still contain parallel data similar to some other training corpora (Blevins and Zettlemoyer, 2022; Briakou et al., 2023). We leave the examina- tion of these potentially overshadowing factors for future research. Despite not finding evidence of parallel data helping in the representation sharing in fully trained models, we still observed some evidence of acceler- ated cross-lingual representation sharing during the early stages of pretraining. Furthermore, it seems that including parallel data to the training corpus might push some of the language-specific neurons to be less language-specific. Interestingly, we also found that the number of language-specific neurons increases across all models as training progresses. To summarize, the results of this study show that while being an intuitively intriguing technique, including parallel data to the pretraining corpus does not seem to provide any major benefit to cross- lingual alignment of language models. Limitations A primary limitation of this study is the restric- tion of our experiments to a single language pair, 8 -- 8 of 13 -- English and Finnish, which could limit the ability to generalize the findings to typologically closer pairs or other distant pairs. However, we selected English-Finnish specifically because it represents a
Chunk 20 · 1,994 chars
ary limitation of this study is the restric- tion of our experiments to a single language pair, 8 -- 8 of 13 -- English and Finnish, which could limit the ability to generalize the findings to typologically closer pairs or other distant pairs. However, we selected English-Finnish specifically because it represents a "hard case" for cross-lingual alignment due to significant typological differences, thus making the results likely to generalize to other language pairs as well and the results not to be biased by the simi- larity of the languages. Another limitation of the study is the scale of the trained language models. Constrained by computational resources, we trained only 1.4B- parameter models on 200B tokens, thus obtaining performance-wise limited models. Consequently, the generalizability of our findings to the large- scale models remains an open question as phenom- ena observed with small models might not always generalize to larger ones. However, training smaller models from scratch allowed us to exercise precise control over the data mixture, specifically the exact ratio of parallel tokens, thus allowing us to conduct the experiments in a more controlled manner. Finally, our models ranged from leveraging 0% to 5% parallel data. While this range might be insufficient to observe potential effects that might emerge at higher ratios (e.g., 20% or 50%), we restricted our scope to this range to reflect real- istic constraints in pretraining language models. For many language pairs, particularly low-resource ones, acquiring high-quality parallel data constitut- ing more than 5% of a massive pretraining corpus is often infeasible. Our objective was to evaluate the utility of parallel data within a regime of re- alistic scarcity, rather than in synthetic scenarios dominated by this parallel data. Acknowledgments The research was supported by the Technology In- dustries of Finland Centennial Foundation. References Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin
Chunk 21 · 1,991 chars
was to evaluate the utility of parallel data within a regime of re- alistic scarcity, rather than in synthetic scenarios dominated by this parallel data. Acknowledgments The research was supported by the Technology In- dustries of Finland Centennial Foundation. References Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Z. Chen, Eric Chu, J. Clark, Laurent El Shafey, Yanping Huang, Kathleen S. Meier-Hellstern, Gaurav Mishra, Er- ica Moreira, Mark Omernick, Kevin Robinson, and 109 others. 2023. PaLM 2 technical report. arXiv preprint arXiv:2305.10403v3. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2019. On the cross-lingual transferability of mono- lingual representations. In Annual Meeting of the Association for Computational Linguistics. Terra Blevins and Luke Zettlemoyer. 2022. Language contamination helps explains the cross-lingual capa- bilities of English pretrained models. In Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, pages 3563â3574, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Eleftheria Briakou, Colin Cherry, and George F. Foster. 2023. Searching for needles in a haystack: On the role of incidental bilingualism in PaLMâs translation capability. In Annual Meeting of the Association for Computational Linguistics. Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural In- formation Processing Systems, Red Hook, NY, USA. Curran Associates Inc. Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Emerging cross- lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 6022â 6034, Online. Association for Computational Lin- guistics. Ona de Gibert, Graeme Nail, Nikolay
Chunk 22 · 1,999 chars
n Li, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Emerging cross- lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 6022â 6034, Online. Association for Computational Lin- guistics. Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema RamĂrez- SĂĄnchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, and Jörg Tiedemann. 2024. A new massive multilingual dataset for high-performance language technologies. In Proceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 1116â1128, Torino, Italia. ELRA and ICCL. Kawin Ethayarajh. 2019. How contextual are contextu- alized word representations? Comparing the geom- etry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 55â65, Hong Kong, China. Association for Computational Linguistics. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Man- deep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi- taliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2020. Beyond English-centric multilingual machine translation. J. Mach. Learn. Res., 22:107:1â107:48. Gemini Team, Google. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261v4. 9 -- 9 of 13 -- Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre RamĂ©, Morgane RiviĂšre, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela
Chunk 23 · 1,997 chars
eration agentic capabilities. arXiv preprint arXiv:2507.06261v4. 9 -- 9 of 13 -- Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre RamĂ©, Morgane RiviĂšre, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786v1. Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khy- athi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. OLMo: Accelerating the science of language mod- els. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 15789â15809, Bangkok, Thailand. Association for Computational Linguistics. Mihir Kale, Aditya Siddhant, Rami Al-Rfou, Linting Xue, Noah Constant, and Melvin Johnson. 2021. nmT5 - Is parallel data still relevant for pre-training massively multilingual language models? In Pro- ceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 683â 691. Takeshi Kojima, Itsuki Okimura, Yusuke Iwasawa, Hit- omi Yanaka, and Yutaka Matsuo. 2024. On the multi- lingual ability of decoder-based pre-trained language models: Finding and controlling language-specific neurons. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6919â6971, Mexico City, Mexico. Association for Computational Linguistics. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tok- enizer
Chunk 24 · 1,986 chars
the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6919â6971, Mexico City, Mexico. Association for Computational Linguistics. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tok- enizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66â71, Brussels, Belgium. Association for Computational Linguistics. Julius Leino and Jussi Karlgren. 2025. Controlling lan- guage and style of multi-lingual generative language models with control vectors. Northern European Journal of Language Technology, 11(1):1â26. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Confer- ence on Learning Representations. Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, VĂ€inö HatanpÀÀ, Peter Sarlin, and Sampo Pyysalo. 2024. Poro 34B and the blessing of multilinguality. In Conference on Language Modeling. Samuel Marks and Max Tegmark. 2024. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824v3. Mistral AI team. 2025. Medium is the new large. https://mistral.ai/news/ mistral-medium-3/. Ari S. Morcos, Maithra Raghu, and Samy Bengio. 2018. Insights on representational similarity in neural net- works with canonical correlation. In Neural Informa- tion Processing Systems. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). Niklas Muennighoff, Alexander M. Rush, Boaz
Chunk 25 · 1,989 chars
ushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. 2023. Scaling data-constrained language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS â23, Red Hook, NY, USA. Curran Associates Inc. NLLB Team, Marta R. Costa-jussĂ , James Cross, Onur Ăelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, and 20 others. 2024. Scaling neural machine translation to 200 languages. Nature, 630(8018):841â846. Guilherme Penedo, Hynek KydlĂËcek, Loubna Ben al- lal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Guilherme Penedo, Hynek KydlĂËcek, Vinko SabolËcec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. 2025. FineWeb2: One pipeline to scale them all â adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920v1. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996â5001. As- sociation for Computational Linguistics. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024.
Chunk 26 · 1,998 chars
ger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996â5001. As- sociation for Computational Linguistics. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024. Steer- ing Llama 2 via contrastive activation addition. In 10 -- 10 of 13 -- Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 15504â15522. Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. 2021. CCMatrix: Mining billions of high-quality parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for Compu- tational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol- ume 1: Long Papers), pages 6490â6500, Online. As- sociation for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715â1725, Berlin, Germany. Association for Computational Lin- guistics. Jasdeep Singh, Bryan McCann, Richard Socher, and Caiming Xiong. 2019. BERT is not an interlingua and the bias of tokenization. In Conference on Em- pirical Methods in Natural Language Processing. Jörg Tiedemann. 2012. Parallel data, tools and inter- faces in OPUS. In Proceedings of the Eighth In- ternational Conference on Language Resources and Evaluation (LRECâ12), pages 2214â2218, Istanbul, Turkey. European Language Resources Association (ELRA). Jörg Tiedemann. 2016. Finding alternative translations in a large corpus of movie subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LRECâ16), pages 3518â 3522, PortoroĆŸ, Slovenia. European Language Re- sources
Chunk 27 · 1,997 chars
l, Turkey. European Language Resources Association (ELRA). Jörg Tiedemann. 2016. Finding alternative translations in a large corpus of movie subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LRECâ16), pages 3518â 3522, PortoroĆŸ, Slovenia. European Language Re- sources Association (ELRA). Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. Do Llamas work in English? On the latent language of multilingual transformers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366â15394. Association for Computational Linguistics. A Model architecture hyperparameters Table 4 shows a detailed breakdown of the model architecture used in this study. B Quality filtering and upsampling steps applied to the monolingual corpora As we noticed that the raw FineWeb2 dataset con- tained significant portions of low quality content, we applied our own keyword- and URL-based con- tent filtering routines tailored for Finnish, resulting in the removal of approximately 8.8% of the doc- uments. To avoid a potential content imbalance Hyperparameter Value Parameters 1, 420, 296, 192 Non-embedding parameters 1, 317, 273, 600 Number of layers 24 Model dimension 2048 Attention type Multi-headed at- tention Number of heads 16 Q/K/V dimension 128 Activation function SwiGLU MLP dimension 5504 Layer normalization Non-parametric LayerNorm Position information RoPE RoPE base Ξ 10, 000 Vocabulary size 50, 280 Embedding size 50, 304 Table 4: Hyperparameters for the model architecture used in this study. between the Finnish and English corpora, we also then applied a similar content filtering step to the monolingual English corpus, leading to the removal of approximately 1.4% of the documents. Due to the resulting Finnish dataset being only around 20B tokens, we further up-sampled the corpus accord- ingly to fit the Finnish data requirements of the models. This
Chunk 28 · 1,998 chars
also
then applied a similar content filtering step to the
monolingual English corpus, leading to the removal
of approximately 1.4% of the documents. Due to
the resulting Finnish dataset being only around 20B
tokens, we further up-sampled the corpus accord-
ingly to fit the Finnish data requirements of the
models. This decision is supported by previous
studies suggesting that training with the same data
even up to four epochs has negligible effect on
the downstream performance (Muennighoff et al.,
2023).
C Example instruction format
Translate the English sentence â{exam-
ple_en}â into Finnish. Finnish: {exam-
ple_fi}.
D Training hyperparameters and
hardware setup
For monolingual corpora, we concatenated the doc-
uments that were shorter than the context window
of 2048 tokens together to fill the whole context
window in one chunk, using special beginning of
sequence <s> and end of sequence </s> tokens
to separate the sequences. For parallel data, we
concatenated the maximum number of formatted
parallel sequences without overflowing the context
window, again using the special tokens as sepa-
rators, and applied padding to fill the remaining
11
-- 11 of 13 --
context window. To avoid any unintended cross-
lingual signals, we kept all the datasets in separate
chunks. However, we still applied shuffling to the
chunks, thus allowing one batch to contain chunks
from all the datasets. We trained the models with
a batch size of approximately two million tokens
(1024 chunks per batch), resulting in 95,367 total
training steps.
For the optimizer, we used AdamW (Loshchilov
and Hutter, 2019) with beta coefficients of 0.9 and
0.95, and a weight decay coefficient of 0.1. For the
learning rate, we employed a cosine learning rate
scheduler with a warmup of 2000 steps and a maxi-
mum learning rate of 2.0 Ă 10â4 and a minimum
learning rate of 2.0 Ă 10â5. We also employed
gradient clipping to a maximum norm of 1.0.
From a hardware perspective, we trained the
models with a cluster ofChunk 29 · 1,997 chars
f 0.1. For the learning rate, we employed a cosine learning rate scheduler with a warmup of 2000 steps and a maxi- mum learning rate of 2.0 Ă 10â4 and a minimum learning rate of 2.0 Ă 10â5. We also employed gradient clipping to a maximum norm of 1.0. From a hardware perspective, we trained the models with a cluster of 32 AMD Instinct MI250X GPUs (64 Graphics Compute Dies, GCDs) divided across 8 nodes (4 physical GPUs per node, 8 GCDs) hosted on the GPU partition of the LUMI super- computer. E LLM-judge evaluation categories Fluency 1. The text is grammatically malformed to the point of being incomprehensible (gibberish), is not in Finnish, or contains only a few Finnish words in a sea of another language. 2. The text is understandable and mostly in Finnish, but contains many significant gram- matical errors or words from other languages. Grammatical errors may include fundamen- tally wrong verb conjugations or noun inflec- tions, severe phrasing issues, or highly repeti- tive words or phrases. 3. The text is fully in Finnish and grammatically correct for the most part, but contains minor, non-critical errors. Examples include unnatu- ral phrasing that suggests a literal translation from another language (calque), occasional incorrect word inflections, or awkward word choices. 4. The text is fully in Finnish and grammatically perfect, idiomatic, and reads as if written by a native Finnish speaker. All word choices, inflections, and structures are natural. Coherence 1. The continuation is completely unrelated to the previous sentences with no semantically resembling words, concepts, or structure indi- cating of relatedness. 2. The continuation is related to the overarching topic or themes of the previous sentences (e.g., âdogsâ, âshoppingâ) but fails to connect to the specific events or characters of the sentences. 3. The continuation correctly identifies and uses specific elements (characters, objects, loca- tions) from the previous sentences. However, it contains
Chunk 30 · 1,994 chars
ed to the overarching topic or themes of the previous sentences (e.g., âdogsâ, âshoppingâ) but fails to connect to the specific events or characters of the sentences. 3. The continuation correctly identifies and uses specific elements (characters, objects, loca- tions) from the previous sentences. However, it contains some major logical inconsistencies, e.g., by mixing up the subject of some actions or presenting conflicting facts about the ele- ments. 4. The continuation provides a coherent continu- ation to the previous sentences without illogi- cal or inconsistent elements. F Example text completions from language control experiment Table 5 shows controlled Finnish text completions to an example English story context. 12 -- 12 of 13 -- Input Kia had always had a distaste for mushrooms. But then one day she or- dered pasta at a restaurant. The sauce was wonderful, and she asked what was in it. To her surprise, the waiter said it was mostly mushrooms! lm-0 Kia ei ollut ainoa, joka oli ihas- tunut tĂ€hĂ€n ruokaan. Pian muutkin asiakkaat alkoivat kertoa syövĂ€nsĂ€ sieniruokaa. Sitten Kia pÀÀtti Kia wasnât the only one who was in love with this dish. Soon other customers started telling me they were eating mushroom food. Then Kia decided lm-1 Kia oli tĂ€ysin jĂ€rkyttynyt. HĂ€n ei ol- lut koskaan syönyt mitÀÀn niin hyvÀÀ. MikĂ€ ihme on tĂ€mĂ€? Mutta siinĂ€ vai- heessa hĂ€n tajusi, ettĂ€ Kia was com- pletely shocked. She had never eaten anything so good. What the hell is this? But at that moment she realized that lm-2 Kia ei ollut tĂ€stĂ€ moksiskaan, vaan ajatteli, ettĂ€ totta kai hĂ€n söi tuon. Mutta sitten hĂ€n huomasi, ettĂ€ ruoka oli Kia wasnât at all surprised by this, and thought that of course he ate that. But then he noticed that the food was lm-5 Kia ei ollut koskaan aiemmin ostanut ravintolasta mitÀÀn muuta kuin ruokaa, joten hĂ€n ei ollut kovin yllĂ€ttynyt ku- ullessaan, ettĂ€ ateria oli tehty sienistĂ€ Kia had never bought anything other than food from a restaurant before,
Chunk 31 · 523 chars
this, and thought that of course he ate that. But then he noticed that the food was lm-5 Kia ei ollut koskaan aiemmin ostanut ravintolasta mitÀÀn muuta kuin ruokaa, joten hĂ€n ei ollut kovin yllĂ€ttynyt ku- ullessaan, ettĂ€ ateria oli tehty sienistĂ€ Kia had never bought anything other than food from a restaurant before, so she wasnât too surprised to hear that the meal was made with mushrooms. Table 5: Example text completions from the language control vector experiment. English translation in italics. 13 -- 13 of 13 --