Benchmarking Concept-Spilling Across Languages in LLMs
Summary
This paper introduces a novel framework to evaluate multilingual semantic robustness in large language models (LLMs) by measuring "language spilling"—a phenomenon where models default to English-based semantic representations when generating content in other languages. The methodology uses 100 high-polysemy English words, translated into nine languages, to prompt models to generate five distinct meanings with examples. A judge model validates these outputs against authoritative dictionaries, calculating spilling rates as the proportion of invalid meanings. Stronger models produce more valid meanings before resorting to dominant-language semantics, while weaker models do so earlier. Evaluations across 16 models revealed significant variation in spilling rates, with llama-3.1-8b-instruct showing 41% and Apertus-70B-2509 showing 20%. Languages like Spanish and Hindi exhibited higher spilling rates compared to German. The study confirms that 70% of rejected meanings in foreign languages are valid in English, supporting the hypothesis of English semantic interference. The framework provides a principled ranking system for model comparison without requiring causal attribution of errors. The authors contribute a scalable benchmark and validation pipeline to advance linguistically balanced AI systems.
PDF viewer
Chunks(40)
Chunk 0 · 1,998 chars
Benchmarking Concept-Spilling Across Languages in LLMs Ilia Badanin EPFL ilia.badanin@epfl.ch Daniil Dzenhaliou EPFL daniil.dzenhaliou@epfl.ch Imanol Schlag ETH AI Center ischlag@ethz.ch Abstract Multilingual Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward the representations from other languages, resulting in semantic interference when generating content in non-English languages—a phenomenon we define as language spilling. This paper presents a novel comparative framework for evaluating multilingual semantic ro- bustness by systematically measuring how models handle polysemous words across languages. Our methodology provides a relative measure of model performance: when required to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but semantically stronger models do so later in the generation sequence, producing more true meanings from the target language before failing, while weaker models resort to dominant- language meanings earlier in the sequence. We evaluate a diverse set of open and closed multilingual LLMs using a struc- tured meaning generation task across nine languages, employing a carefully curated benchmark of 100 high-polysemy English words. Our findings reveal significant variation in semantic robustness across both models and languages, providing a principled ranking system for model com- parison without requiring definitive causal attribution of error sources. We contribute both a scalable comparative benchmark for multilingual semantic evaluation and a rigorous validation pipeline—critical tools for developing more linguistically balanced AI systems. 1 Introduction The advent of Large Language Models (LLMs) has marked a transformative moment in the Figure 1: Overview of our methodology: Starting from polysemous word selection, we generate struc- tured meanings across languages, validate them against dictionary references,
Chunk 1 · 1,992 chars
g more linguistically balanced AI systems. 1 Introduction The advent of Large Language Models (LLMs) has marked a transformative moment in the Figure 1: Overview of our methodology: Starting from polysemous word selection, we generate struc- tured meanings across languages, validate them against dictionary references, and compute spilling rates to assess semantic interference. field of natural language processing, revolu- tionizing how humans interact with and pro- cess information across languages. These mod- els, trained on massive text corpora spanning trillions of tokens, demonstrate remarkable capabilities in generating coherent and con- textually relevant text in multiple languages (Brown et al., 2020; Anthropic, 2024; Google DeepMind, 2025). Their multilingual abili- ties have enabled diverse applications includ- ing machine translation, cross-lingual informa- tion retrieval, content generation, and sophis- ticated dialogue systems in various languages. This advancement represents a significant step toward breaking down linguistic barriers in global digital communication and knowledge access. Despite these advances, a growing body of research suggests that LLMs often exhibit a subtle but pervasive bias toward English arXiv:2601.12549v1 [cs.CL] 18 Jan 2026 -- 1 of 22 -- (Rigouts Terryn and de Lhoneux, 2024; Mi- haylov and Shtedritski, 2024; Wendler et al., 2024), the language that typically dominates their training data both in volume and qual- ity. Recent work has shown that multilingual LLMs fundamentally make key decisions in a representation space closest to English, regard- less of input language (Schut et al., 2025), and employ an English-pivot multilingual mecha- nism where non-English queries are converted to English at bottom layers before perform- ing reasoning (Huo et al., 2025). English, due to its dominance on the internet and in various digital resources, typically consti- tutes the vast majority of the training data. This imbalance has
Chunk 2 · 1,992 chars
English-pivot multilingual mecha- nism where non-English queries are converted to English at bottom layers before perform- ing reasoning (Huo et al., 2025). English, due to its dominance on the internet and in various digital resources, typically consti- tutes the vast majority of the training data. This imbalance has significant implications for how these models learn and represent lan- guage. For example, LLaMA-2 was trained primarily on English data, where English to- kens volume for training had an 89.70% (Tou- vron et al., 2023b) share across all tokens. This English-centric bias manifests in multi- ple dimensions of language processing: influ- encing grammatical structures such as word order and agreement patterns (Papadimitriou et al., 2023), stylistic preferences favoring con- structions common in English (Enomoto et al., 2025), and most critically, semantic interpre- tations employed when processing and gener- ating text in non-English languages (Cahyawi- jaya et al., 2025; Iyer et al., 2023). The result- ing asymmetry in language representation cre- ates fundamental challenges in achieving true multilingual parity in language model perfor- mance. This study focuses on a specific and impor- tant aspect of this bias that we term language spilling. We define language spilling as the phenomenon where multilingual LLMs, when tasked with understanding or generating con- tent in a target language, inadvertently de- fault to semantic representations from other languages—most commonly English due to its dominance in training data, but potentially from any high-resource language with similar representational patterns. For example, a model might use a French word “cravate”, which typically means “neck- tie”, in contexts where an English writer would use the word “tie”. This likely happens be- cause of close latent representations of English and French words. We illustrate this phe- Word cravate (French) Core Meaning necktie (clothing) Spilling Example 1 “Il a
Chunk 3 · 1,998 chars
use a French word “cravate”, which typically means “neck- tie”, in contexts where an English writer would use the word “tie”. This likely happens be- cause of close latent representations of English and French words. We illustrate this phe- Word cravate (French) Core Meaning necktie (clothing) Spilling Example 1 “Il a utilisé une cravate pour tirer la selle du cheval” Literal Trans. “He used a necktie to pull the horse’s saddle” Intended Meaning He used a rope/tie to pull... (English interfer- ence: tie → cravate) Spilling Example 2 “Les arbres étaient liés avec des cravates” Literal Trans. “The trees were tied with neckties” Intended Meaning The trees were tied to- gether with ties. (English interference: tie → cra- vate) Figure 2: Example of Language Spilling: The model uses the French word “cravate” (necktie) in contexts appropriate for the English word “tie” (fasten/bind), resulting in semantically unnatural sentences. nomenon in Figure 2. This phenomenon, also observed in related studies (Li et al., 2025; Wang et al., 2023a), results in outputs that may appear grammat- ically correct but contain subtle semantic er- rors that native speakers immediately recog- nize as unnatural or incorrect. Additional ex- amples across nine languages are provided in Appendix D. The prevalence of language spilling under- mines the reliability of multilingual LLMs in real-world applications. When models fail to capture language-specific semantics and cultural context, they risk misinterpreting user intent and generating misleading content (Veselovsky et al., 2025). This is particularly concerning in high-stakes domains such as le- gal translation, medical information retrieval, and cross-cultural communication, where se- mantic precision is essential. Related work has demonstrated that LLMs exhibit signifi- cant inconsistencies in generating text in in- tended languages (Marchisio et al., 2024) and produce more unsafe responses for non-English queries than English ones (Wang et
Chunk 4 · 1,981 chars
on retrieval, and cross-cultural communication, where se- mantic precision is essential. Related work has demonstrated that LLMs exhibit signifi- cant inconsistencies in generating text in in- tended languages (Marchisio et al., 2024) and produce more unsafe responses for non-English queries than English ones (Wang et al., 2023b). As LLMs increasingly serve as interfaces be- tween languages, their ability to maintain se- mantic fidelity across linguistic boundaries be- comes crucial to ensure equitable access to AI technologies worldwide. -- 2 of 22 -- In this paper, we present a novel com- parative methodology to systematically evalu- ate multilingual semantic robustness in LLMs. Our approach leverages English words with multiple meanings (high-polysemy words) to assess how well models handle semantic com- plexity across target languages and under- stand why polysemic challenges require elicit- ing stronger multilingual representations. We instruct models to generate structured mean- ings and examples for translated terms, then validate these outputs against authoritative target language dictionaries using a judge model with dictionary context. Critically, our methodology provides a relative measure of se- mantic robustness: when forced to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but stronger models do so later in the generation sequence, producing a higher number of valid meanings before defaulting to dominant-language semantics, while weaker models resort to them earlier—regardless of whether these errors stem from interference from English, other high-resource languages, hallucinations, or other failure modes. Through this method, we establish compar- ative rankings across different languages and models, providing a principled framework for multilingual model selection and development without requiring definitive causal attribution of semantic errors. Our work makes three key
Chunk 5 · 1,997 chars
nguages, hallucinations, or other failure modes. Through this method, we establish compar- ative rankings across different languages and models, providing a principled framework for multilingual model selection and development without requiring definitive causal attribution of semantic errors. Our work makes three key contributions: (1) a novel comparative benchmark for evalu- ating multilingual semantic robustness across nine diverse languages and multiple models, providing principled rankings for model selec- tion; (2) a rigorous methodology for stress- testing semantic validity through structured meaning generation tasks, validated through extensive human evaluation achieving 77.43% agreement with native speakers; and (3) com- prehensive empirical analysis demonstrating significant variation in semantic robustness across models and languages, with consistent rankings verified through judge model concor- dance (Kendall’s W = 0.9176). These contri- butions provide essential tools for developing more linguistically balanced and culturally in- clusive AI systems. 2 Method Our methodology, illustrated in Figure 1, sys- tematically quantifies semantic robustness in multilingual LLMs through four stages: (1) selecting the top 100 high-polysemy English words as probes for semantic probes; (2) prompting models to generate five distinct meanings with examples for each translated word across target languages; (3) validating responses using a judge model with dictionary definitions to determine semantic validity; and (4) calculating spilling rates as the proportion of invalid meanings. This approach provides a comparative framework for evaluating how well models maintain semantic validity. Our validation experiments confirm that this met- ric captures meaningful semantic interference patterns (Section 4). 2.1 Polysemous Word Selection and Translation We identified English words susceptible to se- mantic misinterpretation by leveraging Word- Net’s lexical database (Miller,
Chunk 6 · 1,995 chars
els maintain semantic validity. Our validation experiments confirm that this met- ric captures meaningful semantic interference patterns (Section 4). 2.1 Polysemous Word Selection and Translation We identified English words susceptible to se- mantic misinterpretation by leveraging Word- Net’s lexical database (Miller, 1994) to select 100 words with the highest number of different meanings. These high-polysemy words serve as effective probes for language spilling, as their multiple meanings increase the likelihood of semantic confusion across languages. We used GPT-4o (OpenAI et al., 2024) to trans- late these words into target languages. For example, the English word “tie” (with mean- ings including necktie, knot, draw in a game, etc.) was translated to French as “cravate”. This translation choice is deliberate: GPT-4o selects translations according to the model’s own understanding of meaning, helping us find translated words that stay closely related to the original English terms, making it easier to detect when models inappropriately trans- fer semantics across languages. We specifi- cally instructed the translation model to avoid Japanese katakana, as katakana usage for for- eign words represents an intended borrowing phenomenon that would confound our analy- sis. 2.2 Multilingual Meaning Generation We implemented a structured JSON-based prompting methodology requiring models to generate five distinct meanings with three ex- -- 3 of 22 -- amples per meaning for each translated word. This “5 meanings” format represents a con- trolled stress test designed to reveal when models begin fabricating content or resort- ing to representations from other languages. This standardized challenge creates compara- ble conditions across models and languages, though we acknowledge it may not reflect nat- ural semantic distributions. The prompts were translated into all target languages to mini- mize bias from English, ensuring that both the prompt and the target words were
Chunk 7 · 1,998 chars
r languages. This standardized challenge creates compara- ble conditions across models and languages, though we acknowledge it may not reflect nat- ural semantic distributions. The prompts were translated into all target languages to mini- mize bias from English, ensuring that both the prompt and the target words were presented in the same language. Our comparative analysis in Section 4.1 confirmed that while structured formatting did not affect spilling rates, it sig- nificantly reduced the noise of formatting in- consistencies and facilitated downstream pro- cessing. The complete prompt template for meanings generation is available in the Ap- pendix A.1. 2.3 Dictionary-Based Validation To check if the generated meanings were accu- rate in the target languages, we used a more powerful model as a judge, incorporating rele- vant dictionary definitions in its context for each word in the target language to make more informed decisions. This validation pro- cess objectively evaluates the generated mean- ings by leveraging dictionary entries from au- thoritative sources as the ground truth. For each target language, we carefully selected and integrated comprehensive dictionary excerpts from reliable sources to provide a robust refer- ence framework (see Appendix E for complete details on dictionary sources and validation methodology). Our final dictionary excerpts were reviewed and refined by native speakers. To ensure the reliability of our automated ap- proach, we conducted extensive human eval- uation with native speakers, achieving strong agreement between human annotators and our judge model. The judge prompt instructs the model to pay attention to the information provided from the dictionary and return a set of five True/False values for each of the definitions generated by the smaller model. This ap- proach transforms the complex task of de- termining whether a generated meaning is a valid sense into a simplified matching opera- tion against established lexical
Chunk 8 · 1,999 chars
he information provided from the dictionary and return a set of five True/False values for each of the definitions generated by the smaller model. This ap- proach transforms the complex task of de- termining whether a generated meaning is a valid sense into a simplified matching opera- tion against established lexical references. The complete prompt template for meanings vali- dation is available in the Appendix A.3. By aggregating the judge model’s assess- ments across words, models, and languages, we calculate a spilling rate for each instance, representing the percentage of meanings deter- mined to be invalid. This quantitative mea- sure reflects how often a model produces incor- rect or English-influenced meanings in other languages, providing a robust metric to com- pare different models in terms of multilingual comprehension. Please, refer to Section 4.2 for more details on how the choice of prompt and judge affects the alignment with human preferences. 2.4 Spilling Rate Quantification Our methodology provides a relative measure: when forced to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but se- mantically stronger models do so later, produc- ing more true meanings from the target lan- guage before resorting to dominant-language semantics. For example, on average llama- 3.1-8b-instruct shows 41% spilling rate while Apertus-70B-2509 shows 20%, meaning the former exhausts valid semantic content much faster. For each word-language-model combi- nation, we calculate this rate as the percent- age of meanings deemed invalid by the judge model when compared against authoritative dictionary definitions. A meaning is consid- ered invalid when that meaning does not ap- pear in dictionary entries for the correspond- ing word—this rejection may result from En- glish semantic interference, hallucinations, or other failure modes, but the key insight is that stronger models should be more robust across all error
Chunk 9 · 1,993 chars
efinitions. A meaning is consid- ered invalid when that meaning does not ap- pear in dictionary entries for the correspond- ing word—this rejection may result from En- glish semantic interference, hallucinations, or other failure modes, but the key insight is that stronger models should be more robust across all error types. Our validation exper- iment in Section 5.3 confirms that approxi- mately 70% of dictionary-rejected meanings in foreign languages are indeed supported by English dictionaries, validating the presence of cross-lingual interference while acknowledg- ing other error sources. This binary classifi- cation approach (valid/invalid) enables princi- pled model comparison at multiple levels: per word (revealing which concepts challenge mod- els most), per language, and per model (es- tablishing semantic robustness rankings). The spilling rate functions as a comparative rank- -- 4 of 22 -- ing system rather than an absolute measure of English interference. To ensure measurement consistency, we discard any responses where the judge model returns non-standard outputs or where the generative model fails to produce five distinct meanings in the required format. This filtering mechanism maintains data qual- ity while providing statistically robust compar- isons across the evaluation dimensions. 3 Results This section presents the key findings of our evaluation of language spilling in multilingual LLMs using the methodology described in the previous section. Our analysis focuses on quan- tifying the average performance of the gen- erative models and the detailed performance across different target languages. The impact of different factors on the evaluation results is explored in Section 4. 3.1 Models and Languages In this work, we analyze nine languages — Rus- sian, French, Chinese, Japanese, Italian, Ger- man, Portuguese, Spanish, and Hindi. We use 16 different models for evaluation from 3 dif- ferent classes: small open-source models, big open-source
Chunk 10 · 1,995 chars
on the evaluation results is explored in Section 4. 3.1 Models and Languages In this work, we analyze nine languages — Rus- sian, French, Chinese, Japanese, Italian, Ger- man, Portuguese, Spanish, and Hindi. We use 16 different models for evaluation from 3 dif- ferent classes: small open-source models, big open-source models, and closed-sourced mod- els. Please, refer to Appendix B to find de- tailed information about the models. We used Gemini 2.5-Flash (Comanici et al., 2024) as the main judging model. The impact of choice of a judge model is covered in Section 4.2. 3.2 Average Performance of Generative Models Figure 3 illustrates the average language spilling rate observed across all target lan- guages for each of the generative models evaluated in this study. This provides a high-level comparison of how different mod- els tend to default to English-based represen- tations when generating meanings for polyse- mous words in other languages. As shown in Figure 3, model llama-3.1-8b-instruct exhib- ited the highest average spilling rate at 41%, while model Apertus-70B-2509 demonstrated the lowest at 20%. 3.3 Average Language Performance Figure 4 summarizes the average language spilling rate observed for each of the nine tar- gpt-4.1-2025-04-14 Apertus-70B-2509 c4ai-aya-expanse-8b claude-sonnet-4-20250514 deepseek-reasoner gemini-2.0-flash gemma-3-12b-it grok-3-latest llama-4-scout-17b-16e-instruct Llama-3.3-70B-Instruct llama-3.1-8b-instruct Mistral-Nemo-Instruct-2407 o3-mini-2025-01-31 phi-4 Qwen3-8B qwen3-235b-a22b-04-28 0 0.1 0.2 0.3 0.4 0.5 0.23 0.2 0.24 0.23 0.24 0.27 0.24 0.21 0.280.28 0.41 0.34 0.21 0.26 0.21 0.24 Spilling Rate Figure 3: Average Spilling Rate Across Generative Models. Lower is better. get languages, aggregated across all genera- tive models. This perspective highlights which languages, on average, are more susceptible to eliciting English-based semantic defaults in the LLMs under investigation. Interestingly, we observed a significant
Chunk 11 · 1,999 chars
Average Spilling Rate Across Generative Models. Lower is better. get languages, aggregated across all genera- tive models. This perspective highlights which languages, on average, are more susceptible to eliciting English-based semantic defaults in the LLMs under investigation. Interestingly, we observed a significant disparity among Ro- mance languages: while French (20%), Por- tuguese (21%), and Italian (19%) show nearly identical low spilling rates, Spanish is a dis- tinct outlier. We hypothesize this could re- flect: (1) differential training data quality and quantity, (2) potential differences in the de- gree of English semantic borrowing in mod- ern Spanish usage compared to other Romance languages, or (3) variations in our dictionary coverage quality across these languages. Ta- ble 1 reveals significant variations in spilling rates across languages. For instance, model DeepSeek-R1-0528 showed a particularly high spilling rate for Spanish (47%), while its per- formance on German was considerably better (16%). Generally, for most models German ap- pears to be a language with the least spilling rate. To examine a more granular view of the phenomenon of language spilling, please refer to the Appendix C 4 Ablation Study We conducted several ablation studies to eval- uate the robustness and performance of our methodology. These experiments focused on -- 5 of 22 -- Table 1: Aggregated Language Spilling Rate (%) by Generative Model and Language. Lower is better. Generative Model Spanish Hindi Japanese Chinese Russian Portuguese French Italian German llama-3.1-8b-instruct 46.9 53.5 47.7 41.3 45 35.4 34.1 35.9 32.8 Mistral-Nemo-Instruct-2407 41.8 49.1 44.5 37.3 29.1 29.5 25.6 24.2 26.8 llama-4-scout-17b-16e-instruct 43.2 32.4 33.3 39 26.3 21.2 21.4 20.8 17.8 Llama-3.3-70B-Instruct 36.8 35.2 37.2 42.8 30.2 18.6 20 17.2 17 gemini-2.0-flash 41.4 38.6 29.2 35.6 20 21.8 20.8 17.6 17.4 phi-4 39.4 31.4 32.8 26.2 21.4 23.4
Chunk 12 · 1,997 chars
41.8 49.1 44.5 37.3 29.1 29.5 25.6 24.2 26.8 llama-4-scout-17b-16e-instruct 43.2 32.4 33.3 39 26.3 21.2 21.4 20.8 17.8 Llama-3.3-70B-Instruct 36.8 35.2 37.2 42.8 30.2 18.6 20 17.2 17 gemini-2.0-flash 41.4 38.6 29.2 35.6 20 21.8 20.8 17.6 17.4 phi-4 39.4 31.4 32.8 26.2 21.4 23.4 20.8 19.2 18 gemma-3-12b-it 38.6 33.6 27.2 26.3 19.2 17.8 18.6 18.8 19.4 qwen3-235b-a22b-04-28 42 26.8 24.3 23.4 20.6 20.6 20.2 18.4 19.4 deepseek-reasoner 47 30.2 23 18.2 19.4 22 17.4 19.4 16 c4ai-aya-expanse-8b 35.6 25.8 28 25.8 18 22.2 20.6 17.5 18.8 gpt-4.1-2025-04-14 42.6 30.8 23.4 24.6 20.6 19.4 18 14.4 17.4 claude-sonnet-4-20250514 44 31 23 23 19.2 18.6 17.4 14.6 19.2 grok-3-latest 42 25.4 23.8 23.2 18.6 19.2 14.8 12.4 12.8 Qwen3-8B 32.8 20.6 23.6 16.6 20.6 16.8 19.2 20.8 19.6 o3-mini-2025-01-31 42.4 23.2 22.4 18.4 18 20.4 15.4 14.2 13.4 Apertus-70B-2509 34.4 27.4 36.6 13.2 8.6 17.2 14.8 13.2 12.2 German Spanish French Hindi Italian Japanese Portuguese Russian Chinese 0 0.1 0.2 0.3 0.4 0.5 0.19 0.41 0.2 0.32 0.19 0.3 0.21 0.22 0.27 Spilling Rate Figure 4: Average Spilling Rate Across Languages. Lower is better. the impact of the generation format and the verification process. 4.1 Impact of Generation Format (JSON) We tested prompting with and without JSON formatting to assess its influence on eliciting semantic spillings and the ease of extracting distinct meanings. We evaluated the top 15 Russian words manually using the LLaMA-3.3-70B model un- der the same conditions as in our automatic quantification. The results showed identical spilling rates of 54.67% (41 out of 75 produced meanings) for both JSON and non-JSON for- mats. The model demonstrated the same ex- act spilling cases and the same exact correct usages of a word in both conditions, essen- tially replicating its generation patterns while adding syntax symbols for the JSON case. Despite
Chunk 13 · 1,999 chars
spilling rates of 54.67% (41 out of 75 produced meanings) for both JSON and non-JSON for- mats. The model demonstrated the same ex- act spilling cases and the same exact correct usages of a word in both conditions, essen- tially replicating its generation patterns while adding syntax symbols for the JSON case. Despite the identical spilling rates, we opted to use JSON formatting for three key reasons: (1) it kept responses shorter and more struc- tured, (2) it made downstream parsing signif- icantly easier, and (3) it facilitated the judge model’s ability to read and evaluate the gener- ated text. 4.2 Impact of Judge Model Variation To assess the potential influence of the choice of the evaluation model on the mea- sured spilling rates, we conducted an ablation study comparing the average spilling rates ob- tained when using different judge models and prompts. Initially, we explored automated prompt op- -- 6 of 22 -- Table 2: Model Alignment with User Preferences Prompt Index DeepSeek-R1-0528 gemini-2.5-flash gemini-2.5-pro o3-mini 1 0.7340 0.6964 – 0.7036 2 0.7407 0.7298 0.7395 0.7564 3 0.7415 0.7523 0.7560 0.7321 4 0.7352 0.7456 0.7341 0.7209 5 0.7389 0.7491 0.7395 0.7561 6 0.6952 0.6338 0.7184 0.7250 7 0.7467 0.7743 0.7589 0.7564 8 0.7578 0.7468 0.7395 0.7615 9 0.7560 0.7450 0.7486 0.7596 10 0.7519 0.7371 0.7743 0.7376 11 0.7744 0.7505 0.7660 0.7106 Min 0.6952 0.6338 0.7184 0.7036 Max 0.7744 0.7743 0.7743 0.7615 timization using TextGrad (Yuksekgonul et al., 2024), but found that it did not improve per- formance on our specific validation task. In- stead, we manually designed the 11 alternative prompts for semantic judgment. We tested 4 high-performing models (DeepSeek-R1-0528, Gemini-2.5-Flash and Gemini-2.5-Pro, and OpenAI’s o3-mini) with 11 different prompts each (Table 2), identi- fying the best prompt for each judge model. The human agreement scores were as follows: DeepSeek 77.44%, Gemini 77.43%, Gemini
Chunk 14 · 1,994 chars
rompts for semantic judgment. We tested 4 high-performing models (DeepSeek-R1-0528, Gemini-2.5-Flash and Gemini-2.5-Pro, and OpenAI’s o3-mini) with 11 different prompts each (Table 2), identi- fying the best prompt for each judge model. The human agreement scores were as follows: DeepSeek 77.44%, Gemini 77.43%, Gemini 2.5 Pro 76.60%, and o3-mini 76.15%. In the same subset, the pairwise agreement between the models ranges from 84.59% to 92.61%, which is a strong indicator of the similarity of the models’ judgement. Regardless of the specific judge model used, judges order generative models almost identically, as evidenced by Kendall’s coefficient of concordance (W = 0.9176), indicating very strong agreement among judges. In our final experiments, we ultimately chose Gemini Flash 2.5 as our primary judge model with the prompt No7. 4.3 Language Spilling Validation Experiment To validate that our metric captures actual language spilling rather than general seman- tic errors, we conducted an additional experi- ment using LLaMA models, which are known to be heavily English-biased, since 89.70% of the training data consisted of English (Tou- vron et al., 2023b). Using meaning generations for 100 words and in 9 languages, we filtered for foreign lan- guage meanings that were rejected by the for- eign dictionary according to our judge. The incorrect meanings were then translated into English and evaluated using the same judge with an English dictionary (WordReference). In this setting, we found that approximately 70% of the meanings that were rejected in the foreign language were accepted in the En- glish language. This finding indicates that the majority of invalid meanings are not random hallucinations but rather instances of cross- lingual semantic transfer. While this does not definitively prove causation at the representa- tional level, the high proportion of English- valid meanings (70%) provides strong empir- ical support for the hypothesis that English semantic
Chunk 15 · 1,997 chars
of invalid meanings are not random hallucinations but rather instances of cross- lingual semantic transfer. While this does not definitively prove causation at the representa- tional level, the high proportion of English- valid meanings (70%) provides strong empir- ical support for the hypothesis that English semantic interference is a dominant source of error in multilingual generation. 5 Related Work 5.1 English Bias in Multilingual LLMs Multilingual LLMs often inherit English- centric semantic biases, producing outputs in other languages that reflect English gram- mar or word senses. In Papadimitriou et al. (2023) authors first documented “grammati- cal structure bias”, demonstrating that mul- tilingual BERT preferred English-like syntax (explicit pronouns, SVO word order) even when modeling Spanish or Greek. This work shows that higher-resource language struc- tures “bleed into” lower-resource ones, a phe- nomenon similar to the language spilling we in- vestigate. Similarly Chollampatt et al. (2025) note that most LLMs are trained on En- glish corpora and “are not able to replicate similar success in other languages”. In ma- chine translation Zhang and Toral (2019) find that translationese (texts translated from En- glish) are systematically easier for MT sys- tems. A translationese-free evaluation proto- col Enomoto et al. (2025) specifically compare English versus native instructions for multilin- gual LLMs, finding that models still “adhere more closely to English instructions” even on non-English tasks. Together, these works doc- ument an English-default bias in grammar and prompt-following across multilingual models. 5.2 Cross-Lingual Polysemy and Meaning Preservation A major facet of language spilling is polysemy – using an English-based sense where the tar- get language would use a different sense. A cross-lingual sense-disambiguation benchmark StingrayBench (Cahyawijaya et al., 2025) is -- 7 of 22 -- built on false friends (words that look alike but
Chunk 16 · 1,996 chars
my and Meaning Preservation A major facet of language spilling is polysemy – using an English-based sense where the tar- get language would use a different sense. A cross-lingual sense-disambiguation benchmark StingrayBench (Cahyawijaya et al., 2025) is -- 7 of 22 -- built on false friends (words that look alike but differ in meaning) show that LLMs of- ten pick the “higher-resource” sense. For example, “Angel” has a false friend in Ger- man where the same writing means a “fish- ing rod”. They find that LLM outputs tend to be biased toward the high-resource lan- guage’s meaning. Iyer et al. (2023) similarly study ambiguous sentences with highly polyse- mous English words. They translate such sen- tences into Spanish, Italian, German, Russian, and Chinese and ask LLMs to disambiguate. Strikingly, very large LLMs (e.g., BLOOMZ- 176B (Muennighoff et al., 2022), LLaMA-65B (Touvron et al., 2023a)) matched or outper- formed top MT systems, resolving ambiguous senses correctly most of the time. However, Campolungo et al. (2022) found that conven- tional NMT systems fail on 40–50% of these cases due to entrenched biases in training data. These works highlight that multilingual mod- els struggle to accurately preserve meaning across languages when faced with polysemy, often defaulting to an English-centric interpre- tation of word senses. 5.3 Structured Prompting and Error Diagnosis Finally, several methods use structured prompting to force LLMs into explicit se- mantics and detect errors. For example, ASPIRO (Vejvar and Fujimoto, 2023) uses a JSON-format output prompt to verbalize RDF triples, enforcing name-entity agnostic templates and including parser-based error checks. By repeating generation until the out- put parses correctly, it significantly reduces generation errors. This illustrates how error- inducing structured prompts can steer an LLM to produce the intended semantics. In a sim- ilar spirit, MLPrompt (Wang et al., 2025) translates error-prone constraints
Chunk 17 · 1,999 chars
based error checks. By repeating generation until the out- put parses correctly, it significantly reduces generation errors. This illustrates how error- inducing structured prompts can steer an LLM to produce the intended semantics. In a sim- ilar spirit, MLPrompt (Wang et al., 2025) translates error-prone constraints into a non- dominant language to help catch mistakes. Since LLMs often overlook rules stated in a dominant language like English, rewriting a vi- olated rule in, say, Spanish, can strengthen the model’s understanding and reasoning. Such cross-lingual prompting can expose English-default assumptions: the model is forced to engage with the semantics of another language when verifying its output. Chain-of- Dictionary prompting (Lu et al., 2024) like- wise adds multilingual dictionary chains into prompts to improve translation by explicitly including lexical semantics. 6 Limitations This study faces several inherent limitations, primarily rooted in the challenges of evalu- ating semantic spillings across different lan- guages. While the methodology simplifies the process through automation, the accuracy of the approach depends on having a carefully designed set of prompts and words that are likely to induce spillings. An additional limitation stems from the re- liance on dictionary definitions as the sole source of ground truth. Dictionaries may not always capture the full spectrum of a word’s usage, including idiomatic expressions or cul- turally specific connotations, which could lead to instances being classified as spillings when they represent valid but less common inter- pretations. Furthermore, our analysis primar- ily focused on polysemous words. Semantic spilling might manifest differently with other linguistic phenomena such as idiomatic expres- sions, metaphors, or words with strong cul- tural connotations. 7 Future Work In our extended work, we aim to collaborate with linguistics departments to refine and ex- pand the dictionary resources used
Chunk 18 · 1,998 chars
emous words. Semantic spilling might manifest differently with other linguistic phenomena such as idiomatic expres- sions, metaphors, or words with strong cul- tural connotations. 7 Future Work In our extended work, we aim to collaborate with linguistics departments to refine and ex- pand the dictionary resources used in our eval- uation pipeline. Additionally, we plan to in- corporate human evaluations to more directly assess the accuracy and nuance of the model- generated meanings. Future work should also incorporate sense-aligned multilingual dictio- naries to better handle translation ambiguity, as sense distinctions differ across languages and our current approach uses single default translations. Furthermore, we plan to extend our evaluation to context-dependent usage (pragmatic competence), examining spilling behavior when words are presented within sen- tences rather than in isolation. Beyond English-centric spilling, we will also explore cross-language spilling between closely related or script-sharing languages—such as Russian to Bulgarian, Spanish to Basque, or Hindi to Sanskrit. These cases are particu- larly interesting because shared scripts or to- -- 8 of 22 -- ken overlap may cause models to incorrectly assume semantic similarity, leading to subtle but systematic errors. We expect that these forms of structural or token-level interference will reveal further challenges in multilingual semantic representation. 8 Conclusion This study introduced a novel methodology for systematically investigating and quantifying the phenomenon of language spilling in mul- tilingual Large Language Models. By employ- ing a structured approach involving the gener- ation of JSON-formatted meanings for trans- lated polysemous English words, automated validation against target language dictionaries using a larger LLM with dictionary inclusion, and comprehensive human evaluation for vali- dation, we were able to identify and analyze in- stances where models inadvertently
Chunk 19 · 1,995 chars
the gener- ation of JSON-formatted meanings for trans- lated polysemous English words, automated validation against target language dictionaries using a larger LLM with dictionary inclusion, and comprehensive human evaluation for vali- dation, we were able to identify and analyze in- stances where models inadvertently defaulted to English-based semantic representations. Our findings revealed significant variations in language spilling rates across different gen- erative models and target languages, high- lighting the nuanced challenges these models face in truly mastering multilingual semantics. The ablation studies confirmed that JSON for- matting did not affect spilling rates but im- proved parsing and readability for the judge LLM. The exploration of different judge mod- els in the ablation study provided insights into the stability of our evaluation framework. Ultimately, this research contributes to a deeper understanding of the complexities in- herent in multilingual LLMs and provides a foundation for future work aimed at enhanc- ing their cross-lingual semantic accuracy and reliability, paving the way for more robust and trustworthy multilingual natural language processing systems. To ensure reproducibility and support future research, we will make our complete codebase, benchmark datasets, dic- tionary resources, and human evaluation an- notations publicly available. References Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauff- mann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, and 8 others. 2024. Phi-4 techni- cal report. Preprint, arXiv:2412.08905. Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Tech- nical report, Anthropic. Available at: https://www.anthropic.com/news/claude- 3-family. Anthropic. 2025. Claude sonnet 4 (claude‑sonnet‑4‑20250514).
Chunk 20 · 1,996 chars
de Rosa, Olli Saarikivi, and 8 others. 2024. Phi-4 techni- cal report. Preprint, arXiv:2412.08905. Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Tech- nical report, Anthropic. Available at: https://www.anthropic.com/news/claude- 3-family. Anthropic. 2025. Claude sonnet 4 (claude‑sonnet‑4‑20250514). https: //www.anthropic.com/news/claude-4. Release date: 22 May 2025; access ver- sion “claude‑sonnet‑4‑20250514”; accessed 2025‑07‑28. Tom B. Brown, Benjamin Mann, Nick Ry- der, Melanie Subbiah, Jared D. Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. In Advances in Neural Informa- tion Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Samuel Cahyawijaya, Ruochen Zhang, Jan Christian Blaise Cruz, Holy Lovenia, Elisa Gilbert, Hiroki Nomoto, and Alham Fikri Aji. 2025. Thank you, stingray: Multilin- gual large language models can not (yet) disambiguate cross-lingual word senses. In Findings of the Association for Computa- tional Linguistics: NAACL 2025, pages 3228–3250, Albuquerque, New Mexico. As- sociation for Computational Linguistics. Niccolò Campolungo, Federico Martelli, Francesco Saina, and Roberto Navigli. 2022. DiBiMT: A novel benchmark for measuring Word Sense Disambiguation biases in Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4331–4352, Dublin, Ireland. Association for Computational Linguistics. Shamil Chollampatt, Minh Quang Pham, Sathish Reddy Indurthi, and Marco Turchi. 2025. Cross-lingual evaluation of multilin- gual text generation. In Proceedings of the 31st International Conference on Compu- tational Linguistics, pages 7766–7777, Abu -- 9 of 22 -- Dhabi, UAE. Association for Computa- tional Linguistics. Gheorghe Comanici, Eric Bieber, and 1 others. 2024. Gemini 2.5: Pushing the frontier with advanced reasoning,
Chunk 21 · 1,955 chars
multilin- gual text generation. In Proceedings of the 31st International Conference on Compu- tational Linguistics, pages 7766–7777, Abu -- 9 of 22 -- Dhabi, UAE. Association for Computa- tional Linguistics. Gheorghe Comanici, Eric Bieber, and 1 others. 2024. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities. Technical Report 2507.06261, arXiv. John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, and 26 oth- ers. 2024. Aya expanse: Combining research breakthroughs for a new multilingual fron- tier. Preprint, arXiv:2412.04261. Google DeepMind. 2024. Gemini 2.0 flash. https://blog.google/ technology/google-deepmind/ google-gemini-ai-update-december-2024/. Introduced Dec 11 2024; image and audio multimodal abilities; accessed 2025‑07‑28. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. Taisei Enomoto, Hwichan Kim, Zhousi Chen, and Mamoru Komachi. 2025. A fair com- parison without translationese: English vs. target-language instructions for multilin- gual LLMs. In Proceedings of the 2025 Con- ference of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 649–670, Albuquerque, New Mexico. Association for Computational Linguistics. Google Cloud. 2024. Structure prompts | generative ai on vertex ai | google
Chunk 22 · 1,989 chars
25 Con- ference of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 649–670, Albuquerque, New Mexico. Association for Computational Linguistics. Google Cloud. 2024. Structure prompts | generative ai on vertex ai | google cloud. https://cloud.google.com/vertex-ai/ generative-ai/docs/learn/prompts/ structure-prompts. Accessed: 2025-05- 14. Google DeepMind. 2025. Gemini 2.5: Our newest Gemini model with thinking. https://blog.google/ technology/google-deepmind/ gemini-model-thinking-updates-march-2025/. Accessed: May 15, 2025. Aaron Grattafiori, Abhimanyu Dubey, Ab- hinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783. Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Ed- uard Frank Ďurech, Ido Hakimi, Juan Gar- cía Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, and 82 others. 2025. Apertus: Democratizing open and compli- ant llms for global language environments. Preprint, arXiv:2509.14233. Wenshuai Huo, Xiaocheng Feng, Yichong Huang, Chengpeng Fu, Baohang Li, Yang- fan Ye, Zhirui Zhang, Dandan Tu, Duyu Tang, Yunfei Lu, Hui Wang, and Bing Qin. 2025. Enhancing non-english capabili- ties of english-centric large language models through deep supervision fine-tuning. arXiv preprint. Accepted at AAAI 2025. Vivek Iyer, Pinzhen Chen, and Alexandra Birch. 2023. Towards effective disambigua- tion for machine translation with large lan- guage models. In Proceedings of the Eighth Conference on Machine Translation,
Chunk 23 · 1,998 chars
nglish-centric large language models through deep supervision fine-tuning. arXiv preprint. Accepted at AAAI 2025. Vivek Iyer, Pinzhen Chen, and Alexandra Birch. 2023. Towards effective disambigua- tion for machine translation with large lan- guage models. In Proceedings of the Eighth Conference on Machine Translation, pages 482–495, Singapore. Association for Compu- tational Linguistics. Yafu Li, Ronghao Zhang, Zhilin Wang, Hua- jian Zhang, Leyang Cui, Yongjing Yin, -- 10 of 22 -- Tong Xiao, and Yue Zhang. 2025. Lost in literalism: How supervised training shapes translationese in llms. Preprint, arXiv:2503.04369. Hongyuan Lu, Haoran Yang, Haoyang Huang, Dongdong Zhang, Wai Lam, and Furu Wei. 2024. Chain-of-dictionary prompting elic- its translation in large language models. Preprint, arXiv:2305.06575. Kelly Marchisio, Wei-Yin Ko, Alexandre Be- rard, Théo Dehaze, and Sebastian Ruder. 2024. Understanding and mitigating lan- guage confusion in LLMs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6653– 6677, Miami, Florida, USA. Association for Computational Linguistics. Inc. Meta Platforms. 2025. Llama 4 scout. https://ai.meta.com/blog/ llama-4-multimodal-intelligence/. Compact MoE model with 17B active parameters, 10 M-token context window; access under Llama 4 Community License; accessed 2025‑07‑28. Viktor Mihaylov and Aleksandar Shtedritski. 2024. What an elegant bridge: Multilin- gual LLMs are biased similarly in different languages. In Proceedings of the 1st Work- shop on NLP for Science (NLP4Science), pages 16–23, Miami, FL, USA. Association for Computational Linguistics. George A. Miller. 1994. WordNet: A lexi- cal database for English. In Human Lan- guage Technology: Proceedings of a Work- shop held at Plainsboro, New Jersey, March 8-11, 1994. Mistral AI. 2024. Mistral nemo: Our new best small model. Accessed: 2025-07-28. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven
Chunk 24 · 1,993 chars
ordNet: A lexi- cal database for English. In Human Lan- guage Technology: Proceedings of a Work- shop held at Plainsboro, New Jersey, March 8-11, 1994. Mistral AI. 2024. Mistral nemo: Our new best small model. Accessed: 2025-07-28. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, and 1 others. 2022. Crosslingual generaliza- tion through multitask finetuning. arXiv preprint arXiv:2211.01786. OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. Gpt-4o system card. Preprint, arXiv:2410.21276. OpenAI. 2025a. Gpt‑4.1. https://openai. com/index/gpt-4-1/. Includes GPT-4.1, GPT‑4.1 Mini, and GPT‑4.1 Nano; ac- cessed 2025‑07‑28. OpenAI. 2025b. Openai o3‑mini. https: //openai.com/index/openai-o3-mini/. General‑availability release of the reasoning model; accessed 2025‑07‑28. Isabel Papadimitriou, Kezia Lopez, and Dan Jurafsky. 2023. Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models. In Find- ings of the Association for Computational Linguistics: EACL 2023, pages 1194–1200, Dubrovnik, Croatia. Association for Compu- tational Linguistics. Ayla Rigouts Terryn and Miryam de Lhoneux. 2024. Exploratory study on the impact of English bias of generative large language models in Dutch and French. In Proceedings of the Fourth Workshop on Human Evalua- tion of NLP Systems (HumEval) @ LREC- COLING 2024, pages 12–27, Torino, Italia. ELRA and ICCL. Lisa Schut, Yarin Gal, and Sebastian Farquhar. 2025. Do multilingual llms think in english? Gemma Team, Aishwarya Kamath, Johan Fer- ret, Shreya Pathak, Nino Vieillard, Ra- mona Merhej, Sarah Perrin, Tatiana Mate- jovicova, Alexandre Ramé, Morgane Riv- ière, Louis
Chunk 25 · 1,980 chars
LING 2024, pages 12–27, Torino, Italia. ELRA and ICCL. Lisa Schut, Yarin Gal, and Sebastian Farquhar. 2025. Do multilingual llms think in english? Gemma Team, Aishwarya Kamath, Johan Fer- ret, Shreya Pathak, Nino Vieillard, Ra- mona Merhej, Sarah Perrin, Tatiana Mate- jovicova, Alexandre Ramé, Morgane Riv- ière, Louis Rouillard, Thomas Mesnard, Ge- offrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. Gemma 3 technical report. Preprint, arXiv:2503.19786. Hugo Touvron, Thibaut Lavril, Gautier Izac- ard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, -- 11 of 22 -- and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. Preprint, arXiv:2302.13971. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Ba- tra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Can- ton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023b. Llama 2: Open foundation and fine-tuned chat mod- els. Preprint, arXiv:2307.09288. Martin Vejvar and Yasutaka Fujimoto. 2023. ASPIRO: Any-shot structured parsing- error-induced ReprOmpting for consistent data-to-text generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3550–3563, Singapore. Association for Computational Linguistics. Veniamin Veselovsky, Berke Argin, Benedikt Stroebl, Chris Wendler, Robert West, James Evans, Thomas L. Griffiths, and Arvind Narayanan. 2025. Localized cul- tural knowledge is conserved and control- lable in large language models. Preprint, arXiv:2504.10191. Jiaan Wang, Fandong Meng, Yunlong Liang, Tingyi Zhang, Jiarong Xu, Zhixu Li, and Jie Zhou. 2023a. Understanding translationese in cross-lingual summarization.
Chunk 26 · 1,997 chars
s L. Griffiths, and Arvind Narayanan. 2025. Localized cul- tural knowledge is conserved and control- lable in large language models. Preprint, arXiv:2504.10191. Jiaan Wang, Fandong Meng, Yunlong Liang, Tingyi Zhang, Jiarong Xu, Zhixu Li, and Jie Zhou. 2023a. Understanding translationese in cross-lingual summarization. Preprint, arXiv:2212.07220. Teng Wang, Zhenqi He, Wing-Yin Yu, Xiao- jin Fu, and Xiongwei Han. 2025. Large language models are good multi-lingual learners : When LLMs meet cross-lingual prompts. In Proceedings of the 31st Inter- national Conference on Computational Lin- guistics, pages 4442–4456, Abu Dhabi, UAE. Association for Computational Linguistics. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael R. Lyu. 2023b. All lan- guages matter: On the multilingual safety of large language models. arXiv preprint. Accepted by ACL 2024 Findings. Chris Wendler, Veniamin Veselovsky, Gio- vanni Monea, and Robert West. 2024. Do llamas work in english? on the la- tent language of multilingual transformers. Preprint, arXiv:2402.10588. xAI. 2025. Grok 3. https://grok. com/ (xAI official announcement). Re- leased Feb 17 2025; advanced reasoning (”Think”/”Big Brain”), DeepSearch tool, available via X Premium+ and SuperGrok tiers; accessed 2025‑07‑28. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report. Preprint, arXiv:2505.09388. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Car- los Guestrin, and James Zou. 2024. Textgrad: Automatic ”differentiation” via text. Preprint, arXiv:2406.07496. Mike Zhang and Antonio Toral. 2019. The effect of translationese in machine transla- tion test sets. In Proceedings of the Fourth Conference on Machine Translation (Vol- ume 1: Research Papers), pages
Chunk 27 · 1,960 chars
g, Car-
los Guestrin, and James Zou. 2024.
Textgrad: Automatic ”differentiation” via
text. Preprint, arXiv:2406.07496.
Mike Zhang and Antonio Toral. 2019. The
effect of translationese in machine transla-
tion test sets. In Proceedings of the Fourth
Conference on Machine Translation (Vol-
ume 1: Research Papers), pages 73–81, Flo-
rence, Italy. Association for Computational
Linguistics.
-- 12 of 22 --
A Prompts
A.1 Meanings Generation Prompt
Create a list of 5 distinct meanings for the word '%s'. For each meaning,
provide exactly 3 examples of its usage. Your response must be in JSON format
only. Do not include anything else in your response. The JSON should follow
this structure: meaning: [examples], meaning: [examples], ...
A.2 Translation Generation Prompt
The following prompt was used to generate translations for the polysemous word set:
You are a multilingual translator. Given a list of English words, translate each word
into: **Russian (ru)**, **French (fr)**, ..., and **Hindi (hi)**. Ensure that
translations are **nouns whenever possible**. Ensure that Japanese translations are
**not** in katakana. Return the translations as a JSON object in the following format:
{"translations": {"word1": {"ru": "Russian translation", "fr": "French translation",
..., "hi": "Hindi translation"}, "word2": {"ru": "Russian translation", ...} ...}}
Here is the list of words to translate: {words_batch}
A.3 Judging LLM Prompt
We structured the prompt using XML tags to enhance response quality, as recommended in
their prompt design guidelines (Google Cloud, 2024).
You are an AI assistant specializing in verifying meanings of terms using a structured
knowledge base. Your primary task is to evaluate whether the meanings provided in the
section match, partially align with, or differ from the ground truth meanings stored in
the section. The section may contain unstructured text, and your goal is to extract
relevant terms and assess their meaningsChunk 28 · 1,993 chars
s using a structured knowledge base. Your primary task is to evaluate whether the meanings provided in the section match, partially align with, or differ from the ground truth meanings stored in the section. The section may contain unstructured text, and your goal is to extract relevant terms and assess their meanings accurately. <DICTIONARY> %s </DICTIONARY> <INSTRUCTIONS> 0. Knowledge to match to is DICTIONARY 1. Queries that may not be structured 2. List of `true/false` values for each query. No other output 3. Matching Logic: - Check if the 'query' matches any 'knowledge entry'. Return 'true' if match found, 'false' if not. - The number of output entries should match the number of query entries - It should be true if it's close - The response should be as a JavaScript list [], with true, false </INSTRUCTIONS> <QUERY> %s </QUERY> -- 13 of 22 -- B Models Model Name Type # Parameters Qwen3-8B(Yang et al., 2025) Open Source 8B meta-llama-3.1-8b-instruct(Grattafiori et al., 2024) Open Source 8B aya-expanse-8b(Dang et al., 2024) Open Source 8B Mistral-Nemo-Instruct-2407(Mistral AI, 2024) Open Source 12B google-gemma-3-12b-it(Team et al., 2025) Open Source 12B phi-4(Abdin et al., 2024) Open Source 14B LLaMA-3.3-70B-Instruct(Grattafiori et al., 2024) Open Source 70B Apertus-70B-2509(Hernández-Cano et al., 2025) Open Source 70B meta-llama-4-scout-17b-16e-instruct(Meta Platforms, 2025) Open Source 109B qwen3-235b-a22b-04-28(Yang et al., 2025) Open Source 235B DeepSeek-R1-0528 (DeepSeek-AI et al., 2025) Open Source 671B grok-3-latest(xAI, 2025) Closed Source Unknown gpt-4.1(OpenAI, 2025a) Closed Source Unknown o3-mini(OpenAI, 2025b) Closed Source Unknown claude-sonnet-4-20250514(Anthropic, 2025) Closed Source Unknown gemini-2.0-flash(DeepMind, 2024) Closed Source Unknown Table 3: Evaluated models with type and parameter count. Among the evaluated models, only LLaMA, Aya, and Apertus explicitly disclose training on multilingual data and languages used for training.
Chunk 29 · 1,920 chars
own claude-sonnet-4-20250514(Anthropic, 2025) Closed Source Unknown gemini-2.0-flash(DeepMind, 2024) Closed Source Unknown Table 3: Evaluated models with type and parameter count. Among the evaluated models, only LLaMA, Aya, and Apertus explicitly disclose training on multilingual data and languages used for training. Although the LLaMA-2 family was not included in this exploration, it is worth noting that it remains one of the few popular models that has disclosed detailed statistics on language distribution in its training data. For the remaining models, although they demonstrate capabilities in multilingual tasks and are frequently evaluated on multilingual benchmarks, no information is provided regarding the actual linguistic composition of their training data. As such, we cannot say what languages these models were exposed to during pretraining. This lack of transparency limits interpretability when comparing cross-lingual behavior and bias, particularly with respect to phenomena like language spilling. With language distribution data for just one model, drawing rigorous correlations with training data composition is not statistically sound. That is why we are focusing on linguistic and architectural factors we can reliably analyze across all models. OpenAI, DeepSeek, Cohere’s Aya, Gemini, Grok, and Claude models were accessed through their official APIs. Qwen3-235B, LLaMA-4 Scout, Gemma 3, and Mistral Nemo were used via the OpenRouter API. Apertus-70B-2509, LLaMA-3.3-70B, Qwen3-8B, and Phi-4 were launched on a local cluster. No additional generation parameters were passed; all models used default generation and tokenizer parameters. API usage ensures compliance with respective terms of service and does not violate any terms of agreement. -- 14 of 22 -- C Supplementary Figures Figure 5: Spilling Heatmap of Different Language Models Across Various Languages. Scale indicate spilling
Chunk 30 · 1,995 chars
sed; all models used default generation and tokenizer parameters. API usage ensures compliance with respective terms of service and does not violate any terms of agreement. -- 14 of 22 -- C Supplementary Figures Figure 5: Spilling Heatmap of Different Language Models Across Various Languages. Scale indicate spilling rate. Spanish Hindi Japanese Chinese Russian Portuguese French Italian German llama-3.1-8b-instruct Mistral-Nemo- Instruct-2407 llama-4-scout- 17b-16e-instruct Llama-3.3- 70B-Instruct gemini-2.0-flash phi-4 gemma-3-12b-it qwen3-235b- a22b-04-28 DeepSeek-R1-0528 c4ai-aya-expanse-8b gpt-4.1-2025-04-14 claude-sonnet- 4-20250514 grok-3-latest Qwen3-8B o3-mini-2025-01-31 Apertus-70B-2509 Language Model Performance Across Languages 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 -- 15 of 22 -- llama-3.1-8b-instruct Mistral-Nemo-Instruct-2407 llama-4-scout-17b-16e-instruct Llama-3.3-70B-Instruct gemini-2.0-flash phi-4 gemma-3-12b-it qwen3-235b-a22b-04-28 DeepSeek-R1-0528 c4ai-aya-expanse-8b gpt-4.1-2025-04-14 claude-sonnet-4-20250514 grok-3-latest Qwen3-8B o3-mini-2025-01-31 Apertus-70B-2509 0 0.2 0.4 0.6 0.8 1 Spilling Rate Detailed Histogram of Spilling Rate for Each Language and Model Spanish Hindi Japanese Chinese Russian Portuguese French Italian German Figure 6: Detailed Language Spilling Rate for Each Generative Model In this section we provide a more detailed quantitative analysis of the spilling in models and languages. Figure 6 presents a breakdown of the spilling rates for each generative model across the ten target languages. This allows for the identification of specific language pairs where language spilling is more or less pronounced for each model. -- 16 of 22 -- D Qualitative Analysis of the Responses D.1 German Model: gpt-4o, Word: slip / Rutsch Model Response: "Eindruck, Etwas Günstiges zu Erhalten": [ "Er bekam den Job mit einem Rutsch.", "Die Beförderung kam mehr oder weniger mit einem Rutsch.", "Der Eintritt in die neue Firma
Chunk 31 · 1,992 chars
s pronounced for each model. -- 16 of 22 -- D Qualitative Analysis of the Responses D.1 German Model: gpt-4o, Word: slip / Rutsch Model Response: "Eindruck, Etwas Günstiges zu Erhalten": [ "Er bekam den Job mit einem Rutsch.", "Die Beförderung kam mehr oder weniger mit einem Rutsch.", "Der Eintritt in die neue Firma verlief mit einem Rutsch." ], Explanation: Using ”Rutsch” for ”getting something easily” is inocrrect. While ”slip” in English can mean something obtained easily (like ”it slipped through”), ”Rutsch” in German doesn’t carry this meaning. The idiomatic use doesn’t transfer. Model: LLaMA-3.3-70B-Instruct, Word: charge / Ladung Model Response: "Aufgabe oder Pflicht": [ "Die Ladung des Chefs ist es, die Firma zu führen.", "Die Ladung des Lehrers ist es, die Schüler zu unterrichten.", "Die Ladung des Politikers ist es, das Wohl der Bürger zu vertreten." ] Explanation: In English, ”charge” can mean a responsibility or duty (e.g., ”in charge of”), but ”Ladung” in German does not carry this meaning. ”Ladung” means electrical charge, cargo/freight, or explosive charge - not responsibility. German would use ”Aufgabe,” ”Verantwortung,” or ”Pflicht” instead. D.2 French Model: gpt-4o, Word: collar / col Model Response: "Nœud ou raccord entre deux tronçons dans un tuyau": [ "Le plombier a réparé le col qui fuyait dans le sous-sol.", "Il faut vérifier régulièrement le col des tuyaux pour éviter les fuites.", "L'installation comprenait plusieurs cols pour relier les différentes sections." ], Explanation: The English definition of “collar” as a mechanical component (a restraining or connecting band, ring, or pipe) does not correspond to the standard usage of the French “col” for a pipe joint. This discrepancy indicates a potential instance where English vocabulary might be misleading. Model: meta-llama-3.1-8b-instruct, Word: tie / cravate Model Response: "Cordelette" : [ "Il a utilisé une cravate pour tirer la selle du cheval", "Les arbres étaient liés avec des
Chunk 32 · 1,999 chars
d usage of the French “col” for a pipe joint. This discrepancy indicates a potential instance where English vocabulary might be misleading. Model: meta-llama-3.1-8b-instruct, Word: tie / cravate Model Response: "Cordelette" : [ "Il a utilisé une cravate pour tirer la selle du cheval", "Les arbres étaient liés avec des cravates pour empêcher la chute des branches", "La cravate était trop serrée et a causé une douleur" ], Explanation: Our initial word is cravate which means a necktie, a piece of clothing. However, it is used in a sense of a tie as a knot of a rope. The sentences translations are: He used a [neck]tie to pull the horse saddle, The trees were tied with [neck]ties, The necktie was too tight and caused pain. Only the last one can be used in that sense, but it seems like it interpreted the word tie in a sense of a knot. D.3 Italian Model: DeepSeek-R1-Distill-Llama-8B, Word: lift / ascensore Model Response: "significato": ["un termine religioso o spirituale per indicare un' elevazione o un rinnovamento", ["L'ascensore durante la meditazione lo portò a una nuova -- 17 of 22 -- consapevolezza.", "La sua fede lo ha guidato verso un ascensore spirituale.", "L'ascensore della sua anima lo portò a un livello di consapevolezza superiore."]] Explanation: The use of “ascensore” as a metaphor for spiritual elevation is non-native and incorrect in Italian. The model appears to have mistakenly transferred a metaphor from English (such as the phrase “lift spirits”) into Italian without considering idiomatic usage. Meanwhile, the spilling might come from French, from an expression “être élevé spirituellement” (to be spiritually lifted). Model: gpt-4o, Word: cast / cast Model Response: "gettare o lanciare qualcosa": [ "Il pescatore ha cast la rete in mare.", "Ha cast il dado e sperava in un numero fortunato.", "Ha cast la palla al suo compagno di squadra." ], Explanation: The English meaning of “cast” as in “to cast a net” is mistranslated in the Italian sentence. While
Chunk 33 · 1,999 chars
: cast / cast Model Response: "gettare o lanciare qualcosa": [ "Il pescatore ha cast la rete in mare.", "Ha cast il dado e sperava in un numero fortunato.", "Ha cast la palla al suo compagno di squadra." ], Explanation: The English meaning of “cast” as in “to cast a net” is mistranslated in the Italian sentence. While in English, “to cast a net” is perfectly idiomatic and standard, this sense does not transfer directly into Italian using the verb “cast”. In Italian, “cast” is a borrowed word that usually means the group of actors in a film, play, or show. D.4 Portuguese Model: meta-llama-3.1-8b-instruct, Word: link / link Model Response: "Relacionamento": [ "O link entre eles era de amizade, mas agora é de rivalidade.", "O link entre a mãe e o filho é muito forte.", "O link entre as duas culturas é fundamental para o entendimento mútuo." ], Explanation: In these 3 Portuguese examples, 3 models use link as a relationship, while the only meaning is a link on the internet, or in a document, which is an English influence. Model: LLaMA-3.3-70B-Instruct, Word: link / link Model Response: "Relação": ["Há um link entre a dieta e a saúde", "O link entre a educação e o desenvolvimento econômico é bem estabelecido", "O link entre a poluição e as mudanças climáticas é um tema de debate"], Explanation: Model: google-gemma-3-12b-it, Word: link / link Model Response: "relação (pessoas)": [ "Ela sentiu um link imediato com o novo colega de trabalho.", "Eles compartilham um link profundo através de suas experiências em comum.", "Apesar da distância, eles mantiveram um link forte." ], Explanation: D.5 Spanish Model: meta-llama-3.1-8b-instruct, Word: passage / pasaje Model Response: "Intermedio o etapa en un proceso": [ "La empresa tiene varios pasajes importantes en su historia.", "El pasaje de la vida de la empresa ha sido muy difícil.", "El pasaje de la infancia a la adolescencia es un momento crucial." ], Explanation: Even though the word “pasaje” has a large variety of meanings,
Chunk 34 · 1,998 chars
el Response: "Intermedio o etapa en un proceso": [ "La empresa tiene varios pasajes importantes en su historia.", "El pasaje de la vida de la empresa ha sido muy difícil.", "El pasaje de la infancia a la adolescencia es un momento crucial." ], Explanation: Even though the word “pasaje” has a large variety of meanings, you cannot -- 18 of 22 -- use this word as a step in a process. You can neither use passage as a step in English. The latent representations for passage and step must be close to each other. Model: gemini-1.5-flash-8b, Word: pack / paquete Model Response: "Grupo de personas o cosas que actúan juntas": [ "El equipo era un paquete de profesionales.", "El paquete de inversores estaba muy interesado en el proyecto.", "Era un paquete de amigos muy unidos." ], Explanation: You cannot use a word “paquete” in Spanish meaning a group of people, but you can use this word in English denoting a group of wolves, or speaking informally a group of people that usually carries a negative emotion like “pack of hooligans”, or “A pack of reporters surrounded the celebrity”. Model: gpt-4o, Word: grip / agarre Model Response: "Influencia o control sobre una situación": [ "El nuevo director tiene un fuerte agarre sobre la empresa.", "Parece que la organización perdió el agarre sobre sus filiales en el extranjero.", "Ella siempre tuvo un agarre firme sobre su equipo, lo que le permitió liderar con éxito." ], "Entendimiento o comprensión de un concepto": [ "Tiene un buen agarre de las teorías matemáticas avanzadas.", "Necesito mejorar mi agarre del idioma francés para poder comunicarme mejor.", "Su agarre sobre la historia medieval le hizo destacar en la clase." ], Explanation: You cannot use a word agarre in Spanish meaning the power, having a grip, influence, and control, but you can use a similar expression in English. At the same time, you can use a related expression in French, prendre en grippe, which can be an additional spilling. The second example uses the English
Chunk 35 · 1,989 chars
." ], Explanation: You cannot use a word agarre in Spanish meaning the power, having a grip, influence, and control, but you can use a similar expression in English. At the same time, you can use a related expression in French, prendre en grippe, which can be an additional spilling. The second example uses the English meaning of a word grip as (of a feeling or emotion) deeply affect (someone) like in a phrase “she was gripped by a feeling of excitement”, and there is no such meaning of a word agarre in Spanish D.6 Russian Model: Apertus-70B-2509, Word: collar / воротник Model Response: "значение": "Часть механизма или машины", "примеры": [ "Воротник насоса был поврежден, и его нужно было заменить.", "Он регулировал воротник на шланге, чтобы вода текла медленнее.", "Воротник дробилки был забит, и ее нужно было прочистить." ] Explanation: The Russian word воротник typically refers only to the part of clothing that surrounds the neck (e.g., shirt collar), and cannot be used to describe a mechanical component like a clamp or ring. However, in English, “collar” can mean a mechanical part (e.g., “shaft collar”, “pipe collar”), but this usage does not transfer into Russian, which is an example of spilling. Model: meta-llama-3.1-8b-instruct, Word: catch / улов Model Response: "уловляю": [ "Я уловляю каждое слово, что говорит он.", "Он уловлял каждое движение противника.", "На эту игру я всегда уловляю." ], Explanation: The usage of the word улов is incorrect in Russian to mean to understand, which is used in this passage. The English verb catch and synonyms to grasp can be used to mean understand and comprehend. The noun улов refers specifically to a catch of fish or animals, not to comprehension or perception. This is a spilling. -- 19 of 22 -- D.7 Chinese Model: claude-sonnet-4-20250514, Word: match / 火柴 Model Response: "体育比赛或竞赛": [ "今晚有一场重要的足球火柴", "这场火柴的胜负关系到总冠军", "观众们都期待这场火柴的结果" ], Explanation: The characters 火柴 mean only matches, or timber to start a fire.
Chunk 36 · 1,994 chars
catch of fish or animals, not to comprehension or perception. This is a spilling. -- 19 of 22 -- D.7 Chinese Model: claude-sonnet-4-20250514, Word: match / 火柴 Model Response: "体育比赛或竞赛": [ "今晚有一场重要的足球火柴", "这场火柴的胜负关系到总冠军", "观众们都期待这场火柴的结果" ], Explanation: The characters 火柴 mean only matches, or timber to start a fire. However, this example uses this word as a sport match, ”Sports match or competition”: [”There is an important football match tonight”, ”The outcome of this match is related to the championship”, ”The audience is looking forward to the result of this match”], Model: DeepSeek-R1-0528, Word: mantle / 地幔 Model Response: "比喻责任或重担": [ "他接过了领导团队的地幔,开始新的项目规划", "作为长子,他早早承担起家庭的地幔", "这份协议将环保地幔转移给了生产企业" ], Explanation: The only meaning of 地幔 is the Earth mantle layer, while in this example the model used the word as a meaning of this word from English, ”an important role or responsibility that passes from one person to another”, the translation is ”Metaphor for responsibility or burden” with related examples. D.8 Japanese Model: gpt-4o, Word: withdrawal / 引き出し Model Response: "5. 戦略的に後退または撤退すること": [ "彼はすぐに話題から引き出した。", "部隊は安全のために戦線から引き出された。", "彼らは計画から引き出すことを決定した。" ] Explanation: The Japanese word means only the drawer or an action of withdrawal, while the generated example means ”The strategy of war means retreat and retreat” with the following examples, which is a direct spilling from an English word. Model: gpt-4o-mini, Word: bank / 銀行 Model Response: "川の土手や岸辺": ( "春になると、銀行に桜の花が咲きます。", "子供たちは銀行で遊ぶのが大好きです。", "銀行沿いの遊歩道を散歩しました。" ), "データの保管場所": ( "このデータは専用の銀行に保存されています。", "情報セキュリティのため、銀行は暗号化されています。", "分析用のデータ銀行から情報を取得しました。" ), Explanation: In this example the word means only the financial institution, while the model used meanings of the river bank, and data bank (term from informatics), the translations are ”River banks and shorelines” and ”data storage locations” respectively. D.9 Hindi Model: gpt-4o, Word: space / अंतरिक्ष Model Response: "स्थान/जगह": [ "कृपया
Chunk 37 · 1,997 chars
ample the word means only the financial institution, while the model used meanings of the river bank, and data bank (term from informatics), the translations are ”River banks and shorelines” and ”data storage locations” respectively. D.9 Hindi Model: gpt-4o, Word: space / अंतरिक्ष Model Response: "स्थान/जगह": [ "कृपया अपनी किताबें इस अलमारी में व्यवस्थित करके रखें ताकि और अधिक अंतरिक्ष बच सके।", "हमारे घर में मेहमानों के लिए कोई अतिरिक्त अंतरिक्ष नहीं है।", "इस पार्किंग स्थल में आपकी गाड़ी के लिए पर्याप्त अंतरिक्ष है।" ], "अंतराल": [ "संगीत के इस अंतरिक्ष के दौरान दर्शक तालियाँ बजाते हैं।", "इंटरव्यू के अंतरिक्ष में उम्मीदवारों ने अपने अनुभव साझा किए।", "अध्याय के विषयों के बीच एक अंतरिक्ष होना चाहिए ताकि पाठ समझने में आसानी हो।" "माध्यमिक क्षेत्र या स्थिति": [ "उसके निर्णय ने उसे उसके करियर में एक नए अंतरिक्ष में पहुँचाया।", "इस वैश्विक समस्या को देखते -- 20 of 22 -- हुए अब हमें पर्यावरण संरक्षण के एक नए अंतरिक्ष पर विचार करना होगा।", "समाज में महिलाओं की बढ़ती भूमिका एक सकारात्मक अंतरिक्ष में बदल रही है।" ] Explanation: The first meaning refers to physical space — such as the space inside a room or ordinary spatial gaps. The second meaning represents a gap or interval, which the English word space can also convey, but in Hindi it is expressed using words like अंतराल or विराम, not अंतरिक्ष. The third meaning refers to an abstract domain or sphere — for example, political space or creative space, meaning a new domain or area of activity. In Hindi, however, the word अंतरिक्ष refers only to cosmic space — the vast void of the universe beyond the Earth — and should not be used for these other meanings. In this example we can see how a single word produce a range of different meanings from English. E Dictionary Sources and Validation We employed authoritative lexicographic sources for each target language to provide robust reference frameworks for semantic validation: German: DWDS (Digitales Wörterbuch der deutschen Sprache) - Digital Dictionary main- tained by the
Chunk 38 · 1,989 chars
uce a range of different meanings from English. E Dictionary Sources and Validation We employed authoritative lexicographic sources for each target language to provide robust reference frameworks for semantic validation: German: DWDS (Digitales Wörterbuch der deutschen Sprache) - Digital Dictionary main- tained by the Berlin-Brandenburg Academy of Sciences (https://www.dwds.de/) French: Larousse - Leading French lexicographic publisher with comprehensive coverage (https://www.larousse.fr/) Portuguese: Dicio - Comprehensive Portuguese dictionary (https://www.dicio.com.br/) Spanish: WordReference - Multilingual dictionary platform with extensive Spanish coverage (http://wordreference.com/) English: WordReference - Multilingual dictionary platform for English validation (http://wordreference.com/) Italian: Corriere Dizionari - Authoritative Italian dictionary by Corriere della Sera (https://dizionari.corriere.it/) Chinese: ZDIC () - Comprehensive Chinese character and word dictionary (https://www.zdic.net/) Japanese: Kotobank - Major Japanese dictionary aggregator (https://kotobank.jp/) Russian: Wiktionary Russian - Community-maintained multilingual dictionary (https://ru.wiktionary.org/) Hindi: Wiktionary Hindi - Comprehensive Hindi dictionary (https://hi.wiktionary.org/) These sources serve as principled baselines for comparative evaluation across multilingual models. F Top-100 Highly Polysemous Words Our benchmark consists of the following 100 highly polysemous English words: head, line, point, case, base, center, field, lead, stock, form, position, order, service, key, bar, plate, slip, life, hand, end, section, round, spot, pound, face, figure, job, hall, piece, stone, opening, band, twist, way, part, thing, house, side, rule, balance, division, release, defense, ball, extension, crown, stroke, shaft, man, number, body, control, book, voice, court, land, top, foot, issue, step, movement, bit, medium, game, union, sign, exchange, operation, resistance,
Chunk 39 · 1,511 chars
igure, job, hall, piece, stone, opening, band, twist, way, part, thing, house, side, rule, balance, division, release, defense, ball, extension, crown, stroke, shaft, man, number, body, control, book, voice, court, land, top, foot, issue, step, movement, bit, medium, game, union, sign, exchange, operation, resistance, corner, relief, frame, resolution, track, card, variation, score, split, strain, defence, wing, grain, draft, carrier, bull, pit, horn, pin, time, day, second, study, word, heart, view, action, front, king, force, review. G Human Evaluation Instructions The following instructions were provided to human evaluators for the semantic validation task. All participants provided informed consent for their data to be used for research purposes and model evaluation. -- 21 of 22 -- G.1 Task 1: Semantic Validation 1. Select the <language> and add the labels. 2. Mark each label as true if the meaning is correct, or false if it’s not. 3. Provide the results as a string of t (true) and f (false), e.g., tfttf. 4. Access the evaluation interface at: <url> G.2 Task 2: Dictionary Review 1. Select the <language> and review the dictionary entries. 2. If you feel any definitions are incomplete or missing important meanings, you should add more. 3. You can check online dictionaries or other reliable sources to find additional meanings. 4. Copy and paste the additional meanings in the <language> from the online sources into the form. 5. Access the dictionary interface at: <url> -- 22 of 22 --