Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering

Summary

This paper introduces MRRE (Multilingual Reasoning via Representation Engineering), a training-free inference-time method to enhance multilingual reasoning in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs). It addresses the performance gap between English and low-resource languages without requiring additional training data or tools. MRRE works by injecting two precomputed vectors during inference: cross-lingual reasoning enhancement vectors, which align non-English hidden states with English representations to improve reasoning, and target-language output anchoring vectors, which restore the target language's output distribution to maintain input-output consistency. Experiments on six advanced LLMs and LVLMs across four reasoning benchmarks show MRRE improves non-English reasoning by an average of 5.48%, with gains up to 7.54% in low-resource languages like Thai and Swahili, while enhancing language consistency by 3.78%. The method is effective across textual and vision-language tasks, diverse models, and out-of-distribution datasets, demonstrating strong generalizability.

PDF viewer

Chunks(38)

Chunk 0 · 1,991 chars

Unlocking Multilingual Reasoning Capabilities of LLMs and LVLMs via
Representation Engineering
Qiming Li1, Xiaocheng Feng1,2, Yixuan Ma1,
Zekai Ye1, Ruihan Chen1, Xiachong Feng3, Bing Qin1,2
1Harbin Institute of Technology 2Peng Cheng Laboratory 3The University of Hong Kong
qmli@ir.hit.edu.cn
Abstract
Large Language Models (LLMs) and Large
Vision-Language Models (LVLMs) demon-
strate strong reasoning capabilities, yet their
performance in English significantly outper-
forms that in low-resource languages, raising
fairness concerns in multilingual applications.
Existing approaches either rely on costly mul-
tilingual training or employ prompting with
external translation tools, both of which are
resource-intensive and sensitive to translation
quality. To address these limitations, we pro-
pose a training-free inference-time method to
enhance Multilingual Reasoning capabilities
via Representation Engineering (MRRE) with-
out using any additional training data or tools.
MRRE sequentially injects two precomputed
vectors at specific layers during inference pro-
cessing: cross-lingual reasoning enhancement
vectors, which steer non-English reasoning rep-
resentations toward English space to unlock
multilingual reasoning, and target-language
output anchoring vectors, which restore the dis-
tribution of the target language to preserve in-
put–output language consistency. Comprehen-
sive experiments across six advanced LLMs
and LVLMs on four reasoning benchmarks
demonstrate that MRRE consistently enhances
non-English reasoning by an average gain of
5.48% and up to 7.54% in low-resource lan-
guages (Thai and Swahili), while improving
input-output language consistency by 3.78%.
1 Introduction
With the rapid development of Large Language
Models (LLMs) and Large Vision-Language
Models (LVLMs), foundational models such as
Llama-3.1-8B-Instruct (Grattafiori et al., 2024)
and Qwen2.5-VL-7B-Instruct (Team, 2025) have
demonstrated impressive capabilities in complex
reasoning.

Chunk 1 · 1,988 chars

consistency by 3.78%.
1 Introduction
With the rapid development of Large Language
Models (LLMs) and Large Vision-Language
Models (LVLMs), foundational models such as
Llama-3.1-8B-Instruct (Grattafiori et al., 2024)
and Qwen2.5-VL-7B-Instruct (Team, 2025) have
demonstrated impressive capabilities in complex
reasoning. However, high-resource languages such
as English exhibit substantially stronger reasoning
capabilities compared to low-resource languages,
Figure 1: MRRE adopts a two-stage intervention strat-
egy to unlock multilingual reasoning capabilities.
raising concerns about fairness in multilingual ap-
plications under global deployment. To address the
above issues, prior works primarily focus on two
directions to enhance non-English reasoning capa-
bilities: (1) Data-driven training methods, which
align multilingual embeddings (Arora et al., 2024)
or construct multilingual reasoning datasets for in-
struction tuning (Fan et al., 2025), but inevitably
depend on expensive data and incur considerable
computational costs. (2) Prompting-based methods,
which rely on external translation tools or models
(Khandelwal et al., 2024; Liu et al., 2024), but are
highly sensitive to translation quality and prompt
design, accompanied by high latency. Moreover,
existing methods are rarely effective across both
LLMs and LVLMs. Therefore, establishing a uni-
fied and efficient paradigm for enhancing multi-
lingual reasoning across both LLMs and LVLMs
remains further exploration.
Prior studies (Zhao et al., 2024; Tang et al., 2025;
Li et al., 2025a,b) have revealed the internal mech-
anism of multilingual reasoning: hidden states are
transformed into high-resource language represen-
tations (e.g., English) in early layers, then exploited
for reasoning from middle to later layers, and fi-
nally restored target language features in late layers.
However, as shown in Figure 2, we observe that hid-
den states in reasoning-related layers of both LLMs
and LVLMs still exhibit

Chunk 2 · 1,990 chars

-resource language represen-
tations (e.g., English) in early layers, then exploited
for reasoning from middle to later layers, and fi-
nally restored target language features in late layers.
However, as shown in Figure 2, we observe that hid-
den states in reasoning-related layers of both LLMs
and LVLMs still exhibit significant differences be-
arXiv:2511.23231v1 [cs.CV] 28 Nov 2025

-- 1 of 20 --

(1) Hidden state visualization of Qwen2.5-7B-Instruct
(2) Hidden state visualization of Qwen2.5-VL-7B-Instruct
Figure 2: t-SNE hidden state visualization and reasoning performance of Qwen2.5-7B-Instruct and Qwen2.5-VL-
7B-Instruct. The reasoning capability in English exhibits substantially stronger than in other languages.
tween English and non-English inputs. This con-
sistent cross-architecture phenomenon motivates
us to steer the distribution of non-English hidden
states toward English, endowing both LLMs and
LVLMs with English-level reasoning capabilities
under a general framework. As shown in Figure 1,
we propose a training-free inference-time method
to enhance Multilingual Reasoning capabilities via
Representation Engineering (MRRE) without us-
ing any additional training data or tools. MRRE
sequentially injects two precomputed vectors at spe-
cific layers during forward passing: cross-lingual
reasoning enhancement vectors, which steer non-
English reasoning representations toward English
distributions to strengthen reasoning, and target-
language output anchoring vectors, which restore
the distribution of the target language to preserve
input–output language consistency.
Experimental results of six advanced LLMs and
LVLMs on four reasoning benchmarks demonstrate
that MRRE enhances non-English reasoning by an
average gain of 5.48% and up to 7.54% in low-
source languages (Thai & Swahili), while improv-
ing input-output language consistency by 3.78%.
2 Related Works
2.1 Multilingual Foundation Models
To address the multilingual demands of real-world
global

Chunk 3 · 1,995 chars

enchmarks demonstrate
that MRRE enhances non-English reasoning by an
average gain of 5.48% and up to 7.54% in low-
source languages (Thai & Swahili), while improv-
ing input-output language consistency by 3.78%.
2 Related Works
2.1 Multilingual Foundation Models
To address the multilingual demands of real-world
global applications, recent studies have extended
foundation models to multilingual models. Ad-
vanced LLMs like Qwen3-8B (Yang et al., 2025)
and Llama-3.1-8B-Instruct (Grattafiori et al., 2024),
are pretrained and instruction-tuned on massive
multilingual data, enabling them to respond across
diverse languages. Furthermore, advanced LVLMs
like Qwen2.5-VL-7B-Instruct (Team, 2025) and
InternVL3.5-8B-Chat (Wang et al., 2025a), inte-
grate visual encoders with these language back-
bones, thereby inheriting comparable multilingual
capabilities. However, the unbalanced distribution
of training data across languages results in a signif-
icant performance disparity between high-resource
languages (e.g., English) and low-resource lan-
guage (e.g., Swahili), raising concerns of fairness.
2.2 Multilingual Reasoning Enhancement
To enhance multilingual reasoning, previous re-
search can be divided into two categories: (1) Data-
driven training methods enhance multilingual rea-
soning by aligning cross-lingual representations or
fine-tuning with multilingual supervision, includ-
ing contrastive alignment (Li et al., 2023; Huang
et al., 2024; Arora et al., 2024) and reasoning-
specific instruction tuning (Zhang et al., 2024a;
Geng et al., 2024; Lai and Nissim, 2024; Fan et al.,
2025). Although effective, these approaches in-
cur substantial data and computational costs. (2)
Prompting methods mitigate language imbalance
without parameter updates, including through di-
rect multilingual inputs (Sakai et al., 2024; Khan-
delwal et al., 2024), pivot-language translation (Liu

-- 2 of 20 --

et al., 2024), or diverse chain-of-thought prompting
(Wang et al., 2024), although

Chunk 4 · 1,998 chars

tational costs. (2)
Prompting methods mitigate language imbalance
without parameter updates, including through di-
rect multilingual inputs (Sakai et al., 2024; Khan-
delwal et al., 2024), pivot-language translation (Liu

-- 2 of 20 --

et al., 2024), or diverse chain-of-thought prompting
(Wang et al., 2024), although effectiveness remain
sensitive to translation quality and prompt design,
accompanying by high latency. In contrast to previ-
ous work, MRRE is the first training-free inference-
time method via representation engineering without
using any additional data or tools, which clearly
distinguishes MRRE from existing methods.
3 Methods
3.1 Task Formulation
We restrict our scope to models that are based on
auto-regressive Transformer architecture (Vaswani
et al., 2017), as it is adopted by most SOTA LLMs
and LVLMs. The input sequence of LLMs and
LVLMs are processed through L transformer lay-
ers of the language decoder, each consisting of
multi-head self-attention (MHSA), feed-forward
network (FFN) that is usually a multilayer percep-
tion (MLP), and a residual stream is applied be-
tween each component. The hidden state h(l) ∈ Rd
for token t at layer l under input sequence R(x) can
be computed from the previous layer:
h(l)(R(x), t) = h(l−1)(R(x), t) + a(l) + m(l),
(1)
where a(l) and m(l) are the outputs of the MHSA
and FFN component at layer l. Finally, the model
predicts the next token in an auto-regressive man-
ner based on the hidden state of the last layer.
In this paper, to bridge the significant perfor-
mance gap between English and non-English lan-
guages in reasoning tasks, we propose a representa-
tion engineering approach that applies a two-stage
hierarchical intervention on the hidden states of
specific layers within the language decoder, un-
locking non-English reasoning capabilities while
preserving input-output language consistency.
3.2 Cross-Lingual Reasoning Enhancement
Prior studies (Zhao et al., 2024; Tang et al., 2025)
have shown that the

Chunk 5 · 1,998 chars

a two-stage
hierarchical intervention on the hidden states of
specific layers within the language decoder, un-
locking non-English reasoning capabilities while
preserving input-output language consistency.
3.2 Cross-Lingual Reasoning Enhancement
Prior studies (Zhao et al., 2024; Tang et al., 2025)
have shown that the mid-layer hidden states play
a critical role in shaping reasoning latent represen-
tation. Furthermore, as shown in Figure 2, we ob-
serve that the hidden states from middle to deeper
layers under different languages exhibit significant
differences in the latent space across both LLMs
and LVLMs. These findings motivate us to propose
a representation engineering strategy that aligns the
reasoning capability of non-English with English,
which exhibits stronger reasoning performance.
Since LLMs and LVLMs generate tokens in an
auto-regressive manner, we focus on the hidden
state of the last token, which aggregates the most
comprehensive visual and textual information. To
precisely estimate the reasoning enhancement di-
rection for hidden states, we define cross-lingual
reasoning enhancement vectors, which align the
hidden states of the reasoning chain in the target
non-English language with the stronger reasoning
chain in English. These vectors are computed by
comparing the output last token’s hidden states on
a set of reasoning problems X . For each prob-
lem x ∈ X , we feed the model with both English
prompt and parallel non-English prompt, generat-
ing the corresponding reasoning responses, RE (x)
and RT (x). We define a difference vector ∆h(l)
x
to estimate the latent difference between English
and non-English:
∆h(l)
x = h(l)(RE (x), tlast) − h(l)(RT (x), tlast),
(2)
The cross-lingual reasoning enhancement vector
at l-th layer, v(l)
r , is then computed as the mean of
these difference vectors over the entire set X :
v(l)
r = 1
|X |
X
x∈X
∆h(l)
x . (3)
By steering the target-language hidden states
along v(l)
r , we shift the non-English reasoning
chain

Chunk 6 · 1,994 chars

t) − h(l)(RT (x), tlast),
(2)
The cross-lingual reasoning enhancement vector
at l-th layer, v(l)
r , is then computed as the mean of
these difference vectors over the entire set X :
v(l)
r = 1
|X |
X
x∈X
∆h(l)
x . (3)
By steering the target-language hidden states
along v(l)
r , we shift the non-English reasoning
chain to a stronger reasoning chain shaped by En-
glish. This intervention enables the model to ex-
hibit English-level reasoning capability when an-
swering questions in the non-English language.
3.3 Target-Language Output Anchoring
However, merely applying the cross-lingual en-
hancement vectors may compromise input-output
language consistency. When the steered hidden
states pass through the layers responsible for con-
structing the output language representation, the
models process them as if they were English in-
puts, resulting in English outputs rather than the
target language. To address this issue, we propose
target-language output anchoring vectors, which
guide the model to steer the English-like output dis-
tribution toward the target non-English language in
the layers responsible for constructing the output
language representation, thereby ensuring input-
output language consistency. The target-language
output anchoring vector at l′-th layer, v(l′)
a , is com-
puted by taking the difference between the last to-
ken representations of two fixed language forcing

-- 3 of 20 --

c) Inference-Time Intervention
…
…
…
…	
…	
…	
…
Let's break down the problem step-by-step: 	…
首先，我们来分析一下珍妮特每天的情况：…
LLM/VLM 	=
Question
English Reasoning Response
Target Reasoning Response 	(zh)
Target Language Force Prompt (zh)
English Language Force Prompt
请用中文回答上述的问题。
LLM/VLM	=
Target Output	English Output
Reasoning Enhancement Direction
Please answer the above question in English.
b) 	Target-Language 	Output
Anchoring 	Vector
a) 	Cross-Lingual Reasoning
Enhancement 	Vector
Vector1 	Vector2
Vector
1
Vector2
Token
LLM & LVLM Language Decoder
MRRE Framework
Figure 3: An

Chunk 7 · 1,995 chars

请用中文回答上述的问题。
LLM/VLM	=
Target Output	English Output
Reasoning Enhancement Direction
Please answer the above question in English.
b) 	Target-Language 	Output
Anchoring 	Vector
a) 	Cross-Lingual Reasoning
Enhancement 	Vector
Vector1 	Vector2
Vector
1
Vector2
Token
LLM & LVLM Language Decoder
MRRE Framework
Figure 3: An overview of our proposed MRRE method. Each rectangle represents the model’s hidden state during
the forward passing. MRRE consists of three key stages: a) Cross-Lingual Reasoning Enhancement Vectors §3.2
are derived from the hidden state differences between English and non-English reasoning responses. b) Target-
Language Output Anchoring Vectors §3.3 are derived from the hidden state differences between non-English and
English language forcing prompts. c) Hierarchical Inference-Time Intervention §3.4: Precomputed vectors are
sequentially injected into the last-token representations at specific layers during forward passing, thereby enhancing
non-English reasoning capabilities while preserving input-output language consistency.
prompts: PT ("Please answer this question in <tar-
get language>" translated into <target language>),
and PE ("Please answer this question in English"):
v(l′)
a = h(l′)(PT , tlast) − h(l′)(PE , tlast). (4)
This vector provides a precise estimate of the hid-
den state distribution shift from English to the tar-
get language, enabling the model to recover output
target-language distribution in later layers and en-
suring input-output language consistency.
3.4 Hierarchical Inference-Time Intervention
Considering the autoregressive decoding mecha-
nism of language models, we propose a two-stage,
hierarchical inference-time intervention method.
First we apply cross-lingual enhancement vectors
to the last token of the current reasoning response
at middle layer, steering non-English hidden state
h(l)
t to an English-like hidden state ˜h(l):
ˆh(l) = h(l) + α1 · ˜v(l)
r , ˜h(l) = ˆh(l) · ∥h(l)∥2
∥ˆh(l)∥2
, (5)
where α1 denotes scaling

Chunk 8 · 1,996 chars

e-time intervention method.
First we apply cross-lingual enhancement vectors
to the last token of the current reasoning response
at middle layer, steering non-English hidden state
h(l)
t to an English-like hidden state ˜h(l):
ˆh(l) = h(l) + α1 · ˜v(l)
r , ˜h(l) = ˆh(l) · ∥h(l)∥2
∥ˆh(l)∥2
, (5)
where α1 denotes scaling coefficients and || · ||
represents the ℓ2 norms of the activation vectors.
The normalization strategy ensures that the vector
scale remains consistent before and after interven-
tion, preventing undesired magnitude shifts that
may distort downstream representations.
The model exhibits English-level reasoning ca-
pability from middle to later layers as if processing
an English problem, producing English output hid-
den states h(l′) at l′-th layer. Then we apply the
target-language output anchoring vectors to steer
English output hidden state back to the target ˜h(l′):
ˆh(l′) = h(l′) + α2 · ˜v(l′)
a , ˜h(l′) = ˆh(l′) · ∥h(l′)∥2
∥ˆh(l′)∥2
,
(6)
Finally, the newly generated token is then ap-
pended to the current input sequence. In the mid-
dle layers, the hidden state of updated input se-
quence can be “translated” into English, activat-
ing stronger reasoning capabilities under English-
like states; and in later layers, it can be “back-
translated” to the target language, ensuring input-
output language consistency. This information flow
continues until end-of-sequence token is produced.
4 Experiments
4.1 Experimental Setup
Baseline Models. We evaluate our proposed
MRRE method on six SOTA models to demonstrate
its broad applicability across LLM and LVLMs.
• LLMs: Qwen2.5-7B-Instruct (Team, 2024),
Qwen3-8B (Yang et al., 2025), and Llama-3.1-
8B-Instruct (Grattafiori et al., 2024).

-- 4 of 20 --

Model En Zh Ja Es Ru Fr De Th Sw LC En Zh Ja Es Ru Fr De Th Sw LC
LLMs MGSM MSVAMP
Qwen2.5-7B-Instruct 88.8 78.4 65.2 81.2 82.0 77.2 74.8 71.6 12.4 84.7 91.3 88.1 83.6 89.6 85.7 88.4 86.4 79.0 44.9 86.3
+ Language forcing – 77.2 67.6 79.2 82.0 76.0 76.8 71.6

Chunk 9 · 1,998 chars

3.1-
8B-Instruct (Grattafiori et al., 2024).

-- 4 of 20 --

Model En Zh Ja Es Ru Fr De Th Sw LC En Zh Ja Es Ru Fr De Th Sw LC
LLMs MGSM MSVAMP
Qwen2.5-7B-Instruct 88.8 78.4 65.2 81.2 82.0 77.2 74.8 71.6 12.4 84.7 91.3 88.1 83.6 89.6 85.7 88.4 86.4 79.0 44.9 86.3
+ Language forcing – 77.2 67.6 79.2 82.0 76.0 76.8 71.6 16.0 92.2 – 86.2 82.0 88.8 85.1 87.8 86.4 79.6 45.7 89.4
+ MRRE – 81.6 71.2 81.2 84.7 79.2 78.4 74.8 17.2 92.7 – 88.7 84.7 90.0 86.5 90.1 87.4 82.6 52.3 90.3
Qwen3-8B 90.8 80.4 76.4 83.2 88.4 82.4 84.0 85.6 40.4 89.3 91.9 88.1 87.4 90.7 87.0 90.1 89.1 83.1 67.8 89.4
+ Language forcing – 82.0 76.8 84.0 88.8 79.6 82.0 84.0 37.6 93.2 – 88.6 86.8 90.0 87.7 89.5 88.7 83.1 62.3 91.8
+ MRRE – 84.7 81.4 84.2 88.8 84.1 85.5 86.0 45.8 95.6 – 89.1 88.2 90.9 88.3 90.3 89.4 85.4 73.4 92.5
Llama-3.1-8B- Instruct 77.2 61.2 38.0 64.8 53.2 63.6 67.2 54.4 44.8 95.3 78.4 65.7 53.8 71.4 56.6 76.4 72.9 55.8 50.8 94.9
+ Language forcing – 57.6 44.8 63.6 62.4 68.0 66.4 53.2 45.6 95.1 – 58.5 59.0 71.5 66.0 73.0 69.9 56.5 50.5 95.1
+ MRRE – 66.6 47.7 68.7 64.5 70.0 71.2 64.3 53.9 96.6 – 70.6 64.3 73.2 65.4 76.4 74.2 63.9 57.6 96.4
LVLMs MMathVerse MMathVista
LLaVA-Onevision-7B 29.9 28.9 25.8 29.2 26.2 26.8 25.3 23.5 11.3 76.9 61.9 51.1 43.7 58.0 49.2 55.8 48.0 41.7 16.3 75.0
+ Language forcing – 26.6 22.7 25.9 22.9 20.3 22.0 26.3 4.0 77.8 – 44.6 38.6 45.1 41.1 41.8 37.1 26.7 10.8 78.4
+ MRRE – 32.9 30.0 30.0 28.2 30.0 29.5 26.5 19.1 77.8 – 52.3 48.2 57.0 55.7 56.6 54.7 48.3 32.1 80.3
Qwen2.5-VL-7B-Instruct 45.4 38.4 36.4 42.5 39.6 38.1 36.5 32.0 16.4 92.5 67.2 58.1 52.0 62.2 57.9 59.6 58.2 51.1 16.7 94.1
+ Language forcing – 36.8 37.7 43.6 40.6 39.6 40.0 43.5 17.8 91.8 – 60.0 53.0 61.2 58.3 59.4 56.9 53.7 10.1 94.4
+ MRRE – 45.9 44.6 47.5 44.6 45.8 45.7 48.4 33.8 93.1 – 62.3 61.1 64.2 62.8 61.2 64.1 57.3 28.6 95.5
InternVL3.5-VL-8B-Chat 57.6 52.3 48.9 52.3 49.0 48.5 46.6 34.4 32.3 76.3 72.2 59.7 60.2 67.9 68.7 67.9 68.7 62.5 40.6 79.6
+ Language forcing – 54.7 50.8 51.4 49.4

Chunk 10 · 1,998 chars

.8 91.8 – 60.0 53.0 61.2 58.3 59.4 56.9 53.7 10.1 94.4
+ MRRE – 45.9 44.6 47.5 44.6 45.8 45.7 48.4 33.8 93.1 – 62.3 61.1 64.2 62.8 61.2 64.1 57.3 28.6 95.5
InternVL3.5-VL-8B-Chat 57.6 52.3 48.9 52.3 49.0 48.5 46.6 34.4 32.3 76.3 72.2 59.7 60.2 67.9 68.7 67.9 68.7 62.5 40.6 79.6
+ Language forcing – 54.7 50.8 51.4 49.4 49.1 49.8 35.5 36.3 84.2 – 62.4 61.1 64.1 62.9 61.5 64.1 57.8 34.8 85.6
+ MRRE - 55.6 51.2 53.3 50.1 50.3 52.3 44.6 42.3 81.9 – 63.5 66.9 70.7 70.5 70.1 69.2 66.3 46.6 86.4
Table 1: Accuracy (%) and Language Consistency (LC, %) of three advanced LLMs and three advanced LVLMs
with different settings across 8 languages (Zh, Ja, Es, Ru, Fr, De, Th, Sw) and 4 reasoning benchmarks: MGSM,
MSVAMP, MMathVerse, MMathVista. Best performances for each experimental settings are bolded.
Model MMathVerse MMathVista
T-D T-L V-I V-D V-O Overall General Math Overall
Qwen2.5-VL-7B 39.9 35.1 32.1 32.5 34.6 34.9 53.5 50.7 52.0
+ Language forcing 44.0 37.2 34.9 35.1 37.0 37.7 52.6 50.8 51.6
+ MRRE 51.2(+11.3) 45.8(+10.7) 42.2(+10.1) 41.2(+8.7) 41.9(+7.3) 44.5(+9.6) 55.5(+2.0) 58.5(+7.8) 57.1(+5.1)
LLaVA-Onevision-7B 27.5 28.4 24.8 22.6 22.4 24.6 44.7 46.1 45.5
+ Language forcing 28.4 25.7 24.2 19.8 8.5 29.3 21.3 41.0 35.8
+ MRRE 37.4(+9.9) 32.1(+3.7) 30.9(+6.1) 28.3(+5.7) 21.2(-1.2) 30.0(+5.4) 48.6(+3.9) 52.3(+6.2) 50.6(+5.1)
InternVL3.5-8B 48.3 41.4 33.3 39.4 37.5 40.1 62.0 61.0 61.5
+ Language forcing 47.8 40.3 36.1 38.1 41.7 40.8 60.8 61.8 61.4
+ MRRE 53.0(+4.7) 47.2(+5.8) 41.7(+8.4) 44.3(+4.9) 48.0(+10.5) 46.8(+6.7) 65.3(+3.3) 64.0(+3.0) 63.8(+2.3)
Table 2: Mean performance (%) of three advanced LVLMs with diverse experimental settings across 8 lan-
guages (Zh, Ja, Es, Ru, Fr, De, Th, Sw) and 2 math-related benchmarks: MMathVerse (Text-Dominant, Text-Lite,
Vision-Integrated, Vision-Dominant, and Vision-Only categories, Overall), MathVista (General, and Math-related
categories, Overall). Numbers in green denote absolute improvements of MRRE over the baseline.
• LVLMs:

Chunk 11 · 1,989 chars

es (Zh, Ja, Es, Ru, Fr, De, Th, Sw) and 2 math-related benchmarks: MMathVerse (Text-Dominant, Text-Lite,
Vision-Integrated, Vision-Dominant, and Vision-Only categories, Overall), MathVista (General, and Math-related
categories, Overall). Numbers in green denote absolute improvements of MRRE over the baseline.
• LVLMs: Qwen2.5-VL-7B-Instruct (Team,
2025), LLaVA-OneVision (Li et al., 2024) and
InternVL3.5-8B-Chat (Wang et al., 2025a).
Benchmarks. We select four challenging bench-
marks to evaluate multilingual reasoning of both
LLMs and LVLMs, covering both mathematical
and general reasoning capabilities.
• MGSM (Shi et al., 2022): A multilingual bench-
mark designed to evaluate mathematical reason-
ing capabilities of LLMs. It contains 250 grade-
school math problems from the English GSM8K
dataset, which are professionally translated into
diverse non-English languages.
• MSVAMP (Chen et al., 2023): A multilingual
challenge set created from the original English
SVAMP dataset, which is designed to expose
LLMs that rely on shallow, unreliable shortcuts.
• MMathVerse, MMathVista: To evaluate rea-
soning of LVLMs, we construct Multilingual ver-
sions of MathVerse (Zhang et al., 2024b) and
MathVista (Lu et al., 2023). English datasets
are translated into target non-English languages
using Google Translate v2. Translation quali-
ties are evaluated via human checking and back-
translation. See Appendix A for more details.
We adopt Accuracy as the evaluation metric. Lan-
guage Consistency (LC) denotes the proportion of
responses generated in the target language.

-- 5 of 20 --

Implementation Details. We randomly sample
100 instances from MGSM and MathVerse to con-
struct MRRE vectors for LLMs and LVLMs, re-
spectively. All experiments are conducted on 8 ×
NVIDIA A100 80GB. See Appendix B for details.
4.2 Main Results
Based on experimental results presented in Table 1
& 2, we can draw the following key conclusions:
(1) Effective multilingual reasoning enhance-
ment

Chunk 12 · 1,997 chars

Verse to con-
struct MRRE vectors for LLMs and LVLMs, re-
spectively. All experiments are conducted on 8 ×
NVIDIA A100 80GB. See Appendix B for details.
4.2 Main Results
Based on experimental results presented in Table 1
& 2, we can draw the following key conclusions:
(1) Effective multilingual reasoning enhance-
ment performance. MRRE achieves effec-
tive performance across high-resource and low-
resource non-English languages on LLMs and
LVLMs, leading to an average improvement of
5.48% on four benchmarks, while improving
input-output language consistency by 3.78%.
Notably, improvements are significantly enhanced
in low-resource languages (Th, Sw) by 7.54%.
(2) Multimodal generalizability. MRRE consis-
tently improves performance across purely textual
and vision–language reasoning tasks. Notably, re-
sults on MMathVerse demonstrate that MRRE en-
hances reasoning in all five categories, including
Text-Dominant, Text-Lite, Vision-Integrated, Vision-
Dominant, and Vision-Only tasks, proving MRRE’s
effectiveness regardless of modality composition.
(3) Cross-model generalizability. By systemat-
ically evaluating diverse LLM and LVLM back-
bones, we find that MRRE does not rely on a par-
ticular model architecture. In contrast, MRRE ben-
efits from universal cross-lingual difference in hid-
den states, ensuring broad application across other
SOTA open-source models.
(4) Cross-dataset generalizability. Although the
construction of MRRE vectors relies on reason-
ing data from MGSM and MMathVerse, these vec-
tors remain effective on out-of-distribution datasets
such as MSVAMP and MMathVista, suggesting
that the intervention represents a generalizable di-
rection rather merely tailored to a specific dataset.
Moreover, the improvement on MMathVista Gen-
eral subset indicates that MRRE not only enhances
mathematical reasoning, but also strengthens gen-
eral reasoning capabilities.
5 Analysis and Discussions
5.1 Analysis of Intervention
To thoroughly visualize the effect of

Chunk 13 · 1,999 chars

ction rather merely tailored to a specific dataset.
Moreover, the improvement on MMathVista Gen-
eral subset indicates that MRRE not only enhances
mathematical reasoning, but also strengthens gen-
eral reasoning capabilities.
5 Analysis and Discussions
5.1 Analysis of Intervention
To thoroughly visualize the effect of MRRE on hid-
den states, we conduct a stepwise analysis along
Figure 4: Kernel Density Estimate (KDE) visualization
plots of cross-lingual hidden states within Qwen2.5-7B-
Instruct before and after two types of intervention. The
x-axis represents the SVM-derived signed distance to
the mean English representation; and the y-axis repre-
sents the estimated probability density.
the information flow of token generation: 1 the
model first receives non-English input and under-
goes understanding forward passing until the initial
reasoning stage. As shown in Figure 4 (a), we
observe a significant difference between English
and non-English representations in the latent space.
Moreover, since Qwen2.5-7B-Instruct is trained
on large amounts of English and Chinese data, the
hidden states of these two high-resource languages
exhibit much more densely clustered than others.
2 Under the intervention of the cross-lingual en-
hancement vectors, as shown in Figure 4 (b) non-
English hidden states shift closer to English. 3
The English-like states then undergo reasoning for-
ward passing until initial generating stage critical
for language outputs. At this stage, the model rea-
sons as if it were processing an English query, ulti-
mately producing English-like representations. As
illustrated in Figure 4 (c), English and non-English
hidden states become highly aligned. 4 Under the
intervention of the target-language output anchor-
ing vectors, as shown in Figure 4 (d), the hidden
states restore the standard output distribution of the

-- 6 of 20 --

Model Zh Ja Es Ru Fr De Th Sw
Qwen2.5-7B as Language Backbone
Qwen2.5-VL-7B 58.1 52.0 62.2 57.9 59.6 58.2 51.1 16.7
+ MRRE

Chunk 14 · 1,997 chars

hly aligned. 4 Under the
intervention of the target-language output anchor-
ing vectors, as shown in Figure 4 (d), the hidden
states restore the standard output distribution of the

-- 6 of 20 --

Model Zh Ja Es Ru Fr De Th Sw
Qwen2.5-7B as Language Backbone
Qwen2.5-VL-7B 58.1 52.0 62.2 57.9 59.6 58.2 51.1 16.7
+ MRRE Vanilla 62.3 61.1 64.2 62.8 61.2 64.1 57.3 28.6
+ MRRE L-B 58.9 55.7 62.0 59.6 61.4 62.7 54.9 33.1
Qwen3-8B as Language Backbone
InternVL3.5-8B 59.7 60.2 67.9 68.7 67.9 68.7 62.5 40.6
+ MRRE Vanilla 63.5 66.9 70.7 70.5 70.1 69.2 66.3 46.6
+ MRRE L-B 61.2 62.1 68.5 69.0 68.8 69.4 64.5 49.4
Table 3: Performance (%) of the MathVista benchmark
across languages. Vanilla represents standard MRRE
method, and L-B represents MRRE method using vec-
tors from responding LVLM’s Language Backbone.
target language, ensuring input and output language
consistency. 5 Finally, the anchored hidden states
continue forward passing until generate the next
token in target language. This lifetime visualiza-
tion of token generation demonstrates that MRRE
achieves the foundational motivation we set forth.
5.2 Cross-modal Generalization of Vectors
In this subsection, we thoroughly explore the cross-
modal generalization of MRRE vectors. LVLMs
are trained by a Language Backbone (L-B) LLM
jointly with a specific visual encoder, which results
in high similarity between their output hidden states
in the latent space. Furthermore, current language
backbone LLMs exhibit stronger reasoning capa-
bilities than corresponding LVLMs (Chen et al.,
2025). Building on these two observations, we hy-
pothesize that cross-lingual reasoning enhancement
vectors derived from language backbones can also
enhance reasoning capability of LVLMs. To vali-
date this hypothesis, we conduct experiments with
Qwen2.5-VL-7B-Instruct and InternVL3.5-8B, re-
placing the cross-lingual reasoning enhancement
vectors with those vectors derived from their lan-
guage backbone, Qwen2.5-7B-Instruct and Qwen3-
8B. As

Chunk 15 · 1,996 chars

ge backbones can also
enhance reasoning capability of LVLMs. To vali-
date this hypothesis, we conduct experiments with
Qwen2.5-VL-7B-Instruct and InternVL3.5-8B, re-
placing the cross-lingual reasoning enhancement
vectors with those vectors derived from their lan-
guage backbone, Qwen2.5-7B-Instruct and Qwen3-
8B. As shown in Table 3, MRRE L-B consistently
improves the reasoning performance across all non-
English languages. Remarkably, MRRE L-B even
surpasses MRRE Vanilla in French and Swahili.
These findings indicate that vectors from language
backbones generalize well to LVLMs, highlight-
ing MRRE’s cross-modal generalization capability.
They also suggest that LVLMs inherit multilingual
reasoning capabilities from their language back-
bones LLMs to some extent, leading to similar
multilingual shift directions.
Model Zh Ja Es Ru Fr De Th Sw
LLM on MGSM
Qwen2.5-7B 78.4 65.2 81.2 82.0 77.2 74.8 71.6 12.4
+ MRRE Vanilla 81.6 71.2 81.2 84.7 79.2 78.4 74.8 17.2
+ MRRE Debias 83.2 68.8 81.2 83.6 77.2 77.2 76.4 19.2
LVLM on MMathVista
Qwen2.5-VL-7B 58.1 52.0 62.2 57.9 59.6 58.2 51.1 16.7
+ MRRE Vanilla 62.3 61.1 64.2 62.8 61.2 64.1 57.3 28.6
+ MRRE Debias 54.3 48.0 56.6 54.4 54.8 56.1 48.5 11.9
Table 4: Performance (%) of the MGSM and MMath-
Vista benchmark across languages. Vanilla represents
standard MRRE method, and Debias represents MRRE
method using latent english debiasing vectors.
5.3 Mitigating English Bias in Latent Space
In this subsection, we further explore how to en-
hance multilingual reasoning capabilities during
the understanding forward passing. Prior study
(Zhao et al., 2024) indicates that hidden states en-
code low-level semantic representations and ex-
hibit strong linguistic characteristics at this stage.
Since a substantial portion of the training data is
in English, when the model processes non-English
queries, the hidden-state distribution during for-
ward passing tends to shift toward the default En-
glish distribution. We hypothesize these

Chunk 16 · 1,999 chars

resentations and ex-
hibit strong linguistic characteristics at this stage.
Since a substantial portion of the training data is
in English, when the model processes non-English
queries, the hidden-state distribution during for-
ward passing tends to shift toward the default En-
glish distribution. We hypothesize these shifts
in early layers may distort the encoded low-level
semantic representations, leading to performance
drops. To address this issue, we introduce an alter-
native multilingual reasoning enhancement method,
which only leverages the cross-lingual enhance-
ment vectors in an reversed direction. We denote
these vectors as the latent english debiasing vec-
tors, designed to mitigate English bias in latent
space during understanding forward passing:
v(l)
debias = − 1
|X |
X
x∈X
∆h(l)
x . (7)
To validate this hypothesis, we conduct experi-
ments with Qwen2.5-7B-Instruct and Qwen2.5-VL-
7B-Instruct and apply latent english debiasing vec-
tors for each model. As illustrated in Table 4, the
MRRE Debias method consistently improves cross-
lingual performance on the Qwen2.5-7B-Instruct
across all evaluated languages, and even surpasses
the vanilla MRRE Vanilla on Chinese, Thai, and
Swahili. However, when applied to the Qwen2.5-
VL-7B-Instruct, MRRE Debias exhibits a consis-
tent performance decline. According to prior study
(Ye et al., 2025), we posit that LVLMs rely predom-

-- 7 of 20 --

Figure 5: Case study of Qwen2.5-VL-7B-Instruct on the MathVerse benchmark.
Figure 6: Analysis of layer selections of MRRE.
inantly on English data during the image–text align-
ment stage of pre-training, which endows English
with stronger multimodal representations. Conse-
quently, injecting latent english debiasing vectors
into non-English hidden states disrupts multimodal
representations, leading to performance drops.
5.4 Analysis of Hyperparameters
This subsection systematically examines the influ-
ence of the intervention layer1 l and layer2 l′ for
two designed vectors,

Chunk 17 · 1,999 chars

ions. Conse-
quently, injecting latent english debiasing vectors
into non-English hidden states disrupts multimodal
representations, leading to performance drops.
5.4 Analysis of Hyperparameters
This subsection systematically examines the influ-
ence of the intervention layer1 l and layer2 l′ for
two designed vectors, respectively. As shown in
Figure 6, applying the first MRRE vector near layer
20 yields the best improvement in the model’s
reasoning ability. Given the first intervention at
layer 20, applying the second vector near layer 23
Model Zh Ja Es Ru Fr Th Sw LC
Qwen2.5-7B 78.4 65.2 81.2 82.0 77.2 71.6 12.4 84.7
+ MRRE Vector1 83.2 73.7 82.3 85.2 80.1 75.2 19.2 26.2
+ MRRE Vanilla 81.6 71.2 81.2 84.7 79.2 74.8 17.2 92.7
Qwen2.5VL-7B 58.1 52.0 62.2 57.9 59.6 51.1 16.7 94.1
+ MRRE Vector1 63.4 62.3 64.9 64.2 63.9 62.5 35.4 30.0
+ MRRE Vanilla 62.3 61.1 64.2 62.8 61.2 57.3 28.6 95.5
Table 5: Ablation studies of MRRE vectors.
achieves the next optimal boost. LVLMs exhibit
a similar pattern, with the optimal layers being 22
and 24, respectively. As shown in Table 5, the
second vector exhibits a trade-off strategy, which
decreases little reasoning capabilities but signifi-
cantly increases the consistency. Furthermore, we
find MRRE achieves optimal results when α1 = 1
and α2 = 0.75. See Appendix B.3 for more abla-
tion studies of intervention strengths α1 and α2.
5.5 Case Study
As shown in Figure 5, MRRE enhances non-
English reasoning capability while maintaining
input-output consistency. See more in Appendix C.
6 Conclusion
We propose a training-free inference-time method
to enhance Multilingual Reasoning capabilities via
Representation Engineering (MRRE), which ap-
plies cross-lingual reasoning enhancement vectors
and target-language output anchoring vectors se-
quentially at specific layers of LLMs and LVLMs
during forward passing. Comprehensive results
demonstrate its effectiveness and generalizability.

-- 8 of 20 --

7 Limitations
One limitation of our work

Chunk 18 · 1,996 chars

(MRRE), which ap-
plies cross-lingual reasoning enhancement vectors
and target-language output anchoring vectors se-
quentially at specific layers of LLMs and LVLMs
during forward passing. Comprehensive results
demonstrate its effectiveness and generalizability.

-- 8 of 20 --

7 Limitations
One limitation of our work is that MRRE requires
access to the internal representations of the model,
making it infeasible for closed-source LLMs or
LVLMs. Furthermore, due to the constraints of
cost and resources, we only conduct experiments
on these widely used benchmarks and models.
References
Gaurav Arora, Srujana Merugu, Shreya Jain, and Vaib-
hav Saxena. 2024. Towards robust knowledge repre-
sentations in multilingual llms for equivalence and in-
heritance based consistent reasoning. arXiv preprint
arXiv:2410.14235.
Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dong-
mei Zhang, and Jia Li. 2023. Breaking language bar-
riers in multilingual mathematical reasoning: Insights
and observations. arXiv preprint arXiv:2310.20246.
Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu,
Siyang Gao, Miao Xiong, Manling Li, and Junxian
He. 2025. Bring reason to vision: Understanding
perception and reasoning through model merging.
arXiv preprint arXiv:2505.05464.
Yuchun Fan, Yongyu Mu, Yilin Wang, Lei Huang, Jun-
hao Ruan, Bei Li, Tong Xiao, Shujian Huang, Xi-
aocheng Feng, and Jingbo Zhu. 2025. Slam: Towards
efficient multilingual reasoning via selective lan-
guage alignment. arXiv preprint arXiv:2501.03681.
Xiang Geng, Ming Zhu, Jiahuan Li, Zhejian Lai, Wei
Zou, Shuaijie She, Jiaxin Guo, Xiaofeng Zhao,
Yinglu Li, Yuang Li, et al. 2024. Why not transform
chat large language models to non-english? arXiv
preprint arXiv:2405.13923.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten,
Alex Vaughan, et al. 2024. The llama 3 herd of mod-
els. arXiv preprint arXiv:2407.21783.
Zixian Huang, Wenhao Zhu, Gong

Chunk 19 · 1,984 chars

els to non-english? arXiv
preprint arXiv:2405.13923.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri,
Abhinav Pandey, Abhishek Kadian, Ahmad Al-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten,
Alex Vaughan, et al. 2024. The llama 3 herd of mod-
els. arXiv preprint arXiv:2407.21783.
Zixian Huang, Wenhao Zhu, Gong Cheng, Lei Li, and
Fei Yuan. 2024. Mindmerger: Efficiently boosting
llm reasoning in non-english languages. Advances in
Neural Information Processing Systems, 37:34161–
34187.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam
Perelman, Aditya Ramesh, Aidan Clark, AJ Os-
trow, Akila Welihinda, Alan Hayes, Alec Radford,
et al. 2024. Gpt-4o system card. arXiv preprint
arXiv:2410.21276.
Aditi Khandelwal, Utkarsh Agarwal, Kumar Tanmay,
and Monojit Choudhury. 2024. Do moral judgment
and reasoning capability of llms change with lan-
guage? a study using the multilingual defining issues
test. arXiv preprint arXiv:2402.02135.
Huiyuan Lai and Malvina Nissim. 2024. mcot: Multilin-
gual instruction tuning for reasoning consistency in
language models. arXiv preprint arXiv:2406.02301.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei
Liu, and Chunyuan Li. 2024. Llava-onevision: Easy
visual task transfer. Preprint, arXiv:2408.03326.
Chong Li, Shaonan Wang, Jiajun Zhang, and Chengqing
Zong. 2023. Improving in-context learning of mul-
tilingual generative language models with cross-
lingual alignment. arXiv preprint arXiv:2311.08089.
Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong,
Weitao Ma, and Xiachong Feng. 2025a. Causal trac-
ing of object representations in large vision language
models: Mechanistic interpretability and hallucina-
tion mitigation. arXiv preprint arXiv:2511.05923.
Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong,
Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang,
Yaowei Wang, Ting Liu, et al. 2025b. Cai: Caption-
sensitive attention intervention for mitigating object
hallucination in large

Chunk 20 · 1,995 chars

echanistic interpretability and hallucina-
tion mitigation. arXiv preprint arXiv:2511.05923.
Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong,
Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang,
Yaowei Wang, Ting Liu, et al. 2025b. Cai: Caption-
sensitive attention intervention for mitigating object
hallucination in large vision-language models. arXiv
preprint arXiv:2506.23590.
Chaoqun Liu, Wenxuan Zhang, Yiran Zhao, Anh Tuan
Luu, and Lidong Bing. 2024. Is translation all you
need? a study on solving multilingual tasks with large
language models. arXiv preprint arXiv:2403.10258.
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun-
yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-
Wei Chang, Michel Galley, and Jianfeng Gao. 2023.
Mathvista: Evaluating mathematical reasoning of
foundation models in visual contexts. arXiv preprint
arXiv:2310.02255.
Yusuke Sakai, Hidetaka Kamigaito, and Taro Watan-
abe. 2024. mcsqa: Multilingual commonsense
reasoning dataset with unified creation strategy by
language models and humans. arXiv preprint
arXiv:2406.04215.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang,
Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022.
Language models are multilingual chain-of-thought
reasoners. arXiv preprint arXiv:2210.03057.
Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min,
Wayne Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang
Zhang. 2025. Unlocking general long chain-of-
thought reasoning capabilities of large language mod-
els via representation engineering. arXiv preprint
arXiv:2503.11314.
Qwen Team. 2024. Qwen2.5: A party of foundation
models.
Qwen Team. 2025. Qwen2.5-vl.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all

-- 9 of 20 --

you need. Advances in neural information processing
systems, 30.
Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang,
Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong,
and Xilin

Chunk 21 · 1,996 chars

oam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all

-- 9 of 20 --

you need. Advances in neural information processing
systems, 30.
Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang,
Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong,
and Xilin Chen. 2024. M4u: Evaluating multilingual
understanding and reasoning for large multimodal
models. arXiv preprint arXiv:2405.15638.
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu,
Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin
Jing, Shenglong Ye, Jie Shao, et al. 2025a. In-
ternvl3.5: Advancing open-source multimodal mod-
els in versatility, reasoning, and efficiency. arXiv
preprint arXiv:2508.18265.
Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei,
Baosong Yang, Rui Wang, Chenshu Sun, Feitong
Sun, Jiran Zhang, Junxuan Wu, et al. 2025b. Poly-
math: Evaluating mathematical reasoning in multi-
lingual contexts. arXiv preprint arXiv:2504.18428.
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao,
Chengen Huang, Chenxu Lv, et al. 2025. Qwen3
technical report. arXiv preprint arXiv:2505.09388.
Zekai Ye, Qiming Li, Xiaocheng Feng, Libo Qin, Yi-
chong Huang, Baohang Li, Kui Jiang, Yang Xiang,
Zhirui Zhang, Yunfei Lu, et al. 2025. Claim: Mitigat-
ing multilingual object hallucination in large vision-
language models with cross-lingual attention inter-
vention. arXiv preprint arXiv:2506.11073.
Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang,
and Min Zhang. 2024a. Lingualift: An effective two-
stage instruction tuning framework for low-resource
language tasks. arXiv e-prints, pages arXiv–2412.
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun
Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu,
Kai-Wei Chang, Yu Qiao, et al. 2024b. Mathverse:
Does your multi-modal llm truly see the diagrams in
visual math problems? In European Conference on
Computer Vision, pages 169–186. Springer.
Yiran Zhao, Wenxuan Zhang, Guizhen Chen,

Chunk 22 · 1,977 chars

ng, Dongzhi Jiang, Yichi Zhang, Haokun
Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu,
Kai-Wei Chang, Yu Qiao, et al. 2024b. Mathverse:
Does your multi-modal llm truly see the diagrams in
visual math problems? In European Conference on
Computer Vision, pages 169–186. Springer.
Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji
Kawaguchi, and Lidong Bing. 2024. How do large
language models handle multilingualism? Advances
in Neural Information Processing Systems, 37:15296–
15319.

-- 10 of 20 --

A Translation Quality
To evaluate the translation quality of our con-
structed MMathVerse and MMathVista bench-
marks, we sample 500 translated queries from each
benchmarks for each language and back-translate
them into English using Google Translate. The
back-translated English queries are then input into
LVLMs to test whether their predictions align with
those generated from the original English queries.
High prediction consistency indicates that the trans-
lated data maintains superior benchmark quality.
As shown in Table 6, these results demonstrate the
reliability of our constructed multilingual dataset.
Lang. Zh Ja Es Ru Fr De Th Sw
APC 100.0 100.0 100.0 99.8 100.0 100.0 99.8 99.7
Table 6: Average Predicted Consistency (APC, %)
MMathVerse and MMathVista across eight languages.
B Experimental Details
In this section, we present the experimental details,
including inference settings, design of experimen-
tal prompts, fine-grained results of MathVerse &
MathVista, and ablation of intervention strength.
B.1 Inference Settings
To ensure that the reasoning parameters are better
aligned with reasoning tasks and to guarantee the
reproducibility of results, we carefully design infer-
ence settings for each model, as shown in Table 7.
Model Setting
Qwen2.5-7B-Instruct
do_sample=True,
temperature=0.7,
top_p=0.8, top_k=20
Qwen3-8B
do_sample=True,
temperature=0.7,
top_p=0.8, top_k=20,
enable_thinking=False
Llama-3.1-8B- Instruct do_sample=True
LLaVA-Onevision-7B

Chunk 23 · 1,997 chars

of results, we carefully design infer-
ence settings for each model, as shown in Table 7.
Model Setting
Qwen2.5-7B-Instruct
do_sample=True,
temperature=0.7,
top_p=0.8, top_k=20
Qwen3-8B
do_sample=True,
temperature=0.7,
top_p=0.8, top_k=20,
enable_thinking=False
Llama-3.1-8B- Instruct do_sample=True
LLaVA-Onevision-7B do_sample=True,
temperature=0.6
Qwen2.5-VL-7B-Instruct
do_sample=True,
temperature=0.1, top_p=0.001,
repetition_penalty=1.1
InternVL3.5-VL-8B-Chat
do_sample=True,
temperature=0.1, top_p=0.001,
repetition_penalty=1.1
Table 7: Inference settings for each LLM and LVLM.
B.2 Design of Experimental Prompts
In this section, we provide a detailed description of
each type of prompt, along with intended purposes.
Language Foring Prompts are designed to en-
force the model to generate outputs in the same
language as the input. Following prior work (Wang
et al., 2025b), we adopt a similar strategy, with the
detailed prompt contents provided in Table 8.
Lang. Prompts
En Use English to think and answer.
Zh 使用中文进行思考和回答。
Ja 日本を使って考え、回答してください。
Es Usa espaol para pensar y responder.
Fr Utilisez le franais pour penser et répondre.
De Verwende Deutsch, um zu denken und zu antworten.
Sw Tumia Kiswahili kufikiri na kujibu.
Table 8: Language forcing prompts contents.
Prompts for MGSM and MSVAMP These two
datasets primarily adopt free-form formats to eval-
uate the reasoning capabilities of LLMs. As shown
in Table 9, we ask the model to first generate rea-
soning responses and then mark final answers using
the \\boxed{} format.
Lang. Prompts
En Please first reason through the problem, then pro-
vide the final answer, expressed as a number using
the \\boxed{} format.
Zh 请首先进行推理，然后给出最后的答案，
用\\boxed{}的形式表示最后的数字。
Ja まず推理を行ってください。その後、最的
な答えを\\boxed{}の形式で表示してくだ
さい。
Es Por favor, primero realice el razonamiento y luego
dé la respuesta final, representando el número
final en forma de \\boxed{}.
Fr Veuillez d’abord raisonner, puis donner la
réponse finale sous la forme \\boxed{}.
De

Chunk 24 · 1,997 chars

rmat.
Zh 请首先进行推理，然后给出最后的答案，
用\\boxed{}的形式表示最后的数字。
Ja まず推理を行ってください。その後、最的
な答えを\\boxed{}の形式で表示してくだ
さい。
Es Por favor, primero realice el razonamiento y luego
dé la respuesta final, representando el número
final en forma de \\boxed{}.
Fr Veuillez d’abord raisonner, puis donner la
réponse finale sous la forme \\boxed{}.
De Bitte führen Sie zunchst Ihre berlegungen durch
und geben Sie dann die endgültige Antwort an.
Verwenden Sie zur Darstellung der endgültigen
Zahl die Form \\boxed{}.
Sw Tafadhali fanya hoja kwanza, kisha toa jibu la
mwisho kwa kuandika nambari ya mwisho katika
umbo la \\boxed{}.
Table 9: Reasoning prompts for MGSM and MSVAMP.

-- 11 of 20 --

Prompts for MathVerse and MathVista These
two benchmarks are employed to evaluate the rea-
soning capability of LVLMs, covering two cate-
gories of tasks: (1) multiple-choice and (2) free-
form. As shown in Table 10, we design tailored
reasoning prompts to guide the model’s response
for multiple-choice queries. The prompts for free-
form are the same as the prompts for MGSM and
MSVAMP benchmarks.
Lang. Prompts
En Please first conduct reasoning, then answer
the question and put the correct option letter
into \\boxed{}, e.g., \\boxed{A}, \\boxed{B},
\\boxed{C}, \\boxed{D}, at the end.
Zh 请 先 进 行 推 理, 然 后 回 答 问 题, 并 在
最 后 将 正 确 的 选 项 字 母 填 入\\boxed{}中,
例 如 ：\\boxed{A}, \\boxed{B}, \\boxed{C},
\\boxed{D}.
Ja ま ず 推 を 行 い 、 次 に に 答 え て 、 正 し い オ
プ シ ョ ン 文 字 を \\boxed{}に 入 れ ま す 。 例:
\\boxed{A},\\boxed{B},\\boxed{C},\\boxed{D},
最後に。
Es Primero realice el razonamiento, luego responda la
pregunta y coloque la letra de la opción correcta en
\\boxed{}, por ejemplo, \\boxed{A}, \\boxed{B},
\\boxed{C}, \\boxed{D}, al final.
Fr Veuillez d’abord effectuer un raisonnement, puis
répondre à la question et mettre la lettre d’option
correcte dans \\boxed{}, par exemple, \\boxed{A},
\\boxed{B}, \\boxed{C}, \\boxed{D}, à la fin.
De Bitte führen Sie zuerst eine Argumentation durch,
beantworten Sie dann die Frage und setzen Sie
am Ende

Chunk 25 · 1,998 chars

}, al final.
Fr Veuillez d’abord effectuer un raisonnement, puis
répondre à la question et mettre la lettre d’option
correcte dans \\boxed{}, par exemple, \\boxed{A},
\\boxed{B}, \\boxed{C}, \\boxed{D}, à la fin.
De Bitte führen Sie zuerst eine Argumentation durch,
beantworten Sie dann die Frage und setzen Sie
am Ende den richtigen Optionsbuchstaben in
\\boxed{}, z. B. \\boxed{A}, \\boxed{B},
\\boxed{C}, \\boxed{D}.
Sw Tafadhali kwanza elekeza hoja, kisha ujibu swali na
uweke herufi ya chaguo sahihi kwenye \\boxed{},
k.m., \\boxed{A}, \\boxed{B}, \\boxed{C},
\\boxed{D}, mwishoni.
Table 10: Multi-choice reasoning prompts for Math-
Verse and MathVista.
B.3 Ablation of Intervention Strength
In this subsection, we conduct a detailed analysis of
how the intervention strengths α1 and α2 influence
the performance of MRRE.
We first analyze the intervention strength of the
cross-lingual enhancement vectors, α1, which aim
to align non-English reasoning hidden states with
English in the latent space. As shown in Table 11,
the optimal performance is achieved when α1 =
1, while weaker or stronger interventions fail to
minimize the distribution gap in the latent space,
limiting the effectiveness of MRRE.
Then we analyze the intervention strength of
the cross-lingual enhancement vectors, α2, which
aim to align English-like generating hidden states
with the original non-English in the latent space.
As shown in Table 12, the optimal performance is
achieved when α2 = 0.75, while too weaker in-
terventions may lead to English responses rather
than target non-English responses and too strong
interventions may result in performance drops in
reasoning tasks. Notably, α2 functions as a trade-
off strategy, effectively balancing multilingual
reasoning performance with input-output lan-
guage consistency.
α1 0.00 0.25 0.50 0.75 1.00 1.25
Zh 78.4 79.3 80.5 82.3 83.2 82.5
Ja 65.2 67.1 68.3 71.0 73.7 70.8
Th 71.6 72.3 73.1 74.2 75.2 73.9
Table 11: Ablation study of α1 on MGSM benchmark
using

Chunk 26 · 1,996 chars

nctions as a trade-
off strategy, effectively balancing multilingual
reasoning performance with input-output lan-
guage consistency.
α1 0.00 0.25 0.50 0.75 1.00 1.25
Zh 78.4 79.3 80.5 82.3 83.2 82.5
Ja 65.2 67.1 68.3 71.0 73.7 70.8
Th 71.6 72.3 73.1 74.2 75.2 73.9
Table 11: Ablation study of α1 on MGSM benchmark
using Qwen2.5-7B-Instruct. Layer1 is set to be 20.
α2 0.00 0.50 0.75 1.00
Zh 83.2 (26.5) 82.5 (78.7) 81.6 (99.2) 79.5 (99.5)
Ja 73.7 (19.8) 71.9 (69.9) 71.2 (99.1) 70.0 (99.6)
Th 75.2 (7.9) 75.0 (54.1) 74.8 (99.2) 72.1 (99.4)
Table 12: Ablation study of α2 on MGSM benchmark
using Qwen2.5-7B-Instruct. The first numbers denote
accuracies (%) and the second numbers denote Lan-
guage Consistency (LC, %). Layer1 is set to be 20, α1
is set to be 1.0, and Layer2 is set to be 23.
B.4 Fine-grained Results of MathVerse and
MathVista Benchmarks
To further elucidate how MRRE improves LVLM
reasoning, we perform a fine-grained analysis
across eight languages, five visual task families
(text-dominant, text-lite, vision-integrated, vision-
dominant, and vision-only), and two domains (gen-
eral and math). We disaggregate performance and
consistency metrics by language, task category, and
domain to identify where MRRE yields the largest
gains; results are reported in Tables 13.
C Fine-grained Case Studies
To better illustrate the improvements in reasoning
capabilities and input-output language consistency
achieved by MRRE, we provide additional compre-
hensive case studies.

-- 12 of 20 --

Model MMathVerse MathVista
T-D T-L V-I V-D V-O Overall General Math Overall
Language-Zh
Qwen2.5-VL-7B 44.4 39.5 35.5 36.0 36.5 38.4 59.3 57.0 58.1
+ Language forcing 41.8 35.3 33.4 34.6 38.7 36.8 60.8 59.8 60.0
+ MRRE 54.2 47.0 43.8 42.9 41.9 45.9 60.7 63.7 62.3
Language-Ja
Qwen2.5-VL-7B 42.8 35.3 33.5 34.5 35.0 36.4 55.0 49.4 52.0
+ Language forcing 44.2 39.2 36.8 36.5 39.6 37.7 54.8 51.5 53.0
+ MRRE 51.6 46.2 42.5 42.4 40.5 44.6 59.3 62.6 61.1
Language-Es
Qwen2.5-VL-7B 49.5 43.5 40.9

Chunk 27 · 1,993 chars

35.3 33.4 34.6 38.7 36.8 60.8 59.8 60.0
+ MRRE 54.2 47.0 43.8 42.9 41.9 45.9 60.7 63.7 62.3
Language-Ja
Qwen2.5-VL-7B 42.8 35.3 33.5 34.5 35.0 36.4 55.0 49.4 52.0
+ Language forcing 44.2 39.2 36.8 36.5 39.6 37.7 54.8 51.5 53.0
+ MRRE 51.6 46.2 42.5 42.4 40.5 44.6 59.3 62.6 61.1
Language-Es
Qwen2.5-VL-7B 49.5 43.5 40.9 38.1 40.4 42.5 63.0 61.5 62.2
+ Language forcing 51.8 43.4 40.5 42.3 40.1 43.6 62.2 60.4 61.2
+ MRRE 55.5 49.5 46.4 45.1 40.7 47.5 64.1 64.3 64.2
Language-Ru
Qwen2.5-VL-7B 45.1 39.0 35.8 35.8 37.6 38.6 60.4 55.7 57.9
+ Language forcing 46.7 41.2 39.6 38.1 37.4 40.6 61.3 55.7 58.3
+ MRRE 52.0 45.3 41.8 42.3 42.4 44.7 62.8 62.8 62.8
Language-Fr
Qwen2.5-VL-7B 43.0 38.6 35.8 36.8 36.4 38.1 61.5 58.0 59.6
+ Language forcing 46.6 40.0 36.9 36.5 37.9 39.6 61.1 58.0 59.4
+ MRRE 51.9 47.5 44.7 42.0 42.9 45.8 62.2 60.2 61.2
Language-De
Qwen2.5-VL-7B 42.5 38.2 32.5 33.4 36.0 36.5 58.9 57.6 58.2
+ Language forcing 50.5 39.0 35.3 36.9 38.2 40.0 56.1 57.6 56.9
+ MRRE 52.9 47.0 42.9 43.8 41.6 45.7 63.5 64.6 64.1
Language-Th
Qwen2.5-VL-7B 36.8 33.1 31.2 31,2 27.5 32.0 52.0 50.4 51.1
+ Language forcing 50.3 43.8 41.8 40.6 41.0 43.5 55.9 51.9 53.7
+ MRRE 56.7 51.0 43.8 43.8 46.9 48.4 58.7 55.8 57.3
Language-Sw
Qwen2.5-VL-7B 15.1 13.8 11.8 14.0 27.3 16.4 17.6 15.9 16.7
+ Language forcing 20.1 16.0 15.0 15.0 22.8 17.8 8.7 11.3 10.1
+ MRRE 34.4 32.5 31.6 32.4 38.2 33.8 20.4 35.6 28.6
Table 13: Fine-grained Results on Qwen2.5-VL-7B-Instruct across eight languages.

-- 13 of 20 --

Para responder esta pregunta, debemos analizar cómo los árboles están conectados en la red trófica mostrada.\n\n1. Los árboles son una
fuente de alimento para los animales herbívoros.\n2. En este diagrama, los animales herbívoros son los ciervos (que comen hojas de
árbol).\n3. Los caballos también comen hojas de árbol, pero no están directamente conectados a los árboles como lo están los
ciervos.\n4. Los gatos y los osos no tienen conexión directa con los árboles en este diagrama.\n\nPor lo

Chunk 28 · 1,996 chars

ívoros.\n2. En este diagrama, los animales herbívoros son los ciervos (que comen hojas de
árbol).\n3. Los caballos también comen hojas de árbol, pero no están directamente conectados a los árboles como lo están los
ciervos.\n4. Los gatos y los osos no tienen conexión directa con los árboles en este diagrama.\n\nPor lo tanto, la pérdida de árboles
tendrá un impacto más directo en los ciervos, ya que son los únicos animales que dependen directamente de los árboles como fuente de
alimentación.\n\nLa respuesta correcta es:\n\n\\boxed{B}.
Qwen2.5-VL-7B w/ Es Query
Input Image
In the diagram of the food web shown what will
most directly be affected by the loss of the trees?
Choices:
(A) horses
(B) cats
(C) nothing
(D) bears
Input Query w/ English
Para determinar qué organismo se verá más directamente afectado por la pérdida de árboles en el diagrama de la red alimentaria dada,
necesitamos analizar las flechas que van desde los árboles hacia otros organismos. Estas flechas indican la dirección del flujo de
energía y quién consume a quién.\n\n1. Los árboles tienen flechas que apuntan hacia ellos desde los ciervos y las plantas, lo que indica
que estos organismos consumen los árboles.\n2. Hay una flecha que va desde los árboles hacia las plantas, lo que sugiere que las
plantas también consumen los árboles.\n3. Las plantas tienen flechas que apuntan hacia los caballos, lo que indica que los caballos
consumen las plantas.\n4. Los caballos tienen una flecha que apunta hacia los osos, lo que indica que los osos consumen a los
caballos.\n5. Los gatos tienen una flecha que apunta hacia los ciervos, lo que indica que los ciervos consumen a los gatos.\n6. Los osos
no tienen ninguna flecha que apunte hacia ellos desde los árboles, por lo que no se ven directamente afectados por la pérdida de
árboles.\n\nA partir de este análisis, está claro que las plantas se ven directamente afectadas por la pérdida de árboles porque hay una
flecha que va desde los árboles hacia las plantas. Sin

Chunk 29 · 1,998 chars

o tienen ninguna flecha que apunte hacia ellos desde los árboles, por lo que no se ven directamente afectados por la pérdida de
árboles.\n\nA partir de este análisis, está claro que las plantas se ven directamente afectadas por la pérdida de árboles porque hay una
flecha que va desde los árboles hacia las plantas. Sin embargo, dado que la pregunta pide el organismo que se verá más directamente
afectado, y las plantas son consumidas por los caballos, los caballos se verán afectados indirectamente a través de las plantas.\n\nPor lo
tanto, la respuesta es:\\boxed{A}.
Qwen2.5-VL-7B+MRRE w/ Es Query
Для ответа на этот вопрос нужно внимательно рассмотреть схему пищевой цепи:\n\n1. Деревья являются основным
источником питания для травоядных животных (лошадь и олень).\n2. Лошадь и олень, в свою очередь, становятся пищей для
хищников (кошка).\n\nТаким образом, если исчезнут деревья, то это приведет к недостатку пищи для лошадей и оленей, а
следовательно, и для кошек. Медведи не указаны в этой пищевой цепи как потребители растительной пищи.\n\nОтвет:
\\boxed{C}.
Qwen2.5-VL-7B w/ 	Ru Query
Чтобы определить ответ на этот вопрос, нам нужно проанализировать пищевую сеть, изображённую на рисунке, и понять,
какой организм напрямую пострадает от исчезновения деревьев (продуцентов в основании пищевой сети).\n\n1. Определение
продуцентов: Растения в нижней части диаграммы представляют собой основных продуцентов, таких как деревья и
кустарники.\n\n2. Определение прямых потребителей: Организмы, которые питаются этими продуцентами, 	— 	это травоядные,
в частности лошади и, возможно, другие мелкие животные, не показанные на схеме, но подразумеваемые из контекста.\n\n3.
Определение хищников: Хищники на диаграмме включают кошку и медведя, которые охотятся на травоядных, таких как
лошадь.\nУчитывая эту информацию, если деревья исчезнут, прямое воздействие будет на травоядных, которые зависят от
них в питании. В данной диаграмме таким травоядным является лошадь.\n\nТаким образом, правильный

Chunk 30 · 1,998 chars

деление хищников: Хищники на диаграмме включают кошку и медведя, которые охотятся на травоядных, таких как
лошадь.\nУчитывая эту информацию, если деревья исчезнут, прямое воздействие будет на травоядных, которые зависят от
них в питании. В данной диаграмме таким травоядным является лошадь.\n\nТаким образом, правильный ответ:\\boxed{A}.
Qwen2.5-VL-7B+MRRE w/ 	Ru Query
Figure 7: Case study of General problem on Qwen2.5-VL-7B-Instruct.

-- 14 of 20 --

为了判断这个图形是否是一个平行四边形，我们需要利用平行四边形的性质和角度关系。
首先，我们知道平行四边形对角相等且邻角互补（即两个相邻的角度之和等于180度）。
观察图中的角度信息：一个角是 	\\(60^\\circ\\)，另一个与它相邻的角是 	\\(55^\\circ + (x+4)\\) 	或者 	\\((3x - 6)\\)，因为它们是平
行线上的同旁内角，所以这两个角应该互补，即 	\\(60^\\circ + [55^\\circ + (x+4)] = 180^\\circ\\) 	或者 	\\(60^\\circ + [(3x - 6)] =
180^\\circ\\)。
我们来解第一个方程：
\[60^\circ + 55^\circ + (x + 4) = 180^\circ\]
\[119^\circ + x + 4 = 180^\circ\]
\[x + 123 = 180^\circ\]
\[x = 180^\circ - 123^\circ\]
\[x = 57^\circ - 123^\circ\]
显然，上面的计算有误，因为我们没有正确地应用补角的概念。实际上，我们应该直接使用第二个方程来解决这个问题，
因为它更直接反映了一个邻角的关系。
现在考虑第二个方程：
\[60^\circ + (3x - 6) = 180^\circ\]
\[60^\circ + 3x - 6 = 180^\circ\]
\[3x + 54 = 180^\circ\]
\[3x = 180^\circ - 54^\circ\]
\[3x = 126^\circ\]
\[x = 42^\circ\]
但是，根据题目给出的答案选项来看，我们的目标是找到一个合适的 	$x$ 	值使得图形成为平行四边形。因此，我们需要重
新审视原始条件并确保选择正确的答案。考虑到题目提供的选项，我们可以尝试验证每个选项是否满足条件。但基于上述
分析，最接近的答案应该是通过验证选项得出的。
最终答案应为 	\(\boxed{D}\)。
Qwen2.5-VL-7B w/ Chinese Query
Input Image
Find x so that the quadrilateral is a 	parallelogram.
Choices:
A:3
B:5
C:12
D:18
Input Query w/ English
❓
为了确定使给定四边形成为平行四边形的 	\( x \) 	的值，我们需要利用平行四边形的性质。具体来说，在平行四边形中，
\textbf{对边长度相等}。
已知边长为 	\( x + 4 \)、\( 5y \)、\( 3x - 6 \)，以及另一条未标记但可根据问题设置推断的边，我们需要根据对边相等的性质
来建立方程。
从图表中，我们有：
- 	一条边是 	\( x + 4 \)。
- 	其对边是 	\( 3x - 6 \)。
将这两个表达式设为相等，得到：
\[ x + 4 = 3x - 6 \]
现在，求解 	\( x \)：
1. 	两边同时减去 	\( x \)：
\[ 4 = 2x - 6 \]
2. 	两边同时加上 	6：
\[ 10 = 2x \]
3. 	两边同时除以 	2：
\[ x = 5 \]
因此，使该四边形成为平行四边形的 	\( x \) 	的值是 	\(\boxed{B}\)。
Qwen2.5-VL-7B+MRRE w/ Chinese Query
Figure 8: Case study of Text Lite problem on Qwen2.5-VL-7B-Instruct.

-- 15

Chunk 31 · 1,995 chars

\( 3x - 6 \)。
将这两个表达式设为相等，得到：
\[ x + 4 = 3x - 6 \]
现在，求解 	\( x \)：
1. 	两边同时减去 	\( x \)：
\[ 4 = 2x - 6 \]
2. 	两边同时加上 	6：
\[ 10 = 2x \]
3. 	两边同时除以 	2：
\[ x = 5 \]
因此，使该四边形成为平行四边形的 	\( x \) 	的值是 	\(\boxed{B}\)。
Qwen2.5-VL-7B+MRRE w/ Chinese Query
Figure 8: Case study of Text Lite problem on Qwen2.5-VL-7B-Instruct.

-- 15 of 20 --

首先，我们知道在圆中，如果一个角的顶点在圆心上，则这个角所对的弧等于该角的一半。\n\n题目给出的是角OAB =
35.0°，这是一个圆周角，它所对的弧是AB。根据圆周角定理，圆周角等于它所对的圆心角的一半，所以角AOB = 2 	* 	角
OAB = 2 	* 	35.0° 	= 70.0°。\n\n接下来，我们需要找出角ACB的度数。由于角ACB是一个圆周角，它所对的弧也是AB。因此，
角ACB也等于角AOB的一半，即角ACB = 1/2 	* 	角AOB = 1/2 	* 	70.0° 	= 35.0°。\n\n所以，正确答案是 	A: 35°。
\n\n\\boxed{A}。
Qwen2.5-VL-7B w/ 	Zh Query
Input Image
As shown in the figure, triangle ABC is the
inscribed triangle of circle O, angle OAB = 35.0,
then the degree of angle ACB is ()
Choices:
A:35°
B:55°
C:60°
D:70°
Input Query w/ English
为了解决这个问题，我们需要利用圆和三角形的性质。\n\n1. 	**确定已知条件：**\n 	- 	三角形 	\( \triangle ABC \) 	内接于圆 	\( O
\)。\n 	- \( \angle OAB = 35^\circ \)。\n\n2. 	**理解圆心角与内接角的关系：**\n 	- 	内接角的度数等于它所对的圆心角度数的一
半。\n 	- 	在本题中，\( \angle ACB \) 	是一个内接角，它所对的弧与圆心角 	\( \angle AOB \) 	相同。\n\n3. 	**求 	\( \angle AOB \) 	的
度数：**\n 	- 	因为 	\( \angle OAB \) 	属于三角形 	\( \triangle OAB \)，且已知 	\( OA = OB \)（都是圆的半径），所以 	\( \triangle
OAB \) 	是等腰三角形，且 	\( \angle OAB = \angle OBA = 35^\circ \)。\n 	- 	因此，三角形 	\( \triangle OAB \) 	的内角和应为
\( 180^\circ \)：\n 	\[\n 	\angle AOB + \angle OAB + \angle OBA = 180^\circ\n 	\]\n 	代入已知数值：\n 	\[\n 	\angle AOB +
35^\circ + 35^\circ = 180^\circ\n 	\]\n 	化简：\n 	\[\n 	\angle AOB + 70^\circ = 180^\circ\n 	\]\n 	解得：\n 	\[\n 	\angle
AOB = 110^\circ\n 	\]\n\n4. 	**求 	\( \angle ACB \) 	的度数：**\n 	- 	因为 	\( \angle ACB \) 	是一个内接角，它所对的弧与圆心角
\( \angle AOB \) 	相同，所以 	\( \angle ACB \) 	的度数是 	\( \angle AOB \) 	的一半：\n 	\[\n 	\angle ACB = \frac{1}{2} \times \angle
AOB\n 	\]\n 	代入 	\( \angle AOB \) 	的数值：\n 	\[\n 	\angle ACB = \frac{1}{2} \times 110^\circ = 55^\circ\n 	\]\n\n因此，正
确答案是：\n\n\[\n\boxed{B}\n\]
Qwen2.5-VL-7B+MRRE w/ 	Zh Query
Figure 9: Case study of Text

Chunk 32 · 1,998 chars

angle AOB \) 	相同，所以 	\( \angle ACB \) 	的度数是 	\( \angle AOB \) 	的一半：\n 	\[\n 	\angle ACB = \frac{1}{2} \times \angle
AOB\n 	\]\n 	代入 	\( \angle AOB \) 	的数值：\n 	\[\n 	\angle ACB = \frac{1}{2} \times 110^\circ = 55^\circ\n 	\]\n\n因此，正
确答案是：\n\n\[\n\boxed{B}\n\]
Qwen2.5-VL-7B+MRRE w/ 	Zh Query
Figure 9: Case study of Text Dominant problem on Qwen2.5-VL-7B-Instruct.

-- 16 of 20 --

この問題では、円周角の定理と弧の長さの公式を使用します。\n\n1. 	**円周角の定理**: 	円周角は中心角の半分になります。
つまり、\\(\\angle RPS = \\frac{1}{2} \\times 130^\\circ = 65^\\circ\\) 	です。\n\n2. 	**弧の長さの公式**: 	弧の長さ 	\\(L\\) 	は、次
の式で計算できます:\n 	\\[\n 	L = r \\theta\n 	\\]\n 	ここで, \\(r\\) 	は半径, \\(\\theta\\) 	は中心角（ラジアン）です。\n\n 	- 	半径
\\(r = 2\\)\n 	- 	中心角 	\\(\\theta = 65^\\circ\\) 	をラジアンに変換します。\\(1^\\circ = \\frac{\\pi}{180}\\) 	ラジアンなので,\n
\\[\n 	\\theta = 65^\\circ \\times \\frac{\\pi}{180} = \\frac{65\\pi}{180} = \\frac{13\\pi}{36}\n 	\\]\n\n 	- 	弧の長さ 	\\(L\\) 	を計算し
ます:\n 	\\[\n 	L = 2 \\times \\frac{13\\pi}{36} = \\frac{26\\pi}{36} = \\frac{13\\pi}{18}\n 	\\]\n\n 	約数を計算すると:\n 	\\[\n
\\frac{13\\pi}{18} \\approx \\frac{13 \\times 3.1416}{18} \\approx \\frac{40.8708}{18} \\approx 2.2706\n 	\\]\n\n 	小数点第1位を四
捨五入すると約 	2.3 	ですが、最も近い選択肢は 	A:2.27 	です。\n\nしたがって、答えは 	\\boxed{A} 	です。
Qwen2.5-VL-7B w/ Ja Query
Input Image
In \\odot P, find the length of \\widehat R S.
Round to the nearest 	hundredth.
Choices:
A:2.27
B:3.91
C:4.54
D:7.82
Input Query w/ English
円 	\( \odot P \) 	でアーク 	RS 	の長さを見つけるには、中央の角度とそれが切断するアークの関係を使用する必要があります。
弧の測定は、その対応する中央角に比例します。アークの長さの公式は:nn\[ \text{アーク長さ} = \frac{\theta}{360^\circ}
\times 2\pi r \]nnwhere:n- \( \theta \) 	は中央角度の測定値,n- \( r \) 	は円の半径です.ダイアグラムから、中心角 	\( \angle RPS \) 	は
\( 130^\circ \) 	として与えられていることがわかります。半径 	\(r \) 	は 	2 	単位として与えられます。これらの値を公式に置き
換える:nn\[ \text{アーク長さ} = \frac{130^\circ}{360^\circ} \times 2\pi \times 2 \]nn表現を簡単にします:nn\[ \text{アーク長さ}
= \frac{130}{360} \times 4\pi \]n\[ \text{アーク長さ} = \frac{13}{36} \times 4\pi \]n\[ \text{アーク長さ} = \frac{52\pi}{36}
\]n\[ \text{アーク長さ} = \frac{13\pi}{9} \]nn次に、( \pi

Chunk 33 · 1,992 chars

とがわかります。半径 	\(r \) 	は 	2 	単位として与えられます。これらの値を公式に置き
換える:nn\[ \text{アーク長さ} = \frac{130^\circ}{360^\circ} \times 2\pi \times 2 \]nn表現を簡単にします:nn\[ \text{アーク長さ}
= \frac{130}{360} \times 4\pi \]n\[ \text{アーク長さ} = \frac{13}{36} \times 4\pi \]n\[ \text{アーク長さ} = \frac{52\pi}{36}
\]n\[ \text{アーク長さ} = \frac{13\pi}{9} \]nn次に、( \pi \approx 3.14159 \):nn\[ \text{アーク長さ} = \frac{13 \times 3.14159}{9}
\]n\[ \text{アーク長さ} \approx \frac{40.84477}{9} \]n\[ \text{アーク長さ} \approx 	4.54 	\]nn最寄りの100番目にめること 	(必要に
応じて最寄りの10番目にめることが必要であるため。
Qwen2.5-VL-7B+MRRE w/ Ja Query
Figure 10: Case study of Vision Dominant problem on Qwen2.5-VL-7B-Instruct.

-- 17 of 20 --

Kuna njia za mbili kuhusu hii:\n\n1. 	**Naangalia kutumika namba za maelezo**: Angaliwezi kuongeza anasema kwamba angle ACB
= 25°.\n\n2. 	**Kutumika namba za maelezo**:\n 	- Angaliwezi kuongeza anasema kwamba basi kiwango cha angle BAO ni angle
AOB.\n 	- Kwa sababu angle AOB ni angle inscribed (kujulikana kama angle subtended by the arc AB) ambayo ni equal to twice the
angle at the circumference that subtends the same arc.\n 	\n3. 	**Kutumika namba za maelezo**:\n 	- Angaliwezi kuongeza anasema
kwamba basi kiwango cha angle BAO ni 2 	* 	angle ACB.\n 	- Tofauti na angle ACB = 25°, basi kiwango cha angle BAO ni 2 	* 	25°
= 50°.\n\nKwa hivyo, basi kiwango cha angle BAO ni 50°.\n\nJinsi yote, angaliwezi kuongeza anasema kwamba:\n\n\\boxed{A}
Qwen2.5-VL-7B w/ Sw Query
Input Image
As shown in the figure, angle ACB = 25.0,
then the degree of angle BAO is ()
Choices:
A:50°
B:55°
C:60°
D:65°
Input Query w/ English
Ili kutatua tatizo hili, tunahitaji kutumia mali za duara na pembe katika pembetatu.\n\nGiven:\n- \( \angle ACB = 25^\circ \)\n-
Tunahitaji kupata kipimo cha \( \angle BAO \)\n\nUfafanuzi wa hatua kwa hatua:\n\n1. Tambua aina ya pembetatu: Kwa kuwa \( O \)
ni kituo cha duara, \( OA \) na \( OB \) ni nusu kipenyo cha duara. Hivyo basi, \( \triangle AOB \) ni pembetatu ya isosceles yenye
\( OA = OB \).\n\n2. Jumla ya pembe katika

Chunk 34 · 1,998 chars

= 25^\circ \)\n-
Tunahitaji kupata kipimo cha \( \angle BAO \)\n\nUfafanuzi wa hatua kwa hatua:\n\n1. Tambua aina ya pembetatu: Kwa kuwa \( O \)
ni kituo cha duara, \( OA \) na \( OB \) ni nusu kipenyo cha duara. Hivyo basi, \( \triangle AOB \) ni pembetatu ya isosceles yenye
\( OA = OB \).\n\n2. Jumla ya pembe katika pembetatu: Jumla ya pembe za ndani katika pembetatu yoyote ni \( 180^\circ \). Katika
\( \triangle AOB \), pembe kwenye \( O \) ni sawa kwa sababu ni pembetatu ya isosceles. Tuita hizi pembe \( \angle OAB \) na \( \angle
OBA \). Kwa hiyo, tunayo:\n 	\[\n 	\angle OAB + \angle OBA + \angle AOB = 180^\circ\n 	\]\n\n3. Uhusiano kati ya pembe ya
katikati na pembe kwenye duara: Pembe ya katikati \( \angle AOB \) inakaa kwenye upinde ule ule kama pembe kwenye duara \( \angle
ACB \). Uhusiano ni kwamba pembe ya katikati ni mara mbili ya pembe ya duara. Kwa hivyo:\n 	\[\n 	\angle AOB = 2 \times \angle
ACB\n 	\]\n 	Tukibadilisha thamani iliyotolewa:\n 	\[\n 	\angle AOB = 2 \times 25^\circ = 50^\circ\n 	\]\n\n4. Jumla ya pembe katika
\( \triangle AOB \): Sasa tunajua moja ya pembe katika \( \triangle AOB \):\n 	\[\n 	\angle AOB = 50^\circ\n 	\]\n 	Kwa kuwa
\( \triangle AOB \) ni ya isosceles, basi:\n 	\[\n 	\angle OAB = \angle OBA\n 	\]\n 	Tuita kila moja ya hizi pembe \( x \). Kisha:\n 	\[\n
x + x + 50^\circ = 180^\circ\n 	\]\n 	Tukirahisisha:\n 	\[\n 	2x + 50^\circ = 180^\circ\n 	\]\n 	Tukitatua kwa \( x \):\n 	\[\n 	2x =
130^\circ\n 	\]\n 	\[\n 	x = 65^\circ\n 	\]\n\nKwa hivyo, kipimo cha \( \angle BAO \) ni \( 65^\circ \).\n\nJibu sahihi
ni:\n\[\n\boxed{D}\n\]
Qwen2.5-VL-7B+MRRE w/ Sw Query
Figure 11: Case study of Vision Intensive problem on Qwen2.5-VL-7B-Instruct.

-- 18 of 20 --

Pour résoudre ce problème, nous devons utiliser les propriétés des angles formés par deux droites parallèles coupées par une
transversale.\n\n1. 	**Identifiez les angles correspondants :**\n 	- Les angles 1 et 50° 	sont des angles alternes-internes car ils sont
situés de part et

Chunk 35 · 1,997 chars

wen2.5-VL-7B-Instruct.

-- 18 of 20 --

Pour résoudre ce problème, nous devons utiliser les propriétés des angles formés par deux droites parallèles coupées par une
transversale.\n\n1. 	**Identifiez les angles correspondants :**\n 	- Les angles 1 et 50° 	sont des angles alternes-internes car ils sont
situés de part et d'autre de la ligne transversale EG et sur le même côté du plan par rapport aux droites AB et CD.\n 	\n2. 	**Utilisez la
propriété des angles alternes-internes :**\n 	- Si deux droites sont parallèles, alors les angles alternes-internes sont égaux.\n 	- Donc,
angle 1 = 50°.\n\n3. 	**Angle bisecteur :**\n 	- EG est un angle bisecteur qui divise l'angle BEF en deux angles égaux.\n 	- Donc,
angle BEG = angle FEG = (angle BEF) / 2.\n\n4. 	**Calculer l'angle BFG :**\n 	- Angle BFG est l'angle supplémentaire de l'angle 1
(car ils forment un angle droit avec l'angle 50°).\n 	- Donc, angle BFG = 180° 	- 50° 	= 130°.\n\n5. 	**Angle 2 :**\n 	- Angle 2 est
l'angle interne adjacent à l'angle BFG.\n 	- Donc, angle 2 = 180° 	- angle BFG = 180° 	- 130° 	= 50°.\n\nDonc, l'angle 2 est égal à
50°.\n\nLa réponse est: 	\\boxed{A}.
Qwen2.5-VL-7B w/ Fr Query
Input Image
No Input Query
Pour résoudre le problème, nous devons utiliser les propriétés des droites parallèles et des bissectrices d'angles.\n\n1. 	**Identifier
lesinformations données :**\n - \(AB \parallel CD\)\n - \(EG\) est la bissectrice de \( \angle BEF \)\n - \( \angle 1 = 50^\circ
\)\n\n2.**Comprendre la relation entre les angles :**\n - Puisque \(AB \parallel CD\), \( \angle BFE \) est un angle extérieur pour
\( \triangleEFG \) et il est égal à la somme des deux angles intérieurs non adjacents.\n - Le théorème de la bissectrice stipule que la
bissectriced'un angle divise le côté opposé en segments proportionnels aux côtés adjacents. Cependant, dans ce contexte, nous pouvons
utiliserle fait que la bissectrice divise l'angle en deux parties égales.\n\n3. 	**Calculer \( \angle BEF \) :**\n -

Chunk 36 · 1,985 chars

adjacents.\n - Le théorème de la bissectrice stipule que la
bissectriced'un angle divise le côté opposé en segments proportionnels aux côtés adjacents. Cependant, dans ce contexte, nous pouvons
utiliserle fait que la bissectrice divise l'angle en deux parties égales.\n\n3. 	**Calculer \( \angle BEF \) :**\n - Comme \( \angle 1 =
50^\circ\) et que \(EG\) est la bissectrice de \( \angle BEF \), posons \( \angle BEG = \angle GEF = x \).\n - Donc, \( \angle BEF = 2x
\).\n\n4.**Utiliser la propriété des angles correspondants :**\n - Comme \(AB \parallel CD\), \( \angle BFE = \angle 1 = 50^\circ \).\n -
Donc,\(2x = 50^\circ\).\n - En résolvant pour \(x\), on obtient \(x = 25^\circ\).\n\n5. 	**Trouver \( \angle 2 \) :**\n - \( \angle 2 \) est
lemême que \( \angle GEF \) car ce sont des angles correspondants formés par la sécante \(EG\) coupant les droites parallèles \(AB\)
et\(CD\).\n - Donc, \( \angle 2 = 25^\circ \).\n\nCependant, les options fournies ne correspondent pas directement à notre valeur
calculée.Réévaluons le problème avec l'approche correcte :\n\nPuisque \( \angle BFE = 50^\circ \) et que \(EG\) est la bissectrice de
\( \angleBEF \), le calcul correct devrait être :\n\[\n\angle BEF = 180^\circ - 50^\circ = 130^\circ\n\]\nMais comme \( \angle 2 \) est
l'angle del'autre côté du segment \(EG\), il doit être égal à la moitié de l'angle restant après avoir soustrait l'angle bisecté de 180
degrés(puisque la somme des angles sur une ligne droite est 180 degrés) :\n\[\n\angle 2 = \frac{180^\circ - 50^\circ}{2} =
\frac{130^\circ}{2}= 65^\circ\n\]\n\nDonc, la réponse correcte est :\n\\boxed{C}\n.
Qwen2.5-VL-7B+MRRE w/ Fr Query
Figure 12: Case study of Vision Only problem on Qwen2.5-VL-7B-Instruct.

-- 19 of 20 --

D Usage of LLMs
We used GPT-4o (Hurst et al., 2024) to assist in
language refinement and readability improvement
of the manuscript. All ideas, experiments, analyses,
and conclusions are developed and verified by

Chunk 37 · 348 chars

B+MRRE w/ Fr Query
Figure 12: Case study of Vision Only problem on Qwen2.5-VL-7B-Instruct.

-- 19 of 20 --

D Usage of LLMs
We used GPT-4o (Hurst et al., 2024) to assist in
language refinement and readability improvement
of the manuscript. All ideas, experiments, analyses,
and conclusions are developed and verified by the
authors.

-- 20 of 20 --