When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities

Authors: Sarmistha Das and contributorsarXiv:2604.10787Published: 12-Apr-2026

Summary

This paper addresses the challenge of idiomatic meaning in multilingual and multimodal AI systems. Idioms, which carry figurative meanings rooted in culture, are often misinterpreted by language models that rely on literal word definitions. The authors introduce "Mediom," a multilingual, multimodal idiom corpus containing 3,533 Hindi, Bengali, and Thai idioms, each with expert explanations, translations, and aligned text-image pairs. They also propose "HIDE," a Hinting-based Idiom Explanation framework that uses error feedback to refine model understanding. Evaluations show that large language models (LLMs) outperform vision-language models (VLMs) in idiomatic reasoning, but both benefit from HIDE. The study highlights the need for culturally grounded, multimodal resources to improve AI's comprehension of idiomatic expressions.

PDF viewer

Chunks(24)

Chunk 0 · 1,990 chars

When Meaning Isn’t Literal: Exploring Idiomatic
Meaning Across Languages and Modalities
1st Sarmistha Das†, 2nd Shreyas Guha†, 3rd Suvrayan Bandyopadhyay†
Department of Computer Science and Engineering
Indian Institute of Technology Patna
Patna, India
{sarmistha_2221cs21, 2201cb58_shreyas, suvrayan_2301cs89}@iitp.ac.in
4th Salisa Phosit, 5th Kitsuchart Pasupa*
School of Information Technology
King Mongkut’s Institute of Technology Ladkrabang
Bangkok, Thailand
67076055@kmitl.ac.th, *kitsuchart@it.kmitl.ac.th
6th Sriparna Saha
Department of Computer Science and Engineering
Indian Institute of Technology Patna
Patna, India
sriparna@iitp.ac.in
Abstract—Idiomatic reasoning deeply intertwined with metaphor
and culture remains a blind spot for contemporary language models,
whose progress skews toward surface-level lexical and semantic
cues. For instance, the Bengali idiom আঙ্গু র ফল টক (angur fol tok,
“grapes are sour”): it encodes denial-driven rationalization, yet Naïve
models latch onto the literal fox-and-grape imagery. Addressing
this oversight, we present “Mediom,” a multilingual, multimodal
idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each
paired with gold-standard explanations, cross-lingual translations,
and carefully aligned text–image representations. We benchmark
both large language models (textual reasoning) and vision language
models (figurative disambiguation) on Mediom, exposing systematic
failures in metaphor comprehension. To mitigate these gaps, we
propose “HIDE,” a Hinting-based Idiom Explanation framework that
leverages error-feedback retrieval and targeted diagnostic cues for
iterative reasoning refinement. Collectively, Mediom and HIDE
establish a rigorous test bed and methodology for culturally grounded,
multimodal idiom understanding embedded with reasoning hints in
next-generation AI systems1.
Index Terms—Idioms, Multimodal, Multilingual, LLMs, VLMs,
HIDE, EFL
I. INTRODUCTION
Idioms embody figurative meanings that extend beyond

Chunk 1 · 1,992 chars

y, Mediom and HIDE
establish a rigorous test bed and methodology for culturally grounded,
multimodal idiom understanding embedded with reasoning hints in
next-generation AI systems1.
Index Terms—Idioms, Multimodal, Multilingual, LLMs, VLMs,
HIDE, EFL
I. INTRODUCTION
Idioms embody figurative meanings that extend beyond their
literal composition, conveying context-dependent semantics
that are central to natural discourse. For instance, the Hindi
idiom आसमान से िगरे, खजूर में अटके (aasmaan se gire, “khajoor
mein atke”)–literally, “fallen from the sky, stuck in a date
palm”–metaphorically denotes escaping one difficulty only to
become trapped in another, often equally severe. Metaphorical
competence is the human ability to fluidly traverse literal
and figurative meanings; it underpins idiom comprehension.
Yet as Honeck et al. [1] demonstrate, such interpretations
remain inherently context-dependent and subjective. Idioms,
as crystallized metaphors, encapsulate cultural knowledge and
linguistic economy. Recent advances demonstrate that Large
†Equal Contribution
*Corresponding Author
1Resources are available at https://github.com/sarmistha-D/Hide.
Language Models (LLMs) achieve near-human performance
across core linguistic tasks [2], [3], while Vision Language
Models (VLMs) further enhance reasoning by jointly model-
ing text and vision, yielding substantial gains in multimodal
understanding such as visual question answering [4]. How-
ever, idioms present a distinct challenge, as their meanings
cannot be inferred from the literal definitions of individual
words but instead depend on cultural context and conventional
usage for proper understanding [5]. Idioms crystallize cultural
knowledge, so accurate interpretation demands both linguistic
and contextual fluency. For example, the Bengali idiom
বােরা মােস েতেরা পাবর্ণ (baro mashe tero parbon, “Thirteen
festivals in twelve months”) evokes a uniquely celebratory
mindset, while the multilingual counterpart “walls have

Chunk 2 · 1,997 chars

Idioms crystallize cultural
knowledge, so accurate interpretation demands both linguistic
and contextual fluency. For example, the Bengali idiom
বােরা মােস েতেরা পাবর্ণ (baro mashe tero parbon, “Thirteen
festivals in twelve months”) evokes a uniquely celebratory
mindset, while the multilingual counterpart “walls have ears”
(Hindi दीवाराें के भी कान होते हैं, Bengali েদয়ােলরও কান আেছ,
Thai กําแพงมีหู ประตูมีตา) shows that shared metaphors
transmit communal wisdom. Yet idiom research remains
English-centric [6], leaving Hindi, Bengali, Thai, and vision-
grounded understanding largely unexplored. To fill this
void, we release the first multimodal, cross-cultural idiom
corpus, pairing expert-annotated text with contextual images
for expressions. Additionally, we provide a hint-embedded
idiom reasoning database that provides interpretive meanings,
symbolic meanings, and context-specific cues. This supports
a Hinting-based Idiom Explanation (HIDE) loop inspired by
Error Feedback Learning (EFL) [7], which helps correct
generation errors during idiom interpretation as demonstrated
in Figure 1.
The research objectives of the current work are as follows:
(i) Evaluate the extent to which LLMs and VLMs encode
culturally grounded knowledge and its impact on idiomatic
inference accuracy. (ii) Analyze the contribution of the HIDE
paradigm in enhancing fine-grained inferential reasoning over
culturally nuanced idioms. (iii) Assess cross-model general-
izability by benchmarking dataset performance across diverse
LLM and VLM architectures.
arXiv:2604.10787v1 [cs.CL] 12 Apr 2026

-- 1 of 6 --

আঙর ফল টক (Angur fol tok)
<Grapes are sour>
When someone cannot attain
something they desire, they pretend
it was not worth having in the first
place
Language Model
The <image/idiom> depicts a fox gazing angrily at
grapes hanging out of reach, symbolizing frustration
and denial.
EFL Database
Original idiom: {original}, Incorrect Translation: {model
translation}, Correct Translation: {original

Chunk 3 · 1,999 chars

hing they desire, they pretend
it was not worth having in the first
place
Language Model
The <image/idiom> depicts a fox gazing angrily at
grapes hanging out of reach, symbolizing frustration
and denial.
EFL Database
Original idiom: {original}, Incorrect Translation: {model
translation}, Correct Translation: {original translation},
Incorrect description: {model description}, Correct description:
{original description}
Hint: Try to consider whether the scene might represent a
broader idea or emotion rather than describing it literally.
Some images may use characters and actions as symbols for
common human behaviours or cultural sayings
The <image/idiom> depicts a fox gazing angrily at grapes hanging out of reach, symbolizing
frustration and denial. It reflects the moral of "sour grapes"—dismissing what one cannot have by
pretending it was undesirable.
Generated Understanding
Accurate
Understanding?
Generated Hint
NO
Yes
Generated Understanding
EFL Framework
Input idiom
Fig. 1. Idiomatic Understanding via HIDE inspired by EFL: The model first
interprets the idiom, detects semantic errors, and uses targeted hints to refine
and improve idiomatic comprehension.
Our contributions towards the research community are as
follows: (i) We introduce “Mediom,” the first multimodal
idiom benchmark for low-resource languages Hindi, Ben-
gali, and Thai, comprising 3,533 idioms with fine-grained
annotations and rigorously validated human explanations.
(ii) We propose “HIDE,” a hint-driven idiomatic explanation
framework inspired by EFL. (iii) We conduct two comple-
mentary evaluations: idiomatic understanding with LLMs and
multimodal idiom interpretation with VLMs.
II. BACKGROUND
Recent natural language processing research has extensively
explored figurative language comprehension and generation,
including simile detection [8], metaphor identification [9],
pun recognition [10], and idiom retrieval [11]. Idiomatic
expression modeling is closely related to idiom generation,
contextual

Chunk 4 · 1,999 chars

BACKGROUND
Recent natural language processing research has extensively
explored figurative language comprehension and generation,
including simile detection [8], metaphor identification [9],
pun recognition [10], and idiom retrieval [11]. Idiomatic
expression modeling is closely related to idiom generation,
contextual quotation recommendation, and literary text re-
trieval [12]. Subsequently, Qin et al. [13] developed a
BERT-based model that encodes idiomatic expressions within
both global and local contexts to improve the explanation of
non-compositional meanings. Modern neural text generation
models have significantly improved the contextual adaptation
of idioms [14], with commonsense reasoning frameworks
further enhancing figurative language understanding [15].
HIDE advances this workflow by logging the model’s past mis-
steps and feeding them back as corrective context, iteratively
tightening reasoning accuracy [7]. Despite growing interest
in explanation strategies, their application to idiomatic under-
standing, particularly regarding the retention of cultural nuance
and metaphorical coherence, remains largely unexplored.
To address the clear absence of idiom-centric resources
for low-resource languages across both LLMs and VLMs
exposed in Table I, our research introduces a retrieval-driven
HIDE loop inspired by EFL that maintains a memory of
past generation errors, retrieves structurally similar failures
at inference time, and injects corrective semantic cues into
subsequent prompts.
Table I
COMPARATIVE OVERVIEW OF THE PROPOSED MEDIOM AGAINST LEADING
RESOURCES. ABBREVIATIONS: E = ENGLISH; ML = MULTILINGUAL SET (THAI,
HINDI, BENGALI); MULTIMODALITY = PAIRED TEXT–IMAGE RESOURCES.
Corpus Name 	Count Language Explanations Multimodality Idiomatic Part
SemEval-2013 [16] 4,350 	E 	× 	× 	×
FLUTE [17] 	8,962 	E 	✓ 	× 	✓(497)
V-FLUTE [18] 	6,027 	E 	✓ 	✓ 	✓(370)
Mediom (Proposed) 3,533 	ML 	✓ 	✓ 	3,533
The 	Bengali 	idiom 	"
গাধা 	গাধাই,
স 	যিদ	
রাজারও 	হয়" 	means 	"A

Chunk 5 · 1,995 chars

TY = PAIRED TEXT–IMAGE RESOURCES.
Corpus Name 	Count Language Explanations Multimodality Idiomatic Part
SemEval-2013 [16] 4,350 	E 	× 	× 	×
FLUTE [17] 	8,962 	E 	✓ 	× 	✓(497)
V-FLUTE [18] 	6,027 	E 	✓ 	✓ 	✓(370)
Mediom (Proposed) 3,533 	ML 	✓ 	✓ 	3,533
The 	Bengali 	idiom 	"
গাধা 	গাধাই,
স 	যিদ	
রাজারও 	হয়" 	means 	"A 	donkey 	remains 	a
	donkey, 	even 	if 	he 	belongs 	to 	the 	king."
	Idiomatically, 	it 	illustrates 	that 	someone's
	fundamental 	nature 	cannot 	be 	changed 	by 	an
	upgrade in status or wealth; a foolish or inept
	person will remain so, even with privileges or high
	position.	
গাধা 	গাধাই,
স 	যিদ	
রাজারও 	হয়	
<Gadha gadhai, se jodi
	rajar-o hoy>
<A donkey is a donkey, even if
	it’s the king’s>	
Nature does not change
	with status
The 	Thai 	idiom 	"
ข้างนอกสุกใสข้างในเป็นโพรง"
	translates to "Outside is shiny, inside is hollow." This
	idiom describes something or someone that appears
	attractive 	or 	impressive 	on 	the 	outside 	but 	lacks
	substance or value on the inside. It highlights the
	importance of looking beyond external appearances to
	understand the true nature or quality of something.	
ข้างนอกสุกใสข้างในเป็น
	โพรง	
<kâang-nòk sùk-sǎi,
	kâang-nai bpen phong>
<outside is shiny, inside is hollow>	
people or things that look good externally but
	lack real worth, honesty, or depth inside
(a)
A.
दफा होना (Dafā honā); Literal meaning: To	
disappear or to be gone; Metaphorically, it	
implies dismissal or rejection
B.
सोने पे सुहागा (Sone pe Suhāgā) is "a wedding	
ornament on gold," while its metaphorical meaning	
refers to something that is an added blessing or an	
enhancement, making a good situation even better
C.
আেগ বাঘ,
িপেছ কুিমর(Aage bagh, pichhe kumir);	
Literal meaning: A tiger in front, a crocodile behind;	
Metaphorically, it describes a situation where one is	
trapped between two equally dangerous or difficult	
problems, with no easy way out.
D.
नौ दो यारह होना (Nau do gyarah hona) literally	
means "to be at nine and eleven,"

Chunk 6 · 1,997 chars

,
িপেছ কুিমর(Aage bagh, pichhe kumir);
Literal meaning: A tiger in front, a crocodile behind;
Metaphorically, it describes a situation where one is
trapped between two equally dangerous or difficult
problems, with no easy way out.
D.
नौ दो यारह होना (Nau do gyarah hona) literally
means "to be at nine and eleven," which refers
to quickly escaping or disappearing from a situation,
usually to avoid trouble, responsibility, or
confrontation.
(b)
Fig. 2. (a) Sample instances of our proposed Mediom dataset; (b) Two
different candidate images for the same idiom in the Mediom dataset.
III. CORPUS FORMULATION
A. Data Collection
Inspired by [6], we initially compiled a dataset of 3,500
idioms from diverse linguistic and cultural backgrounds,
sourcing them from online repositories, literature, and cultural
archives, including Hindi idioms from the Simple Help2,
Bengali idioms from Bangla Probad 3, and Thai idioms [19].
Our selection was designed to capture syntactic diversity
across idioms in Hindi, Bengali, and Thai, which includes
fixed expressions, such as นํ้าท่วมปาก (nam thuam pak,
“unable to speak out,”) and verb-object structures, exemplified
by नाक रगड़ना (nāk ragaṛnā), “to plead intensely.” We
also included adjective-noun combinations, such as িমিষ্টসব্প্ন
(miṣṭi swapna, “sweet dreams,” prepositional phrases, such
as เข้าหูซ้ายทะลุหูขวา (khao hu sai thalu hu khwa, “in one
ear and out the other,” and binomial pairs, such as उल्टा
सीधा (ulṭā sīdhā, “all sorts of nonsense.” For syntactically
flexible idioms, normalization was performed by unifying
verb inflections and replacing person-specific pronouns with
neutral counterparts. However, structurally rigid idioms were
preserved in their original form to maintain cultural and
linguistic authenticity.
Following rigorous curation, 3,533 idioms were retained to
ensure diversity and quality; representative samples are shown
in Figure 2a.
B. Data Quality Assurance
To ensure rigorous annotation quality, we enlisted a

Chunk 7 · 1,998 chars

lly rigid idioms were
preserved in their original form to maintain cultural and
linguistic authenticity.
Following rigorous curation, 3,533 idioms were retained to
ensure diversity and quality; representative samples are shown
in Figure 2a.
B. Data Quality Assurance
To ensure rigorous annotation quality, we enlisted a doctoral
scholar and two literature professors, native speakers of Thai,
Hindi, and Bengali, whose combined academic and linguistic
expertise delivers culturally nuanced, high-precision labels.
2https://thesimplehelp.com/hindi-idioms-with-meanings
3https://archive.org/details/in.ernet.dli.2015.455639

-- 2 of 6 --

Initially, each expert annotated 100 samples, generating a
reference set of 200 idiomatic explanations. Subsequently,
we established Information Persistence Ratings (IPR) criteria
to evaluate the quality of idiom annotations, which were
structured around five core aspects: (i) Literal Translation,
where each idiom was directly translated into English while
preserving its original structure, metaphorical elements, and
imagery; (ii) Contextual Interpretation, in which annotators
provided a succinct explanation of the idiom’s meaning within
its cultural and linguistic context; (iii) Usage Scenarios, which
included brief examples illustrating real-world applications of
the idiom for enhanced comprehension; (iv) Cultural Sig-
nificance, where any historical, regional, or societal relevance
associated with the idiom was documented; and (v) Coherence
Preservation, ensuring the interpretation encapsulated the
idiom’s intended meaning without fragmentation. To ensure
consistency and high annotation quality, linguistic experts
established comprehensive guidelines following standard best
practices. We then conducted a two-stage annotation com-
petition with 15 participants: (i) a training phase using 100
reference samples, from which seven annotators qualified, and
(ii) a testing phase with an additional 100 samples, resulting
in the selection of two

Chunk 8 · 1,996 chars

ed comprehensive guidelines following standard best
practices. We then conducted a two-stage annotation com-
petition with 15 participants: (i) a training phase using 100
reference samples, from which seven annotators qualified, and
(ii) a testing phase with an additional 100 samples, resulting
in the selection of two final annotators. Annotators were
compensated at $0.5 per sample.
To further guarantee the reliability of the annotations,
a stringent two-tiered validation protocol was implemented.
(i) Peer Review: Each annotation was cross-verified by at
least two additional annotators fluent in the source language
to enhance accuracy and mitigate subjective biases. (ii) Expert
Validation: A panel comprising linguistic and cultural special-
ists possessing adequate linguistic knowledge on Thai, Hindi,
and Bengali (can read and write proficiently) conducted a
final review of the annotations. Their role included validating
adherence to the IPR criteria for each sample. The rating is
based on the retention of individual aspects. For instance,
if five aspects are retained, the evaluation score will be five.
The annotation process yielded a robust inter-annotator agree-
ment score of 0.82 (Cohen’s kappa), reflecting substantial
consistency and reliability across the dataset.
C. Idiomatic Image Creation
We implement a visual-idiom generation module. Since no
curated visual idiom corpus exists, we synthesize high-fidelity
images with a prompt-driven pipeline: gold-standard idiom
explanations feed DALL·E 3 [20] via GPT-4o prompts [21].
Each prompt is crafted to preserve both figurative meaning
and cultural nuance. Figure 2b illustrates the refinement
process: (i) for the Hindi idiom सोने पे सुहागा (“icing on
the cake”), literal visuals were replaced with metaphorically
aligned depictions; (ii) for the Bengali idiom আেগ বাঘ,
িপেছ কুিমর, only frames conveying an inescapable dilemma
were retained; and (iii) for the Hindi idiom नौ दोग् यारह होना ,
arithmetic literalizations

Chunk 9 · 1,995 chars

s: (i) for the Hindi idiom सोने पे सुहागा (“icing on
the cake”), literal visuals were replaced with metaphorically
aligned depictions; (ii) for the Bengali idiom আেগ বাঘ,
িপেছ কুিমর, only frames conveying an inescapable dilemma
were retained; and (iii) for the Hindi idiom नौ दोग् यारह होना ,
arithmetic literalizations (e.g., “9+2=11”) were discarded. In
such cases, annotators refined prompts with contextual cues
and regenerated images to ensure cultural fidelity and semantic
alignment.
IV. METHODOLOGY
Given an idiom pair (xt, xi), where xt is the textual
representation and xi is the corresponding image, this study
aims to develop a multimodal idiom interpretation frame-
work leveraging LLMs and VLMs. The LLM-based model
Pθ (yt | xt) is trained via supervised learning to generate
textual idiom interpretations, while the VLM-based model
Pϕ(yt | xi) learns idiom meanings from images using
supervised learning without preference optimization. The
dataset consists of text-meaning pairs Dt = {(xtj , yj )}N
j=1
and image-meaning pairs Di = {(xij , yj )}N
j=1 where N is
the number of idioms. To achieve our objective, the first
phase centers on idiom explanation, leveraging both LLMs
and VLMs to capture nuanced semantic and visual cues.
The second phase employs the hinting database embedded in
the HIDE framework to iteratively refine model performance
and enhance interpretability. Figure 3 describes the entire
methodological representation.
A. Idiom Explanation Generation
We fine-tuned the Gemma2-9B [22] using supervised
learning on idiom–explanation pairs (X (τ ), Y), where X (τ )
represents a figurative idiom in textual form and Y denotes
a corresponding human-curated explanation. Following fine-
tuning, Gemma2-9B underwent human evaluation to assess
outputs for semantic precision and sociocultural alignment.
In parallel, to enable visual idiom interpretation, we fine-
tuned a pre-trained VLM, Paligemma2-10B [23], denoted
as Pϕ(Y | X (ν)), where X (ν) is a visual input

Chunk 10 · 1,997 chars

uman-curated explanation. Following fine-
tuning, Gemma2-9B underwent human evaluation to assess
outputs for semantic precision and sociocultural alignment.
In parallel, to enable visual idiom interpretation, we fine-
tuned a pre-trained VLM, Paligemma2-10B [23], denoted
as Pϕ(Y | X (ν)), where X (ν) is a visual input associated
with an idiom (e.g., metaphorical illustrations or idiom-linked
scenes), and Y is the textual explanation predicted by the
model. However, post-fine-tuning analysis revealed multiple
instances of generative misinterpretation. For instance, the
Bengali idiom িমিষ্ট মু েখ জুেতা মারা (“mishti mukhe jutor
bari”)-literally, “Hitting with a shoe while smiling sweetly”–
was misinterpreted by the system as commentary on con-
flicting emotional expressions or unusual behavior involving
footwear. This interpretation failed to capture the idiom’s
actual implication–the act of harming or insulting someone
with deceptive politeness or a pretence of affection. Among a
held-out evaluation set of 600 idiomatic samples, 360 instances
(60%) exhibited comparable inaccuracies, often resulting from
insufficient grounding in sociocultural nuance or an over-literal
mapping from visual or lexical features. A granular inspection
of the models’ outputs revealed that their inferential reasoning
trajectories frequently diverge from the gold-standard at the
very onset of generation. This early misalignment necessitates
the introduction of controlled nudges to systematically steer
the reasoning away from known pitfalls.
To address this, we integrated a HIDE mechanism. In
this framework, idiomatic explanations generated by the fine-
tuned LLM (YLLM) are passed into the HIDE framework.
This verifier-guided refinement loop enforces canonical idiom
alignment and cultural fidelity, substantially boosting the
semantic and contextual accuracy of the language models.

-- 3 of 6 --

"
राई का पवत या पहाड़ होना" (rai ka
parvat ya pahaad hona)
Visual Language
Model
GPT-4o
The

Chunk 11 · 1,993 chars

ssed into the HIDE framework.
This verifier-guided refinement loop enforces canonical idiom
alignment and cultural fidelity, substantially boosting the
semantic and contextual accuracy of the language models.

-- 3 of 6 --

"
राई का पवत या पहाड़ होना" (rai ka
parvat ya pahaad hona)
Visual Language
Model
GPT-4o
The Hindi idiom '
राई का पवत या पहाड़ होना' translates
to 'making a mountain out of a molehill' in English.
It means to exaggerate a small issue or problem,
making it seem much larger or more significant
than it actually is. For an image representation, you
could depict a tiny molehill being magnified into a
towering mountain, illustrating the concept of
blowing things out of proportion
DALL-E 3
Original Idiom Description
Idiom Image generation phase
To visualize the Hindi idiom '
राई का पवत या पहाड़ होना,' which translates to 'making a
mountain out of a molehill,' the image can be represented in a more specific and
clear manner. Picture a small molehill in the foreground, with details like a few tiny
ants or small grass tufts for scale, indicating its modest size. In the background,
have a towering, majestic mountain with snow-capped peaks and dense forests,
dramatically rising above the horizon. The key is to include a magnifying glass
positioned in the middle ground, seamlessly linking the molehill to the mountain.
This glass should be shown in a way that it visually enlarges a section of the
molehill, exaggerating its size to symbolize the proverb's meaning of blowing things
out of proportion. The scene should be set with a clear blue sky to enhance the
contrast between the tiny molehill and the grandiose mountain.
Briefing the descriptional prompt to generate the image
You are an expert in Hindi idioms. You will be given a hindi idiom along with
an incorrect translation and a correct translation. You will also be given an
incorrect description of the same idiom along with a correct description. You
are supposed to analyze the difference

Chunk 12 · 1,997 chars

the descriptional prompt to generate the image
You are an expert in Hindi idioms. You will be given a hindi idiom along with
an incorrect translation and a correct translation. You will also be given an
incorrect description of the same idiom along with a correct description. You
are supposed to analyze the difference between the correct and incorrect
description, and summarize that into a hint which the model can use in the
future for better translation and description. Make sure that the hint does not
contain any specific details of the idiom, but rather addresses the model's
short-coming in general in regards to this particular case. Only output the
summarized hint.
Original idiom: {original}, Incorrect Translation: {model translation}, Correct
Translation: {original translation}, Incorrect description: {model description},
Correct description: {original description}
DALL-E 3
System_prompt:
You are an expert in idioms, and you have to
translate this idiom and provide a description of
what it means and what it is trying to convey.
prompt: {idiom_image}
System_prompt:
You are an expert in idioms, and you have to
translate this idiom and provide a description of
what it means and what it is trying to convey.
prompt: {idiom_text}
HIDE construction by GPT 4o During Training
Generated Hint : Always verify the idiom’s core
figurative function (e.g. exaggeration vs.
actual distress) rather than relying on surface wording
System_prompt: 	You are an expert in idioms and you have to give translate this idiom
and return a description of what it means and is trying to convey.
prompt: 	Use the provided hint to describe this hindi idiom and based on this
description explain what it means. Hint: {hint}. Idiom: {idiom}.
This idiom is used to describe someone or something that is
extremely tall or impressive. The phrase "
राई 	का 	पवत" (Raai ka
Parvat) literally means "mountain," while "
पहाड़" (Pahad)
means "hill." The idiom is often used to describe someone who is
very tall or

Chunk 13 · 1,996 chars

tion explain what it means. Hint: {hint}. Idiom: {idiom}.
This idiom is used to describe someone or something that is
extremely tall or impressive. The phrase "
राई का पवत" (Raai ka
Parvat) literally means "mountain," while "
पहाड़" (Pahad)
means "hill." The idiom is often used to describe someone who is
very tall or impressive, but it can also be used to describe
something that is exceptionally good or impressive. For example,
you might say "He is as tall as a mountain" to describe someone
who is very tall. The idiom is often used to exaggerate or
emphasize someone's height or impressiveness, and it is a
common way to express admiration or awe.
The idiom "
राई का पवत बनाना"(Raai ka Parvat
Banana), meaning "making a mountain out of a
molehill," describes the act of exaggerating a minor
issue. In the image, a small anthill is shown
magnified to look like a mountain, symbolizing how
trivial matters are often blown out of proportion. The
background mountains contrast this exaggeration,
highlighting the difference between real problems
and imagined ones.
Generated Image
Large Language
Model
XV
Yvlm
XT
Yllm
Hint< h>
Inference with Hint During Testing M(Prompt(x, h)
Output:Correct Idiomatic understanding
Generation by LLM
Output: Correct Idiomatic understanding
Generation by VLM
Fine-Tune
Idiomatic Understanding validator for VLM
Idiomatic Understanding validator for LLM
Fig. 3. Architectural framework for idiom explanation that fuses LLMs and VLMs, augmented by a HIDE module with Hint Generation
B. HIDE construction inspired by EFL
To construct hint-embedded idiomatic explanations, initially
after inference, each incorrectly handled idiom is ingested
as the quintuple ⟨xi, ˜Ti, ˜Ei, Ti, Ei⟩, where xi is the
idiom string, ( ˜Ti, ˜Ei) are the model-generated translation
and explanation, and (Ti, Ei) are their human-annotated gold
counterparts. A discriminator compares ˜Ei with Ei and
compresses the discrepancy into a high-level corrective hint
hi = ϕ( ˜Ei, Ei
).

Chunk 14 · 1,988 chars

sted
as the quintuple ⟨xi, ˜Ti, ˜Ei, Ti, Ei⟩, where xi is the
idiom string, ( ˜Ti, ˜Ei) are the model-generated translation
and explanation, and (Ti, Ei) are their human-annotated gold
counterparts. A discriminator compares ˜Ei with Ei and
compresses the discrepancy into a high-level corrective hint
hi = ϕ( ˜Ei, Ei
). The idiom is then embedded by a semantic
encoder f : X → Rd, producing zi = f (xi) where d is the
embedding dimensional. The tuple ⟨zi, hi, ˜Ti, ˜Ei⟩ is archived
in an Idiomatic Error-Feedback Repository (denoted H) for
future reuse.
At the secondary inference time, the system treats a test
idiom x as a query: it computes z = f (x) and retrieves the
repository entry j = arg maxk cos(z, zk
) that is most similar in
embedding space. The accompanying hint hj is concatenated
with x to form an augmented prompt, Prompt(x, hj ) = x ∥ hj ,
which is passed to the generation model M to yield the final
translation–explanation pair ˆy = ( ˆT , ˆE) = M(Prompt(x, hj )).
Injecting this retrieved-error context exposes the model to a
concrete, previously mis-handled scenario that is semantically
close to the current input, steering its reasoning away from
known pitfalls while preserving the original flow.
V. EXPERIMENTAL RESULTS AND DISCUSSION
This section details the experimental protocol, baseline
settings, and a direct comparative analysis of LLMs and
VLMs, complemented by qualitative error diagnostics that
expose interpretive limitations and open research challenges.
Our study addresses three focal Research Questions (RQs):
RQ1: Idiomatic Competence. How effectively do LLMs and
VLMs internalize and interpret culturally grounded, metaphor-
rich expressions? RQ2: HIDE+EFL Impact. How effectively
does HIDE enhance idiom understanding and contextual
reasoning in LLMs without degrading overall performance?
RQ3: Generalizability and Implications. How robustly does
the dataset generalize across model families, and what broader
socio-linguistic insights does it

Chunk 15 · 1,996 chars

ich expressions? RQ2: HIDE+EFL Impact. How effectively
does HIDE enhance idiom understanding and contextual
reasoning in LLMs without degrading overall performance?
RQ3: Generalizability and Implications. How robustly does
the dataset generalize across model families, and what broader
socio-linguistic insights does it reveal?
Models were fine-tuned for 5 epochs with a learning rate
of 1 × 10−5 on an NVIDIA A100 (80 GB) GPU, using
the AdamW optimizer. Training employed a batch size
of 8 with two-step gradient accumulation and native Auto-
matic Mixed Precision (AMP) for memory-efficient mixed-
precision execution. A linear learning-rate scheduler was
applied to ensure stable convergence, while Top-K sam-
pling (K=10) encouraged generative diversity. The dataset
was split into 80% training and 20% testing. The model
achieved a final training cross-entropy loss of 2.1986. A
comparative evaluation was performed across a range of
prominent LLMs and VLMs, including GPT-3.5 [24], Mistral-
7B [25], LLaMA2-7B [26], Blip2-7B [27], Qwen2-VL-
7B-Instruct [28], SmolVLM-Instruct [29], Video-LLaVA-
7B [30]. Model outputs were benchmarked using an extensive
suite of evaluation metrics that captured lexical overlap,
semantic alignment, and distributional similarity as reported
in Table II.

-- 4 of 6 --

Table II
PERFORMANCE VARIANCE ACROSS BASELINE LLMS AND VLM CONFIGURATIONS. R-1, R-2, AND R-L DENOTE ROUGE-1, ROUGE-2, AND ROUGE-L; B-1, B-2,
B-3, AND B-L DENOTE BLEU-1, BLEU-2, BLEU-3, AND BLEU-L; BS DENOTES BERTSCORE. MS, CD, JSD, L2, L1, PS, AND FRS REPRESENT METEOR,
COSINE DISTANCE, JENSEN–SHANNON DIVERGENCE, L2 (EUCLIDEAN) DISTANCE, L1 (MANHATTAN) DISTANCE, PERPLEXITY SCORE, AND FLESCH–KINCAID
READABILITY SCORE. BOLDFACE INDICATES THE BEST RESULTS; ↑ HIGHER IS BETTER, ↓ LOWER IS BETTER.
Model Type Model 	Metrics
R-1 R-2 R-L B-1 B-2 B-3 B-L BS MS CD JSD L2 L1 PS FRS
LLM LLaMA2-7B 	0.42 0.18 0.34 0.24 0.16 0.13 0.21 0.67 0.37 0.39 0.58 12.17 98.95 74.37 58.11
Mistral -7B 	0.47

Chunk 16 · 1,991 chars

LEXITY SCORE, AND FLESCH–KINCAID
READABILITY SCORE. BOLDFACE INDICATES THE BEST RESULTS; ↑ HIGHER IS BETTER, ↓ LOWER IS BETTER.
Model Type Model 	Metrics
R-1 R-2 R-L B-1 B-2 B-3 B-L BS MS CD JSD L2 L1 PS FRS
LLM LLaMA2-7B 	0.42 0.18 0.34 0.24 0.16 0.13 0.21 0.67 0.37 0.39 0.58 12.17 98.95 74.37 58.11
Mistral -7B 	0.47 0.24 0.41 0.27 0.24 0.20 0.27 0.73 0.45 0.34 0.53 9.82 78.91 70.11 38.05
GPT 3.5 	0.51 0.25 0.42 0.31 0.24 0.20 0.27 0.75 0.46 0.31 0.54 10.14 49.73 68.31 48.91
Gemma2-9B 	0.65 0.55 0.52 0.51 0.46 0.44 0.55 0.79 0.55 0.22 0.39 7.19 44.13 57.69 67.15
LLaMA2-7B+HIDE 0.42 0.18 0.35 0.24 0.18 0.14 0.21 0.69 0.38 0.38 0.57 11.93 96.02 73.23 59.91
Mistral -7B+HIDE 0.48 0.24 0.42 0.31 0.25 0.21 0.29 0.74 0.46 0.32 0.53 9.76 76.56 68.71 39.23
Gemma2-9B+HIDE 0.68↑ 0.56↑ 0.58↑ 0.53↑ 0.50↑ 0.48↑ 0.57↑ 0.81↑ 0.56↑ 0.21↓ 0.38↓ 7.01↓ 43.71↓ 56.81↓ 69.23↑
VLM Blip2-7B 	0.15 0.03 0.12 0.18 0.08 0.06 0.02 0.47 0.05 0.58 0.70 20.34 148.98↓ 175.83 74.54↑
Qwen2-VL-7B-Instruct 0.36 0.09 0.22 0.36 0.18 0.11 0.06 0.60 0.21 0.28 0.60 18.08 185.43 109.27 51.46
SmolVLM-Instruct 0.37 0.11 0.22 0.33 0.18 0.11 0.06 0.62 0.25 0.25 0.57 29.47 249.61 114.41 51.82
Video-LLaVA-7B 0.38 0.10 0.24 0.37 0.19 0.11 0.06 0.62 0.22 0.32 0.59 18.20 175.66 131.72 50.64
Paligemma2-10B 0.43↑ 0.14↑ 0.27↑ 0.41↑ 0.24 0.15 0.09 0.66↑ 0.28↑ 0.24↓ 0.56↓ 18.97 185.90 86.09 52.30
A. Resultant Discussion
This section synthesizes the findings in response to the
stated RQs, supported by qualitative insights and error ana-
lyzes across model generations.
1) Response to RQ1: Idiomatic competence across LLMs
and VLMs: Table II highlights a clear performance divide
between text-only LLMs and VLMs on culturally dense
idioms. Even without error-driven refinement, Gemma2-
9B leads across all metrics, substantially outperforming the
strongest VLM, Paligemma2-10B. Distance-based measures
further confirm this gap, with lower cosine distance and
Jensen–Shannon divergence for LLMs, indicating tighter se-
mantic

Chunk 17 · 1,997 chars

d VLMs on culturally dense
idioms. Even without error-driven refinement, Gemma2-
9B leads across all metrics, substantially outperforming the
strongest VLM, Paligemma2-10B. Distance-based measures
further confirm this gap, with lower cosine distance and
Jensen–Shannon divergence for LLMs, indicating tighter se-
mantic alignment with figurative meanings. Superior readabil-
ity and lower perplexity likewise favor LLMs, underscoring
that large-scale textual pretraining remains more effective
than current multimodal grounding for internalizing culturally
encoded idiomatic semantics in Hindi, Bengali, and Thai.
2) Response to RQ2: HIDE+EFL gains in idiom compre-
hension: Injecting Error-Feedback Learning (via the HIDE
retriever) converts past errors into micro-lessons that sig-
nificantly sharpen idiom handling. For LLMs, Gemma2-
9B + HIDE raises ROUGE-1/2/3, trims cosine distance to
0.21, and cuts L2/L1 errors while lifting readability (FRS
≈ 69) and lowering perplexity. Mistral-7B and LLaMA2-7B
record smaller yet consistent gains, accompanied by reduced
perplexity, proving that even lightweight hints nudge mid-sized
models toward the correct figurative space.
3) Response to RQ3: Dataset Impact: The Mediom corpus
provides a multimodal benchmark for idiomatic compre-
hension in Hindi, Bengali, and Thai by coupling high-
resolution images with expert-curated explanations. This
visual grounding sharpens cross-lingual transfer (e.g., idiom
translation) and equips conversational agents with culturally
aligned reasoning. In education, image-anchored idioms turn
abstract metaphors into concrete, engaging content, boosting
learner retention and motivation.
B. Analytical Discussion
1) Human Evaluation: Three native linguists evaluated
300 idiom instances across five dimensions: literal accuracy,
(a) Qualitative analysis of the Ben-
gali idiom আেগ বাঘ, িপেছ কুিমর (Aage
bagh, pichhe kumir). Literal: A
tiger in front, a crocodile behind.
Metaphorical: Trapped between two
equally

Chunk 18 · 1,997 chars

alytical Discussion
1) Human Evaluation: Three native linguists evaluated
300 idiom instances across five dimensions: literal accuracy,
(a) Qualitative analysis of the Ben-
gali idiom আেগ বাঘ, িপেছ কুিমর (Aage
bagh, pichhe kumir). Literal: A
tiger in front, a crocodile behind.
Metaphorical: Trapped between two
equally dangerous situations.
(b) Error analysis of the Hindi idiom
नाच न जाने आँगन टेढ़ा (Naach na jaane
angan tedha). Literal: One who
cannot dance blames the courtyard.
Metaphorical: Blaming external fac-
tors for personal failure.
Fig. 4. Head-to-head qualitative and error analysis.
contextual fit, usage naturalness, cultural depth, and overall co-
herence. Fine-tuned LLMs (e.g., GPT-3.5, Gemma2-9B) and
VLMs (e.g., Paligemma2-10B, Qwen2-VL-7B-Instruct) excel
in literal translation and contextual interpretation, reflecting
strong linguistic priors, but initially lag in cultural significance
and coherence. Incorporating HIDE yields substantial gains,
particularly for LLMs such as Gemma2-9B and LLaMA2-7B,
by converting prior reasoning errors into structured semantic
cues, thereby improving metaphor comprehension, cultural
alignment, and usage coherence. While HIDE-enabled LLMs
achieve the most consistent idiomatic reasoning, only fine-
tuned VLMs exhibit marked improvements in visual–semantic
alignment and contextual depth, narrowing the unimodal–
multimodal performance gap.
2) Qualitative Analysis: Figure 4a illustrates qualitative
differences in Bengali idiom interpretation. Gemma2-9B
augmented with HIDE delivers the most nuanced and context-
aware explanation, effectively capturing the idiom’s core
dilemma of simultaneous threats. Mistral-7B and LLaMA2-
7B provide competent but less granular interpretations. In
contrast, multimodal models such as VideoLLaVA-7B enhance

-- 5 of 6 --

comprehension through vivid visualizations of danger and
tension, while BLIP-2, Qwen2-VL-7B-Instruct, Paligemma2-
10B, and SmolVLM-Instruct reinforce the concept visually
but

Chunk 19 · 1,989 chars

and LLaMA2-
7B provide competent but less granular interpretations. In
contrast, multimodal models such as VideoLLaVA-7B enhance

-- 5 of 6 --

comprehension through vivid visualizations of danger and
tension, while BLIP-2, Qwen2-VL-7B-Instruct, Paligemma2-
10B, and SmolVLM-Instruct reinforce the concept visually
but fall short in figurative depth. Overall, HIDE-equipped
Gemma2-9B best captures idiomatic semantics, whereas
VLMs primarily excel in visual grounding.
3) Error Analysis: Figure 4b underscores a clear modality
gap. While LLMs fine-tuned with HIDE (Gemma2-9B,
Mistral-7B, LLaMA2-7B) reliably captured the core metaphor
(blaming the courtyard for one’s own lack of skill), framing it
within themes of accountability and self-reflection. GPT-3.5,
however, slipped into over-literal narration, anchoring on irrel-
evant spatial details and losing the abstraction. Conclusively,
HIDE equips LLMs to internalize cultural metaphor, whereas
current VLMs struggle to transcend literal scene parsing.
VI. CONCLUSION
In this work, we introduce Mediom, a multimodal dataset
containing 3,533 idioms across Hindi, Thai, and Bengali,
paired with gold-standard human explanations and aligned vi-
sual representations. Our analysis reveals that HIDE, inspired
by EFL fine-tuning, slightly enhances lexical and contextual
reasoning in LLMs (LLaMA2-7B, Gemma2-9B, Mistral-
7B), facilitating nuanced idiomatic interpretations. Moreover,
VLMs (Paligemma2-10B) effectively leverage visual cues
to capture cultural subtleties. Mediom thus establishes a
benchmark for evaluating culturally and figuratively aware
AI models, advancing idiomatic comprehension through an
integrated textual, visual, and cultural framework.
REFERENCES
[1] Richard P. Honeck, A Proverb in Mind: The Cognitive Science of
Proverbial Wit and Wisdom, Lawrence Erlbaum Associates, Mahwah,
NJ, USA, 1997.
[2] Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and
Timothy Baldwin, “Large language models are

Chunk 20 · 1,996 chars

ed textual, visual, and cultural framework.
REFERENCES
[1] Richard P. Honeck, A Proverb in Mind: The Cognitive Science of
Proverbial Wit and Wisdom, Lawrence Erlbaum Associates, Mahwah,
NJ, USA, 1997.
[2] Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and
Timothy Baldwin, “Large language models are human-like internally,”
Trans. Assoc. Comput. Linguist., vol. 13, pp. 1743–1766, 2025.
[3] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wain-
wright, Pamela Mishkin, Chong Zhang, et al., “Training language
models to follow instructions with human feedback,” in Advances
in Neural Information Processing Systems, 2022, vol. 35, pp. 27730–
27744.
[4] Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin
Choi, and Bill Yuchen Lin, “WildVision: Evaluating vision-language
models in the wild with human preferences,” in Advances in Neural
Information Processing Systems, 2024, vol. 38.
[5] Francesca De Luca Fornaciari, Begoña Altuna, Itziar Gonzalez-Dios,
and Maite Melero, “A hard nut to crack: Idiom detection with
conversational large language models,” in The Workshop on Figurative
Language Processing, 2024, pp. 35–44.
[6] Hessel Haagsma, Johan Bos, and Malvina Nissim, “MAGPIE: A
large corpus of potentially idiomatic expressions,” in The Language
Resources and Evaluation Conference, 2020, pp. 279–287.
[7] Di Zhang, Junxian Li, Jingdi Lei, Xunzhi Wang, Yujie Liu, Zonglin
Yang, Jiatong Li, et al., “Critic-V: VLM critics help catch VLM errors
in multimodal reasoning,” in The IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2025.
[8] Jiali Zeng, Linfeng Song, Jinsong Su, Jun Xie, Wei Song, and Jiebo
Luo, “Neural simile recognition with cyclic multitask learning and local
attention,” in The AAAI Conference on Artificial Intelligence, 2020,
pp. 9515–9522.
[9] Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, and Nanyun Peng,
“MERMAID: Metaphor generation with symbolism and discriminative
decoding,” in The Conference of the North

Chunk 21 · 1,997 chars

imile recognition with cyclic multitask learning and local
attention,” in The AAAI Conference on Artificial Intelligence, 2020,
pp. 9515–9522.
[9] Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, and Nanyun Peng,
“MERMAID: Metaphor generation with symbolism and discriminative
decoding,” in The Conference of the North American Chapter of the
ACL, 2021, pp. 4250–4261.
[10] Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie
Pavlick, Aaron Steven White, and Benjamin Van Durme, “Collecting
diverse natural language inference problems for sentence representation
evaluation,” in The Conference on Empirical Methods in Natural
Language Processing, 2018, pp. 67–81.
[11] Hanbit Lee, Yeonchan Ahn, Haejun Lee, Seungdo Ha, and Sang
goo Lee, “Quote recommendation in dialogue using deep neural
network,” in The International ACM SIGIR Conference on Research
and Development in Information Retrieval, 2016, pp. 957–960.
[12] Lingzhi Wang, Jing Li, Xingshan Zeng, Haisong Zhang, and Kam-Fai
Wong, “Continuity of topic, interaction, and query: Learning to quote
in online conversations,” in The Conference on Empirical Methods in
Natural Language Processing, 2020, pp. 6640–6650.
[13] Ruiyang Qin, Haozheng Luo, Zheheng Fan, and Ziang Ren, “IBERT:
Idiom cloze-style reading comprehension with attention,” arXiv preprint
arXiv:2112.02994, 2021.
[14] Tosin Adewumi, Foteini Liwicki, and Marcus Liwicki, “Vector
representations of idioms in conversational systems,” Sci, vol. 4, no.
4, pp. 37, 2022.
[15] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya,
Asli Celikyilmaz, and Yejin Choi, “COMET: Commonsense trans-
formers for automatic knowledge graph construction,” in The Annual
Meeting of the ACL, 2019, pp. 4762–4779.
[16] Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and
Chris Biemann, “SemEval-2013 task 5: Evaluating phrasal semantics,”
in The International Workshop on Semantic Evaluation, 2013, pp. 39–
47.
[17] Tuhin Chakrabarty, Arkadiy Saakyan,

Chunk 22 · 1,991 chars

onstruction,” in The Annual
Meeting of the ACL, 2019, pp. 4762–4779.
[16] Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and
Chris Biemann, “SemEval-2013 task 5: Evaluating phrasal semantics,”
in The International Workshop on Semantic Evaluation, 2013, pp. 39–
47.
[17] Tuhin Chakrabarty, Arkadiy Saakyan, Debanjan Ghosh, and Smaranda
Muresan, “FLUTE: Figurative language understanding through textual
explanations,” in The Conference on Empirical Methods in Natural
Language Processing, 2022, pp. 7139–7159.
[18] Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, and Smaranda
Muresan, “Understanding figurative meaning through explainable visual
entailment,” in The Conference of the Nations of the Americas Chapter
of the ACL, 2025, pp. 1–23.
[19] Ekarat Udomporn, 5000 Thai Idioms: From the Past Right on up to
Now!, P.S. Pattana Publishing, 2014.
[20] OpenAI, “Improving image generation with better captions,” Tech.
Rep., OpenAI, 2023.
[21] Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya
Ramesh, Aidan Clark, AJ Ostrow, et al., “GPT-4o system card,” arXiv
preprint arXiv:2410.21276, 2024.
[22] Gemma Team, “Gemma,” Kaggle, 2024.
[23] Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel
Keysers, Xiao Wang, Yonatan Bitton, et al., “PaliGemma 2: A family
of versatile VLMs for transfer,” arXiv preprint arXiv:2412.03555, 2024.
[24] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al., “Language
models are few-shot learners,” in Advances in Neural Information
Processing Systems, 2020, vol. 33, pp. 1877–1901.
[25] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam-
ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand,
et al., “Mistral 7B,” arXiv preprint arXiv:2310.06825, 2023.
[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad
Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al., “Llama
2: Open foundation and fine-tuned chat

Chunk 23 · 1,273 chars

rthur Mensch, Chris Bam-
ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand,
et al., “Mistral 7B,” arXiv preprint arXiv:2310.06825, 2023.
[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad
Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al., “Llama
2: Open foundation and fine-tuned chat models,” arXiv preprint
arXiv:2307.09288, 2023.
[27] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “BLIP-2:
Bootstrapping language-image pre-training with frozen image encoders
and large language models,” in The International Conference on
Machine Learning, 2023, vol. 202, pp. 19730–19742.
[28] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai,
Keqin Chen, Xuejing Liu, Jialin Wang, et al., “Qwen2-VL: Enhancing
vision-language model’s perception of the world at any resolution,”
arXiv preprint arXiv:2409.12191, 2024.
[29] Hugging Face, “SmolVLM-500M-Base,” https://huggingface.co/
HuggingFaceTB/SmolVLM-500M-Base, 2025.
[30] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Hongfa Wang,
Yatian Pang, Wenhao Jiang, Junwu Zhang, et al., “LanguageBind:
Extending video-language pretraining to N-modality by language-based
semantic alignment,” in The International Conference on Learning
Representations, 2024.

-- 6 of 6 --