When Meaning Isn't Literal: Exploring Idiomatic Meaning Across Languages and Modalities
Summary
This paper addresses the challenge of idiomatic meaning in multilingual and multimodal AI systems. Idioms, which carry figurative meanings rooted in culture, are often misinterpreted by language models that rely on literal word definitions. The authors introduce "Mediom," a multilingual, multimodal idiom corpus containing 3,533 Hindi, Bengali, and Thai idioms, each with expert explanations, translations, and aligned text-image pairs. They also propose "HIDE," a Hinting-based Idiom Explanation framework that uses error feedback to refine model understanding. Evaluations show that large language models (LLMs) outperform vision-language models (VLMs) in idiomatic reasoning, but both benefit from HIDE. The study highlights the need for culturally grounded, multimodal resources to improve AI's comprehension of idiomatic expressions.
PDF viewer
Chunks(24)
Chunk 0 ¡ 1,990 chars
When Meaning Isnât Literal: Exploring Idiomatic
Meaning Across Languages and Modalities
1st Sarmistha Dasâ , 2nd Shreyas Guhaâ , 3rd Suvrayan Bandyopadhyayâ
Department of Computer Science and Engineering
Indian Institute of Technology Patna
Patna, India
{sarmistha_2221cs21, 2201cb58_shreyas, suvrayan_2301cs89}@iitp.ac.in
4th Salisa Phosit, 5th Kitsuchart Pasupa*
School of Information Technology
King Mongkutâs Institute of Technology Ladkrabang
Bangkok, Thailand
67076055@kmitl.ac.th, *kitsuchart@it.kmitl.ac.th
6th Sriparna Saha
Department of Computer Science and Engineering
Indian Institute of Technology Patna
Patna, India
sriparna@iitp.ac.in
AbstractâIdiomatic reasoning deeply intertwined with metaphor
and culture remains a blind spot for contemporary language models,
whose progress skews toward surface-level lexical and semantic
cues. For instance, the Bengali idiom āĻāĻā§āĻā§ āϰ āĻĢāϞ āĻāĻ (angur fol tok,
âgrapes are sourâ): it encodes denial-driven rationalization, yet Naïve
models latch onto the literal fox-and-grape imagery. Addressing
this oversight, we present âMediom,â a multilingual, multimodal
idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each
paired with gold-standard explanations, cross-lingual translations,
and carefully aligned textâimage representations. We benchmark
both large language models (textual reasoning) and vision language
models (figurative disambiguation) on Mediom, exposing systematic
failures in metaphor comprehension. To mitigate these gaps, we
propose âHIDE,â a Hinting-based Idiom Explanation framework that
leverages error-feedback retrieval and targeted diagnostic cues for
iterative reasoning refinement. Collectively, Mediom and HIDE
establish a rigorous test bed and methodology for culturally grounded,
multimodal idiom understanding embedded with reasoning hints in
next-generation AI systems1.
Index TermsâIdioms, Multimodal, Multilingual, LLMs, VLMs,
HIDE, EFL
I. INTRODUCTION
Idioms embody figurative meanings that extend beyondChunk 1 ¡ 1,992 chars
y, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems1. Index TermsâIdioms, Multimodal, Multilingual, LLMs, VLMs, HIDE, EFL I. INTRODUCTION Idioms embody figurative meanings that extend beyond their literal composition, conveying context-dependent semantics that are central to natural discourse. For instance, the Hindi idiom ā¤ā¤¸ā¤Žā¤žā¤¨ ⤏āĨ ā¤ŋā¤ā¤°āĨ, ā¤ā¤āĨ⤰ ā¤ŽāĨ⤠⤠ā¤ā¤āĨ (aasmaan se gire, âkhajoor mein atkeâ)âliterally, âfallen from the sky, stuck in a date palmââmetaphorically denotes escaping one difficulty only to become trapped in another, often equally severe. Metaphorical competence is the human ability to fluidly traverse literal and figurative meanings; it underpins idiom comprehension. Yet as Honeck et al. [1] demonstrate, such interpretations remain inherently context-dependent and subjective. Idioms, as crystallized metaphors, encapsulate cultural knowledge and linguistic economy. Recent advances demonstrate that Large â Equal Contribution *Corresponding Author 1Resources are available at https://github.com/sarmistha-D/Hide. Language Models (LLMs) achieve near-human performance across core linguistic tasks [2], [3], while Vision Language Models (VLMs) further enhance reasoning by jointly model- ing text and vision, yielding substantial gains in multimodal understanding such as visual question answering [4]. How- ever, idioms present a distinct challenge, as their meanings cannot be inferred from the literal definitions of individual words but instead depend on cultural context and conventional usage for proper understanding [5]. Idioms crystallize cultural knowledge, so accurate interpretation demands both linguistic and contextual fluency. For example, the Bengali idiom āĻŦāĻžā§āϰāĻž āĻŽāĻžā§āϏ ā§āϤā§āϰāĻž āĻĒāĻžāĻŦāϰā§āĻŖ (baro mashe tero parbon, âThirteen festivals in twelve monthsâ) evokes a uniquely celebratory mindset, while the multilingual counterpart âwalls have
Chunk 2 ¡ 1,997 chars
Idioms crystallize cultural
knowledge, so accurate interpretation demands both linguistic
and contextual fluency. For example, the Bengali idiom
āĻŦāĻžā§āϰāĻž āĻŽāĻžā§āϏ ā§āϤā§āϰāĻž āĻĒāĻžāĻŦāϰā§āĻŖ (baro mashe tero parbon, âThirteen
festivals in twelve monthsâ) evokes a uniquely celebratory
mindset, while the multilingual counterpart âwalls have earsâ
(Hindi ā¤ĻāĨā¤ĩā¤žā¤°ā¤žāĨ⤠ā¤āĨ ā¤āĨ ā¤ā¤žā¤¨ ā¤šāĨ⤤āĨ ā¤šāĨā¤, Bengali ā§āĻĻā§āĻžā§āϞāϰāĻ āĻāĻžāύ āĻā§āĻ,
Thai ā¸āšā¸˛āšā¸ā¸ā¸Ąā¸ĩā¸Ģā¸š ā¸ā¸Ŗā¸°ā¸ā¸šā¸Ąā¸ĩā¸ā¸˛) shows that shared metaphors
transmit communal wisdom. Yet idiom research remains
English-centric [6], leaving Hindi, Bengali, Thai, and vision-
grounded understanding largely unexplored. To fill this
void, we release the first multimodal, cross-cultural idiom
corpus, pairing expert-annotated text with contextual images
for expressions. Additionally, we provide a hint-embedded
idiom reasoning database that provides interpretive meanings,
symbolic meanings, and context-specific cues. This supports
a Hinting-based Idiom Explanation (HIDE) loop inspired by
Error Feedback Learning (EFL) [7], which helps correct
generation errors during idiom interpretation as demonstrated
in Figure 1.
The research objectives of the current work are as follows:
(i) Evaluate the extent to which LLMs and VLMs encode
culturally grounded knowledge and its impact on idiomatic
inference accuracy. (ii) Analyze the contribution of the HIDE
paradigm in enhancing fine-grained inferential reasoning over
culturally nuanced idioms. (iii) Assess cross-model general-
izability by benchmarking dataset performance across diverse
LLM and VLM architectures.
arXiv:2604.10787v1 [cs.CL] 12 Apr 2026
-- 1 of 6 --
āĻāĻāϰ āĻĢāϞ āĻāĻ (Angur fol tok)
<Grapes are sour>
When someone cannot attain
something they desire, they pretend
it was not worth having in the first
place
Language Model
The <image/idiom> depicts a fox gazing angrily at
grapes hanging out of reach, symbolizing frustration
and denial.
EFL Database
Original idiom: {original}, Incorrect Translation: {model
translation}, Correct Translation: {originalChunk 3 ¡ 1,999 chars
hing they desire, they pretend
it was not worth having in the first
place
Language Model
The <image/idiom> depicts a fox gazing angrily at
grapes hanging out of reach, symbolizing frustration
and denial.
EFL Database
Original idiom: {original}, Incorrect Translation: {model
translation}, Correct Translation: {original translation},
Incorrect description: {model description}, Correct description:
{original description}
Hint: Try to consider whether the scene might represent a
broader idea or emotion rather than describing it literally.
Some images may use characters and actions as symbols for
common human behaviours or cultural sayings
The <image/idiom> depicts a fox gazing angrily at grapes hanging out of reach, symbolizing
frustration and denial. It reflects the moral of "sour grapes"âdismissing what one cannot have by
pretending it was undesirable.
Generated Understanding
Accurate
Understanding?
Generated Hint
NO
Yes
Generated Understanding
EFL Framework
Input idiom
Fig. 1. Idiomatic Understanding via HIDE inspired by EFL: The model first
interprets the idiom, detects semantic errors, and uses targeted hints to refine
and improve idiomatic comprehension.
Our contributions towards the research community are as
follows: (i) We introduce âMediom,â the first multimodal
idiom benchmark for low-resource languages Hindi, Ben-
gali, and Thai, comprising 3,533 idioms with fine-grained
annotations and rigorously validated human explanations.
(ii) We propose âHIDE,â a hint-driven idiomatic explanation
framework inspired by EFL. (iii) We conduct two comple-
mentary evaluations: idiomatic understanding with LLMs and
multimodal idiom interpretation with VLMs.
II. BACKGROUND
Recent natural language processing research has extensively
explored figurative language comprehension and generation,
including simile detection [8], metaphor identification [9],
pun recognition [10], and idiom retrieval [11]. Idiomatic
expression modeling is closely related to idiom generation,
contextualChunk 4 ¡ 1,999 chars
BACKGROUND Recent natural language processing research has extensively explored figurative language comprehension and generation, including simile detection [8], metaphor identification [9], pun recognition [10], and idiom retrieval [11]. Idiomatic expression modeling is closely related to idiom generation, contextual quotation recommendation, and literary text re- trieval [12]. Subsequently, Qin et al. [13] developed a BERT-based model that encodes idiomatic expressions within both global and local contexts to improve the explanation of non-compositional meanings. Modern neural text generation models have significantly improved the contextual adaptation of idioms [14], with commonsense reasoning frameworks further enhancing figurative language understanding [15]. HIDE advances this workflow by logging the modelâs past mis- steps and feeding them back as corrective context, iteratively tightening reasoning accuracy [7]. Despite growing interest in explanation strategies, their application to idiomatic under- standing, particularly regarding the retention of cultural nuance and metaphorical coherence, remains largely unexplored. To address the clear absence of idiom-centric resources for low-resource languages across both LLMs and VLMs exposed in Table I, our research introduces a retrieval-driven HIDE loop inspired by EFL that maintains a memory of past generation errors, retrieves structurally similar failures at inference time, and injects corrective semantic cues into subsequent prompts. Table I COMPARATIVE OVERVIEW OF THE PROPOSED MEDIOM AGAINST LEADING RESOURCES. ABBREVIATIONS: E = ENGLISH; ML = MULTILINGUAL SET (THAI, HINDI, BENGALI); MULTIMODALITY = PAIRED TEXTâIMAGE RESOURCES. Corpus Name Count Language Explanations Multimodality Idiomatic Part SemEval-2013 [16] 4,350 E à à à FLUTE [17] 8,962 E â à â(497) V-FLUTE [18] 6,027 E â â â(370) Mediom (Proposed) 3,533 ML â â 3,533 The Bengali idiom " āĻāĻžāϧāĻž āĻāĻžāϧāĻžāĻ, āϏ āϝāĻŋāĻĻ āϰāĻžāĻāĻžāϰāĻ āĻšā§" means "A
Chunk 5 ¡ 1,995 chars
TY = PAIRED TEXTâIMAGE RESOURCES. Corpus Name Count Language Explanations Multimodality Idiomatic Part SemEval-2013 [16] 4,350 E à à à FLUTE [17] 8,962 E â à â(497) V-FLUTE [18] 6,027 E â â â(370) Mediom (Proposed) 3,533 ML â â 3,533 The Bengali idiom " āĻāĻžāϧāĻž āĻāĻžāϧāĻžāĻ, āϏ āϝāĻŋāĻĻ āϰāĻžāĻāĻžāϰāĻ āĻšā§" means "A donkey remains a donkey, even if he belongs to the king." Idiomatically, it illustrates that someone's fundamental nature cannot be changed by an upgrade in status or wealth; a foolish or inept person will remain so, even with privileges or high position. āĻāĻžāϧāĻž āĻāĻžāϧāĻžāĻ, āϏ āϝāĻŋāĻĻ āϰāĻžāĻāĻžāϰāĻ āĻšā§ <Gadha gadhai, se jodi rajar-o hoy> <A donkey is a donkey, even if itâs the kingâs> Nature does not change with status The Thai idiom " ā¸āšā¸˛ā¸ā¸ā¸ā¸ā¸Ē⏏ā¸āšā¸Ēā¸āšā¸˛ā¸āšā¸āšā¸āšā¸āšā¸ā¸Ŗā¸" translates to "Outside is shiny, inside is hollow." This idiom describes something or someone that appears attractive or impressive on the outside but lacks substance or value on the inside. It highlights the importance of looking beyond external appearances to understand the true nature or quality of something. ā¸āšā¸˛ā¸ā¸ā¸ā¸ā¸Ē⏏ā¸āšā¸Ēā¸āšā¸˛ā¸āšā¸āšā¸āšā¸ āšā¸ā¸Ŗā¸ <kÃĸang-nÃ˛k sÚk-saĖi, kÃĸang-nai bpen phong> <outside is shiny, inside is hollow> people or things that look good externally but lack real worth, honesty, or depth inside (a) A. ā¤Ļā¤Ģā¤ž ā¤šāĨā¤¨ā¤ž (DafÄ honÄ); Literal meaning: To disappear or to be gone; Metaphorically, it implies dismissal or rejection B. ⤏āĨ⤍āĨ ā¤ĒāĨ ⤏āĨā¤šā¤žā¤ā¤ž (Sone pe SuhÄgÄ) is "a wedding ornament on gold," while its metaphorical meaning refers to something that is an added blessing or an enhancement, making a good situation even better C. āĻā§āĻ āĻŦāĻžāĻ, āĻŋāĻĒā§āĻ āĻā§āĻŋāĻŽāϰ(Aage bagh, pichhe kumir); Literal meaning: A tiger in front, a crocodile behind; Metaphorically, it describes a situation where one is trapped between two equally dangerous or difficult problems, with no easy way out. D. ⤍āĨ ā¤ĻāĨ ā¤¯ā¤žā¤°ā¤š ā¤šāĨā¤¨ā¤ž (Nau do gyarah hona) literally means "to be at nine and eleven,"
Chunk 6 ¡ 1,997 chars
, āĻŋāĻĒā§āĻ āĻā§āĻŋāĻŽāϰ(Aage bagh, pichhe kumir); Literal meaning: A tiger in front, a crocodile behind; Metaphorically, it describes a situation where one is trapped between two equally dangerous or difficult problems, with no easy way out. D. ⤍āĨ ā¤ĻāĨ ā¤¯ā¤žā¤°ā¤š ā¤šāĨā¤¨ā¤ž (Nau do gyarah hona) literally means "to be at nine and eleven," which refers to quickly escaping or disappearing from a situation, usually to avoid trouble, responsibility, or confrontation. (b) Fig. 2. (a) Sample instances of our proposed Mediom dataset; (b) Two different candidate images for the same idiom in the Mediom dataset. III. CORPUS FORMULATION A. Data Collection Inspired by [6], we initially compiled a dataset of 3,500 idioms from diverse linguistic and cultural backgrounds, sourcing them from online repositories, literature, and cultural archives, including Hindi idioms from the Simple Help2, Bengali idioms from Bangla Probad 3, and Thai idioms [19]. Our selection was designed to capture syntactic diversity across idioms in Hindi, Bengali, and Thai, which includes fixed expressions, such as ā¸āšāšā¸˛ā¸āšā¸§ā¸Ąā¸ā¸˛ā¸ (nam thuam pak, âunable to speak out,â) and verb-object structures, exemplified by ā¤¨ā¤žā¤ ā¤°ā¤āĨā¤¨ā¤ž (nÄk ragaášnÄ), âto plead intensely.â We also included adjective-noun combinations, such as āĻŋāĻŽāĻŋāώā§āĻāϏāĻŦā§āĻĒā§āύ (miášŖáši swapna, âsweet dreams,â prepositional phrases, such as āšā¸āšā¸˛ā¸Ģā¸šā¸āšā¸˛ā¸ĸā¸ā¸°ā¸Ĩ⏏ā¸Ģā¸šā¸ā¸§ā¸˛ (khao hu sai thalu hu khwa, âin one ear and out the other,â and binomial pairs, such as ā¤ā¤˛āĨā¤ā¤ž ⤏āĨā¤§ā¤ž (ulášÄ sÄĢdhÄ, âall sorts of nonsense.â For syntactically flexible idioms, normalization was performed by unifying verb inflections and replacing person-specific pronouns with neutral counterparts. However, structurally rigid idioms were preserved in their original form to maintain cultural and linguistic authenticity. Following rigorous curation, 3,533 idioms were retained to ensure diversity and quality; representative samples are shown in Figure 2a. B. Data Quality Assurance To ensure rigorous annotation quality, we enlisted a
Chunk 7 ¡ 1,998 chars
lly rigid idioms were preserved in their original form to maintain cultural and linguistic authenticity. Following rigorous curation, 3,533 idioms were retained to ensure diversity and quality; representative samples are shown in Figure 2a. B. Data Quality Assurance To ensure rigorous annotation quality, we enlisted a doctoral scholar and two literature professors, native speakers of Thai, Hindi, and Bengali, whose combined academic and linguistic expertise delivers culturally nuanced, high-precision labels. 2https://thesimplehelp.com/hindi-idioms-with-meanings 3https://archive.org/details/in.ernet.dli.2015.455639 -- 2 of 6 -- Initially, each expert annotated 100 samples, generating a reference set of 200 idiomatic explanations. Subsequently, we established Information Persistence Ratings (IPR) criteria to evaluate the quality of idiom annotations, which were structured around five core aspects: (i) Literal Translation, where each idiom was directly translated into English while preserving its original structure, metaphorical elements, and imagery; (ii) Contextual Interpretation, in which annotators provided a succinct explanation of the idiomâs meaning within its cultural and linguistic context; (iii) Usage Scenarios, which included brief examples illustrating real-world applications of the idiom for enhanced comprehension; (iv) Cultural Sig- nificance, where any historical, regional, or societal relevance associated with the idiom was documented; and (v) Coherence Preservation, ensuring the interpretation encapsulated the idiomâs intended meaning without fragmentation. To ensure consistency and high annotation quality, linguistic experts established comprehensive guidelines following standard best practices. We then conducted a two-stage annotation com- petition with 15 participants: (i) a training phase using 100 reference samples, from which seven annotators qualified, and (ii) a testing phase with an additional 100 samples, resulting in the selection of two
Chunk 8 ¡ 1,996 chars
ed comprehensive guidelines following standard best practices. We then conducted a two-stage annotation com- petition with 15 participants: (i) a training phase using 100 reference samples, from which seven annotators qualified, and (ii) a testing phase with an additional 100 samples, resulting in the selection of two final annotators. Annotators were compensated at $0.5 per sample. To further guarantee the reliability of the annotations, a stringent two-tiered validation protocol was implemented. (i) Peer Review: Each annotation was cross-verified by at least two additional annotators fluent in the source language to enhance accuracy and mitigate subjective biases. (ii) Expert Validation: A panel comprising linguistic and cultural special- ists possessing adequate linguistic knowledge on Thai, Hindi, and Bengali (can read and write proficiently) conducted a final review of the annotations. Their role included validating adherence to the IPR criteria for each sample. The rating is based on the retention of individual aspects. For instance, if five aspects are retained, the evaluation score will be five. The annotation process yielded a robust inter-annotator agree- ment score of 0.82 (Cohenâs kappa), reflecting substantial consistency and reliability across the dataset. C. Idiomatic Image Creation We implement a visual-idiom generation module. Since no curated visual idiom corpus exists, we synthesize high-fidelity images with a prompt-driven pipeline: gold-standard idiom explanations feed DALL¡E 3 [20] via GPT-4o prompts [21]. Each prompt is crafted to preserve both figurative meaning and cultural nuance. Figure 2b illustrates the refinement process: (i) for the Hindi idiom ⤏āĨ⤍āĨ ā¤ĒāĨ ⤏āĨā¤šā¤žā¤ā¤ž (âicing on the cakeâ), literal visuals were replaced with metaphorically aligned depictions; (ii) for the Bengali idiom āĻā§āĻ āĻŦāĻžāĻ, āĻŋāĻĒā§āĻ āĻā§āĻŋāĻŽāϰ, only frames conveying an inescapable dilemma were retained; and (iii) for the Hindi idiom ⤍āĨ ā¤ĻāĨā¤āĨ ā¤¯ā¤žā¤°ā¤š ā¤šāĨā¤¨ā¤ž , arithmetic literalizations
Chunk 9 ¡ 1,995 chars
s: (i) for the Hindi idiom ⤏āĨ⤍āĨ ā¤ĒāĨ ⤏āĨā¤šā¤žā¤ā¤ž (âicing on
the cakeâ), literal visuals were replaced with metaphorically
aligned depictions; (ii) for the Bengali idiom āĻā§āĻ āĻŦāĻžāĻ,
āĻŋāĻĒā§āĻ āĻā§āĻŋāĻŽāϰ, only frames conveying an inescapable dilemma
were retained; and (iii) for the Hindi idiom ⤍āĨ ā¤ĻāĨā¤āĨ ā¤¯ā¤žā¤°ā¤š ā¤šāĨā¤¨ā¤ž ,
arithmetic literalizations (e.g., â9+2=11â) were discarded. In
such cases, annotators refined prompts with contextual cues
and regenerated images to ensure cultural fidelity and semantic
alignment.
IV. METHODOLOGY
Given an idiom pair (xt, xi), where xt is the textual
representation and xi is the corresponding image, this study
aims to develop a multimodal idiom interpretation frame-
work leveraging LLMs and VLMs. The LLM-based model
Pθ (yt | xt) is trained via supervised learning to generate
textual idiom interpretations, while the VLM-based model
PĪ(yt | xi) learns idiom meanings from images using
supervised learning without preference optimization. The
dataset consists of text-meaning pairs Dt = {(xtj , yj )}N
j=1
and image-meaning pairs Di = {(xij , yj )}N
j=1 where N is
the number of idioms. To achieve our objective, the first
phase centers on idiom explanation, leveraging both LLMs
and VLMs to capture nuanced semantic and visual cues.
The second phase employs the hinting database embedded in
the HIDE framework to iteratively refine model performance
and enhance interpretability. Figure 3 describes the entire
methodological representation.
A. Idiom Explanation Generation
We fine-tuned the Gemma2-9B [22] using supervised
learning on idiomâexplanation pairs (X (Ī ), Y), where X (Ī )
represents a figurative idiom in textual form and Y denotes
a corresponding human-curated explanation. Following fine-
tuning, Gemma2-9B underwent human evaluation to assess
outputs for semantic precision and sociocultural alignment.
In parallel, to enable visual idiom interpretation, we fine-
tuned a pre-trained VLM, Paligemma2-10B [23], denoted
as PĪ(Y | X (ÎŊ)), where X (ÎŊ) is a visual inputChunk 10 ¡ 1,997 chars
uman-curated explanation. Following fine- tuning, Gemma2-9B underwent human evaluation to assess outputs for semantic precision and sociocultural alignment. In parallel, to enable visual idiom interpretation, we fine- tuned a pre-trained VLM, Paligemma2-10B [23], denoted as PĪ(Y | X (ÎŊ)), where X (ÎŊ) is a visual input associated with an idiom (e.g., metaphorical illustrations or idiom-linked scenes), and Y is the textual explanation predicted by the model. However, post-fine-tuning analysis revealed multiple instances of generative misinterpretation. For instance, the Bengali idiom āĻŋāĻŽāĻŋāώā§āĻ āĻŽā§ ā§āĻ āĻā§ā§āϤāĻž āĻŽāĻžāϰāĻž (âmishti mukhe jutor bariâ)-literally, âHitting with a shoe while smiling sweetlyââ was misinterpreted by the system as commentary on con- flicting emotional expressions or unusual behavior involving footwear. This interpretation failed to capture the idiomâs actual implicationâthe act of harming or insulting someone with deceptive politeness or a pretence of affection. Among a held-out evaluation set of 600 idiomatic samples, 360 instances (60%) exhibited comparable inaccuracies, often resulting from insufficient grounding in sociocultural nuance or an over-literal mapping from visual or lexical features. A granular inspection of the modelsâ outputs revealed that their inferential reasoning trajectories frequently diverge from the gold-standard at the very onset of generation. This early misalignment necessitates the introduction of controlled nudges to systematically steer the reasoning away from known pitfalls. To address this, we integrated a HIDE mechanism. In this framework, idiomatic explanations generated by the fine- tuned LLM (YLLM) are passed into the HIDE framework. This verifier-guided refinement loop enforces canonical idiom alignment and cultural fidelity, substantially boosting the semantic and contextual accuracy of the language models. -- 3 of 6 -- " ā¤°ā¤žā¤ ā¤ā¤ž ā¤Ēā¤ĩ⤤ ā¤¯ā¤ž ā¤Ēā¤šā¤žāĨ ā¤šāĨā¤¨ā¤ž" (rai ka parvat ya pahaad hona) Visual Language Model GPT-4o The
Chunk 11 ¡ 1,993 chars
ssed into the HIDE framework. This verifier-guided refinement loop enforces canonical idiom alignment and cultural fidelity, substantially boosting the semantic and contextual accuracy of the language models. -- 3 of 6 -- " ā¤°ā¤žā¤ ā¤ā¤ž ā¤Ēā¤ĩ⤤ ā¤¯ā¤ž ā¤Ēā¤šā¤žāĨ ā¤šāĨā¤¨ā¤ž" (rai ka parvat ya pahaad hona) Visual Language Model GPT-4o The Hindi idiom ' ā¤°ā¤žā¤ ā¤ā¤ž ā¤Ēā¤ĩ⤤ ā¤¯ā¤ž ā¤Ēā¤šā¤žāĨ ā¤šāĨā¤¨ā¤ž' translates to 'making a mountain out of a molehill' in English. It means to exaggerate a small issue or problem, making it seem much larger or more significant than it actually is. For an image representation, you could depict a tiny molehill being magnified into a towering mountain, illustrating the concept of blowing things out of proportion DALL-E 3 Original Idiom Description Idiom Image generation phase To visualize the Hindi idiom ' ā¤°ā¤žā¤ ā¤ā¤ž ā¤Ēā¤ĩ⤤ ā¤¯ā¤ž ā¤Ēā¤šā¤žāĨ ā¤šāĨā¤¨ā¤ž,' which translates to 'making a mountain out of a molehill,' the image can be represented in a more specific and clear manner. Picture a small molehill in the foreground, with details like a few tiny ants or small grass tufts for scale, indicating its modest size. In the background, have a towering, majestic mountain with snow-capped peaks and dense forests, dramatically rising above the horizon. The key is to include a magnifying glass positioned in the middle ground, seamlessly linking the molehill to the mountain. This glass should be shown in a way that it visually enlarges a section of the molehill, exaggerating its size to symbolize the proverb's meaning of blowing things out of proportion. The scene should be set with a clear blue sky to enhance the contrast between the tiny molehill and the grandiose mountain. Briefing the descriptional prompt to generate the image You are an expert in Hindi idioms. You will be given a hindi idiom along with an incorrect translation and a correct translation. You will also be given an incorrect description of the same idiom along with a correct description. You are supposed to analyze the difference
Chunk 12 ¡ 1,997 chars
the descriptional prompt to generate the image
You are an expert in Hindi idioms. You will be given a hindi idiom along with
an incorrect translation and a correct translation. You will also be given an
incorrect description of the same idiom along with a correct description. You
are supposed to analyze the difference between the correct and incorrect
description, and summarize that into a hint which the model can use in the
future for better translation and description. Make sure that the hint does not
contain any specific details of the idiom, but rather addresses the model's
short-coming in general in regards to this particular case. Only output the
summarized hint.
Original idiom: {original}, Incorrect Translation: {model translation}, Correct
Translation: {original translation}, Incorrect description: {model description},
Correct description: {original description}
DALL-E 3
System_prompt:
You are an expert in idioms, and you have to
translate this idiom and provide a description of
what it means and what it is trying to convey.
prompt: {idiom_image}
System_prompt:
You are an expert in idioms, and you have to
translate this idiom and provide a description of
what it means and what it is trying to convey.
prompt: {idiom_text}
HIDE construction by GPT 4o During Training
Generated Hint : Always verify the idiomâs core
figurative function (e.g. exaggeration vs.
actual distress) rather than relying on surface wording
System_prompt: You are an expert in idioms and you have to give translate this idiom
and return a description of what it means and is trying to convey.
prompt: Use the provided hint to describe this hindi idiom and based on this
description explain what it means. Hint: {hint}. Idiom: {idiom}.
This idiom is used to describe someone or something that is
extremely tall or impressive. The phrase "
ā¤°ā¤žā¤ ā¤ā¤ž ā¤Ēā¤ĩ⤤" (Raai ka
Parvat) literally means "mountain," while "
ā¤Ēā¤šā¤žāĨ" (Pahad)
means "hill." The idiom is often used to describe someone who is
very tall orChunk 13 ¡ 1,996 chars
tion explain what it means. Hint: {hint}. Idiom: {idiom}.
This idiom is used to describe someone or something that is
extremely tall or impressive. The phrase "
ā¤°ā¤žā¤ ā¤ā¤ž ā¤Ēā¤ĩ⤤" (Raai ka
Parvat) literally means "mountain," while "
ā¤Ēā¤šā¤žāĨ" (Pahad)
means "hill." The idiom is often used to describe someone who is
very tall or impressive, but it can also be used to describe
something that is exceptionally good or impressive. For example,
you might say "He is as tall as a mountain" to describe someone
who is very tall. The idiom is often used to exaggerate or
emphasize someone's height or impressiveness, and it is a
common way to express admiration or awe.
The idiom "
ā¤°ā¤žā¤ ā¤ā¤ž ā¤Ēā¤ĩ⤤ ā¤Ŧā¤¨ā¤žā¤¨ā¤ž"(Raai ka Parvat
Banana), meaning "making a mountain out of a
molehill," describes the act of exaggerating a minor
issue. In the image, a small anthill is shown
magnified to look like a mountain, symbolizing how
trivial matters are often blown out of proportion. The
background mountains contrast this exaggeration,
highlighting the difference between real problems
and imagined ones.
Generated Image
Large Language
Model
XV
Yvlm
XT
Yllm
Hint< h>
Inference with Hint During Testing M(Prompt(x, h)
Output:Correct Idiomatic understanding
Generation by LLM
Output: Correct Idiomatic understanding
Generation by VLM
Fine-Tune
Idiomatic Understanding validator for VLM
Idiomatic Understanding validator for LLM
Fig. 3. Architectural framework for idiom explanation that fuses LLMs and VLMs, augmented by a HIDE module with Hint Generation
B. HIDE construction inspired by EFL
To construct hint-embedded idiomatic explanations, initially
after inference, each incorrectly handled idiom is ingested
as the quintuple â¨xi, ËTi, ËEi, Ti, EiâŠ, where xi is the
idiom string, ( ËTi, ËEi) are the model-generated translation
and explanation, and (Ti, Ei) are their human-annotated gold
counterparts. A discriminator compares ËEi with Ei and
compresses the discrepancy into a high-level corrective hint
hi = Ī( ËEi, Ei
).Chunk 14 ¡ 1,988 chars
sted as the quintuple â¨xi, ËTi, ËEi, Ti, EiâŠ, where xi is the idiom string, ( ËTi, ËEi) are the model-generated translation and explanation, and (Ti, Ei) are their human-annotated gold counterparts. A discriminator compares ËEi with Ei and compresses the discrepancy into a high-level corrective hint hi = Ī( ËEi, Ei ). The idiom is then embedded by a semantic encoder f : X â Rd, producing zi = f (xi) where d is the embedding dimensional. The tuple â¨zi, hi, ËTi, ËEi⊠is archived in an Idiomatic Error-Feedback Repository (denoted H) for future reuse. At the secondary inference time, the system treats a test idiom x as a query: it computes z = f (x) and retrieves the repository entry j = arg maxk cos(z, zk ) that is most similar in embedding space. The accompanying hint hj is concatenated with x to form an augmented prompt, Prompt(x, hj ) = x âĨ hj , which is passed to the generation model M to yield the final translationâexplanation pair Ëy = ( ËT , ËE) = M(Prompt(x, hj )). Injecting this retrieved-error context exposes the model to a concrete, previously mis-handled scenario that is semantically close to the current input, steering its reasoning away from known pitfalls while preserving the original flow. V. EXPERIMENTAL RESULTS AND DISCUSSION This section details the experimental protocol, baseline settings, and a direct comparative analysis of LLMs and VLMs, complemented by qualitative error diagnostics that expose interpretive limitations and open research challenges. Our study addresses three focal Research Questions (RQs): RQ1: Idiomatic Competence. How effectively do LLMs and VLMs internalize and interpret culturally grounded, metaphor- rich expressions? RQ2: HIDE+EFL Impact. How effectively does HIDE enhance idiom understanding and contextual reasoning in LLMs without degrading overall performance? RQ3: Generalizability and Implications. How robustly does the dataset generalize across model families, and what broader socio-linguistic insights does it
Chunk 15 ¡ 1,996 chars
ich expressions? RQ2: HIDE+EFL Impact. How effectively does HIDE enhance idiom understanding and contextual reasoning in LLMs without degrading overall performance? RQ3: Generalizability and Implications. How robustly does the dataset generalize across model families, and what broader socio-linguistic insights does it reveal? Models were fine-tuned for 5 epochs with a learning rate of 1 Ã 10â5 on an NVIDIA A100 (80 GB) GPU, using the AdamW optimizer. Training employed a batch size of 8 with two-step gradient accumulation and native Auto- matic Mixed Precision (AMP) for memory-efficient mixed- precision execution. A linear learning-rate scheduler was applied to ensure stable convergence, while Top-K sam- pling (K=10) encouraged generative diversity. The dataset was split into 80% training and 20% testing. The model achieved a final training cross-entropy loss of 2.1986. A comparative evaluation was performed across a range of prominent LLMs and VLMs, including GPT-3.5 [24], Mistral- 7B [25], LLaMA2-7B [26], Blip2-7B [27], Qwen2-VL- 7B-Instruct [28], SmolVLM-Instruct [29], Video-LLaVA- 7B [30]. Model outputs were benchmarked using an extensive suite of evaluation metrics that captured lexical overlap, semantic alignment, and distributional similarity as reported in Table II. -- 4 of 6 -- Table II PERFORMANCE VARIANCE ACROSS BASELINE LLMS AND VLM CONFIGURATIONS. R-1, R-2, AND R-L DENOTE ROUGE-1, ROUGE-2, AND ROUGE-L; B-1, B-2, B-3, AND B-L DENOTE BLEU-1, BLEU-2, BLEU-3, AND BLEU-L; BS DENOTES BERTSCORE. MS, CD, JSD, L2, L1, PS, AND FRS REPRESENT METEOR, COSINE DISTANCE, JENSENâSHANNON DIVERGENCE, L2 (EUCLIDEAN) DISTANCE, L1 (MANHATTAN) DISTANCE, PERPLEXITY SCORE, AND FLESCHâKINCAID READABILITY SCORE. BOLDFACE INDICATES THE BEST RESULTS; â HIGHER IS BETTER, â LOWER IS BETTER. Model Type Model Metrics R-1 R-2 R-L B-1 B-2 B-3 B-L BS MS CD JSD L2 L1 PS FRS LLM LLaMA2-7B 0.42 0.18 0.34 0.24 0.16 0.13 0.21 0.67 0.37 0.39 0.58 12.17 98.95 74.37 58.11 Mistral -7B 0.47
Chunk 16 ¡ 1,991 chars
LEXITY SCORE, AND FLESCHâKINCAID READABILITY SCORE. BOLDFACE INDICATES THE BEST RESULTS; â HIGHER IS BETTER, â LOWER IS BETTER. Model Type Model Metrics R-1 R-2 R-L B-1 B-2 B-3 B-L BS MS CD JSD L2 L1 PS FRS LLM LLaMA2-7B 0.42 0.18 0.34 0.24 0.16 0.13 0.21 0.67 0.37 0.39 0.58 12.17 98.95 74.37 58.11 Mistral -7B 0.47 0.24 0.41 0.27 0.24 0.20 0.27 0.73 0.45 0.34 0.53 9.82 78.91 70.11 38.05 GPT 3.5 0.51 0.25 0.42 0.31 0.24 0.20 0.27 0.75 0.46 0.31 0.54 10.14 49.73 68.31 48.91 Gemma2-9B 0.65 0.55 0.52 0.51 0.46 0.44 0.55 0.79 0.55 0.22 0.39 7.19 44.13 57.69 67.15 LLaMA2-7B+HIDE 0.42 0.18 0.35 0.24 0.18 0.14 0.21 0.69 0.38 0.38 0.57 11.93 96.02 73.23 59.91 Mistral -7B+HIDE 0.48 0.24 0.42 0.31 0.25 0.21 0.29 0.74 0.46 0.32 0.53 9.76 76.56 68.71 39.23 Gemma2-9B+HIDE 0.68â 0.56â 0.58â 0.53â 0.50â 0.48â 0.57â 0.81â 0.56â 0.21â 0.38â 7.01â 43.71â 56.81â 69.23â VLM Blip2-7B 0.15 0.03 0.12 0.18 0.08 0.06 0.02 0.47 0.05 0.58 0.70 20.34 148.98â 175.83 74.54â Qwen2-VL-7B-Instruct 0.36 0.09 0.22 0.36 0.18 0.11 0.06 0.60 0.21 0.28 0.60 18.08 185.43 109.27 51.46 SmolVLM-Instruct 0.37 0.11 0.22 0.33 0.18 0.11 0.06 0.62 0.25 0.25 0.57 29.47 249.61 114.41 51.82 Video-LLaVA-7B 0.38 0.10 0.24 0.37 0.19 0.11 0.06 0.62 0.22 0.32 0.59 18.20 175.66 131.72 50.64 Paligemma2-10B 0.43â 0.14â 0.27â 0.41â 0.24 0.15 0.09 0.66â 0.28â 0.24â 0.56â 18.97 185.90 86.09 52.30 A. Resultant Discussion This section synthesizes the findings in response to the stated RQs, supported by qualitative insights and error ana- lyzes across model generations. 1) Response to RQ1: Idiomatic competence across LLMs and VLMs: Table II highlights a clear performance divide between text-only LLMs and VLMs on culturally dense idioms. Even without error-driven refinement, Gemma2- 9B leads across all metrics, substantially outperforming the strongest VLM, Paligemma2-10B. Distance-based measures further confirm this gap, with lower cosine distance and JensenâShannon divergence for LLMs, indicating tighter se- mantic
Chunk 17 ¡ 1,997 chars
d VLMs on culturally dense idioms. Even without error-driven refinement, Gemma2- 9B leads across all metrics, substantially outperforming the strongest VLM, Paligemma2-10B. Distance-based measures further confirm this gap, with lower cosine distance and JensenâShannon divergence for LLMs, indicating tighter se- mantic alignment with figurative meanings. Superior readabil- ity and lower perplexity likewise favor LLMs, underscoring that large-scale textual pretraining remains more effective than current multimodal grounding for internalizing culturally encoded idiomatic semantics in Hindi, Bengali, and Thai. 2) Response to RQ2: HIDE+EFL gains in idiom compre- hension: Injecting Error-Feedback Learning (via the HIDE retriever) converts past errors into micro-lessons that sig- nificantly sharpen idiom handling. For LLMs, Gemma2- 9B + HIDE raises ROUGE-1/2/3, trims cosine distance to 0.21, and cuts L2/L1 errors while lifting readability (FRS â 69) and lowering perplexity. Mistral-7B and LLaMA2-7B record smaller yet consistent gains, accompanied by reduced perplexity, proving that even lightweight hints nudge mid-sized models toward the correct figurative space. 3) Response to RQ3: Dataset Impact: The Mediom corpus provides a multimodal benchmark for idiomatic compre- hension in Hindi, Bengali, and Thai by coupling high- resolution images with expert-curated explanations. This visual grounding sharpens cross-lingual transfer (e.g., idiom translation) and equips conversational agents with culturally aligned reasoning. In education, image-anchored idioms turn abstract metaphors into concrete, engaging content, boosting learner retention and motivation. B. Analytical Discussion 1) Human Evaluation: Three native linguists evaluated 300 idiom instances across five dimensions: literal accuracy, (a) Qualitative analysis of the Ben- gali idiom āĻā§āĻ āĻŦāĻžāĻ, āĻŋāĻĒā§āĻ āĻā§āĻŋāĻŽāϰ (Aage bagh, pichhe kumir). Literal: A tiger in front, a crocodile behind. Metaphorical: Trapped between two equally
Chunk 18 ¡ 1,997 chars
alytical Discussion 1) Human Evaluation: Three native linguists evaluated 300 idiom instances across five dimensions: literal accuracy, (a) Qualitative analysis of the Ben- gali idiom āĻā§āĻ āĻŦāĻžāĻ, āĻŋāĻĒā§āĻ āĻā§āĻŋāĻŽāϰ (Aage bagh, pichhe kumir). Literal: A tiger in front, a crocodile behind. Metaphorical: Trapped between two equally dangerous situations. (b) Error analysis of the Hindi idiom ā¤¨ā¤žā¤ ā¤¨ ā¤ā¤žā¤¨āĨ ā¤ā¤ā¤ā¤¨ ā¤āĨāĨā¤ž (Naach na jaane angan tedha). Literal: One who cannot dance blames the courtyard. Metaphorical: Blaming external fac- tors for personal failure. Fig. 4. Head-to-head qualitative and error analysis. contextual fit, usage naturalness, cultural depth, and overall co- herence. Fine-tuned LLMs (e.g., GPT-3.5, Gemma2-9B) and VLMs (e.g., Paligemma2-10B, Qwen2-VL-7B-Instruct) excel in literal translation and contextual interpretation, reflecting strong linguistic priors, but initially lag in cultural significance and coherence. Incorporating HIDE yields substantial gains, particularly for LLMs such as Gemma2-9B and LLaMA2-7B, by converting prior reasoning errors into structured semantic cues, thereby improving metaphor comprehension, cultural alignment, and usage coherence. While HIDE-enabled LLMs achieve the most consistent idiomatic reasoning, only fine- tuned VLMs exhibit marked improvements in visualâsemantic alignment and contextual depth, narrowing the unimodalâ multimodal performance gap. 2) Qualitative Analysis: Figure 4a illustrates qualitative differences in Bengali idiom interpretation. Gemma2-9B augmented with HIDE delivers the most nuanced and context- aware explanation, effectively capturing the idiomâs core dilemma of simultaneous threats. Mistral-7B and LLaMA2- 7B provide competent but less granular interpretations. In contrast, multimodal models such as VideoLLaVA-7B enhance -- 5 of 6 -- comprehension through vivid visualizations of danger and tension, while BLIP-2, Qwen2-VL-7B-Instruct, Paligemma2- 10B, and SmolVLM-Instruct reinforce the concept visually but
Chunk 19 ¡ 1,989 chars
and LLaMA2- 7B provide competent but less granular interpretations. In contrast, multimodal models such as VideoLLaVA-7B enhance -- 5 of 6 -- comprehension through vivid visualizations of danger and tension, while BLIP-2, Qwen2-VL-7B-Instruct, Paligemma2- 10B, and SmolVLM-Instruct reinforce the concept visually but fall short in figurative depth. Overall, HIDE-equipped Gemma2-9B best captures idiomatic semantics, whereas VLMs primarily excel in visual grounding. 3) Error Analysis: Figure 4b underscores a clear modality gap. While LLMs fine-tuned with HIDE (Gemma2-9B, Mistral-7B, LLaMA2-7B) reliably captured the core metaphor (blaming the courtyard for oneâs own lack of skill), framing it within themes of accountability and self-reflection. GPT-3.5, however, slipped into over-literal narration, anchoring on irrel- evant spatial details and losing the abstraction. Conclusively, HIDE equips LLMs to internalize cultural metaphor, whereas current VLMs struggle to transcend literal scene parsing. VI. CONCLUSION In this work, we introduce Mediom, a multimodal dataset containing 3,533 idioms across Hindi, Thai, and Bengali, paired with gold-standard human explanations and aligned vi- sual representations. Our analysis reveals that HIDE, inspired by EFL fine-tuning, slightly enhances lexical and contextual reasoning in LLMs (LLaMA2-7B, Gemma2-9B, Mistral- 7B), facilitating nuanced idiomatic interpretations. Moreover, VLMs (Paligemma2-10B) effectively leverage visual cues to capture cultural subtleties. Mediom thus establishes a benchmark for evaluating culturally and figuratively aware AI models, advancing idiomatic comprehension through an integrated textual, visual, and cultural framework. REFERENCES [1] Richard P. Honeck, A Proverb in Mind: The Cognitive Science of Proverbial Wit and Wisdom, Lawrence Erlbaum Associates, Mahwah, NJ, USA, 1997. [2] Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and Timothy Baldwin, âLarge language models are
Chunk 20 ¡ 1,996 chars
ed textual, visual, and cultural framework. REFERENCES [1] Richard P. Honeck, A Proverb in Mind: The Cognitive Science of Proverbial Wit and Wisdom, Lawrence Erlbaum Associates, Mahwah, NJ, USA, 1997. [2] Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and Timothy Baldwin, âLarge language models are human-like internally,â Trans. Assoc. Comput. Linguist., vol. 13, pp. 1743â1766, 2025. [3] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wain- wright, Pamela Mishkin, Chong Zhang, et al., âTraining language models to follow instructions with human feedback,â in Advances in Neural Information Processing Systems, 2022, vol. 35, pp. 27730â 27744. [4] Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin, âWildVision: Evaluating vision-language models in the wild with human preferences,â in Advances in Neural Information Processing Systems, 2024, vol. 38. [5] Francesca De Luca Fornaciari, BegoÃąa Altuna, Itziar Gonzalez-Dios, and Maite Melero, âA hard nut to crack: Idiom detection with conversational large language models,â in The Workshop on Figurative Language Processing, 2024, pp. 35â44. [6] Hessel Haagsma, Johan Bos, and Malvina Nissim, âMAGPIE: A large corpus of potentially idiomatic expressions,â in The Language Resources and Evaluation Conference, 2020, pp. 279â287. [7] Di Zhang, Junxian Li, Jingdi Lei, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, et al., âCritic-V: VLM critics help catch VLM errors in multimodal reasoning,â in The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. [8] Jiali Zeng, Linfeng Song, Jinsong Su, Jun Xie, Wei Song, and Jiebo Luo, âNeural simile recognition with cyclic multitask learning and local attention,â in The AAAI Conference on Artificial Intelligence, 2020, pp. 9515â9522. [9] Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, and Nanyun Peng, âMERMAID: Metaphor generation with symbolism and discriminative decoding,â in The Conference of the North
Chunk 21 ¡ 1,997 chars
imile recognition with cyclic multitask learning and local attention,â in The AAAI Conference on Artificial Intelligence, 2020, pp. 9515â9522. [9] Tuhin Chakrabarty, Xurui Zhang, Smaranda Muresan, and Nanyun Peng, âMERMAID: Metaphor generation with symbolism and discriminative decoding,â in The Conference of the North American Chapter of the ACL, 2021, pp. 4250â4261. [10] Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme, âCollecting diverse natural language inference problems for sentence representation evaluation,â in The Conference on Empirical Methods in Natural Language Processing, 2018, pp. 67â81. [11] Hanbit Lee, Yeonchan Ahn, Haejun Lee, Seungdo Ha, and Sang goo Lee, âQuote recommendation in dialogue using deep neural network,â in The International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016, pp. 957â960. [12] Lingzhi Wang, Jing Li, Xingshan Zeng, Haisong Zhang, and Kam-Fai Wong, âContinuity of topic, interaction, and query: Learning to quote in online conversations,â in The Conference on Empirical Methods in Natural Language Processing, 2020, pp. 6640â6650. [13] Ruiyang Qin, Haozheng Luo, Zheheng Fan, and Ziang Ren, âIBERT: Idiom cloze-style reading comprehension with attention,â arXiv preprint arXiv:2112.02994, 2021. [14] Tosin Adewumi, Foteini Liwicki, and Marcus Liwicki, âVector representations of idioms in conversational systems,â Sci, vol. 4, no. 4, pp. 37, 2022. [15] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi, âCOMET: Commonsense trans- formers for automatic knowledge graph construction,â in The Annual Meeting of the ACL, 2019, pp. 4762â4779. [16] Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and Chris Biemann, âSemEval-2013 task 5: Evaluating phrasal semantics,â in The International Workshop on Semantic Evaluation, 2013, pp. 39â 47. [17] Tuhin Chakrabarty, Arkadiy Saakyan,
Chunk 22 ¡ 1,991 chars
onstruction,â in The Annual Meeting of the ACL, 2019, pp. 4762â4779. [16] Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and Chris Biemann, âSemEval-2013 task 5: Evaluating phrasal semantics,â in The International Workshop on Semantic Evaluation, 2013, pp. 39â 47. [17] Tuhin Chakrabarty, Arkadiy Saakyan, Debanjan Ghosh, and Smaranda Muresan, âFLUTE: Figurative language understanding through textual explanations,â in The Conference on Empirical Methods in Natural Language Processing, 2022, pp. 7139â7159. [18] Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, and Smaranda Muresan, âUnderstanding figurative meaning through explainable visual entailment,â in The Conference of the Nations of the Americas Chapter of the ACL, 2025, pp. 1â23. [19] Ekarat Udomporn, 5000 Thai Idioms: From the Past Right on up to Now!, P.S. Pattana Publishing, 2014. [20] OpenAI, âImproving image generation with better captions,â Tech. Rep., OpenAI, 2023. [21] Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, et al., âGPT-4o system card,â arXiv preprint arXiv:2410.21276, 2024. [22] Gemma Team, âGemma,â Kaggle, 2024. [23] Andreas Steiner, AndrÊ Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, et al., âPaliGemma 2: A family of versatile VLMs for transfer,â arXiv preprint arXiv:2412.03555, 2024. [24] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al., âLanguage models are few-shot learners,â in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877â1901. [25] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al., âMistral 7B,â arXiv preprint arXiv:2310.06825, 2023. [26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al., âLlama 2: Open foundation and fine-tuned chat
Chunk 23 ¡ 1,273 chars
rthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, et al., âMistral 7B,â arXiv preprint arXiv:2310.06825, 2023. [26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al., âLlama 2: Open foundation and fine-tuned chat models,â arXiv preprint arXiv:2307.09288, 2023. [27] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, âBLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,â in The International Conference on Machine Learning, 2023, vol. 202, pp. 19730â19742. [28] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, et al., âQwen2-VL: Enhancing vision-language modelâs perception of the world at any resolution,â arXiv preprint arXiv:2409.12191, 2024. [29] Hugging Face, âSmolVLM-500M-Base,â https://huggingface.co/ HuggingFaceTB/SmolVLM-500M-Base, 2025. [30] Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, et al., âLanguageBind: Extending video-language pretraining to N-modality by language-based semantic alignment,â in The International Conference on Learning Representations, 2024. -- 6 of 6 --