Large Multimodal Models for Low-Resource Languages: A Survey

Authors: Radu Tudor IonescuarXiv:2502.05568Published: 02-Feb-2026Source

Comprehensive survey analyzing techniques to adapt large multimodal models (LMMs) for low-resource languages. Examines 117 studies across 96 LR languages, identifying key patterns in addressing limited data and computational resources. Covers visual enhancement, data creation, cross-modal transfer, and fusion strategies. Shows visual information as crucial bridge for improving LR language model performance.

Summary

This survey systematically analyzes techniques for adapting large multimodal models (LMMs) to low-resource (LR) languages, covering 117 studies across 96 languages. It identifies key approaches such as visual enhancement, data creation, cross-modal transfer, and fusion strategies. The survey reveals a strong focus on text-image combinations, with fewer studies exploring audio or video modalities. Research is unevenly distributed, with Hindi, Arabic, and Bengali receiving the most attention, while 42 languages are each studied only once. Factors influencing this disparity include institutional capacity, speaker population, digital resources, script accessibility, geographic location, and geopolitical interest. The survey introduces a taxonomy categorizing methods into six areas: data creation, synthetic data generation, fusion techniques, visual enhancement, cross-modal transfer, and architectural innovations. It highlights challenges like hallucination mitigation and computational efficiency, while emphasizing the need for community-driven data governance and equitable research focus. The authors provide an open-source repository to support future work.

PDF viewer

Chunks(89)

Chunk 0 · 1,991 chars

Large Multimodal Models for Low-Resource
Languages: A Survey
Marian Lupa¸scua, Ana-Cristina Rogoza, Mihai Sorin Stupariua, Radu Tudor
Ionescua
aDepartment of Computer Science, University of Bucharest, Romania
Abstract
In this survey, we systematically analyze techniques used to adapt large multi-
modal models (LMMs) for low-resource (LR) languages, examining approaches
ranging from visual enhancement and data creation to cross-modal transfer and
fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR
languages, we identify key patterns in how researchers tackle the challenges of
limited data and computational resources. We categorize works into resource-
oriented and method-oriented contributions, further dividing contributions into
relevant sub-categories. We compare method-oriented contributions in terms of
performance and efficiency, discussing benefits and limitations of representative
studies. We find that visual information often serves as a crucial bridge for im-
proving model performance in LR settings, though significant challenges remain
in areas such as hallucination mitigation and computational efficiency. In sum-
mary, we provide researchers with a clear understanding of current approaches
and remaining challenges in making LMMs more accessible to speakers of LR
(understudied) languages. We complement our survey with an open-source repos-
itory available at: https://github.com/marianlupascu/LMM4LRL-Survey.
1. Introduction
Recent advancements in large multimodal models (LMMs) showcased re-
markable capabilities in processing and understanding diverse types of data, in-
cluding text, images, audio and video. Models like GPT-4V, KOSMOS-1 [1]
and PaLM-E [2] achieved impressive performance levels across various multi-
modal tasks through their ability to simultaneously process and reason about mul-
tiple modalities. However, these developments have primarily focused on high-
resource languages, particularly English, leaving a

Chunk 1 · 1,988 chars

odels like GPT-4V, KOSMOS-1 [1]
and PaLM-E [2] achieved impressive performance levels across various multi-
modal tasks through their ability to simultaneously process and reason about mul-
tiple modalities. However, these developments have primarily focused on high-
resource languages, particularly English, leaving a significant gap in supporting
arXiv:2502.05568v4 [cs.CL] 2 Feb 2026

-- 1 of 62 --

the world’s many low-resource languages. The distinction between high-resource
(HR) and low-resource (LR) languages is primarily determined by the availability
of digital resources and training data. High-resource languages, such as English,
Mandarin, and Spanish, benefit from extensive digital corpora, parallel texts, and
annotated datasets. In contrast, low-resource or understudied languages, which
constitute the majority of the world’s languages, lack sufficient digital resources,
standardized datasets, and computational tools. This disparity is particularly pro-
nounced in multimodal contexts, where the scarcity of paired data across modali-
ties (e.g. image-text pairs, audio-text alignments) poses additional challenges.
A recent analysis [3] identified 27% of languages as “Invisible Giants”, i.e. de-
mographically robust yet digitally absent, highlighting that resource scarcity is in-
stitutionally constructed rather than inherent. This distinction has practical impli-
cations, e.g. LMM development that treats data scarcity merely as technical risks
can perpetuate the structural inequalities it ostensibly addresses. We therefore
situate our analysis within the UNESCO International Decade of Indigenous Lan-
guages (2022-2032) and the CARE principles for Indigenous data governance [4],
which emphasize community authority over linguistic data. Indeed, the very ter-
minology “low-resource” has been critiqued as colonial and Eurocentric, obscur-
ing the political decisions that produced linguistic marginalization [5].
The motivation for developing multimodal

Chunk 2 · 1,997 chars

inciples for Indigenous data governance [4],
which emphasize community authority over linguistic data. Indeed, the very ter-
minology “low-resource” has been critiqued as colonial and Eurocentric, obscur-
ing the political decisions that produced linguistic marginalization [5].
The motivation for developing multimodal capabilities for LR languages is
compelling. First, multimodal processing better reflects how humans naturally
communicate and understand information through multiple sensory channels. Sec-
ond, visual and audio cues can provide crucial contextual information that helps
to overcome the limitations of scarce textual data. Third, many LR languages are
primarily spoken rather than written, making multimodal approaches particularly
relevant for their digital preservation and processing. However, developing mul-
timodal systems for LR languages faces several significant challenges, including:
(1) the scarcity of high-quality multimodal datasets in these languages, (2) the
lack of standardized evaluation benchmarks, (3) the computational cost of train-
ing large-scale models with limited resources, and (4) the complexity of handling
different writing systems, dialects, and cultural contexts. Moreover, the problem
of catastrophic forgetting [6] when adapting pre-trained models to new languages
and the challenge of maintaining performance across different modalities pose
significant technical hurdles.
Literature selection process. We survey research articles from 2018 to 2025 that
specifically study LMMs for LR languages. We begin our analysis with 2018
because one of the first large language models (LLMs), BERT [7], was intro-
duced that year, marking a significant turning point in the development of mod-
2

-- 2 of 62 --

Video
Text+Image (76) 	Text+Audio+
Image (11)
Image+
Audio (5)
All (4)
Image+Audio+Video (1)
[Onuoha and Uba, 2024]
[Onuoha and Uba, 2024] [Meetei et	al., 2023a]
Rahman et	al., 2024]
[Kim et al., 2023]
[Jigar et	al., 2024
[Arifin et al.,

Chunk 3 · 1,988 chars

marking a significant turning point in the development of mod-
2

-- 2 of 62 --

Video
Text+Image (76) 	Text+Audio+
Image (11)
Image+
Audio (5)
All (4)
Image+Audio+Video (1)
[Onuoha and Uba, 2024]
[Onuoha and Uba, 2024] [Meetei et	al., 2023a]
Rahman et	al., 2024]
[Kim et al., 2023]
[Jigar et	al., 2024
[Arifin et al., 2024]
[Arifin et al., 2024]
[Najadat and	Abushaqra, 2018]
[Najadat and
Abushaqra, 2018]
[Sehar et al., 2021]
[Sehar et al., 2021]
[Tang et al., 2023]
[Yang et al., 2024
[Yang et al., 2024
[Lovenia et al., 2024]
[Lovenia et al., 2024]
[Hossain et al., 2022a]
[Haq et al., 2024]
[Alwajih et al., 2024]
Figure 1: A Venn diagram with the distribution of papers across different modality combi-
nations used by LMMs for low-resource languages. Text+image is the dominant modality
pair, while more complex video-inclusive combinations are less common. A selection of
representative papers is included for each modality combination. References are clickable
links to papers.
ern language modeling techniques. We focus on works that go beyond simple
cross-lingual transfer or translation, examining techniques that leverage multiple
modalities to improve model performance.
We queried a broad set of digital libraries to ensure representative coverage:
ACM Digital Library, IEEE Xplore, ACL Anthology, arXiv, SpringerLink, Sci-
enceDirect, and Google Scholar. During this search, we specifically targeted
venues known for frequent LR or multimodal contributions (e.g. ACL, EMNLP,
NAACL, COLING, LREC, EACL, WMT, CVPR/ICCV/ECCV workshops, and
INTERSPEECH) to ensure that relevant conference and workshop publications
are captured. For each of these digital libraries, we formulated several keyword
combinations capturing (i) multimodality, (ii) low-resource aspects, and (iii) lan-
guage or task types. To further improve coverage, we performed a backward
3

-- 3 of 62 --

Hindi; 31
Arabic; 23
Bengali; 21
Malayalam; 19
Tamil; 14
Korean; 10
Yoruba; 10
Amharic; 9
Hausa; 9
Urdu;

Chunk 4 · 1,992 chars

braries, we formulated several keyword
combinations capturing (i) multimodality, (ii) low-resource aspects, and (iii) lan-
guage or task types. To further improve coverage, we performed a backward
3

-- 3 of 62 --

Hindi; 31
Arabic; 23
Bengali; 21
Malayalam; 19
Tamil; 14
Korean; 10
Yoruba; 10
Amharic; 9
Hausa; 9
Urdu; 9
Vietnamese; 9
Assamese; 7
Indonesian; 7	Romanian; 7
Turkish; 7
Swahili; 6
Fongbe; 5	
Javanese; 5
Manipuri; 5
Persian; 6	
Uyghur; 5
Bulgarian; 4
Kazakh; 4
Lao; 4
Marath
i; 4
Sinhala; 4
Swedish; 4	Uzbek; 4
Czech; 3
Greek; 3
Hebrew; 3
Igbo; 3
Kannada; 3
Mongolian; 3
Norwegian; 3
Polish; 3
Thai; 4
Ukrainian; 3
Belarusian; 2
Bosnian; 2
Buginese; 2
Cantonese; 2
Danish; 2
Finnish; 2
Gujarati; 2
Hungarian; 2
Malay; 2
Sanskrit; 2
Serbian; 2
Sundanese; 2
Tagalog; 2 	Telugu; 2 	Xhosa; 2
Zulu; 2
ARZ, AZ, BEM, CEB, DAR, EGY, ET, GA, HT, HY, KA, KAM, KH, KM, LCU,
LUO, LT, LV, MA, MG, MI, MK, MGL, MON, MRN, MY, NE, OC, OR, PA, PS,
SA, SEA, SK, SL, SQ, TG, UMB, WAR, YE; 42
Figure 2: Distribution of papers across 96 low-resource languages, representing 117 pa-
pers. Hindi leads with 31 studies, followed by Arabic (23), Bengali (21), Malayalam (19),
Tamil (14), Korean and Yoruba (with 10 papers each). The remaining languages have less
than 10 papers each. Languages with only one paper (42 languages) are listed using ISO
639-1 codes. The data highlights the disparity in research focus among LR languages,
with a few languages receiving more focus, while many others remain understudied in the
context of multimodal learning. Some papers simultaneously address multiple languages,
contributing to the individual language counts. HR languages such as English, Chinese,
Mandarin and Spanish are excluded from this chart.
and forward search to identify additional relevant work from the reference lists
of included papers, and we used citation links to identify more recent follow-up
studies.
Finally, we merged all retrieved records and applied a manual two-stage screen-
ing

Chunk 5 · 1,999 chars

ish, Chinese,
Mandarin and Spanish are excluded from this chart.
and forward search to identify additional relevant work from the reference lists
of included papers, and we used citation links to identify more recent follow-up
studies.
Finally, we merged all retrieved records and applied a manual two-stage screen-
ing process. We began by reviewing titles and abstracts to remove clearly ir-
relevant work (e.g. single modality studies or studies exclusively targeting high-
resource languages such as English, Mandarin, or Spanish). We then examined
the main contributions of the remaining papers to assess whether they were a suit-
4

-- 4 of 62 --

Table 1: Factors explaining research disparity across low-resource languages in multi-
modal NLP. We categorize languages from our survey by paper count and analyze con-
tributing factors.
Factor 	High Coverage (10+ papers) Medium Coverage (2-9 pa-
pers)
Low Coverage (1 paper)
Institutional capacity Strong local NLP communi-
ties (India, Middle East)
Emerging research groups Minimal local infrastructure
Speaker population Variable (38M–600M) 	Variable (2M–200M) 	Often <1M, but not always
Existing resources Multiple benchmarks, pre-
trained models
Some datasets available 	Little to no digital presence
Script accessibility Shared with HR languages or
well-supported
Moderate tool support 	Often unique / unsupported
scripts
Geographic location Regions with NLP venues
(Asia, Middle East)
Mixed 	Often Sub-Saharan Africa,
Oceania
Geopolitical interest Strategic priority (Arabic
post-9/11, Hindi for US-
India relations, Korean for
East Asia security)
Emerging strategic relevance
(Ukrainian post-2022, Turk-
ish for NATO relations)
No perceived strategic value;
excluded from defense / in-
telligence funding
Example languages Hindi, Arabic, Bengali,
Malayalam
Swahili, Romanian, Turkish,
Yoruba
Luo, Xhosa, Occitan, Maori
able fit for our survey. Ultimately, a study was included in our survey if it matched
all of the following criteria:

Chunk 6 · 1,993 chars

relations)
No perceived strategic value;
excluded from defense / in-
telligence funding
Example languages Hindi, Arabic, Bengali,
Malayalam
Swahili, Romanian, Turkish,
Yoruba
Luo, Xhosa, Occitan, Maori
able fit for our survey. Ultimately, a study was included in our survey if it matched
all of the following criteria: (a) is multimodal (at least two input modalities), (b)
focuses on low-resource languages (at least one of the targeted languages was an
underrepresented language), and (c) proposes, adapts or evaluates a multimodal
model.
Research focus distribution across LR languages. Our survey reveals several
interesting patterns in how researchers approached multimodal learning for LR
languages. As shown in Figure 1, text-image combinations dominate the research
landscape, appearing in 76 papers (65% of surveyed works), while more complex
combinations incorporating audio and video remain less explored. In addition, the
distribution of research focus across languages is notably uneven, as illustrated in
Figure 2, with Hindi (31 papers), Arabic (23) and Bengali (21 papers) receiving
significant focus, whereas 42 other languages are each represented by a single
study. On the one hand, this striking disparity highlights the need for a broader
coverage of understudied languages in multimodal research. On the other hand,
it also warrants critical examination of the factors influencing the distribution of
research studies across languages.
We identify six interacting factors that explain this disparity (see Table 1).
5

-- 5 of 62 --

First, institutional research capacity plays a dominant role: Rungta et al. [8]
demonstrated that NLP publications are heavily concentrated in North America,
Western Europe, and China, with minimal representation from Africa and South
America. Languages spoken in regions with established NLP research communi-
ties (e.g. Hindi in India, Arabic in the Middle East) benefit from existing infras-
tructure, funding, and researcher networks.

Chunk 7 · 1,999 chars

heavily concentrated in North America,
Western Europe, and China, with minimal representation from Africa and South
America. Languages spoken in regions with established NLP research communi-
ties (e.g. Hindi in India, Arabic in the Middle East) benefit from existing infras-
tructure, funding, and researcher networks. Second, speaker population shows
surprising variability: while one might expect larger speaker populations to attract
more research, this correlation is weak. For instance, Swahili has approximately
200 million speakers, yet remains underrepresented compared with Malayalam
(38 million speakers, 19 papers). Third, digital resource availability creates
a self-reinforcing cycle: languages with existing datasets attract more research,
which produces more datasets [9]. Ranathunga et al. [10] showed that even within
the same resource class [9], languages from higher-GDP regions receive dispro-
portionately more research attention. Fourth, script and typological proximity
to high-resource languages facilitates transfer learning research, e.g. Hindi bene-
fits from shared Devanagari script resources, while languages with unique scripts
(e.g. Ge’ez for Amharic) face additional barriers. Fifth, geographic location de-
termines access to NLP venues and research networks: languages spoken in re-
gions hosting major conferences (Asia, Middle East, Europe) receive more atten-
tion than those in Sub-Saharan Africa or Oceania. Sixth, geopolitical interest
drives strategic investment: Arabic NLP surged post-9/11, Ukrainian gained at-
tention after 2022, and US-China AI competition benefits research on Mandarin
Chinese, but not on minority languages within China.
These factors have critical implications for researchers working on truly un-
derrepresented languages. The 42 languages with single studies in our survey face
a “cold start” problem: without existing benchmarks, baselines, or community
momentum, new contributions are harder to contextualize and evaluate [11].

Chunk 8 · 1,998 chars

China.
These factors have critical implications for researchers working on truly un-
derrepresented languages. The 42 languages with single studies in our survey face
a “cold start” problem: without existing benchmarks, baselines, or community
momentum, new contributions are harder to contextualize and evaluate [11]. We
observe that 88.4% of the world’s languages (Class 0 defined by Joshi et al. [9])
have no representation in standard NLP resources whatsoever [9]. For researchers
targeting these languages, we recommend: (1) prioritizing dataset creation with
community involvement over model development, (2) leveraging typologically
similar languages for transfer rather than defaulting to English, and (3) publishing
in venues with explicit low-resource tracks (e.g. AfricaNLP, AmericasNLP, etc.)
to build critical mass within language-specific research communities.
Research investment in specific languages is strongly influenced by geopolit-
ical events and national security priorities. The clearest documented case is Ara-
bic NLP, which experienced a dramatic surge in funding following September 11,
2001. Darwish et al. [12] documented the fact that “Arabic NLP gained increasing
6

-- 6 of 62 --

importance in the Western world especially after September 11. The USA funded
large projects for companies and research centers to develop NLP tools for Ara-
bic and its dialects”. This investment wave (2001-2010) produced fundamental
resources for machine translation, speech recognition, named entity recognition,
and information extraction, that continue to underpin Arabic multimodal research
today.
A similar pattern is emerging for Ukrainian. Prior to February 2022, Ukrainian
was a moderately-resourced Slavic language, receiving limited attention in NLP
research. The Russian invasion triggered rapid mobilization: the CLARIN Knowl-
edge Centre for Ukrainian NLP (UkrNLP-Corpora) was established in 2023 [13],
the Ukrainian Natural Language Processing Workshop (UNLP) expanded

Chunk 9 · 1,998 chars

y 2022, Ukrainian
was a moderately-resourced Slavic language, receiving limited attention in NLP
research. The Russian invasion triggered rapid mobilization: the CLARIN Knowl-
edge Centre for Ukrainian NLP (UkrNLP-Corpora) was established in 2023 [13],
the Ukrainian Natural Language Processing Workshop (UNLP) expanded to four
editions by 2025, and researchers developed numerous datasets for disinformation
detection, sentiment analysis, and propaganda identification on Ukrainian social
media [14]. This research is explicitly framed around information warfare: de-
tecting “manipulative narratives” and “rhetorical manipulation techniques used to
influence Ukrainian Telegram users” [15]. The geopolitical urgency has attracted
Western funding and research attention that Ukrainian might not otherwise have
received.
The US-China technology competition further illustrates how strategic rival-
ries shape NLP investment trajectories. China invested $125 billion in AI in 2025,
representing 38% of global AI investment, with NLP receiving 11% of this al-
location [16]. Chinese companies, including Baidu, Alibaba, and Tencent, are
developing competitive large language models (Qwen, Yi, DeepSeek) trained on
massive Chinese corpora, partly driven by the US export controls on advanced
semiconductors [17]. This competition benefits Mandarin Chinese language re-
sources, but does not extend to minority languages within China (Tibetan, Uyghur,
Mongolian), which remain severely underrepresented despite large speaker pop-
ulations. This indicates that geopolitical attention flows to languages of strategic
interest to major powers, not necessarily to the most linguistically marginalized
communities.
The patterns identified above reveal a troubling dynamic for truly underrep-
resented languages: research investment follows geopolitical salience rather than
linguistic need. Languages become “high-resource” when powerful states per-
ceive strategic value in processing them for intelligence

Chunk 10 · 1,998 chars

marginalized
communities.
The patterns identified above reveal a troubling dynamic for truly underrep-
resented languages: research investment follows geopolitical salience rather than
linguistic need. Languages become “high-resource” when powerful states per-
ceive strategic value in processing them for intelligence gathering, countering dis-
information, or economic competition. The 42 languages in our survey with only a
single study lack this geopolitical visibility. For researchers working on such lan-
guages, this suggests that framing research around emerging strategic concerns
(e.g. climate migration, regional stability, pandemic communication) may attract
7

-- 7 of 62 --

Table 2: Comparison of our survey with related work on LLMs and LMMs across different
focuses, modalities, languages, techniques, and additional coverage.
Survey Focus
Multimodality
Languages
Techniques
Additional Coverage
Data Creation
Fusion
Visual Enh.
Transfer
Adaptation
Gu et al. [18] LLM-as-a-judge High-resource Evaluation, reliability, applications
Joshi et al. [9] Language resources Low-resource ✓ Digital divide
Paullada et al. [19] Dataset development Low-resource ✓ Data challenges
Ruder et al. [20] Cross-lingual NLP Low-resource ✓ Benchmarking
Zhao et al. [21] LLMs Both ✓ ✓ ✓ Pre-training, adaptation, utilization
Zhu et al. [22] Multilingual LLM Both ✓ ✓ Cross-lingual transfer
Gan et al. [23] Vision-language ✓ High-resource ✓ ✓ ✓ Pre-training objectives
Wang et al. [24] Pre-training ✓ High-resource ✓ ✓ Model architectures
Li et al. [25] Large VLMs ✓ High-resource Benchmark evaluations, challenges
Yin et al. [26] LMM architectures ✓ High-resource ✓ ✓ ✓ Training strategies
Xie et al. [27] Large multimodal agents ✓ High-resource Agentic AI, evaluation methods
Xu et al. [28] Resource-efficient models ✓ Both ✓ Efficient algorithms, system designs
Alam et al. [29] LLMs for LR contexts ✓ Low-resource ✓ ✓ Capabilities, prompting, evaluation
Mu

Chunk 11 · 1,987 chars

✓ High-resource ✓ ✓ ✓ Training strategies
Xie et al. [27] Large multimodal agents ✓ High-resource Agentic AI, evaluation methods
Xu et al. [28] Resource-efficient models ✓ Both ✓ Efficient algorithms, system designs
Alam et al. [29] LLMs for LR contexts ✓ Low-resource ✓ ✓ Capabilities, prompting, evaluation
Mu et al. [30] Mixture-of-Experts ✓ Both ✓ Algorithms, theory, applications
Our Survey LMMs for LR languages ✓ Low-resource ✓ ✓ ✓ ✓ ✓ Systematic taxonomy, 96 languages,
117 studies
funding that purely linguistic motivations cannot.
Relation to other surveys and academic contributions. Some recent surveys
have explored various aspects of multimodal language models. Zhao et al. [21]
provided a comprehensive overview of LMM architectures, training strategies,
and applications, while Wang et al. [24] focused on pre-training techniques and
model architectures. Additional surveys have examined related areas, including
LLMs [9, 18–20, 23, 25, 27, 28, 30], but they did not specifically address the
unique challenges and solutions for LR languages in multimodal contexts. Alam
et al. [29] explored LLMs for low-resource languages in multilingual, multimodal
and dialectal settings, but they focused primarily on the capabilities of LLMs
rather than presenting a comprehensive survey of techniques. To the best of our
knowledge, our survey is the first to focus on multimodal learning for understud-
ied languages.
As shown in Table 2, our work differs from previous surveys by specifically
focusing on the intersection of multimodality and low-resource languages, while
addressing all major techniques relevant to this domain. While several prior sur-
veys have separately explored multimodality or low-resource languages, none has
comprehensively examined both aspects across such a diverse range of languages
8

-- 8 of 62 --

Multimodality for Low-Resource
Languages
Multimodal Data Creation Visual Enhancement
Techniques
Synthetic Data Generation Architecture

Chunk 12 · 1,989 chars

sur-
veys have separately explored multimodality or low-resource languages, none has
comprehensively examined both aspects across such a diverse range of languages
8

-- 8 of 62 --

Multimodality for Low-Resource
Languages
Multimodal Data Creation 	Visual Enhancement
Techniques
	Synthetic Data Generation 	Architecture Innovations
	Cross-Modal Transfer
Learning
Creation from
scratch
	Extension
Multimodal Fusion
Techniques
Early Fusion
Late Fusion
Architectural
fusion
[Dashtipour et al. 2021],
[Jigar et al. 2024],
	[Hatami et al., 2024]
	Zhang et al., 2024,
	Arifin et al., 2024,
Rahman et al., 2024 (SADTech)
Chakravarthi et al. 2019,
	Haputhanthri et al. 2023,
Sehar et al. 2021
Back-
translation
Image-based
	generation
	Image-guided
	translation
Visual
disambiguation
Laskar et al., 2020,
	Meetei et al., 2023a,
	Meetei et al., 2023b
Parida et al., 2023 (HaVQA),
	Jain et al., 2021 (MURAL),
Kovath et al., 2024
Modality
	transfer
	Language
	transfer
Chen et al., 2023,
	Yeo et al., 2024,
Bhatia et al., 2024 (Qalam)
Kim et al., 2023,
	Nortje et al., 2024,
	Wang et al., 2023b
Asgarov and Rustamov, 2024 (LowCLIP),
Tang et al., 2023 (XtremeCLIP),
Wu et al., 2019,
	Jin et al., 2022
Data Generation and Augmentation 	Multimodal Integration Techniques 	Transfer and Generalization
Najadat and Abushaqra, 2018,
Chakravarthi et al., 2021(DravidianMM),
Taylor and Fauzi, 2024 (MyMSC),
	Abdulmumin et al., 2022 (HausaVG),
Parida et al., 2023 (HaVQA),
Saichyshyna et al. 2023 (Extension Multi30K)
Chowdhury et al., 2018,
Haq et al., 2024 (DCU ADAPT),
	Alwajih et al., 2024 (Dallah)
	Dash et al., 2023 (BITS-P),
	Doan et al., 2024 (Vintern-1B),
	Haq et al., 2024 (DCU ADAPT)
Figure 3: High-level taxonomy of LMMs for low-resource languages. We depict six
main categories (inside boxes with green background), which are further divided into
subcategories, exemplified via a few representative studies. References are clickable links
to papers.
and approaches.
In summary, our contribution is

Chunk 13 · 1,997 chars

PT)
Figure 3: High-level taxonomy of LMMs for low-resource languages. We depict six
main categories (inside boxes with green background), which are further divided into
subcategories, exemplified via a few representative studies. References are clickable links
to papers.
and approaches.
In summary, our contribution is fourfold:
• We provide the first comprehensive analysis of LMMs specifically focused
on LR languages, examining 117 studies across 96 languages.
• We develop a novel taxonomy (see Figure 3) that categorizes existing ap-
proaches into six main categories: multimodal data creation, synthetic data
generation, multimodal fusion techniques, visual enhancement techniques,
cross-modal transfer learning and architectural innovations.
• We systematically organize the literature to enable a clear understanding
of current approaches and remaining challenges in making LMMs more
accessible to speakers of LR languages.
• We provide an open-source repository that includes implementation details,
datasets, and benchmarks to facilitate future research in this emerging field.
Organization. The remainder of this survey is organized as follows. In Section 2,
we present an overview of the constructed taxonomy and discuss research trends
between 2018 and 2025. Sections 3 and 4 are dedicated to resource-oriented
9

-- 9 of 62 --

contributions. In Section 3, we identify and categorize common approaches for
dataset creation for LR languages. In Section 4, we discuss automated data gener-
ation techniques. Sections 5, 6, 7 and 8 are dedicated to method-oriented contri-
butions. In Section 5, we categorize and compare strategies used to fuse multiple
modalities. In Section 6, we discuss techniques used to enhance machine transla-
tion by using visual information. In Section 7, we present methods that perform
cross-modal transfer learning. In Section 8, we analyze architectural contribu-
tions. In Section 9, we discuss current evaluation challenges and propose several
ways to

Chunk 14 · 1,990 chars

In Section 6, we discuss techniques used to enhance machine transla-
tion by using visual information. In Section 7, we present methods that perform
cross-modal transfer learning. In Section 8, we analyze architectural contribu-
tions. In Section 9, we discuss current evaluation challenges and propose several
ways to address the identified challenges. In Section 10, we draw our conclusions,
point out current research gaps, and propose ways to mitigate them in future work.
2. Taxonomy
To organize the diverse approaches in the rapidly evolving field of LMMs for
low-resource languages, we develop a comprehensive taxonomy through a sys-
tematic analysis of the 117 papers in our survey. Our methodology involves initial
coding of each paper’s primary contributions and techniques, iterative refinement
through thematic analysis to identify recurring patterns, and hierarchical orga-
nization of approaches based on their functional relationships and chronological
development in the field. Our analysis reveals that researchers addressing the chal-
lenges of LMMs for low-resource languages typically follow a progression from
resource development to architectural refinement. This progression is reflected in
our taxonomy, which organizes approaches into six main categories that represent
both the current state of the field and the primary research strategies for addressing
challenges in the context of underrepresented languages.
In Figure 3, we systematically organize LMMs for LR languages into six main
categories. The first two categories focus on constructing high-quality resources.
While the first category discusses multimodal data creation either from scratch
or via extending existing datasets, the second approach centers on synthetic data
generation, which automatically expands available resources via back-translation
and image-based generation. Building upon this work, we present several multi-
modal fusion techniques and provide various strategies for effectively

Chunk 15 · 1,995 chars

cratch
or via extending existing datasets, the second approach centers on synthetic data
generation, which automatically expands available resources via back-translation
and image-based generation. Building upon this work, we present several multi-
modal fusion techniques and provide various strategies for effectively combining
this information, ranging from early and late fusion to more complex hybrid ap-
proaches. In the fourth category, we illustrate visual enhancement techniques that
harness visual information through image-guided translation and visual disam-
biguation methods, highlighting their importance for improving translation qual-
ity and resolving ambiguities. Expanding from the single-modality solutions, the
next category focuses on cross-modal transfer learning approaches that can facil-
10

-- 10 of 62 --

itate knowledge sharing based on both modality transfer and language transfer.
Finally, our last category comprises architectural innovations specifically tailored
for multimodal tasks in the context of LR languages.
It is important to note that several studies naturally span multiple areas. For
instance, some studies that fuse visual and textual features could also be viewed
as cross-modal transfer when they leverage pre-trained vision-language models.
Similarly, papers that introduce new datasets may incorporate synthetic augmen-
tation, and architecture-focused contributions may sometimes rely on transfer-
learning mechanisms. In such cases, we assign each study to the category that
reflects its primary technical contribution. This principle helps maintain clear
boundaries, while acknowledging the natural overlap across multimodal methods
for low-resource languages.
Furthermore, to understand how these categories have evolved over time, we
analyze the publication trends from 2018 to 2025. Figure 4 shows the number of
papers per year in each category, illustrating how early work focused primarily
on data creation and synthetic augmentation,

Chunk 16 · 1,997 chars

odal methods
for low-resource languages.
Furthermore, to understand how these categories have evolved over time, we
analyze the publication trends from 2018 to 2025. Figure 4 shows the number of
papers per year in each category, illustrating how early work focused primarily
on data creation and synthetic augmentation, while recent years saw an increase
in fusion strategies, cross-modal transfer, and architectural innovations, reflecting
the shift toward foundation-model-based approaches.
Our taxonomic structure not only organizes existing research, but also high-
lights the inter-dependencies between different approaches and reveals gaps in
current research, particularly in the exploration of complex multimodal combina-
tions involving video and speech for low-resource languages. We structure the
remainder of this article according to our novel taxonomy shown in Figure 3.
3. Multimodal Data Creation
There are two main approaches to create multimodal datasets for LR lan-
guages. The first is based on multimodal dataset creation from scratch, while
the second is based on using an existing resource as a starting point. We next
discuss papers introducing novel datasets based on the two alternatives.
Dataset creation from scratch. Dataset creation from scratch has emerged as
a crucial approach for enabling multimodal research in LR languages, particu-
larly for sentiment analysis and specific language tasks. Multiple research teams
have focused on creating specialized datasets through direct data collection and
annotation, such as collecting Arabic videos with multimodal features for senti-
ment analysis [31], building comprehensive Tamil and Malayalam video review
datasets [32], developing new corpora for languages such as Malay [33], creating
speech translation resources for Fongbe [34] and compiling Arabic multimodal
11

-- 11 of 62 --

2018 	2019 	2020 	2021 	2022 	2023 	2024 	2025
Publication Year
0
10
20
30
40
50
Number of Papers
Number of Large Multimodal Model

Chunk 17 · 1,997 chars

deo review
datasets [32], developing new corpora for languages such as Malay [33], creating
speech translation resources for Fongbe [34] and compiling Arabic multimodal
11

-- 11 of 62 --

2018 	2019 	2020 	2021 	2022 	2023 	2024 	2025
Publication Year
0
10
20
30
40
50
Number of Papers
Number of Large Multimodal Model Papers for Low-Resource Languages per Year
Technique Category
Data Creation
Synthetic Data
Fusion
Visual Enhancement
Cross-Modal Transfer
Architecture
Figure 4: Number of LMM papers for LR languages published per year (2018-2025),
categorized by technique: Multimodal Data Creation, Synthetic Data Generation, Multi-
modal Fusion Techniques, Visual Enhancement Techniques, Cross-Modal Transfer Learn-
ing, and Architectural Innovations. Best viewed in color.
sentiment collections [35]. A significant trend has been the creation of meme-
based datasets, with efforts focused on Bengali, through MemoSen and MUTE
[36, 37], and Romanian, through RoMemes [38], all incorporating multiple levels
of annotation.
These dataset creation efforts have expanded beyond sentiment analysis to
encompass other crucial applications, such as sign language recognition with
ArabSign [39], and multi-purpose datasets like BIG-C for Bemba [40]. Addition-
ally, the creation of a Manipuri-English parallel corpus with accompanying audio
recordings for speech-to-text translation [41] provides an important resource for
research in low-resource languages. More recently, Farsi et al. [42] introduced
a comprehensive suite of multimodal datasets for Persian, covering tasks such
as VQA, OCR, visual abstraction reasoning, and cultural knowledge grounding.
These projects typically involve careful quality control by using multiple anno-
tators, standardized recording environments, and expert validation, demonstrat-
ing a shift toward building comprehensive resources specifically designed for LR
languages, rather than relying on translation or transfer from high-resource lan-
12

-- 12 of 62 --

Chunk 18 · 1,998 chars

involve careful quality control by using multiple anno-
tators, standardized recording environments, and expert validation, demonstrat-
ing a shift toward building comprehensive resources specifically designed for LR
languages, rather than relying on translation or transfer from high-resource lan-
12

-- 12 of 62 --

guages.
Dataset extension. In addition to building data from scratch in the context of
LR language understanding, there have been several efforts for leveraging exist-
ing datasets of rich-resource languages and building upon them. Sen et al. [43]
introduced the Bengali Visual Genome (BVG) dataset, which extends the Visual
Genome dataset [44] with Bengali translations and annotations, enabling the de-
velopment and evaluation of multimodal models for Bengali-English machine
translation (MT) and image captioning. Similarly, Abdulmumin et al. [45] cre-
ated the Hausa Visual Genome (HaVG) dataset by translating a subset of the
Visual Genome dataset into Hausa, providing a valuable resource for English-
to-Hausa multimodal MT. Building upon prior work and continuing the focus on
the Hausa language, Parida et al. [46] introduced the Hausa Visual Question An-
swering (HaVQA) dataset, which adapts question-answer pairs from the Visual
Genome dataset to the Hausa language through manual translation, creating the
first visual question-answering (VQA) dataset for Hausa. Extending this trend to
Indian languages, Parida et al. [47] introduced OVQA, a multimodal dataset for
the Odia language, by translating over 6,000 question-answer pairs and associated
captions from the Visual Genome dataset into Odia. Similarly, Anwar et al. [48]
introduced MuAViC, a multilingual audio-visual corpus providing 1,200 hours of
audio-visual speech across 9 languages, establishing the first open benchmark for
audio-visual speech-to-text translation.
Apart from the focus on African languages, Saichyshyna et al. [49] extended
the Multi30K dataset [50] to include Ukrainian translations

Chunk 19 · 1,995 chars

a multilingual audio-visual corpus providing 1,200 hours of
audio-visual speech across 9 languages, establishing the first open benchmark for
audio-visual speech-to-text translation.
Apart from the focus on African languages, Saichyshyna et al. [49] extended
the Multi30K dataset [50] to include Ukrainian translations and captions, facilitat-
ing integrated vision and language research in Ukrainian. More recently, Lovenia
et al. [51] presented SEACrowd, a comprehensive multilingual and multimodal
data hub and benchmark suite for Southeast Asian languages, which covers 13
tasks across three modalities (text, image, and audio) and 38 Southeast Asian
indigenous languages, while Lent et al. [52] introduced CreoleVal, an extensive
collection of benchmarks for 28 Creole languages, addressing the significant re-
source gap for these historically marginalized language varieties.
4. Synthetic Data Generation
An alternative approach to efficiently create multimodal datasets for LR lan-
guages relies on synthetic data generation. Unlike traditional dataset creation,
which typically involves intensive manual data collection, human annotation, and
domain-specific curation, synthetic data generation leverages existing resources
and automated techniques to produce new multimodal content with minimal hu-
13

-- 13 of 62 --

man input. This distinction is critical, as synthetic methods offer a scalable al-
ternative for low-resource settings, where manual annotation is often costly or
infeasible.
Back-translation. A common approach for synthetic data generation relies on
the usage of back-translation, which has proven to be an effective technique to en-
hance the data for multilingual MT (MMT) in LR language pairs. This technique
works by translating text from an HR language into an LR language, and then back
again, helping to generate additional aligned examples without requiring human
involvement. Dutta Chowdhury et al. [53] demonstrated the effectiveness of this
technique for

Chunk 20 · 1,999 chars

ta for multilingual MT (MMT) in LR language pairs. This technique
works by translating text from an HR language into an LR language, and then back
again, helping to generate additional aligned examples without requiring human
involvement. Dutta Chowdhury et al. [53] demonstrated the effectiveness of this
technique for training a neural MMT system in the context of LR language pairs
by leveraging the Flickr30k dataset [54] and translating the source-language (En-
glish) captions to the target LR language (Hindi). Meetei et al. [55] extended this
approach for low-resource multimodal neural machine translation in the news do-
main for English-Hindi. In the WMT24 English-to-Low-Resource Multi-Modal
Translation task, Haq et al. [56] showcased the effectiveness of back-translation
for Hindi. Another use case of back-translation was shown by Alwajih et al. [57],
who, starting from English-based image-text pairs, employed translation to Ara-
bic, as well as back-translation. This was necessary for evaluating the quality of
the translation, before passing the data to humans for Arabic dialect translation
and training a dialect-aware LMM, named Dallah. However, the consistency of
back-translated synthetic data has been a concern. To address this issue, Wang et
al. [58] proposed a framework to improve the robustness of models when adapt-
ing grounded VQA models to LR languages, aiming to improve the performance
without relying on machine-translated data. Wang et al. [59] further explored
this challenge by introducing noise-robust learning for cross-lingual cross-modal
retrieval to handle translation noise in machine-translated sentences.
Image-based generation. Another mainstream approach for synthetic data gen-
eration uses images as a starting point. In the case of Indic language multimodal
MT [60], synthetic images generated by diffusion models were deemed benefi-
cial, their main goal being that of capturing the complexity of the target domain,
and augmenting the existing

Chunk 21 · 1,997 chars

nother mainstream approach for synthetic data gen-
eration uses images as a starting point. In the case of Indic language multimodal
MT [60], synthetic images generated by diffusion models were deemed benefi-
cial, their main goal being that of capturing the complexity of the target domain,
and augmenting the existing image dataset. Similarly, Haq et al. [56] created ex-
haustive image descriptions in addition to the already existing short region-based
descriptions. Doan et al. [61] utilized image-based generation for several pur-
poses, such as description generation and relevant information extraction, to de-
velop Vintern-1B, an efficient LMM for Vietnamese. Nath et al. [62] applied this
approach for image caption generation in the low-resource Assamese language
using an encoder-decoder framework that combines CNNs and RNNs to generate
descriptions from images. Jiang et al. [63] expanded these approaches with multi-
14

-- 14 of 62 --

modal seed data augmentation for the low-resource audio Latin Cuengh language,
demonstrating how seed data can enhance intelligent recognition and comprehen-
sion of low-resource dialects. Collectively, these studies demonstrate the versa-
tility and effectiveness of synthetic data for tackling a diverse set of multimodal
tasks in the context of LR languages.
An emerging approach that avoids both traditional back-translation or image-
based methods consists in leveraging the outputs of Vision-Language Models
(VLMs). Qu et al. [64] generated multilingual responses for image-text inputs,
translated them into English, and compared them with trusted references to de-
tect hallucinations. These automatically mined hallucination-aware pairs are then
used for direct preference optimization [65], enabling scalable fine-tuning without
manual annotations, especially useful for low-resource languages.
Another innovative approach focuses on optimizing the composition of train-
ing data itself. Shukor et al. [66] developed systematic methods

Chunk 22 · 1,997 chars

ination-aware pairs are then
used for direct preference optimization [65], enabling scalable fine-tuning without
manual annotations, especially useful for low-resource languages.
Another innovative approach focuses on optimizing the composition of train-
ing data itself. Shukor et al. [66] developed systematic methods to determine op-
timal domain weights for multimodal pre-training using scaling laws, validating
their approach across Large Language Models, Native Multimodal Models, and
Large Vision Models. This methodology provides principled alternatives to costly
trial-and-error approaches for data mixture optimization, particularly valuable in
resource-constrained settings, typical for low-resource language development.
Data sovereignty concerns. Synthetic data generation introduces risks beyond
technical quality. Back-translation and LLM-based augmentation propagate source-
language biases to target languages, potentially encoding cultural assumptions
misaligned with target communities [67]. More critically, the CARE principles [4]
assert that Indigenous communities must retain authority over their linguistic data,
a requirement that synthetic generation pipelines rarely accommodate. Evidence
from Sámi language technology demonstrates the consequences: LLMs trained on
available corpora without community oversight produce outputs that appear valid
to non-speakers, but constitute nonsense to native speakers [68]. We thus rec-
ommend that synthetic data pipelines incorporate community validation protocols
and explicit data governance agreements prior to deployment.
5. Multimodal Fusion Techniques
Multimodal fusion refers to the process of combining information from differ-
ent modalities (such as text, images, audio) to make more informed predictions or
generate better outputs. Fusion can be seen as the “meeting point” where informa-
tion from separate sensory channels comes together, similar to how humans inte-
grate what they see, hear and smell to understand

Chunk 23 · 1,996 chars

information from differ-
ent modalities (such as text, images, audio) to make more informed predictions or
generate better outputs. Fusion can be seen as the “meeting point” where informa-
tion from separate sensory channels comes together, similar to how humans inte-
grate what they see, hear and smell to understand their environment. The choice of
15

-- 15 of 62 --

Text input
Audio input
Visual input
Feature
extractors
Feature
fusion 	Classifier 	Output
Early Fusion
Text input
Audio input
Visual input
Decision
fusion 	Output
Late Fusion
Text model
Audio model
Visual model
Prediction 1
Prediction 2
Prediction 3
Text
features
Audio
features
Visual
features
Feature
concatenation
Fused
features
Text
features
Audio
features
Visual
features
Gated
controller
network
Gated
features
Concatenation Fusion
Gated Fusion
[Jigar et al., 2024]
[Hatami et al., 2024]
Encoder
Text
Audio
Image
Latent
space 	Decoder 	Output
Self-
attention
Text
Audio
Image
Attention
weights 	Output
Cross-
attention
Encoder-Decoder Fusion
Attention Fusion
[Chakravarthi et al., 2019]
[Haputhanthri et al., 2023]
Architectural Fusion 	Text model
Class: positive
Audio model
Class: positive
Visual model
Class: negative
Vote counting
Positive: 2
Negative: 1
Result:
positive
Weighted sum
0.8×0.5 +
0.6×0.3 +
0.3×0.2 = 0.65
Result:
positive
(0.65)
Majority Voting
Weighted Average
[Dashtipour et al., 2021]
[Karim et al., 2022]	
Text model
Score: 0.8
Weight: 0.5
Audio model
Score: 0.6
Weight: 0.3
Visual model
Score: 0.3
Weight: 0.2
Figure 5: An overview of various fusion strategies employed in LMMs, categorized into
early fusion, late fusion, and architectural fusion approaches. Early fusion combines fea-
tures from different modalities (text, audio, and visual) using feature extractors and fusion
techniques, before passing them to a classifier for the final output. Concatenation fusion
directly concatenates features from different modalities, while gated fusion employs a gate
controller network to regulate

Chunk 24 · 1,994 chars

mbines fea-
tures from different modalities (text, audio, and visual) using feature extractors and fusion
techniques, before passing them to a classifier for the final output. Concatenation fusion
directly concatenates features from different modalities, while gated fusion employs a gate
controller network to regulate information flow between modalities. Late fusion processes
each modality using separate models, then combines their predictions using decision-level
fusion methods, such as majority voting or weighted averaging. Architectural fusion ap-
proaches, such as attention fusion and encoder-decoder fusion, provide more sophisticated
methods for multimodal integration. Attention fusion leverages self-attention layers and
learned attention weights to selectively focus on relevant features across modalities.
fusion strategy significantly impacts performance, especially in low-resource set-
tings, where each modality might provide crucial complementary information that
others lack. Below, we describe the primary approaches to fusion that represent
different philosophies about when and how this integration should occur.
We identify three distinct types of fusion approaches employed in multimodal
learning, categorized into early fusion, late fusion, and architectural fusion ap-
proaches. An overview of the different fusion strategies is provided in Figure 5.
The diagram depicts the various ways in which textual, visual and auditory fea-
tures can be combined at different stages to enable effective integration of multi-
modal information. In Table 3, we provide a summary of computational require-
ments and efficiency trade-offs for a series of representative fusion approaches.
We further discuss each fusion strategy independently, referring to the computa-
tional requirements and trade-offs along the way.
Early fusion. Early fusion, also known as feature-level fusion, involves com-
16

-- 16 of 62 --

bining features from different modalities at the input level

Chunk 25 · 1,999 chars

esentative fusion approaches.
We further discuss each fusion strategy independently, referring to the computa-
tional requirements and trade-offs along the way.
Early fusion. Early fusion, also known as feature-level fusion, involves com-
16

-- 16 of 62 --

bining features from different modalities at the input level before passing them
through a unified model [69, 70]. Early fusion can be conceptualized as “combin-
ing ingredients before cooking”, i.e. all modalities are mixed at the beginning of
the processing pipeline. This allows the model to learn cross-modal interactions
from the start, potentially capturing subtle relationships between modalities.
In Persian sentiment analysis, Dashtipour et al. [71] demonstrated the effec-
tiveness of early fusion by combining acoustic, visual, and textual features through
a context-aware framework, achieving 91.39% accuracy with A+V+T concatena-
tion. Similarly, Al-Azani et al. [72] showed that early fusion of textual, auditory,
and visual modalities achieved over 94% accuracy for Arabic sentiment analysis.
The shared task on Tamil and Malayalam multimodal sentiment analysis [73]
also revealed that early fusion techniques are particularly effective for handling
code-mixed content and cultural nuances specific to these languages [74].
For Amharic hate speech detection in memes, Jigar et al. [75] employed con-
catenation, directly combining visual features from memes with textual features,
achieving 75% accuracy and demonstrating the effectiveness of this straightfor-
ward approach for LR languages [76]. The integration of multimodal features
through gating mechanisms has shown particular promise in LR scenarios, as
demonstrated in English-to-Low-Resource translation tasks for Hindi, Malayalam,
Bengali, and Hausa, where Hatami et al. [77] used gated fusion to selectively
combine visual and textual information. This approach was further validated by
Alalem et al. [78] in their Audio-Text Fusion model for English and Egyptian

Chunk 26 · 1,995 chars

s, as
demonstrated in English-to-Low-Resource translation tasks for Hindi, Malayalam,
Bengali, and Hausa, where Hatami et al. [77] used gated fusion to selectively
combine visual and textual information. This approach was further validated by
Alalem et al. [78] in their Audio-Text Fusion model for English and Egyptian Ara-
bic, where they employed Group Gated Fusion to dynamically control the flow of
information between modalities, achieving superior performance over traditional
fusion methods.
From a computational perspective, early fusion approaches such as Multi-
Representative Fusion (MRF) [79] demonstrate that competitive results can be
achieved on consumer-grade hardware (GTX 1080Ti with 11 GB VRAM), reach-
ing 84.1% accuracy on the ICT-MMMO dataset within 100 epochs. However,
early fusion typically requires 2-3× more memory during training due to joint
feature processing, and demands strict temporal alignment between modalities.
Late fusion. Late fusion, also known as decision-level fusion, combines pre-
dictions from separate modality-specific models at the decision stage rather than
fusing features early in the pipeline [84, 87]. Late fusion can be conceptualized as
“requesting multiple expert opinions and then voting on a final decision” [83, 86].
In this approach, each modality is processed by its own specialized model, which
becomes an expert in that particular type of data. Only after these individual ex-
perts have made their predictions are the results combined. This is particularly
17

-- 17 of 62 --

Table 3: Computational requirements and efficiency trade-offs for multimodal fusion
techniques in low-resource settings. Bold values indicate configurations accessible for
researchers with limited computational resources. A dash line indicates that the respec-
tive information is not specified in the original publication.
Method/Model 	Fusion Type 	GPU Req. 	Training 	Params 	Performance Key Trade-off
Early Fusion Approaches
MRF [79] 	Early 	1080Ti

Chunk 27 · 1,997 chars

indicate configurations accessible for
researchers with limited computational resources. A dash line indicates that the respec-
tive information is not specified in the original publication.
Method/Model 	Fusion Type 	GPU Req. 	Training 	Params 	Performance Key Trade-off
Early Fusion Approaches
MRF [79] 	Early 	1080Ti 11GB 	100 ep. 	≈50M 	84.1% Acc Noise-robust; needs multiple rep-
resentations
ViT + mBERT [80] 	Early 	– 	40 ep. 	≈200M 	72.4% Acc High #params; moderate accuracy
Swin + XLM-RoBERTa [80] 	Early 	– 	40 ep. 	≈280M 	75.8% Acc Best early fusion; heavier
A+V+T Concat [71] 	Early 	– 	– 	≈30M 	91.4% Acc Simple; sync-sensitive
BiLSTM Multimodal [75] 	Early (Concat) 	– 	32-64 ep. 	≈10M 	75.0% Acc Low #params; limited complexity
Late Fusion Approaches
XLM-R + DenseNet [81] 	Late 	GTX 1050 	5-fold CV 	≈400M 	83.0% F1 	Best multimodal; high memory
MARBERTv2 + Ensemble [82] 	Late 	– 	100 ep. 	≈180M 	85.6% Acc Robust to missing modalities
Intermediate / Architectural Fusion
SentimentFormer [80] 	Intermediate 	– 	30 ep. 	≈220M 	79.0% Acc Best overall; balanced cost
AVTF-TBN [83] 	Attention 	RTX 3090 24GB 300 ep. 	≈100M 	78.0% F1 	High compute; medium accuracy
CNN-LSTM Tagalog [84] 	Intermediate 	– 	12 h 	≈20M 	89.5% Acc 25% faster than A+V
Encoder–Decoder & Advanced Fusion
URSA (3D-CNN + BLSTM) [85] Feature-level 	– 	– 	128+64 cells 95.4% Acc Feature > decision fusion
Feature-Extract [86] 	Sep.+Merge 	T4 16GB 	– 	8.48M 	93.3% Acc Low #params; specialized pipeline
valuable when certain modalities might be missing or corrupted in real-world ap-
plications [88].
Two popular late fusion strategies are weighted averaging and majority voting.
In weighted averaging, the predictions from different modalities are combined us-
ing a weighted sum, with weights determining the contribution of each modality
to the final decision [81, 89]. The weights can be uniform or learned to optimize
performance. Majority voting employs gating mechanisms to control information
flow between

Chunk 28 · 1,999 chars

eraging, the predictions from different modalities are combined us-
ing a weighted sum, with weights determining the contribution of each modality
to the final decision [81, 89]. The weights can be uniform or learned to optimize
performance. Majority voting employs gating mechanisms to control information
flow between modalities and determine which modality should be emphasized
[90]. For example, Dashtipour et al. [71] used gating networks to adaptively com-
bine predictions from audio, visual and textual models based on their estimated
reliability for Persian sentiment analysis. Their results showed that intelligent
fusion using gates improved performance compared with simple averaging, high-
lighting the benefits of selective information integration in multimodal systems.
Late fusion strategies are especially suitable for resource-constrained envi-
ronments due to their flexibility and lower memory requirements. Since each
modality is processed by independent models, the system can continue function-
ing when one modality is unavailable, enabling graceful degradation with missing
inputs [87, 88]. For Arabic rumor detection, Albalawi et al. [82] achieved 83.83%
accuracy with late fusion (MARBERTv2 + VGG-19 ensemble), compared with
85.57% for early fusion, demonstrating that the 1.7% performance gap is often
18

-- 18 of 62 --

smaller than the computational cost savings. Late fusion also enables parallel
training of modality-specific models, reducing wall-clock time by 25-40% com-
pared with end-to-end early fusion training [84].
Architectural fusion. Architectural fusion comprises more sophisticated inte-
gration methods that go beyond simple concatenation or averaging of features.
Encoder-decoder fusion can be understood as a “translation system” between
modalities, i.e. information from each modality is first converted into a com-
mon “language” (shared representation space) by encoders, before being decoded
into the final output [70, 80]. This allows the model

Chunk 29 · 1,999 chars

nation or averaging of features.
Encoder-decoder fusion can be understood as a “translation system” between
modalities, i.e. information from each modality is first converted into a com-
mon “language” (shared representation space) by encoders, before being decoded
into the final output [70, 80]. This allows the model to find complex mappings
between very different data types. For example, Chakravarthi et al. [91] em-
ployed an encoder-decoder framework with phonetic transcription to improve ma-
chine translation between Dravidian languages, while Sehar et al. [85] utilized an
encoder-decoder architecture to fuse audio, video and text features for Urdu sen-
timent analysis. Similarly, Meetei et al. [92] showed that encoder-decoder fusion
of correlated modalities can enhance translation quality for LR languages. The
key advantage of encoder-decoder architectures is their ability to first encode in-
put features from different modalities into a shared representation space before
decoding them into the target output.
Attention-based fusion has also proven to be highly effective for multimodal
integration [93]. As shown by Haputhanthri et al. [94] for Sinhala sign language
recognition, attention mechanisms allow the model to dynamically focus on the
most relevant features across modalities. Yang et al. [95] successfully employed
attention fusion for Mongolian sentiment analysis by combining features from au-
dio, text and visual inputs. Zhang et al. [83] demonstrated that attention-based fu-
sion of multimodal data improves depression risk detection by allowing the model
to attend to salient information across audio, video and text modalities. The abil-
ity of attention mechanisms to learn dynamic weights between modalities makes
them particularly suitable for tasks requiring adaptive integration of complemen-
tary sources.
Intermediate and architectural fusion approaches offer a balance between per-
formance and accessibility. SentimentFormer [80] achieves the highest

Chunk 30 · 1,998 chars

y of attention mechanisms to learn dynamic weights between modalities makes
them particularly suitable for tasks requiring adaptive integration of complemen-
tary sources.
Intermediate and architectural fusion approaches offer a balance between per-
formance and accessibility. SentimentFormer [80] achieves the highest Bangla
meme accuracy (79.04%) with only 30 epochs, outperforming both early (75.83%)
and late fusion (74.80%) on the same dataset. At the high-resource end, attention-
based models such as AVTF-TBN [83] require an RTX 3090 (24 GB) and 300
epochs for clinical-grade depression detection accuracy.
Comparative analysis of fusion techniques. Each fusion approach presents dis-
tinct advantages and challenges in the context of LR languages [69, 84]. Early
fusion enables deep interaction between modalities from the start, but can be
19

-- 19 of 62 --

computationally expensive and may suffer when one modality is noisy. Late fu-
sion offers flexibility and robustness when modalities are missing, but may miss
important cross-modal interactions [87]. Architectural fusion approaches show
promise in capturing complex relationships between modalities, but require care-
ful tuning and substantial computational resources. A notable innovation in this
space is the Multi-Representative Fusion (MRF) mechanism [79], which gener-
ates diverse representations for each modality and selectively chooses the best fu-
sion via attention. This approach has shown particular promise in handling noisy
inputs, achieving state-of-the-art performance on several LR sentiment analysis
benchmarks.
Handling noisy modalities. A critical consideration for real-world deployment is
robustness to corrupted modalities. The MRF mechanism [79] addresses this by
generating multiple diverse representations for each modality and using attention
to select the most informative fusion. When acoustic features are corrupted, MRF
automatically reduces their contribution (from approximately 15% to <5% of

Chunk 31 · 1,996 chars

t is
robustness to corrupted modalities. The MRF mechanism [79] addresses this by
generating multiple diverse representations for each modality and using attention
to select the most informative fusion. When acoustic features are corrupted, MRF
automatically reduces their contribution (from approximately 15% to <5% of the
final prediction), maintaining robust performance. However, MRF fails when all
three modalities are simultaneously noisy for utterances critical to prediction. For
Javanese emotion recognition [86], separate modality processing achieves an ac-
curacy of 93.32% compared with 71.15% for joint processing, specifically be-
cause independent processing minimizes interference when one channel contains
noise.
Architectural complexity considerations. Our analysis suggests that the ad-
ditional complexity of architectural fusion is justified in three cases: (1) when
cross-modal interactions are semantically rich and task-critical, as in Sinhala sign
language recognition [94] and Mongolian sentiment analysis [95], where tem-
poral alignment between visual gestures and linguistic features requires learned
attention weights; (2) when modalities have different noise characteristics or in-
formation densities, as demonstrated for Urdu sentiment analysis where feature-
level fusion (95.35%) substantially outperformed decision-level fusion (91.23%)
[85]; and (3) for clinical or safety-critical applications where prediction errors
have serious consequences [83]. Conversely, for rapid prototyping or tasks where
text modality dominates (e.g. in Bengali hate speech detection, where a text-only
XLM-RoBERTa achieves F 1 = 0.82 vs. F 1 = 0.83 for the multimodal pipeline
[81]), simpler approaches may be preferable.
In Table 4, we present the performance of fusion strategies across different
low-resource languages and tasks, while in Table 5, we provide a decision guide
based on specific constraints. Early fusion generally achieves the highest accuracy
(e.g. 95.35% for

Chunk 32 · 1,996 chars

timodal pipeline
[81]), simpler approaches may be preferable.
In Table 4, we present the performance of fusion strategies across different
low-resource languages and tasks, while in Table 5, we provide a decision guide
based on specific constraints. Early fusion generally achieves the highest accuracy
(e.g. 95.35% for Urdu, 91.39% for Persian video analytics), but the performance
20

-- 20 of 62 --

Table 4: Performance comparison of early, late and intermediate fusion for low-resource
languages. Best score on each row is highlighted in bold.
Language/Task Early Late Intermediate Best Strategy
Bangla Memes [80] 75.83% 74.80% 79.04% Intermediate
Arabic Rumors [82] 85.57% 83.83% – Early
Urdu Sentiment [85] 95.35% 91.23% – Early
Javanese Emotion [86] 71.15% – 93.32% Separate Processing
Amharic Memes [75] 75.00% – – Early
Persian Video [71] 91.39% 90.32% – Early
Table 5: Decision guide for selecting the fusion strategy based on constraints and require-
ments. ✓✓ = Strongly recommended, ✓ = Suitable, ∼ = Acceptable, ✗ = Not recom-
mended. Based on empirical findings from [79, 80, 83, 87].
Requirement/Constraint Early Late Architectural
Missing modality robustness ✗ ✓✓ ✓
Noisy input handling ✗ ✓ ✓✓ (MRF)
Low memory (<8GB VRAM) ✗ ✓✓ ∼
Fast training (<50 epochs) ∼ ✓✓ ∼
Maximum accuracy ✓✓ ∼ ✓✓
Cross-modal interactions ✓✓ ✗ ✓✓
Rapid prototyping ∼ ✓✓ ✗
Clinical/safety-critical ∼ ✗ ✓✓
gap between strategies is often smaller than the computational cost difference. For
researchers with limited computational resources (single GPU, <16GB VRAM),
we recommend starting with lightweight early fusion models such as BiLSTM
(≈10M parameters) to establish baselines, before progressing to intermediate fu-
sion with efficient architectures for improved performance.
6. Visual Enhancement Techniques
Visual enhancement techniques aim to improve MT quality by leveraging
visual information to provide additional context and resolve ambiguities in the
source text. These techniques broadly fall

Chunk 33 · 1,998 chars

e progressing to intermediate fu-
sion with efficient architectures for improved performance.
6. Visual Enhancement Techniques
Visual enhancement techniques aim to improve MT quality by leveraging
visual information to provide additional context and resolve ambiguities in the
source text. These techniques broadly fall into two main categories: image-guided
translation, which uses visual features to enhance the overall translation process,
21

-- 21 of 62 --

and visual disambiguation, which specifically focuses on resolving ambiguous
words/phrases via visual context.
Image-guided translation. A promising direction for improving translation qual-
ity for LR languages is the use of image-guided translation approaches. Dutta
Chowdhury et al. [53] showed that augmenting neural MT systems with visual
features extracted from a pre-trained CNN and integrated into an encoder-decoder
architecture can improve translation quality, achieving a bilingual evaluation un-
derstudy (BLEU) score of 24.2 for Hindi to English translation. Building upon
these ideas, Laskar et al. [96, 97] developed a multimodal neural MT system with
a bidirectional RNN encoder and a doubly-attentive decoder for English-Hindi
translation. Their system, which combines visual and textual features, and em-
ploys pre-trained word embeddings from monolingual data, outperforms a text-
only baseline, achieving a BLEU score of 33.57 versus 27.75 on the test set.
Subsequent studies [56, 98–101] have demonstrated the effective use of visual
information for improving MT in LR settings, particularly for the English-Hindi
language pair. Meetei et al. [100] proposed a video-guided multimodal MT frame-
work that incorporates spatio-temporal video features, showing improvements of
up to +4.2 BLEU over text-only baselines for English to Hindi translation, while
Meetei et al. [101] explored multimodal translation for news domain data, show-
ing that ResNet-based image features outperform VGG-based features and im-
prove

Chunk 34 · 1,996 chars

ork that incorporates spatio-temporal video features, showing improvements of
up to +4.2 BLEU over text-only baselines for English to Hindi translation, while
Meetei et al. [101] explored multimodal translation for news domain data, show-
ing that ResNet-based image features outperform VGG-based features and im-
prove BLEU scores by +1.8 points. Additionally, Shi et al. [99] explored different
approaches for extracting and integrating image features using VGG and ResNet
models, achieving a +3 BLEU improvement over text-only translation. Another
contribution is presented by Gain et al. [98], who showed how visual context en-
hances translation robustness under noisy conditions (e.g. OCR errors), even when
image relevance is reduced. Extending this line of work, Tayir et al. [102] demon-
strated that visual context can bridge structural gaps in distant language pairs, such
as English-Uyghur, by introducing a visual masked language modeling approach
for unsupervised multimodal MT. Similarly, Tayir et al. [103] improved transla-
tion for the same language pairs by harnessing varying-granularity image features
in low-resource settings.
More recently, Haq et al. [56] presented a context-aware transformer model
that integrates visual features via BERT encoding, demonstrating consistent im-
provements over text-only baselines. In a related direction, Lekshmy et al. [104]
developed an English-Malayalam vision-aided translation system for visually im-
paired users, employing multimodal machine learning techniques to perform ob-
ject recognition and generate translated descriptions in real-time.
Across all studies, qualitative analyses confirmed that visual cues are partic-
22

-- 22 of 62 --

ularly beneficial for handling rare words and domain-specific terms, with both
image and video modalities helping to resolve ambiguity and improve translation
quality in LR scenarios.
Visual disambiguation. While image-guided translation aims to enhance overall
translation quality by

Chunk 35 · 1,988 chars

es are partic-
22

-- 22 of 62 --

ularly beneficial for handling rare words and domain-specific terms, with both
image and video modalities helping to resolve ambiguity and improve translation
quality in LR scenarios.
Visual disambiguation. While image-guided translation aims to enhance overall
translation quality by integrating visual context, the visual disambiguation tech-
niques focus on task-specific ambiguities by grounding them in visual informa-
tion. In this regard, studies revolving around the creation of Visual Genome
datasets for LR languages, such as Hindi [105], Bengali [43] and Hausa [45],
have played a pivotal role in advancing visual disambiguation techniques. Build-
ing upon previous work, Parida et al. [106] explored this line of research by devel-
oping a multimodal NMT system for English-Bengali using object tags extracted
from images as auxiliary input, while Nortje et al. [107] introduced an innovative
few-shot learning approach for visually-prompted keyword localization in Yoruba.
Several studies have investigated the use of visual features for disambiguation
[96, 108–112]. For example, Jain et al. [108] highlighted the benefits of using
visual features for disambiguation. Their model, called MURAL, shows strong
performance on text-to-image retrieval tasks, where it manages to retrieve relevant
images for ambiguous queries. This finding is also supported by the qualitative
examples, where MURAL successfully disambiguates word senses based on vi-
sual context. In addition, Kovath et al. [110] proposed a co-attention mechanism
for Malayalam VQA that allows the model to jointly learn attention over both tex-
tual and visual inputs, demonstrating improved performance over baselines using
only textual features.
Comparative analysis of visual enhancement techniques. Image-guided trans-
lation consistently demonstrates performance improvements over text-only base-
lines for LR languages, though effectiveness varies with dataset size and

Chunk 36 · 1,997 chars

sual inputs, demonstrating improved performance over baselines using
only textual features.
Comparative analysis of visual enhancement techniques. Image-guided trans-
lation consistently demonstrates performance improvements over text-only base-
lines for LR languages, though effectiveness varies with dataset size and trans-
lation direction. These approaches excel at handling semantic ambiguities and
culturally-specific concepts, but their success depends heavily on the quality of
extracted visual features. A key limitation is the reliance on high-quality image-
text pairs, which are often scarce for LR languages. While these techniques im-
prove translation quality, they also introduce computational overheads. Future
work should focus on developing more efficient visual feature extraction methods
and better approaches for leveraging visual information with limited paired data.
7. Cross-Modal Transfer Learning
Cross-modal transfer learning represents a critical approach for LR languages,
allowing models to harness knowledge from data-rich modalities or languages to
23

-- 23 of 62 --

improve performance in resource-constrained settings. Unlike traditional trans-
fer learning, which operates within a single modality, cross-modal transfer must
bridge the significant gap between different types of data representations. This
is conceptually similar to how a person might use their understanding of written
language to help learn a sign language, or how knowledge of one spoken lan-
guage can facilitate learning another. In the context of low-resource languages,
two primary transfer directions have emerged: modality transfer, which moves
knowledge between different data types (e.g. from text to speech), and language
transfer, which leverages high-resource languages to improve performance in low-
resource ones.
Modality transfer. Modality transfer addresses the challenge of transferring
knowledge between different modalities to improve performance on low-resource
tasks.

Chunk 37 · 1,990 chars

different data types (e.g. from text to speech), and language
transfer, which leverages high-resource languages to improve performance in low-
resource ones.
Modality transfer. Modality transfer addresses the challenge of transferring
knowledge between different modalities to improve performance on low-resource
tasks. This approach is particularly valuable when certain modalities have more
abundant data than others. For example, text data is often easier to collect than
paired speech data for many languages. The fundamental challenge lies in bridg-
ing the representational gap between modalities, since text operates in a discrete
symbolic space, while speech and vision exist in continuous signal spaces with
very different statistical properties. Successful modality transfer requires finding
meaningful mappings between these different representational spaces. A diver-
sity of approaches has been used to achieve modality transfer. Chen et al. [113]
proposed a progressive transfer learning strategy that leverages both general pre-
training (Kinetics-400 for visual and CC25 for language) and domain-specific pre-
training (sign-to-gloss translation) to bridge modalities for sign language trans-
lation. Amalas et al. [114] introduced a data-driven approach to select source
languages and demonstrated that multilingual pre-training outperforms monolin-
gual pre-training for text-to-speech systems. Wu et al. [115] developed a cap-
tioning approach via multi-objective optimization that addresses the challenge of
utilizing both triplet datasets (image, HR language, LR language) and large-scale
paired datasets during training. Yeo et al. [116] tackled LR visual speech recog-
nition by using Whisper-based automatic transcriptions to generate training labels
from unlabeled multilingual audio-visual data. For Arabic handwriting recogni-
tion, Bhatia et al. [117] employed modality transfer through an architecture com-
bining SwinV2 for visual encoding and RoBERTa for text

Chunk 38 · 1,997 chars

visual speech recog-
nition by using Whisper-based automatic transcriptions to generate training labels
from unlabeled multilingual audio-visual data. For Arabic handwriting recogni-
tion, Bhatia et al. [117] employed modality transfer through an architecture com-
bining SwinV2 for visual encoding and RoBERTa for text decoding, while Tran
et al. [118] demonstrated successful modality transfer for Vietnamese through
extensive pre-training of both vision and language components, combined with
automated data curation methods. Notably, Onuoha et al. [119] challenged the
assumptions about multimodal integration through their study of Igbo minimal
pairs. Their findings show that native Igbo speakers can accurately distinguish
24

-- 24 of 62 --

Table 6: Overview of architectural innovations for low-resource multimodal learning.
V = Vision, T = Text, A = Audio. Although Cycle-Attn is evaluated on EN and DE
(high-resource), it is included as a key methodological reference. The authors simulated
a low-resource scenario using the limited Multi30K dataset to demonstrate the efficacy
of knowledge transfer from a rich monolingual corpus (EN) via cycle consistency con-
straints.
Model/Method Year Languages Modalities Task Approach Category Key Innovation
Cycle-Attn [120] 2019 EN, DE V, T Image Captioning Translation+Alignment Cycle consistency constraint
for cross-lingual alignment
Multi-task Adversarial [121] 2022 EN, HI T, A Sentiment Analysis Adversarial Learning Cross-lingual transfer via
shared embeddings
FEWVLM [122] 2022 EN, HI V, T VL Understanding Prompt Engineering Few-shot prompting with
moderate-size VLM
Amharic Captioning [123] 2023 Amharic V, T Image Captioning Attention-based DNN Visual attention + Bi-GRU
decoder
Auxiliary CTC [113] 2023 102 langs A, T Multilingual ASR CTC Conditioning LID-conditioned auxiliary
objectives
Sanskrit-Malayalam NMT [124] 2022 SA, ML T, A Machine Translation Multimodal NMT Morphology + WSD embed-
ding

Chunk 39 · 1,998 chars

23 Amharic 	V, T 	Image Captioning 	Attention-based DNN Visual attention + Bi-GRU
decoder
Auxiliary CTC [113] 	2023 102 langs 	A, T 	Multilingual ASR 	CTC Conditioning 	LID-conditioned 	auxiliary
objectives
Sanskrit-Malayalam NMT [124] 2022 SA, ML 	T, A 	Machine Translation Multimodal NMT 	Morphology + WSD embed-
ding fusion
XtremeCLIP [125] 	2023 	EN, HI 	V, T 	VL Understanding Parameter-efficient 	Prototype affinity matching
(5-7K params)
LowCLIP [126] 	2024 Azerbaijani 	V, T 	Image Retrieval 	Efficiency-first 	mBERT + lightweight image
encoders
Yoruba ASR [127] 	2024 EN, YO 	T, A 	Speech Recognition Transfer Learning 	MFCC-based acoustic mod-
eling
Llama 3 [128] 	2024 200 langs 	V, T 	General Multimodal Foundation Model 	Native multimodal MoE ar-
chitecture
DeepSeek-V3 [129] 	2024 14 langs 	V, T 	General Multimodal MoE Architecture 	MLA + FP8 training effi-
ciency
Claude 4 [130] 	2025 15 langs 	V, T 	General Multimodal Foundation Model 	Dual-mode reasoning opera-
tion
Apple AFM [131] 	2024 16 langs 	V, T 	On-device/Server 	Distillation+QAT 	2-bit quantization for edge
deployment
MMaDA [132] 	2025 60+ langs 	V, T 	Multimodal Diffusion Unified Diffusion 	Discrete diffusion language
modeling
MixLoRA [133] 	2024 15 langs 	V, T 	Instruction Tuning Dynamic PEFT 	Conditional mixture routing
for adaptation
minimal pairs through audio alone, suggesting that the benefits of cross-modal
integration may be more relevant for non-native speakers.
Language transfer. Language transfer is an approach to harness knowledge from
HR languages to improve model performance on LR languages. Recent work
demonstrated several effective strategies. For instance, Wang et al. [58] adapted
MDETR to new languages by using adapters and code-switching without relying
on MT data. Cheema et al. [134] presented ViLanOCR, a novel approach that
adapts multilingual vision-language transformers for low-resource Urdu optical
character recognition by leveraging the Swin encoder and mBART-50 decoder.
Kim

Chunk 40 · 1,994 chars

58] adapted
MDETR to new languages by using adapters and code-switching without relying
on MT data. Cheema et al. [134] presented ViLanOCR, a novel approach that
adapts multilingual vision-language transformers for low-resource Urdu optical
character recognition by leveraging the Swin encoder and mBART-50 decoder.
Kim et al. [135] focused on learning general speech knowledge from English
for lip reading, and combining it with language-specific audio features. Aruna
Gladys et al. [136] proposed a multimodal representation learning framework that
25

-- 25 of 62 --

Table 7: Computational requirements for architectural innovations. A dash line indicates
that the respective information was not reported in the original paper.
Model Trainable
Params
Total
Params
Training
Duration
Hardware
(per paper)
Training
Data
Parameter-Efficient Vision-Language Methods
XtremeCLIP [125] 	5-7K 149M 	20-60 min 	1× A100 2K-10K samples
LowCLIP [126] 	192M 192M 	37 hours 	1× T4 	500K+ captions
FEWVLMbase [122] 	224M 224M 	30 epochs 	– 	Few-shot (16 ex.)
FEWVLMlarge [122] 	740M 740M 	30 epochs 	– 	Few-shot (16 ex.)
Language-Specific Architectures
Amharic Caption [123] 	– 	– 	35 epochs 	– 	8K images
Cycle-Attn [120] 	– 	– 	50 epochs 	– 	30K pairs
Foundation Models (for reference)
Llama 3 405B [128] 	405B 	405B 3.8 × 1025 FLOPs 16K× H100 15.6T tokens
DeepSeek-V3 [129] 37B active 671B 2.788M H800 hours 2048× H800 14.8T tokens
uses cross-lingual transfer learning to analyze sentiment in LR language datasets,
demonstrating significant performance improvements for Tamil language senti-
ment analysis. Chen et al. [137] improved multilingual ASR by conditioning
models on language identity predictions from early layers to enhance performance
across numerous languages. dos Santos et al. [138] proposed to use data aug-
mentation and contrastive learning to improve multilingual contrastive language-
image pre-training (CLIP) models for LR languages. Nortje et al. [139] showed
that initializing a

Chunk 41 · 1,993 chars

e identity predictions from early layers to enhance performance
across numerous languages. dos Santos et al. [138] proposed to use data aug-
mentation and contrastive learning to improve multilingual contrastive language-
image pre-training (CLIP) models for LR languages. Nortje et al. [139] showed
that initializing a Yoruba few-shot word learning model with weights from an En-
glish speech-image model substantially improves performance. These approaches
share the common theme of transferring learned representations and knowledge
from HR languages (typically English), while developing techniques to adapt and
fine-tune models for target LR languages.
The effectiveness of language transfer methods varies significantly based on
linguistic similarity, writing systems, and cultural context. For instance, transfer
between closely related languages (such as Spanish to Portuguese) typically out-
performs transfer between distant language families (such as English to Tamil).
The methods described above demonstrated different approaches to this challenge:
Wang et al. [58] focused on architecture adaptation through adapters, while Kim
et al. [135] emphasized feature-level knowledge transfer. Meanwhile, Nortje et
al. [139] showed that even initialization from a different language can provide sub-
stantial benefits. For practitioners working with specific low-resource languages,
26

-- 26 of 62 --

Table 8: Performance metrics for low-resource multimodal architectures. All values are
extracted directly from source papers. Baseline methods and improvement calculations
are specified for reproducibility. Full FT = full fine-tuning; Aug. = augmentation; – = not
applicable or not reported.
Model 	Task 	Dataset 	Metric 	Score 	Baseline 	∆
Parameter-Efficient Vision-Language Methods
XtremeCLIP [125]
Visual Entailment 	SNLI-VE (10K samples) Accuracy 	62.06% 51.10% (Full FT) 	+21.4%
Visual QA 	VQA v2 (10K samples) 	Accuracy 	59.21% 54.10% (Full FT) 	+9.4%
Image Classification FGVC

Chunk 42 · 1,993 chars

pplicable or not reported.
Model 	Task 	Dataset 	Metric 	Score 	Baseline 	∆
Parameter-Efficient Vision-Language Methods
XtremeCLIP [125]
Visual Entailment 	SNLI-VE (10K samples) Accuracy 	62.06% 51.10% (Full FT) 	+21.4%
Visual QA 	VQA v2 (10K samples) 	Accuracy 	59.21% 54.10% (Full FT) 	+9.4%
Image Classification FGVC (16-shot) 	Accuracy 	48.30% 28.14% (Full FT) 	+71.6%
LowCLIP [126] 	Image Retrieval 	MSCOCO (AZ) 	mAP 	0.80 	0.70 (Base Loss) 	+14.3%
Flickr30k (AZ) 	mAP 	0.87 	0.84 (No Aug.) 	+3.6%
FEWVLMlarge [122] 	Visual QA 	VQAv2 (few-shot) 	Accuracy 	51.1% 38.2% (Frozen 7B) +33.8%
OK-VQA (few-shot) 	Accuracy 	23.1% 12.6% (Frozen 7B) +83.3%
Language-Specific Architectures
Amharic Captioning [123] Image Captioning 	Flickr8k (AM) 	4-gram BLEU 	38.8 	28.5 (CNN-GRU) 	+36.1%
BNATURE (AM) 	4-gram BLEU 	42.7 	16.4 (CNN-GRU) +160.4%
Cycle-Attn [120] 	Image Captioning 	Multi30K (DE) 	CIDEr 	43.78 42.91 (Dual-Attn+) 	+2.0%
BLEU-4 	5.71 	5.54 (Dual-Attn+) 	+3.1%
Foundation Models (for reference)
Llama 3 405B [128] 	General 	MMLU (5-shot) 	Accuracy 	87.3% 	– 	–
DeepSeek-V3 [129] 	General 	MMLU-Pro (5-shot CoT) Accuracy 	75.9% 	– 	–
the choice between these approaches should consider both linguistic factors and
computational constraints.
8. Architectural Innovations
Architectural innovations for low-resource multimodal learning focus on de-
signing model structures that can effectively leverage limited data while main-
taining reasonable computational requirements. The fundamental challenge lies
in balancing model capacity (ability to learn complex patterns) with sample effi-
ciency (ability to learn from limited examples). While simply scaling down large
models designed for high-resource settings is one approach, the most successful
innovations in this space incorporate architectural elements specifically designed
to address the constraints of low-resource scenarios. These innovations generally
fall into three categories: (1) efficiency-focused adaptations of existing

Chunk 43 · 1,998 chars

models designed for high-resource settings is one approach, the most successful
innovations in this space incorporate architectural elements specifically designed
to address the constraints of low-resource scenarios. These innovations generally
fall into three categories: (1) efficiency-focused adaptations of existing architec-
tures, (2) parameter-efficient fine-tuning methods, and (3) novel architectures de-
signed specifically for low-resource multimodal learning. In Table 6, we provide
a systematic overview of these architectural innovations, categorized by approach
type, supported modalities, and target tasks. Tables 7 and 8 complement this
overview with quantitative analyses of computational requirements and empirical
27

-- 27 of 62 --

Table 9: Image encoder performance comparison for low-resource image retrieval. Re-
sults are taken from LowCLIP [126].
Image Encoder Params GFLOPs Size (MB) mAP
COCO Flickr8k Flickr30k
ResNet-50 25.6M 4.09 97.8 0.80 0.76 0.73
EfficientNet-B0 5.29M 0.39 20.5 0.81 0.85 0.87
ViT-Base 86.6M 17.56 330.3 0.71 0.80 0.70
Swin-Tiny 28.3M 4.49 108.2 0.80 0.84 0.79
performance, enabling direct comparison across methods with varying resource
constraints.
Some recent architectural innovations in the context of LR languages have
focused on adapting the CLIP architecture [140]. One such example is the Low-
CLIP model [126], which replaces the original text encoder trained primarily on
English text with a multilingual BERT (mBERT). The authors evaluated various
lightweight image encoders, such as EfficientNet-B0 and Tiny Swin Transformer,
for a more computationally efficient approach, while also targeting LR languages
like Azerbaijani. To compensate for the lighter architecture and the scarcity of
image-text pairs in Azerbaijani, LowCLIP leveraged synthetic data generation via
MT for text features, and image augmentation techniques, such as crop and rota-
tion, for image features. In contrast, XtremeCLIP [125] took a different approach,
in

Chunk 44 · 1,996 chars

rbaijani. To compensate for the lighter architecture and the scarcity of
image-text pairs in Azerbaijani, LowCLIP leveraged synthetic data generation via
MT for text features, and image augmentation techniques, such as crop and rota-
tion, for image features. In contrast, XtremeCLIP [125] took a different approach,
in which the authors introduced a parameter-efficient method that only tunes a
small prototype matrix, while keeping the visual and text encoders frozen. Their
model also employs contrastive learning to provide additional supervisory signals
in LR settings. Collectively, these efforts extend the applicability of CLIP to mul-
timodal image retrieval tasks.
Approaches to adapting CLIP for LR settings illustrate different design philoso-
phies. LowCLIP takes an efficiency-first approach, focusing on reducing both the
model size and data requirements through lighter architectures and extensive data
augmentation. In contrast, XtremeCLIP maintains most of the original model ca-
pacity, but introduces parameter-efficient tuning to learn a small set of adaptable
weights. This trade-off between model capacity and training efficiency repre-
sents a key consideration for researchers working in low-resource settings, where
both data and computational resources may be constrained. The choice between
these approaches depends on the specific constraints of the application scenario,
e.g. LowCLIP may be more suitable for deployment on edge devices or in settings
28

-- 28 of 62 --

Table 10: Comparison of parameter-efficient fine-tuning methods for low-resource vision-
language understanding. VE = Visual Entailment, VQA = Visual Question Answering,
IC = Image Classification. Results are taken from XtremeCLIP [125].
Method 	Trainable Params VE VQA IC Avg. Training Time
Zero-shot 	0 33.74 52.03 39.17 42.89 –
Full fine-tuning 	149M 51.10 54.10 28.14 51.12 hours
LLRD 	149M 57.23 53.88 31.36 53.60 hours
BitFit 	176-178K 59.56 54.72 41.61 55.66 minutes
BiNor 	208-210K 59.54

Chunk 45 · 1,995 chars

mage Classification. Results are taken from XtremeCLIP [125].
Method 	Trainable Params VE VQA IC Avg. Training Time
Zero-shot 	0 33.74 52.03 39.17 42.89 –
Full fine-tuning 	149M 51.10 54.10 28.14 51.12 hours
LLRD 	149M 57.23 53.88 31.36 53.60 hours
BitFit 	176-178K 59.56 54.72 41.61 55.66 minutes
BiNor 	208-210K 59.54 54.75 41.73 55.67 minutes
CLIP-Adapter 	131-262K 59.21 54.21 44.88 55.45 minutes
Tip-Adapter 	5-10M 59.67 54.70 45.12 55.62 minutes
XtremeCLIP 	5-7K 62.06 59.21 48.30 57.73 20 minutes
LoRA (r = 4) 	≈4K 	– – – 65.39 minutes
LoRA (r = 16) 	≈16K – – – 65.50 minutes
MixLoRA (E = 16) ≈8K/layer – – – 67.17 hours
with extremely limited data, while XtremeCLIP might be preferred when main-
taining representation power for complex tasks is crucial. As shown in Table 9,
EfficientNet-B0 achieves competitive retrieval performance (an mAP of 0.87 on
Flickr30k), while requiring 16× fewer parameters and 45× fewer FLOPs than
ViT-Base. The choice between these approaches depends on deployment con-
straints: LowCLIP suits scenarios requiring end-to-end retraining with domain-
specific data, while XtremeCLIP is preferable when rapid adaptation with mini-
mal computational overhead is essential.
Another approach for multimodality in the context of LR languages is in-
troduced by Wu et al. [120]. The approach combines two existing methods,
a translation-based one and an alignment-based one, into a unified architecture
to improve image captioning. The framework employs a model that first gen-
erates high-quality English captions, which are then used together with the im-
ages to produce captions in the LR language. The model achieves a fine-grained
alignment between visual elements and captions in both languages via a cycle-
consistency constraint, outperforming existing methods on standard metrics.
Jin et al. [122] introduced FEWVLM, showing that careful prompt engineer-
ing and efficient architectural design can achieve strong performance in the con-
text of LMM usage

Chunk 46 · 1,999 chars

t between visual elements and captions in both languages via a cycle-
consistency constraint, outperforming existing methods on standard metrics.
Jin et al. [122] introduced FEWVLM, showing that careful prompt engineer-
ing and efficient architectural design can achieve strong performance in the con-
text of LMM usage with either little data or computational needs. They man-
aged to develop a moderate-size VLM that combines a sequence-to-sequence
transformer with prefix language modeling and masked language modeling, in-
29

-- 29 of 62 --

troducing effective prompt engineering approaches for visual-language tasks in
the LR setting. Notably, FEWVLM outperforms Frozen [141], a model which
is 31× larger. In turn, Frozen achieves comparable results with PICa [142],
which is 246× larger. These results demonstrate that an effective design can
compensate for model size. Building on parameter-efficient approaches, Shen
et al. [133] introduced Conditional Mixture of LoRA (MixLoRA) for multimodal
instruction tuning, which dynamically constructs adaptation matrices tailored to
each input instance, addressing task interference challenges in multimodal sce-
narios. For specific language pairs, Laskar et al. [143] proposed a transliteration-
based phrase augmentation approach for English-Assamese translation, which al-
lows their model to share sub-word level information, and provides better word
alignment through phrase pairs. In Table 10, we quantify the size vs. perfor-
mance trade-off across these methods. XtremeCLIP achieves the highest av-
erage accuracy (57.73%) across visual entailment, VQA, and image classifica-
tion benchmarks, while training only 5-7K parameters, compared with 149M for
full fine-tuning. This demonstrates that task reformulation as prototype affinity
matching can outperform conventional fine-tuning, while using less than 21,000×
fewer trainable parameters. FEWVLMlarge (740M parameters) achieves 51.1%
on VQAv2, surpassing the 7B-parameter Frozen model

Chunk 47 · 1,993 chars

arameters, compared with 149M for
full fine-tuning. This demonstrates that task reformulation as prototype affinity
matching can outperform conventional fine-tuning, while using less than 21,000×
fewer trainable parameters. FEWVLMlarge (740M parameters) achieves 51.1%
on VQAv2, surpassing the 7B-parameter Frozen model (38.2%) by 33.8%, vali-
dating the hypothesis that architectural efficiency can compensate for raw model
scale. MixLoRA further improves upon standard LoRA by 8.3% on the MME
benchmark through its conditional mixture routing mechanism, which dynami-
cally selects expert combinations based on input characteristics.
Foundation models represent a qualitatively different design point, prioritiz-
ing broad capability over resource efficiency. We include them here to contex-
tualize the computational differences that shape research accessibility. Dubey et
al. [128] introduced the Llama 3 series with models ranging from 8B to 405B pa-
rameters, officially supporting 8 languages (English, German, French, Italian, Por-
tuguese, Hindi, Spanish, and Thai), with experimental multilingual capabilities on
a broader set via the speech interface (34 languages). Llama 3 multimodal exten-
sions for image, video, and speech understanding were described in their technical
report, but remain under development and have not been publicly released along
with the paper. In a similar endeavor, Liu et al. [129] presented DeepSeek-V3,
a 671B-parameter MoE language model (37B active parameters per token) with
multi-head latent attention (MLA) and FP8 mixed-precision training. It is impor-
tant to note that DeepSeek-V3 is a text-only language model without native vision
or audio capabilities. However, we include it to put efficient training strategies into
perspective (DeepSeek-V3 requires 2.788M H800 GPU-hours, costing approxi-
30

-- 30 of 62 --

Table 11: Design strategies and trade-offs in multimodal architectures for low-resource
settings. Core strategies are grouped by

Chunk 48 · 1,988 chars

e vision
or audio capabilities. However, we include it to put efficient training strategies into
perspective (DeepSeek-V3 requires 2.788M H800 GPU-hours, costing approxi-
30

-- 30 of 62 --

Table 11: Design strategies and trade-offs in multimodal architectures for low-resource
settings. Core strategies are grouped by methodological approach.
Model 	Core Strategy 	Advantages 	Constraints
Parameter-Efficient Adaptation
XtremeCLIP [125] 	Prototype affinity matching with
frozen CLIP encoders; contrastive
learning for supervision
21,000× less parameters vs. full
fine-tuning; 20 min training on one
A100; edge-deployable
Task performance bounded by
frozen backbone capacity; requires
labeled prototype examples
LowCLIP [126] 	Lightweight 	image 	encoders
(EfficientNet-B0) with mBERT;
synthetic data via MT
Trainable on consumer GPU (T4);
open-source; 37 hours total training
Performance depends on MT qual-
ity; cross-domain generalization
gap observed
FEWVLM [122] 	Seq2seq 	with 	PrefixLM 	+
MaskedLM; prompt-based few-
shot learning
Outperforms 31× larger Frozen
model; comparable with 246×
larger PICa
Zero-shot performance sensitive
to prompt wording; task-specific
prompt engineering required
MixLoRA [133] 	Conditional mixture of LoRA ex-
perts; input-dependent routing
Reduces task interference in multi-
task settings; 8.3% gain over stan-
dard LoRA on MME
Routing computation overhead; re-
quires careful expert initialization
Cross-Lingual Transfer
Cycle-Attn [120] 	Translation + alignment hybrid
with cycle consistency constraint
Fine-grained visual-textual align-
ment; leverages English captioning
supervision
Requires pre-trained English cap-
tioner; limited to language pairs
with English pivot
Amharic Captioning [123] Inception-v3 encoder + Bi-GRU de-
coder with visual attention
End-to-end trainable; interpretable
attention weights; 	significant
BLEU increase on BNATURE
Requires translated Flickr8k data;
architecture not tested on other LR
languages
Multilingual

Chunk 49 · 1,995 chars

limited to language pairs
with English pivot
Amharic Captioning [123] Inception-v3 encoder + Bi-GRU de-
coder with visual attention
End-to-end trainable; interpretable
attention weights; 	significant
BLEU increase on BNATURE
Requires translated Flickr8k data;
architecture not tested on other LR
languages
Multilingual Speech
Auxiliary CTC [113] 	LID-conditioned auxiliary objec-
tives on Whisper encoder
Scales to 102 languages; 28% rela-
tive CER reduction on FLEURS
Requires pre-extracted Whisper
features; 	multi-stage training
pipeline
Foundation Models (for reference)
Llama 3 [128] 	Dense Transformer (405B params);
multimodal extensions under devel-
opment
Strong zero-shot; 8 officially sup-
ported languages; open weights
3.8×1025 FLOPs pre-training; mul-
timodal capabilities not yet released
DeepSeek-V3 [129] 	MoE with MLA (671B total, 37B
active); FP8 mixed-precision train-
ing
2.788M H800 GPU hours ($5.6M);
competitive with GPT-4o on bench-
marks
Text-only model; no native vi-
sion/audio; requires 2048×H800
cluster
Apple AFM [131] 	On-device (≈3B) with 2-bit QAT;
server PT-MoE architecture
Edge-deployable; 16 languages;
image understanding capability
Proprietary; Apple ecosystem only;
version-specific adapters
mately $5.6M) and better inform future multimodal model development. For edge
deployment scenarios, Gunter et al. [131] introduced Apple Intelligence Founda-
tion Models with a novel Parallel-Track MoE architecture optimized for on-device
processing, supporting 16 languages with 2-bit quantization-aware training. The
prevalence of MoE architectures in these recent developments demonstrates the
effectiveness of expert-based scaling for multimodal tasks, as also observed by
Mu et al. [30]. Additionally, Yang et al. [132] proposed a unified diffusion ar-
chitecture that combines multimodal understanding with generation capabilities,
offering new perspectives on architectural design for LR contexts.
For specific language families and modality combinations,

Chunk 50 · 1,991 chars

imodal tasks, as also observed by
Mu et al. [30]. Additionally, Yang et al. [132] proposed a unified diffusion ar-
chitecture that combines multimodal understanding with generation capabilities,
offering new perspectives on architectural design for LR contexts.
For specific language families and modality combinations, several innova-
31

-- 31 of 62 --

tive architectures have been proposed. Solomon et al. [123] developed a hy-
bridized attention-based deep neural network for Amharic language image cap-
tioning, combining a CNN encoder with visual attention mechanisms and a bidi-
rectional GRU decoder, achieving significant improvements in terms of BLEU.
Rahul et al. [124] introduced a multimodal neural machine translation system
between Sanskrit and Malayalam, which embeds morphology and word sense
disambiguation awareness. It utilizes both textual and speech modalities via a
two-level fusion approach of transform-based feature vectors. For African lan-
guages, Rahmon et al. [127] presented a speech recognition model for Yoruba
that employs acoustic and language modeling with sequential MFCC features,
achieving 83% accuracy in speech-to-text conversion. For sentiment analysis,
Mamta et al. [121] explored multilingual, multi-task and adversarial learning ap-
proaches to transfer knowledge from HR languages to LR scenarios, leveraging
shared semantic spaces through cross-lingual word embeddings. For Arabic, Al-
wajih et al. [144] introduced Peacock, a comprehensive family of LMMs with
strong vision and language capabilities, alongside Henna, a benchmark for eval-
uating culturally-aware Arabic LMMs, further helping to bridge the gap between
high-resource and low-resource languages, while addressing unique linguistic and
cultural characteristics.
In Table 11, we synthesize the design trade-offs across architectural approaches,
organized by methodological strategy. Three principal patterns emerge from our
analysis. First, parameter-efficient adaptation methods

Chunk 51 · 1,997 chars

-resource and low-resource languages, while addressing unique linguistic and
cultural characteristics.
In Table 11, we synthesize the design trade-offs across architectural approaches,
organized by methodological strategy. Three principal patterns emerge from our
analysis. First, parameter-efficient adaptation methods (XtremeCLIP, LowCLIP,
FEWVLM, MixLoRA) achieve competitive performance, while reducing train-
able parameters by 3-5 orders of magnitude compared with full fine-tuning, mak-
ing the corresponding models accessible to researchers with limited computational
resources. Second, cross-lingual transfer approaches (Cycle-Attn, Amharic Cap-
tioning) effectively leverage high-resource language supervision, typically En-
glish, to bootstrap performance in target languages. However, this creates struc-
tural dependency on pivot language quality and availability. Third, foundation
models occupy a distinct design regime. Since Llama 3 requires 3.8×1025 FLOPs
and DeepSeek-V3 consumes 2.788M H800 GPU-hours for pre-training, these
models remain inaccessible to most research groups focused on low-resource lan-
guages. The practical implication is that parameter-efficient methods currently
offer the most viable path for researchers operating under resource constraints,
while foundation models may serve as upstream components for transfer learning
when API access or pre-trained weights are available.
The computational requirements documented in Table 7 reveal a structural
divide with sociolinguistic implications. While parameter-efficient methods like
32

-- 32 of 62 --

XtremeCLIP (5-7K parameters, 20 minutes on one GPU) remain accessible, foun-
dation models require resource-intensive infrastructure (Llama 3 consumes 3.8 ×
1025 FLOPs across 16K H100 GPUs; DeepSeek-V3 requires 2.79M H800 GPU
hours with an estimated cost of $5.6M). This asymmetry matters because LLMs
exhibit systematic bias in knowledge acquisition. Indeed, new knowledge is learned
less efficiently in

Chunk 52 · 1,999 chars

odels require resource-intensive infrastructure (Llama 3 consumes 3.8 ×
1025 FLOPs across 16K H100 GPUs; DeepSeek-V3 requires 2.79M H800 GPU
hours with an estimated cost of $5.6M). This asymmetry matters because LLMs
exhibit systematic bias in knowledge acquisition. Indeed, new knowledge is learned
less efficiently in LR languages, transfers less effectively to them, and is overwrit-
ten more easily by HR language information [145]. The implication is that scaling
alone is not sufficient to achieve equity. Therefore, architectural innovations must
explicitly counteract these biases.
Federated learning offers a technical framework aligned with data sovereignty
principles, enabling collaborative training without data centralization [146]. Re-
cent work demonstrates feasibility for multilingual LR settings. For example, fed-
erated prompt tuning achieves competitive performance while preserving data lo-
cality [147], and differential privacy integration protects against gradient inversion
attacks [148]. For multimodal LR applications, federated approaches could en-
able geographically-distributed language communities to collaboratively improve
models without ceding control over culturally-sensitive audiovisual data.
9. Evaluation Challenges
Evaluation remains one of the most underdeveloped aspects of research on
LMMs for LR languages. While the field has made significant strides in dataset
creation, fusion strategies, and architectural innovations, the ways for measuring
success have not kept pace. The lack of consistent and culturally-grounded eval-
uation protocols severely hampers the ability of researchers to compare models,
reproduce findings, or interpret results in real-world contexts.
Limitations of standard metrics across cultural contexts. Most evaluation
pipelines for LR multimodal models rely on automatic metrics originally designed
for high-resource and predominantly Western-centric settings. Metrics such as
BLEU, ROUGE, accuracy, and F1 implicitly assume

Chunk 53 · 1,993 chars

erpret results in real-world contexts.
Limitations of standard metrics across cultural contexts. Most evaluation
pipelines for LR multimodal models rely on automatic metrics originally designed
for high-resource and predominantly Western-centric settings. Metrics such as
BLEU, ROUGE, accuracy, and F1 implicitly assume that reference annotations re-
flect shared cultural, visual, and linguistic grounding. This assumption frequently
fails in low-resource contexts.
One such case can be observed in multimodal tasks such as visual question
answering, image captioning, and meme understanding, where the visual content
itself often encodes culturally-specific assumptions regarding object salience, so-
cial roles, or everyday activities. For instance, a model trained primarily on West-
ern image datasets may fail to recognize culturally significant objects (e.g. tradi-
tional clothing, local foods, religious symbols, etc.) that are common in LR lan-
33

-- 33 of 62 --

guage contexts. When benchmarks are translated or minimally adapted from high-
resource languages, models may achieve high lexical overlap with reference an-
swers while still producing outputs that are culturally inappropriate, semantically
misleading, or pragmatically implausible for native speakers. These issues are fur-
ther exacerbated in knowledge-intensive evaluations derived from English-centric
benchmarks. For example, questions about local festivals, historical events, or
social customs require cultural context that translation alone cannot provide. As
a result, standard metrics may overestimate progress or mask systematic failures
that are only visible through culturally-grounded evaluation.
Dataset heterogeneity and comparability issues. A second major challenge
concerns dataset heterogeneity, as existing studies evaluate multimodal models
on datasets with widely distinct characteristics and assumptions. Many studies
rely on translated versions of high-resource benchmarks, such as extensions

Chunk 54 · 1,993 chars

evaluation.
Dataset heterogeneity and comparability issues. A second major challenge
concerns dataset heterogeneity, as existing studies evaluate multimodal models
on datasets with widely distinct characteristics and assumptions. Many studies
rely on translated versions of high-resource benchmarks, such as extensions of
Multi30K for Ukrainian [49] or Visual Genome variants for Bengali [43], Hausa
[45], and Hindi [105]. While translation-based approaches enable rapid bench-
mark construction, they often introduce Western cultural biases and may fail to
reflect authentic language use or visual grounding in target communities. In
contrast, newly introduced language-specific datasets, such as DravidianMulti-
Modality [32], RoMemes [38], and ArabSign [39], better capture genuine lin-
guistic and cultural phenomena, but typically suffer from limited coverage, non-
standardized annotation protocols, and heterogeneous quality control practices,
making cross-study comparison difficult. As a result, performance improvements
reported across such heterogeneous evaluation settings are often not directly com-
parable.
Recommendations for fair evaluation practices. Based on our analysis, we pro-
pose the following recommendations for evaluation in LR multimodal research:
• Report multiple metrics. Studies should report multiple complementary
metrics that capture different aspects of performance. In the context of ma-
chine translation tasks, researchers should report BLEU alongside COMET,
or human evaluation scores. Another example in the context of VQA tasks,
exact-match accuracy should be accompanied by relaxed matching that ac-
counts for morphological variants and, when possible, human judgment of
answer correctness.
• Perform culturally-grounded human evaluation. In addition to a diverse set
of automated metrics, we believe that human evaluation conducted by na-
tive speakers from the target language community also plays a crucial role.
Evaluators should assess whether

Chunk 55 · 1,999 chars

nts and, when possible, human judgment of
answer correctness.
• Perform culturally-grounded human evaluation. In addition to a diverse set
of automated metrics, we believe that human evaluation conducted by na-
tive speakers from the target language community also plays a crucial role.
Evaluators should assess whether outputs sound natural to native speakers,
34

-- 34 of 62 --

whether they are culturally appropriate, whether they convey the intended
meaning accurately, and (for VQA) whether answers are semantically cor-
rect, even if worded differently from the reference.
• Develop and use standardized benchmarks. The field needs publicly avail-
able test sets for LR multimodal evaluation, following examples like SEACrowd
[51] for Southeast Asian languages and CreoleVal [52] for Creole languages.
Such benchmarks should cover different task types (VQA, captioning, trans-
lation, classification), include culturally-accurate content created together
with language communities, provide multiple correct answers to account
for natural variation, and document how data was labeled.
• Compare to sensible baselines. Rather than reporting absolute performance
in isolation, studies should contextualize results relative to unimodal base-
lines (text-only or vision-only) to demonstrate the benefits of multimodal
approaches, random and majority-class baselines to establish task difficulty,
prior work on the same dataset when available, and performance on related
HR languages to quantify the LR gap.
As shown above, evaluation challenges remain a major problem in LR multi-
modal research, but some steps have already been taken towards fixing this gap.
Although standard metrics represent a great starting point for evaluation, they are
designed for English and often miss what matters for LR languages and their cul-
tural contexts. Solving these challenges requires creating culturally-appropriate
benchmarks, using multiple and diverse evaluation metrics, as well as involving
language

Chunk 56 · 1,990 chars

d metrics represent a great starting point for evaluation, they are
designed for English and often miss what matters for LR languages and their cul-
tural contexts. Solving these challenges requires creating culturally-appropriate
benchmarks, using multiple and diverse evaluation metrics, as well as involving
language communities in the evaluation process.
10. Conclusion and Future Work
Conclusion. Our survey has provided a comprehensive analysis of LMM-based
approaches for LR languages, comprising 117 studies across 96 languages. Vision-
language combinations dominate the current research landscape (65% of surveyed
works), with an increasing trend toward incorporating video and speech in re-
cent works. We observed a concentration of research in South Asian languages
(including Hindi, Bengali, Malayalam), Southeast Asian languages (Vietnamese,
Javanese, Malay), Middle Eastern languages (Persian, Arabic) and African lan-
guages (Hausa, Amharic), while 42 other languages appear in only one study
each.
35

-- 35 of 62 --

The landscape of LMMs for LR languages has shown remarkable progress
across multiple dimensions, from data creation to fusion techniques and architec-
tural innovations. Projects like HVG, SEACrowd, and BVG highlight growing at-
tention to creating high-quality multimodal resources for understudied languages.
Recent successes with models such as Qalam, LaVy, and Amharic LLaVA [149]
demonstrate that carefully designed multimodal strategies can effectively lever-
age limited resources, while adapting large-scale architectures for low-resource
contexts.
Challenges and gaps. Our analysis reveals several critical challenges in the cur-
rent landscape of LMMs for LR languages. A significant modality imbalance ex-
ists, with text-image pairs dominating research (65% of studies), while audio and
video modalities remain underexplored. This gap is particularly problematic for
languages with strong oral traditions, where speech, tone and gesture carry

Chunk 57 · 1,995 chars

ur-
rent landscape of LMMs for LR languages. A significant modality imbalance ex-
ists, with text-image pairs dominating research (65% of studies), while audio and
video modalities remain underexplored. This gap is particularly problematic for
languages with strong oral traditions, where speech, tone and gesture carry essen-
tial linguistic information, with only 32% of studies incorporating audio, despite
its crucial importance for predominantly oral languages, and only 8.5% of studies
incorporating a video modality. We also identified persistent dataset scarcity and
uneven language representation, with just three languages (Hindi, Arabic, Ben-
gali) accounting for a disproportionate share of research attention. Technical lim-
itations further constrain progress, as computational constraints limit the applica-
tion of advanced fusion techniques in resource-constrained environments typical
for LR contexts. Current cross-modal transfer methods struggle with catastrophic
forgetting and inefficient knowledge transfer, particularly for languages that are
structurally distant from high-resource counterparts. The field also lacks stan-
dardized evaluation frameworks for meaningful comparison across approaches,
while recent work by Shen et al. [150] highlights significant safety challenges
when deploying LLMs in multilingual contexts. Finally, sociolinguistic dimen-
sions remain underexplored, including cultural representation, algorithmic bias,
and potential impacts on language endangerment and revitalization efforts. These
concerns are particularly acute given power imbalances between communities
speaking low-resource languages and the primarily Western institutions develop-
ing these technologies.
Our study identifies three mechanisms through which LMMs may perpet-
uate digital inequalities. First, language model training inherently favors lan-
guages with larger training representation [67, 145], introducing a bias towards
modeling HR languages. Second, benchmarks

Chunk 58 · 1,995 chars

institutions develop-
ing these technologies.
Our study identifies three mechanisms through which LMMs may perpet-
uate digital inequalities. First, language model training inherently favors lan-
guages with larger training representation [67, 145], introducing a bias towards
modeling HR languages. Second, benchmarks derived from English (e.g. trans-
lated MMLU) embed Western cultural assumptions that disadvantage LR lan-
guage speakers even when linguistic accuracy is achieved [151], introducing cul-
tural biases in the evaluation. Third, computational requirements exclude re-
36

-- 36 of 62 --

searchers in LR language regions from model development, creating dependency
on external institutions and biasing resource access. Addressing these biases
requires community-centered approaches that prioritize local capacity building,
alongside technical performance metrics.
Future work. Based on the challenges identified above, we propose several key
directions for future research.
For short-term development, we propose the following actionable research
directions for benchmark and dataset creation: (1) extend Visual Genome-style
multimodal datasets to at least 20 additional LR languages, prioritizing the 42
languages currently represented by only a single study; (2) develop speech-image
paired corpora for tonal languages (e.g. Yoruba, Igbo, Fongbe), where audio modal-
ity carries critical semantic distinctions absent in text; and (3) establish a stan-
dardized “LR-MMBench” evaluation suite with culturally-adapted visual question
answering tasks, following SEACrowd’s multilingual methodology, but incorpo-
rating non-Western visual contexts and evaluation protocols validated by native
speakers.
To develop and improve LMMs for LR language, several concrete directions
emerge from our analysis: (1) develop catastrophic forgetting mitigation strategies
maintaining over 95% source-language performance, while achieving over 80%
target-language performance for language pairs

Chunk 59 · 1,997 chars

protocols validated by native
speakers.
To develop and improve LMMs for LR language, several concrete directions
emerge from our analysis: (1) develop catastrophic forgetting mitigation strategies
maintaining over 95% source-language performance, while achieving over 80%
target-language performance for language pairs with fewer than 1,000 parallel
sentences; (2) create language-agnostic visual encoders pre-trained on culturally-
diverse image collections sourced from non-Western contexts, reducing the doc-
umented Western bias in current visual representations; and (3) establish explicit
source-language selection guidelines based on typological similarity metrics (syn-
tactic distance, shared writing systems, WALS features) to maximize positive
transfer for specific target languages.
Several other research gaps require attention in future. Regarding the observed
modality imbalance, researchers should prioritize incorporating audio and video
for LR languages with limited writing traditions, enabling more robust applica-
tions that better reflect natural communication patterns, particularly for tonal lan-
guages and those where non-verbal communication is significant. For resource
development, future work should advance synthetic data generation techniques
(building on HVG, ELAICHI, Vintern-1B) and improve cross-lingual transfer
methodologies (extending XtremeCLIP, LowCLIP) to accommodate greater lin-
guistic diversity, while minimizing catastrophic forgetting. To overcome limi-
tations in resource-constrained settings, researchers should investigate efficient
fusion approaches including stacking-based late fusion, tensor fusion for com-
plex interactions, and graphical fusion leveraging graph-based representations,
37

-- 37 of 62 --

all adapted for computational efficiency. Advancing adaptive integration through
mechanisms that dynamically adjust the contribution of each modality based on
input quality and task requirements will be crucial. Building on MRF [79],

Chunk 60 · 1,998 chars

ons, and graphical fusion leveraging graph-based representations,
37

-- 37 of 62 --

all adapted for computational efficiency. Advancing adaptive integration through
mechanisms that dynamically adjust the contribution of each modality based on
input quality and task requirements will be crucial. Building on MRF [79], fu-
ture work should explore hybrid approaches that combine strengths of different
fusion strategies, while maintaining computational efficiency. Finally, adopting
community-centered design approaches that address sociolinguistic dimensions
alongside technical advances will ensure that developments benefit the intended
language communities themselves.
Finally, we advocate for mandatory community engagement through: (i) par-
ticipatory design frameworks requiring documented language community involve-
ment in dataset creation, with explicit data governance and benefit-sharing agree-
ments; (ii) open-source, mobile-first data collection libraries suitable for field con-
ditions, where many LR languages are spoken; and (3) standardized model cards
for LR multimodal systems, documenting limitations, cultural biases, and appro-
priate use cases, ensuring transparent communication with end-user communities.
Acknowledgments
This research is supported by the project “Romanian Hub for Artificial Intel-
ligence - HRIA”, Smart Growth, Digitization and Financial Instruments Program,
2021-2027, MySMIS no. 351416. The authors thank reviewers for the construc-
tive feedback.
References
[1] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui,
O. K. Mohammed, B. Patra, et al., Language is not all you need: aligning
perception with language models, in: Proceedings of the 37th International
Conference on Neural Information Processing Systems (NeuIPS), 2023, pp.
72096–72109.
URL https://dl.acm.org/doi/10.5555/3666122.3669277
[2] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., PaLM-E:

Chunk 61 · 1,975 chars

ge models, in: Proceedings of the 37th International
Conference on Neural Information Processing Systems (NeuIPS), 2023, pp.
72096–72109.
URL https://dl.acm.org/doi/10.5555/3666122.3669277
[2] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., PaLM-E: an embodied mul-
timodal language model, in: Proceedings of the 40th International Confer-
ence on Machine Learning (ICML), 2023, pp. 8469–8488.
URL https://dl.acm.org/doi/10.5555/3618408.3618748
38

-- 38 of 62 --

[3] S. Khanna, X. Li, Invisible languages of the LLM universe, arXiv preprint
arXiv:2510.11557 (2025).
URL https://arxiv.org/abs/2510.11557
[4] S. R. Carroll, I. Garba, O. L. Figueroa-Rodríguez, J. C. Holbrook,
R. Lovett, S. Materechera, M. A. Parsons, K. Raseroka, D. Rodriguez-
Lonebear, R. Rowe, et al., The CARE principles for indigenous data
governance, Data Science Journal 19 (2020) 43. doi:10.5334/
dsj-2020-043.
URL https://datascience.codata.org/articles/dsj-2020-043
[5] S. Bird, Local languages, third spaces, and other high-resource scenarios,
in: Proceedings of the 60th Annual Meeting of the Association for Compu-
tational Linguistics (ACL), 2022, p. 7817–7829. doi:10.18653/v1/
2022.acl-long.539.
URL https://aclanthology.org/2022.acl-long.539/
[6] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A.
Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al.,
Overcoming catastrophic forgetting in neural networks, Proceedings of
the National Academy of Sciences 114 (13) (2017) 3521–3526. doi:
10.1073/pnas.1611835114.
URL https://www.pnas.org/doi/full/10.1073/pnas.1611835114
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep
bidirectional transformers for language understanding, in: Proceedings of
the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies (NAACL-
HLT), 2019, pp. 4171–4186.

Chunk 62 · 1,988 chars

35114
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep
bidirectional transformers for language understanding, in: Proceedings of
the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies (NAACL-
HLT), 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
URL https://aclanthology.org/N19-1423/
[8] M. Rungta, J. Singh, S. M. Mohammad, D. Yang, Geographic citation gaps
in NLP research, in: Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing (EMNLP), 2022, pp. 1371–1383.
doi:10.18653/v1/2022.emnlp-main.89.
URL https://aclanthology.org/2022.emnlp-main.89/
[9] P. M. Joshi, S. Santy, A. Budhiraja, K. Bali, M. Choudhury, The state
and fate of linguistic diversity and inclusion in the NLP world, in: Pro-
ceedings of the 58th Annual Meeting of the Association for Computational
39

-- 39 of 62 --

Linguistics (ACL), 2020, pp. 6282–6293. doi:10.18653/v1/2020.
acl-main.560.
URL https://aclanthology.org/2020.acl-main.560/
[10] S. Ranathunga, N. de Silva, Some languages are more equal than others:
Probing deeper into the linguistic disparity in the NLP world, in: Proceed-
ings of the 2nd Conference of the Asia-Pacific Chapter of the Associa-
tion for Computational Linguistics and the 12th International Joint Con-
ference on Natural Language Processing (AACL-IJCNLP), 2022, pp. 823–
848. doi:10.18653/v1/2022.aacl-main.62.
URL https://aclanthology.org/2022.aacl-main.62/
[11] A. Caines, M. Rei, The geographic diversity of NLP conferences, https:
//www.marekrei.com/blog/geographic-diversity-of-nlp-conferences/, ac-
cessed: December 2025 (2019).
[12] K. Darwish, N. Habash, M. Abbas, H. S. Al-Khalifa, H. T. Al-Natsheh,
S. R. El-Beltagy, H. Bouamor, K. Bouzoubaa, V. Cavalli-Sforza, W. El-
Hajj, M. Jarrar, H. Mubarak, A panoramic survey of natural language pro-
cessing in the Arab world, Communications of the ACM 64 (2020) 72–81.
doi:10.1145/3447735.
URL

Chunk 63 · 1,977 chars

(2019).
[12] K. Darwish, N. Habash, M. Abbas, H. S. Al-Khalifa, H. T. Al-Natsheh,
S. R. El-Beltagy, H. Bouamor, K. Bouzoubaa, V. Cavalli-Sforza, W. El-
Hajj, M. Jarrar, H. Mubarak, A panoramic survey of natural language pro-
cessing in the Arab world, Communications of the ACM 64 (2020) 72–81.
doi:10.1145/3447735.
URL https://dl.acm.org/doi/10.1145/3447735
[13] O. Kanishcheva, CLARIN knowledge centre for Ukrainian
NLP and corpora (UkrNLP-Corpora), https://www.clarin.eu/blog/
introduction-clarin-knowledge-centre-ukrainian-nlp-and-corpora-ukrnlp-corpora,
accessed: December 2025 (2025).
[14] M. Romanyshyn (Ed.), Proceedings of the Fourth Ukrainian Natural Lan-
guage Processing Workshop (UNLP), Association for Computational Lin-
guistics, 2025. doi:10.18653/v1/2025.unlp-1.0.
URL https://aclanthology.org/2025.unlp-1.0/
[15] K. Akhynko, O. Kosovan, M. Trokhymovych, Hidden Persuasion: De-
tecting Manipulative Narratives on Social Media During the 2022 Rus-
sian Invasion of Ukraine, in: Proceedings of the Fourth Ukrainian Natu-
ral Language Processing Workshop (UNLP), 2025, pp. 194–202. doi:
10.18653/v1/2025.unlp-1.19.
URL https://aclanthology.org/2025.unlp-1.19/
40

-- 40 of 62 --

[16] M. Li, Top 50+ Chinese AI investment statistics [2025], https:
//www.secondtalent.com/resources/chinese-ai-investment-statistics/,
accessed: December 2025 (2025).
[17] D. Normile, Chinese firm’s faster, cheaper AI language model makes a
splash, Science 387 (6731) (2025) 238–238.
URL https://www.science.org/content/article/
chinese-firm-s-faster-cheaper-ai-language-model-makes-splash
[18] J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma,
H. Liu, Y. Wang, J. Guo, A Survey on LLM-as-a-Judge, arXiv preprint
arXiv:2411.15594 (2024).
URL https://arxiv.org/abs/2411.15594
[19] A. Paullada, I. D. Raji, E. M. Bender, E. Denton, A. Hanna,
Data and its (dis)contents: A survey of dataset development and
use in machine learning research, Patterns 2 (11) (2021)

Chunk 64 · 1,997 chars

Ma,
H. Liu, Y. Wang, J. Guo, A Survey on LLM-as-a-Judge, arXiv preprint
arXiv:2411.15594 (2024).
URL https://arxiv.org/abs/2411.15594
[19] A. Paullada, I. D. Raji, E. M. Bender, E. Denton, A. Hanna,
Data and its (dis)contents: A survey of dataset development and
use in machine learning research, Patterns 2 (11) (2021) 100336.
doi:https://doi.org/10.1016/j.patter.2021.100336.
URL https://www.sciencedirect.com/science/article/pii/
S2666389921001847
[20] S. Ruder, N. Constant, J. Botha, A. Siddhant, O. Firat, J. Fu, P. Liu,
J. Hu, D. Garrette, G. Neubig, M. Johnson, XTREME-R: Towards More
Challenging and Nuanced Multilingual Evaluation, in: Proceedings of the
2021 Conference on Empirical Methods in Natural Language Process-
ing (EMNLP), 2021, pp. 10215–10245. doi:10.18653/v1/2021.
emnlp-main.802.
URL https://aclanthology.org/2021.emnlp-main.802/
[21] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang,
J. Zhang, Z. Dong, et al., A Survey of Large Language Models, arXiv
preprint arXiv:2303.18223 (2023).
URL https://arxiv.org/abs/2303.18223
[22] W. Zhu, Y. Lv, Q. Dong, F. Yuan, J. Xu, S. Huang, L. Kong, J. Chen,
L. Li, Extrapolating Large Language Models to Non-English by Aligning
Languages, arXiv preprint arXiv:2308.04948 (2023).
URL https://arxiv.org/abs/2308.04948
41

-- 41 of 62 --

[23] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, J. Gao, Vision-language pre-training:
Basics, recent advances, and future trends, Foundations and Trends in Com-
puter Graphics and Vision 14 (3–4) (2022) 163–352. doi:10.1561/
0600000105.
URL https://www.nowpublishers.com/article/Details/CGV-105
[24] X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, W. Gao,
Large-scale multi-modal pre-trained models: A comprehensive survey,
Machine Intelligence Research 20 (2023) 447–482. doi:10.1007/
s11633-022-1410-8.
URL https://link.springer.com/article/10.1007/s11633-022-1410-8
[25] Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, G. Shi, A Survey of State
of the Art Large Vision

Chunk 65 · 1,997 chars

an, W. Gao,
Large-scale multi-modal pre-trained models: A comprehensive survey,
Machine Intelligence Research 20 (2023) 447–482. doi:10.1007/
s11633-022-1410-8.
URL https://link.springer.com/article/10.1007/s11633-022-1410-8
[25] Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, G. Shi, A Survey of State
of the Art Large Vision Language Models: Benchmark Evaluations and
Challenges, in: Proceedings of IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW), 2025, pp. 1587–1606.
URL https://openaccess.thecvf.com/content/CVPR2025W/
TMM-OpenWorld/html/Li_A_Survey_of_State_of_the_Art_Large_
Vision_Language_CVPRW_2025_paper.html
[26] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on mul-
timodal large language models, National Science Review 11 (12) (2024)
nwae403. doi:10.1093/nsr/nwae403.
URL https://academic.oup.com/nsr/article/11/12/nwae403/7896414
[27] J. Xie, Z. Chen, R. Zhang, X. Wan, G. Li, Large Multimodal Agents: A
Survey, arXiv preprint arXiv:2402.15116 (2024).
URL https://arxiv.org/abs/2402.15116
[28] M. Xu, W. Yin, D. Cai, R. Yi, D. Xu, Q. Wang, B. Wu, Y. Zhao, C. Yang,
S. Wang, et al., A Survey of Resource-efficient LLM and Multimodal Foun-
dation Models, arXiv preprint arXiv:2401.08092 (2024).
URL https://arxiv.org/abs/2401.08092
[29] F. Alam, S. A. Chowdhury, S. Boughorbel, M. Hasanain, LLMs for low
resource languages in multilingual, multimodal and dialectal settings, in:
Proceedings of the 18th Conference of the European Chapter of the Asso-
ciation for Computational Linguistics: Tutorial Abstracts (EACL), 2024,
pp. 27–33.
URL https://aclanthology.org/2024.eacl-tutorials.5/
42

-- 42 of 62 --

[30] S. Mu, S. Lin, A Comprehensive Survey of Mixture-of-Experts: Algo-
rithms, Theory, and Applications, arXiv preprint arXiv:2503.07137 (2025).
doi:10.48550/ARXIV.2503.07137.
URL https://doi.org/10.48550/arXiv.2503.07137
[31] H. Najadat, F. Abushaqra, Multimodal sentiment analysis of Arabic videos,
Journal of Image and Graphics 6

Chunk 66 · 1,990 chars

Lin, A Comprehensive Survey of Mixture-of-Experts: Algo-
rithms, Theory, and Applications, arXiv preprint arXiv:2503.07137 (2025).
doi:10.48550/ARXIV.2503.07137.
URL https://doi.org/10.48550/arXiv.2503.07137
[31] H. Najadat, F. Abushaqra, Multimodal sentiment analysis of Arabic videos,
Journal of Image and Graphics 6 (1) (2018) 39–43.
URL https://www.joig.net/index.php?m=content&c=index&a=show&
catid=47&id=173
[32] B. R. Chakravarthi, J. Parameswaran P.K., Premjith B, K. Soman, R. Pon-
nusamy, P. K. Kumaresan, K. P. Thamburaj, J. P. McCrae, DravidianMul-
tiModality: A Dataset for Multi-modal Sentiment Analysis in Tamil and
Malayalam, arXiv preprint arXiv:2106.04853 (2021).
URL https://arxiv.org/abs/2106.04853
[33] S. Taylor, F. Fauzi, Multimodal Sentiment Analysis for the Malay Lan-
guage: New Corpus using CNN-based Framework, ACM Transactions on
Asian and Low-Resource Language Information Processing 24 (2024) 1–
26. doi:10.1145/3703445.
URL https://dl.acm.org/doi/10.1145/3703445
[34] D. F. Kponou, F. A. Laleye, E. C. Ezin, FFSTC: Fongbe to French speech
translation corpus, in: Proceedings of the 2024 Joint International Confer-
ence on Computational Linguistics, Language Resources and Evaluation
(LREC-COLING), 2024, pp. 7270–7276.
URL https://aclanthology.org/2024.lrec-main.638/
[35] A. Haouhat, S. Bellaouar, A. Nehar, H. Cherroun, Towards Arabic mul-
timodal dataset for sentiment analysis, in: Proceedings of Fourth Inter-
national Conference on Intelligent Data Science Technologies and Appli-
cations (IDSTA), 2023, pp. 126–133. doi:10.1109/IDSTA58916.
2023.10317847.
URL https://ieeexplore.ieee.org/document/10317847
[36] E. Hossain, O. Sharif, M. M. Hoque, MemoSen: A multimodal dataset for
sentiment analysis of memes, in: Proceedings of the Thirteenth Language
Resources and Evaluation Conference (LREC), 2022, pp. 1542–1554.
URL https://aclanthology.org/2022.lrec-1.165/
43

-- 43 of 62 --

[37] E. Hossain, O. Sharif, M. M. Hoque, MUTE: A multimodal dataset

Chunk 67 · 1,995 chars

, M. M. Hoque, MemoSen: A multimodal dataset for
sentiment analysis of memes, in: Proceedings of the Thirteenth Language
Resources and Evaluation Conference (LREC), 2022, pp. 1542–1554.
URL https://aclanthology.org/2022.lrec-1.165/
43

-- 43 of 62 --

[37] E. Hossain, O. Sharif, M. M. Hoque, MUTE: A multimodal dataset for
detecting hateful memes, in: Proceedings of the 2nd Conference of the
Asia-Pacific Chapter of the Association for Computational Linguistics and
the 12th International Joint Conference on Natural Language Processing:
Student Research Workshop (AACL-IJCNLP), 2022, pp. 32–39. doi:
10.18653/v1/2022.aacl-srw.5.
URL https://aclanthology.org/2022.aacl-srw.5/
[38] V. P˘ai¸s, S. Ni¸t˘a, A.-I. Jerpelea, L. Pan˘a, E. Curea, RoMemes: A multimodal
meme corpus for the Romanian language, arXiv preprint arXiv:2410.15497
(2024).
URL https://arxiv.org/abs/2410.15497
[39] H. Luqman, ArabSign: A Multi-modality Dataset and Benchmark for
Continuous Arabic Sign Language Recognition, in: Proceedings of IEEE
17th International Conference on Automatic Face and Gesture Recognition
(FG), 2023, pp. 1–8. doi:10.1109/FG57933.2023.10042720.
URL https://ieeexplore.ieee.org/document/10042720
[40] C. Sikasote, E. Mukonde, M. M. I. Alam, A. Anastasopoulos, BIG-C: a
multimodal multi-purpose dataset for Bemba, in: Proceedings of the 61st
Annual Meeting of the Association for Computational Linguistics (ACL),
2023, pp. 2062–2078. doi:10.18653/v1/2023.acl-long.115.
URL https://aclanthology.org/2023.acl-long.115/
[41] L. Sanayai Meetei, L. Rahul, A. Singh, S. M. Singh, T. D. Singh, S. Bandy-
opadhyay, An experiment on speech-to-text translation systems for Ma-
nipuri to English on low resource setting, in: Proceedings of the 18th In-
ternational Conference on Natural Language Processing (ICON), 2021, pp.
54–63.
URL https://aclanthology.org/2021.icon-main.8/
[42] F. Farsi, S. Shariati Motlagh, S. Bali, S. Sabouri, S. Momtazi, Persian in a
court: Benchmarking VLMs in Persian multi-modal

Chunk 68 · 1,977 chars

nglish on low resource setting, in: Proceedings of the 18th In-
ternational Conference on Natural Language Processing (ICON), 2021, pp.
54–63.
URL https://aclanthology.org/2021.icon-main.8/
[42] F. Farsi, S. Shariati Motlagh, S. Bali, S. Sabouri, S. Momtazi, Persian in a
court: Benchmarking VLMs in Persian multi-modal tasks, in: Proceedings
of the First Workshop of Evaluation of Multi-Modal Generation (EvalMG),
2025, pp. 52–56.
URL https://aclanthology.org/2025.evalmg-1.5/
[43] A. Sen, S. Parida, K. Kotwal, S. Panda, O. Bojar, S. R. Dash, Bengali Visual
Genome: A multimodal dataset for machine translation and image caption-
44

-- 44 of 62 --

ing, in: Proceedings of the 9th International Conference on Frontiers in
Intelligent Computing: Theory and Applications (FICTA), 2021, pp. 63–
70. doi:10.1007/978-981-16-6624-7_7.
URL https://link.springer.com/chapter/10.1007/978-981-16-6624-7_7
[44] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,
Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei, Visual
genome: Connecting language and vision using crowdsourced dense image
annotations, International Journal of Computer Vision 123 (1) (2017) 32–
73. doi:10.1007/s11263-016-0981-7.
URL https://link.springer.com/article/10.1007/s11263-016-0981-7
[45] I. Abdulmumin, S. R. Dash, M. A. Dawud, S. Parida, S. Muhammad, I. S.
Ahmad, S. Panda, O. Bojar, B. S. Galadanci, B. S. Bello, Hausa visual
genome: A dataset for multi-modal English to Hausa machine translation,
in: Proceedings of the Thirteenth Language Resources and Evaluation Con-
ference (LREC), 2022, pp. 6471–6479.
URL https://aclanthology.org/2022.lrec-1.694/
[46] S. Parida, I. Abdulmumin, S. H. Muhammad, A. Bose, G. S. Kohli, I. S.
Ahmad, K. Kotwal, S. Deb Sarkar, O. Bojar, H. Kakudi, HaVQA: A
dataset for visual question answering and multimodal research in Hausa
language, in: Findings of the Association for Computational Linguis-
tics (ACL), 2023, pp. 10162–10183.

Chunk 69 · 1,997 chars

1.694/
[46] S. Parida, I. Abdulmumin, S. H. Muhammad, A. Bose, G. S. Kohli, I. S.
Ahmad, K. Kotwal, S. Deb Sarkar, O. Bojar, H. Kakudi, HaVQA: A
dataset for visual question answering and multimodal research in Hausa
language, in: Findings of the Association for Computational Linguis-
tics (ACL), 2023, pp. 10162–10183. doi:10.18653/v1/2023.
findings-acl.646.
URL https://aclanthology.org/2023.findings-acl.646/
[47] S. Parida, S. Sahoo, S. Sekhar, K. Sahoo, K. Kotwal, S. Khosla, S. R. Dash,
A. Bose, G. S. Kohli, S. S. Lenka, O. Bojar, OVQA: A Dataset for Visual
Question Answering and Multimodal Research in Odia Language, in: Pro-
ceedings of the First Workshop on Natural Language Processing for Indo-
Aryan and Dravidian Languages (IndoNLP), 2025, pp. 58–66.
URL https://aclanthology.org/2025.indonlp-1.7/
[48] M. Anwar, B. Shi, V. Goswami, W.-N. Hsu, J. Pino, C. Wang, MuAViC:
A Multilingual Audio-Visual Corpus for Robust Speech Recognition and
Robust Speech-to-Text Translation, in: Proceedings of Conference of
the International Speech Communication Association (INTERSPEECH),
2023, pp. 4064–4068. doi:10.21437/Interspeech.2023-2279.
45

-- 45 of 62 --

URL https://www.isca-archive.org/interspeech_2023/anwar23_
interspeech.html
[49] N. Saichyshyna, D. Maksymenko, O. Turuta, A. Yerokhin, A. Babii, O. Tu-
ruta, Extension Multi30K: Multimodal dataset for integrated vision and
language research in Ukrainian, in: Proceedings of the Second Ukrainian
Natural Language Processing Workshop (UNLP), 2023, pp. 54–61. doi:
10.18653/v1/2023.unlp-1.7.
URL https://aclanthology.org/2023.unlp-1.7/
[50] D. Elliott, S. Frank, K. Sima’an, L. Specia, Multi30K: Multilingual
English-German image descriptions, in: Proceedings of the 5th Workshop
on Vision and Language (VL’16), 2016, pp. 70–74. doi:10.18653/
v1/W16-3210.
URL https://aclanthology.org/W16-3210/
[51] H. Lovenia, R. Mahendra, S. M. Akbar, L. J. V. Miranda, J. Santoso,
E. Aco, A. Fadhilah, J. Mansurov, J. M. Imperial, O. P. Kampman, et

Chunk 70 · 1,995 chars

-German image descriptions, in: Proceedings of the 5th Workshop
on Vision and Language (VL’16), 2016, pp. 70–74. doi:10.18653/
v1/W16-3210.
URL https://aclanthology.org/W16-3210/
[51] H. Lovenia, R. Mahendra, S. M. Akbar, L. J. V. Miranda, J. Santoso,
E. Aco, A. Fadhilah, J. Mansurov, J. M. Imperial, O. P. Kampman, et al.,
SEACrowd: A multilingual multimodal data hub and benchmark suite for
Southeast Asian languages, in: Proceedings of the 2024 Conference on
Empirical Methods in Natural Language Processing (EMNLP), 2024, pp.
5155–5203. doi:10.18653/v1/2024.emnlp-main.296.
URL https://aclanthology.org/2024.emnlp-main.296/
[52] H. Lent, K. Tatariya, R. Dabre, Y. Chen, M. Fekete, E. Ploeger, L. Zhou,
R.-A. Armstrong, A. Eijansantos, C. Malau, et al., CreoleVal: Multilin-
gual multitask benchmarks for creoles, Transactions of the Association for
Computational Linguistics 12 (2024) 950–978. doi:10.1162/tacl_
a_00682.
URL https://aclanthology.org/2024.tacl-1.53/
[53] K. Dutta Chowdhury, M. Hasanuzzaman, Q. Liu, Multimodal neural ma-
chine translation for low-resource language pairs using synthetic data,
in: Proceedings of the Workshop on Deep Learning Approaches for
Low-Resource NLP (DeepLo), 2018, pp. 33–42. doi:10.18653/v1/
W18-3405.
URL https://aclanthology.org/W18-3405/
[54] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions
to visual denotations: New similarity metrics for semantic inference over
46

-- 46 of 62 --

event descriptions, Transactions of the Association for Computational Lin-
guistics 2 (2014) 67–78. doi:10.1162/tacl_a_00166.
URL https://aclanthology.org/Q14-1006/
[55] L. S. Meetei, T. D. Singh, S. Bandyopadhyay, Low resource multimodal
neural machine translation of English-Hindi in news domain, in: Proceed-
ings of the First Workshop on Multimodal Machine Translation for Low
Resource Languages (MMTLRL), 2021, pp. 20–29.
URL https://aclanthology.org/2021.mmtlrl-1.4/
[56] S. Haq, R. Huidrom, S. Castilho, DCU ADAPT at WMT24: English

Chunk 71 · 1,979 chars

resource multimodal
neural machine translation of English-Hindi in news domain, in: Proceed-
ings of the First Workshop on Multimodal Machine Translation for Low
Resource Languages (MMTLRL), 2021, pp. 20–29.
URL https://aclanthology.org/2021.mmtlrl-1.4/
[56] S. Haq, R. Huidrom, S. Castilho, DCU ADAPT at WMT24: English to
low-resource multi-modal translation task, in: Proceedings of the Ninth
Conference on Machine Translation (WMT), 2024, pp. 810–814. doi:
10.18653/v1/2024.wmt-1.75.
URL https://aclanthology.org/2024.wmt-1.75/
[57] F. Alwajih, G. Bhatia, M. Abdul-Mageed, Dallah: A dialect-aware mul-
timodal large language model for Arabic, in: Proceedings of the Second
Arabic Natural Language Processing Conference (ArabicNLP), 2024, pp.
320–336. doi:10.18653/v1/2024.arabicnlp-1.27.
URL https://aclanthology.org/2024.arabicnlp-1.27/
[58] Y. Wang, J. Pfeiffer, N. Carion, Y. LeCun, A. Kamath, Adapting grounded
visual question answering models to low resource languages, in: Proceed-
ings of IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition Workshops (CVPRW), 2023, pp. 2596–2605. doi:10.1109/
CVPRW59228.2023.00258.
URL https://ieeexplore.ieee.org/document/10208296/
[59] Y. Wang, J. Dong, T. Liang, M. Zhang, R. Cai, X. Wang, Cross-lingual
cross-modal retrieval with noise-robust learning, in: Proceedings of the
30th ACM International Conference on Multimedia (ACMMM), 2022, pp.
422–433. doi:10.1145/3503161.3548003.
URL https://dl.acm.org/doi/10.1145/3503161.3548003
[60] A. Dash, H. R. Gupta, Y. Sharma, BITS-P at WAT 2023: Improving Indic
language multimodal translation by image augmentation using diffusion
models, in: Proceedings of the 10th Workshop on Asian Translation (WAT),
2023, pp. 41–45.
URL https://aclanthology.org/2023.wat-1.3/
47

-- 47 of 62 --

[61] K. T. Doan, B. G. Huynh, D. T. Hoang, T. D. Pham, N. H. Pham,
Q. Nguyen, B. Q. Vo, S. N. Hoang, Vintern-1B: An Efficient Multimodal
Large Language Model for Vietnamese, arXiv preprint

Chunk 72 · 1,998 chars

s of the 10th Workshop on Asian Translation (WAT),
2023, pp. 41–45.
URL https://aclanthology.org/2023.wat-1.3/
47

-- 47 of 62 --

[61] K. T. Doan, B. G. Huynh, D. T. Hoang, T. D. Pham, N. H. Pham,
Q. Nguyen, B. Q. Vo, S. N. Hoang, Vintern-1B: An Efficient Multimodal
Large Language Model for Vietnamese, arXiv preprint arXiv:2408.12480
(2024).
URL https://arxiv.org/abs/2408.12480
[62] P. Nath, P. K. Adhikary, P. Dadure, P. Pakray, R. Manna, S. Bandyopadhyay,
Image caption generation for low-resource Assamese language, in: Pro-
ceedings of the 34th Conference on Computational Linguistics and Speech
Processing (ROCLING 2022), 2022, pp. 263–272.
URL https://aclanthology.org/2022.rocling-1.33/
[63] L. Jiang, J. Li, J. Zhang, Y. Shen, Multimodal Seed Data Augmentation for
Low-Resource Audio Latin Cuengh Language, Applied Sciences 14 (20)
(2024) 9533. doi:10.3390/app14209533.
URL https://www.mdpi.com/2076-3417/14/20/9533
[64] X. Qu, M. Song, W. Wei, J. Dong, Y. Cheng, Mitigating multilingual hallu-
cination in large vision-language models, arXiv preprint arXiv:2408.00550
(2024).
URL https://arxiv.org/abs/2408.00550
[65] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, C. Finn,
Direct Preference Optimization: Your Language Model is Secretly a
Reward Model, in: Proceedings of the 37th International Conference on
Neural Information Processing Systems (NeurIPS), Vol. 36, 2023, pp.
53728–53741.
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/
a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf
[66] M. Shukor, L. Bethune, D. Busbridge, D. Grangier, E. Fini, A. El-
Nouby, P. Ablin, Scaling laws for optimal data mixtures, arXiv preprint
arXiv:2507.09404 (2025).
URL https://arxiv.org/abs/2507.09404
[67] R. Navigli, S. Conia, B. Ross, Biases in large language models: Origins,
inventory, and discussion, ACM Journal of Data and Information Quality
15 (2023) 1–21. doi:10.1145/3597307.
URL https://dl.acm.org/doi/10.1145/3597307
48

-- 48 of 62 --

[68] L.

Chunk 73 · 1,996 chars

arXiv:2507.09404 (2025).
URL https://arxiv.org/abs/2507.09404
[67] R. Navigli, S. Conia, B. Ross, Biases in large language models: Origins,
inventory, and discussion, ACM Journal of Data and Information Quality
15 (2023) 1–21. doi:10.1145/3597307.
URL https://dl.acm.org/doi/10.1145/3597307
48

-- 48 of 62 --

[68] L. Wiechetek, F. A. Pirinen, B. Gaup, T. Trosterud, M. Kappfjell, S. N.
Moshagen, The ethical question – use of indigenous corpora for large lan-
guage models, in: Proceedings of the 2024 Joint International Confer-
ence on Computational Linguistics, Language Resources and Evaluation
(LREC-COLING), 2024, pp. 15922–15931.
URL https://aclanthology.org/2024.lrec-main.1383/
[69] F. Z. Youcef, F. Barigou, Arabic language investigation in the context of
unimodal and multimodal sentiment analysis, in: Proceedings of 22nd In-
ternational Arab Conference on Information Technology (ACIT), 2021, pp.
1–7. doi:10.1109/ACIT53391.2021.9677274.
URL https://ieeexplore.ieee.org/document/9677274
[70] N. Al Roken, G. Barlas, Multimodal Arabic emotion recognition using
deep learning, Speech Communication 155 (C) (2023) 103005. doi:
10.1016/j.specom.2023.103005.
URL https://doi.org/10.1016/j.specom.2023.103005
[71] K. Dashtipour, M. Gogate, E. Cambria, A. Hussain, A novel context-aware
multimodal framework for Persian sentiment analysis, Neurocomputing
457 (C) (2021) 377–388. doi:10.1016/j.neucom.2021.02.020.
URL https://www.sciencedirect.com/science/article/abs/pii/
S0925231221002666
[72] S. Al-Azani, E.-S. M. El-Alfy, Enhanced video analytics for sentiment
analysis based on fusing textual, auditory and visual information, IEEE
Access 8 (2020) 136843–136857. doi:10.1109/ACCESS.2020.
3011977.
URL https://ieeexplore.ieee.org/document/9148603
[73] B. Premjith, G. Jyothish Lal, V. Sowmya, B. R. Chakravarthi, R. Natara-
jan, K. Nandhini, A. Murugappan, B. Bharathi, M. Kaushik, S. Prasanth,
R. Aswin Raj, S. Vijai Simmon, Findings of the shared task on multimodal
abusive language

Chunk 74 · 1,998 chars

oi:10.1109/ACCESS.2020.
3011977.
URL https://ieeexplore.ieee.org/document/9148603
[73] B. Premjith, G. Jyothish Lal, V. Sowmya, B. R. Chakravarthi, R. Natara-
jan, K. Nandhini, A. Murugappan, B. Bharathi, M. Kaushik, S. Prasanth,
R. Aswin Raj, S. Vijai Simmon, Findings of the shared task on multimodal
abusive language detection and sentiment analysis in Tamil and Malay-
alam, in: Proceedings of the Third Workshop on Speech and Language
Technologies for Dravidian Languages (DravidianLangTech), 2023, pp.
72–79.
URL https://aclanthology.org/2023.dravidianlangtech-1.10/
49

-- 49 of 62 --

[74] R. G. Kodali, D. P. Manukonda, M. Pannakkaran, byte-
SizedLLM@DravidianLangTech 2025: Abusive Tamil and Malayalam
Text targeting Women on Social Media Using XLM-RoBERTa and
Attention-BiLSTM, in: Proceedings of the Fifth Workshop on Speech,
Vision, and Language Technologies for Dravidian Languages (Dravidian-
LangTech), 2025, pp. 80–85.
URL https://aclanthology.org/2025.dravidianlangtech-1.14/
[75] M. A. Jigar, A. A. Ayele, S. M. Yimam, C. Biemann, Detecting hate speech
in Amharic using multimodal analysis of social media memes, in: Pro-
ceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying
(TRAC), 2024, pp. 85–95.
URL https://aclanthology.org/2024.trac-1.10/
[76] A. G. Debele, M. M. Woldeyohannis, Multimodal Amharic hate speech
detection using deep learning, in: Proceedings of International Confer-
ence on Information and Communication Technology for Development for
Africa (ICT4DA), 2022, pp. 102–107. doi:10.1109/ICT4DA56482.
2022.9971436.
URL https://ieeexplore.ieee.org/document/9971436/
[77] A. Hatami, S. Banerjee, M. Arcan, P. Buitelaar, J. Philip McCrae, English-
to-low-resource translation: A multimodal approach for Hindi, Malayalam,
Bengali, and Hausa, in: Proceedings of the Ninth Conference on Machine
Translation (WMT), 2024, pp. 815–822. doi:10.18653/v1/2024.
wmt-1.76.
URL https://aclanthology.org/2024.wmt-1.76/
[78] S. Alalem, M. S. Zaghloul, O. Badawy,

Chunk 75 · 1,996 chars

p McCrae, English-
to-low-resource translation: A multimodal approach for Hindi, Malayalam,
Bengali, and Hausa, in: Proceedings of the Ninth Conference on Machine
Translation (WMT), 2024, pp. 815–822. doi:10.18653/v1/2024.
wmt-1.76.
URL https://aclanthology.org/2024.wmt-1.76/
[78] S. Alalem, M. S. Zaghloul, O. Badawy, A Novel Deep Learning Multi-
Modal Sentiment Analysis Model for English and Egyptian Arabic Di-
alects Using Audio and Text, in: Proceedings of 24th International Arab
Conference on Information Technology (ACIT), 2023, pp. 1–5. doi:
10.1109/ACIT58888.2023.10453875.
URL https://ieeexplore.ieee.org/document/10453875
[79] D. S. Chauhan, A. Ekbal, P. Bhattacharyya, An efficient fusion mecha-
nism for multimodal low-resource setting, in: Proceedings of the 45th
International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR), 2022, pp. 2583–2588. doi:10.1145/
50

-- 50 of 62 --

3477495.3531900.
URL https://doi.org/10.1145/3477495.3531900
[80] F. T. J. Faria, L. H. Baniata, M. H. Baniata, M. A. Khair, A. I. Bani Ata,
C. Bunterngchit, S. Kang, SentimentFormer: A Transformer-Based Mul-
timodal Fusion Framework for Enhanced Sentiment Analysis of Memes
in Under-Resourced Bangla Language, Electronics 14 (4) (2025) 799.
doi:10.3390/electronics14040799.
URL https://www.mdpi.com/2079-9292/14/4/799
[81] M. R. Karim, S. K. Dey, T. Islam, M. Shajalal, B. R. Chakravarthi, Multi-
modal hate speech detection from Bengali memes and texts, in: Proceed-
ings of the International Conference on Speech and Language Technologies
for Low-Resource Languages (SPELLL), 2022, pp. 293–308.
URL https://link.springer.com/chapter/10.1007/978-3-031-33231-9_21
[82] R. M. Albalawi, A. T. Jamal, A. O. Khadidos, A. M. Alhothali, Multimodal
Arabic rumors detection, IEEE Access 11 (2023) 9716–9730. doi:10.
1109/ACCESS.2023.3240373.
URL https://ieeexplore.ieee.org/document/10026837
[83] Z. Zhang, S. Zhang, D. Ni, Z. Wei, K. Yang, S. Jin, G. Huang, Z. Liang,
L.

Chunk 76 · 1,991 chars

78-3-031-33231-9_21
[82] R. M. Albalawi, A. T. Jamal, A. O. Khadidos, A. M. Alhothali, Multimodal
Arabic rumors detection, IEEE Access 11 (2023) 9716–9730. doi:10.
1109/ACCESS.2023.3240373.
URL https://ieeexplore.ieee.org/document/10026837
[83] Z. Zhang, S. Zhang, D. Ni, Z. Wei, K. Yang, S. Jin, G. Huang, Z. Liang,
L. Zhang, L. Li, et al., Multimodal sensing for depression risk detection:
Integrating audio, video, and text data, Sensors 24 (12) (2024) 3714. doi:
10.3390/s24123714.
URL https://www.mdpi.com/1424-8220/24/12/3714
[84] N. J. Deocampo, M. Villarica, A. Vinluan, A Lip-Reading Model for Taga-
log Using Multimodal Deep Learning Approach, International Journal of
Computing Sciences Research 8 (2024) 2796–2808.
URL https://stepacademic.net/ijcsr/article/view/511
[85] U. Sehar, S. Kanwal, K. Dashtipur, U. Mir, U. Abbasi, F. Khan, Urdu
sentiment analysis via multimodal data mining based on deep learning
algorithms, IEEE Access 9 (2021) 153072–153082. doi:10.1109/
ACCESS.2021.3122025.
URL https://ieeexplore.ieee.org/document/9583225
[86] F. Arifin, A. Nasuha, A. Priambodo, A. Winursito, T. Gunawan, Ad-
vanced Multimodal Emotion Recognition for Javanese Language Using
51

-- 51 of 62 --

Deep Learning, Indonesian Journal of Electrical Engineering and Informat-
ics 12 (3) (2024) 503–515. doi:10.52549/ijeei.v12i3.5662.
URL https://section.iaesonline.com/index.php/IJEEI/article/view/5662
[87] O. Z. Mamyrbayev, K. Alimhan, B. Amirgaliyev, B. Zhumazhanov,
D. Mussayeva, F. Gusmanova, Multimodal systems for speech recognition,
International Journal of Mobile Communications 18 (3) (2020) 314–326.
doi:10.1504/ijmc.2020.107097.
URL https://doi.org/10.1504/ijmc.2020.107097
[88] K. T. Elahi, T. B. Rahman, S. Shahriar, S. Sarker, S. K. S. Joy, F. M. Shah,
Explainable multimodal sentiment analysis on Bengali memes, in: Pro-
ceedings of 26th International Conference on Computer and Information
Technology (ICCIT), 2023, pp. 1–6. doi:10.1109/ICCIT60459.
2023.10441342.
URL

Chunk 77 · 1,996 chars

504/ijmc.2020.107097
[88] K. T. Elahi, T. B. Rahman, S. Shahriar, S. Sarker, S. K. S. Joy, F. M. Shah,
Explainable multimodal sentiment analysis on Bengali memes, in: Pro-
ceedings of 26th International Conference on Computer and Information
Technology (ICCIT), 2023, pp. 1–6. doi:10.1109/ICCIT60459.
2023.10441342.
URL https://ieeexplore.ieee.org/document/10441342
[89] M. Rahman, A. Raihan, T. Rahman, S. Ahsan, J. Hossain, A. Das,
M. M. Hoque, Binary_Beasts@DravidianLangTech-EACL 2024: Multi-
modal abusive language detection in Tamil based on integrated approach
of machine learning and deep learning techniques, in: Proceedings of the
Fourth Workshop on Speech, Vision, and Language Technologies for Dra-
vidian Languages (DravidianLangTech), 2024, pp. 212–217.
URL https://aclanthology.org/2024.dravidianlangtech-1.35/
[90] R. Das, T. D. Singh, A multi-stage multimodal framework for
sentiment analysis of Assamese in low resource setting, Ex-
pert Systems with Applications 204 (C) (2022) 117575. doi:
10.1016/j.eswa.2022.117575.
URL https://www.sciencedirect.com/science/article/abs/pii/
S0957417422008879
[91] B. R. Chakravarthi, R. Priyadharshini, B. Stearns, A. Jayapal, S. Sridevy,
M. Arcan, M. Zarrouk, J. P. McCrae, Multilingual multimodal machine
translation for Dravidian languages utilizing phonetic transcription, in:
Proceedings of the 2nd Workshop on Technologies for MT of Low Re-
source Languages (LoResMT), 2019, pp. 56–63.
URL https://aclanthology.org/W19-6809/
52

-- 52 of 62 --

[92] L. Meetei, T. D. Singh, S. Bandyopadhyay, Exploiting multiple correlated
modalities can enhance low-resource machine translation quality, Multi-
media Tools and Applications 83 (2024) 13137–13157. doi:10.1007/
s11042-023-15721-2.
URL https://link.springer.com/article/10.1007/s11042-023-15721-2
[93] N.-C. Ristea, R. T. Ionescu, Cascaded cross-modal transformer for request
and complaint detection, in: Proceedings of the 31st ACM International
Conference on Multimedia (ACMMM), 2023,

Chunk 78 · 1,995 chars

ications 83 (2024) 13137–13157. doi:10.1007/
s11042-023-15721-2.
URL https://link.springer.com/article/10.1007/s11042-023-15721-2
[93] N.-C. Ristea, R. T. Ionescu, Cascaded cross-modal transformer for request
and complaint detection, in: Proceedings of the 31st ACM International
Conference on Multimedia (ACMMM), 2023, pp. 9467–9471. doi:10.
1145/3581783.3612846.
URL https://dl.acm.org/doi/10.1145/3581783.3612846
[94] H. Haputhanthri, H. Tennakoon, M. Wijesekara, B. Pushpananda,
H. Thilini, Multi-modal Deep Learning Approach to Improve Sentence
level Sinhala Sign Language Recognition, International Journal on Ad-
vances in ICT for Emerging Regions 16 (2) (2023) 21–30. doi:10.
4038/icter.v16i2.7264.
URL https://icter.sljol.info/articles/10.4038/icter.v16i2.7264
[95] Y. Yang, Q.-D.-E.-J. Ren, R.-F. He, Multi-modal Sentiment Analysis of
Mongolian Language based on Pre-trained Models and High-resolution
Networks, in: Proceedings of International Conference on Asian Language
Processing (IALP), 2024, pp. 291–296. doi:10.1109/IALP63756.
2024.10661161.
URL https://ieeexplore.ieee.org/document/10661161/
[96] S. R. Laskar, A. F. U. R. Khilji, P. Pakray, S. Bandyopadhyay, Multimodal
neural machine translation for English to Hindi, in: Proceedings of the
7th Workshop on Asian Translation (WAT), 2020, pp. 109–113. doi:
10.18653/v1/2020.wat-1.11.
URL https://aclanthology.org/2020.wat-1.11/
[97] S. R. Laskar, A. F. U. R. Khilji, D. Kaushik, P. Pakray, S. Bandyopad-
hyay, Improved English to Hindi multimodal neural machine translation,
in: Proceedings of the 8th Workshop on Asian Translation (WAT), 2021,
pp. 155–160. doi:10.18653/v1/2021.wat-1.17.
URL https://aclanthology.org/2021.wat-1.17/
[98] B. Gain, D. Bandyopadhyay, S. Mukherjee, C. Adak, A. Ekbal, Impact of
Visual Context on Noisy Multimodal NMT: An Empirical Study for En-
53

-- 53 of 62 --

glish to Indian Languages, arXiv preprint arXiv:2308.16075 (2023).
URL https://arxiv.org/abs/2308.16075
[99] X. Shi, Z. Yu, Adding

Chunk 79 · 1,991 chars

thology.org/2021.wat-1.17/
[98] B. Gain, D. Bandyopadhyay, S. Mukherjee, C. Adak, A. Ekbal, Impact of
Visual Context on Noisy Multimodal NMT: An Empirical Study for En-
53

-- 53 of 62 --

glish to Indian Languages, arXiv preprint arXiv:2308.16075 (2023).
URL https://arxiv.org/abs/2308.16075
[99] X. Shi, Z. Yu, Adding visual information to improve multimodal machine
translation for low-resource language, Mathematical Problems in Engineer-
ing 2022 (1) (2022) 5483535. doi:10.1155/2022/5483535.
URL https://onlinelibrary.wiley.com/doi/10.1155/2022/5483535
[100] L. S. Meetei, A. Singh, T. D. Singh, S. Bandyopadhyay, Do cues in a video
help in handling rare words in a machine translation system under a low-
resource setting?, Natural Language Processing Journal 3 (2023) 100016.
doi:https://doi.org/10.1016/j.nlp.2023.100016.
URL https://www.sciencedirect.com/science/article/pii/
S2949719123000134
[101] L. S. Meetei, S. M. Singh, A. Singh, R. Das, T. D. Singh, S. Bandyopad-
hyay, Hindi to English Multimodal Machine Translation on News Dataset
in Low Resource Setting, in: Proceedings of International Conference on
Machine Learning and Data Engineering (ICMLDE), Vol. 218, 2023, pp.
2102–2109. doi:10.1016/j.procs.2023.01.186.
URL https://www.sciencedirect.com/science/article/pii/
S1877050923001862
[102] T. Tayir, L. Li, Unsupervised multimodal machine translation for low-
resource distant language pairs, ACM Transactions on Asian and Low-
Resource Language Information Processing 23 (4) (2024) 1–22. doi:
10.1145/3652161.
URL https://dl.acm.org/doi/10.1145/3652161
[103] T. Tayir, L. Li, M. Maimaiti, Y. Muhtar, Low-resource machine translation
with different granularity image features, in: Proceedings of Chinese Con-
ference on Pattern Recognition and Computer Vision (PRCV), 2025, pp.
260–273.
URL https://link.springer.com/chapter/10.1007/978-981-97-8620-6_18
[104] H. Lekshmy, S. Jayaraman, English-Malayalam Vision aid with Multi
Modal Machine Learning Technologies, in:

Chunk 80 · 1,979 chars

t granularity image features, in: Proceedings of Chinese Con-
ference on Pattern Recognition and Computer Vision (PRCV), 2025, pp.
260–273.
URL https://link.springer.com/chapter/10.1007/978-981-97-8620-6_18
[104] H. Lekshmy, S. Jayaraman, English-Malayalam Vision aid with Multi
Modal Machine Learning Technologies, in: Proceedings of 6th Interna-
tional Conference on Intelligent Computing and Control Systems (ICI-
CCS), 2022, pp. 1469–1476. doi:10.1109/ICICCS53718.2022.
54

-- 54 of 62 --

9788187.
URL https://ieeexplore.ieee.org/document/9788187
[105] S. Parida, O. Bojar, S. R. Dash, Hindi visual genome: A dataset for
multi-modal English to Hindi machine translation, Computación y Sis-
temas 23 (4) (2019) 1499–1505. doi:10.13053/cys-23-4-3294.
URL https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/3294
[106] S. Parida, S. Panda, S. P. Biswal, K. Kotwal, A. Sen, S. R. Dash,
P. Motlicek, Multimodal neural machine translation system for English to
Bengali, in: Proceedings of the First Workshop on Multimodal Machine
Translation for Low Resource Languages (MMTLRL), 2021, pp. 31–39.
URL https://aclanthology.org/2021.mmtlrl-1.6/
[107] L. Nortje, D. Onea¸t˘a, G. Pirlogeanu, H. Kamper, Improved visually
prompted keyword localisation in real low-resource settings, arXiv preprint
arXiv:2409.06013 (2024).
URL https://arxiv.org/abs/2409.06013
[108] A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y. Yang,
J. Baldridge, MURAL: Multimodal, multitask representations across lan-
guages, in: Findings of the Association for Computational Linguistics:
Empirical Methods in Natural Language Processing (EMNLP), 2021, pp.
3449–3463. doi:10.18653/v1/2021.findings-emnlp.293.
URL https://aclanthology.org/2021.findings-emnlp.293/
[109] W. Jian, H. Hou, N. Wu, S. Sun, Z. Yang, Y. Wang, P. Wang, Multimodal
Neural Machine Translation for Mongolian to Chinese, in: Proceedings of
2022 International Joint Conference on Neural Networks (IJCNN), 2022,
pp. 1–8.

Chunk 81 · 1,996 chars

i:10.18653/v1/2021.findings-emnlp.293.
URL https://aclanthology.org/2021.findings-emnlp.293/
[109] W. Jian, H. Hou, N. Wu, S. Sun, Z. Yang, Y. Wang, P. Wang, Multimodal
Neural Machine Translation for Mongolian to Chinese, in: Proceedings of
2022 International Joint Conference on Neural Networks (IJCNN), 2022,
pp. 1–8. doi:10.1109/IJCNN55064.2022.9892831.
URL https://ieeexplore.ieee.org/document/9892831
[110] A. G. Kovath, A. Nayyar, O. K. Sikha, Multimodal attention-
driven visual question answering for Malayalam, Neural Computing
and Applications 36 (24) (2024) 14691–14708. doi:10.1007/
s00521-024-09818-4.
URL https://link.springer.com/article/10.1007/s00521-024-09818-4
[111] S. R. Laskar, R. Singh, M. F. Karim, R. Manna, P. Pakray, S. Bandyopad-
hyay, Investigation of English to Hindi multimodal neural machine transla-
tion using transliteration-based phrase pairs augmentation, in: Proceedings
55

-- 55 of 62 --

of the 9th Workshop on Asian Translation (WAT), 2022, pp. 117–122.
URL https://aclanthology.org/2022.wat-1.15/
[112] S. R. Laskar, B. Paul, S. Paudwal, P. Gautam, N. Biswas, P. Pakray, Mul-
timodal Neural Machine Translation for English–Assamese Pair, in: Pro-
ceedings of International Conference on Computational Performance Eval-
uation (ComPE), 2021, pp. 387–392. doi:10.1109/ComPE53109.
2021.9752181.
URL https://ieeexplore.ieee.org/document/9752181
[113] Y. Chen, F. Wei, X. Sun, Z. Wu, S. Lin, A simple multi-modality
transfer learning baseline for sign language translation, in: Proceedings
of IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2022, pp. 5110–5120. doi:10.1109/CVPR52688.2022.
00506.
URL https://ieeexplore.ieee.org/document/9879103/
[114] A. Amalas, M. Ghogho, M. Chetouani, R. O. H. Thami, A multilin-
gual training strategy for low resource text to speech, arXiv preprint
arXiv:2409.01217 (2024).
URL https://arxiv.org/abs/2409.01217
[115] Y. Wu, S. Zhao, Y. Zhang, X. Yuan, Z. Su, When pairs meet triplets: Im-
proving

Chunk 82 · 1,997 chars

ore.ieee.org/document/9879103/
[114] A. Amalas, M. Ghogho, M. Chetouani, R. O. H. Thami, A multilin-
gual training strategy for low resource text to speech, arXiv preprint
arXiv:2409.01217 (2024).
URL https://arxiv.org/abs/2409.01217
[115] Y. Wu, S. Zhao, Y. Zhang, X. Yuan, Z. Su, When pairs meet triplets: Im-
proving low-resource captioning via multi-objective optimization, ACM
Transactions on Multimedia Computing, Communications, and Applica-
tions 18 (3) (2022) 1–20. doi:10.1145/3492325.
URL https://dl.acm.org/doi/10.1145/3492325
[116] J. Yeo, M. Kim, S. Watanabe, Y. Ro, Visual Speech Recognition for Lan-
guages with Limited Labeled Data Using Automatic Labels from Whis-
per, in: Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2024, pp. 10471–10475. doi:
10.1109/ICASSP48485.2024.10446720.
URL https://ieeexplore.ieee.org/document/10446720
[117] G. Bhatia, E. M. B. Nagoudi, F. Alwajih, M. Abdul-Mageed, Qalam: A
multimodal LLM for Arabic optical character and handwriting recognition,
in: Proceedings of the Second Arabic Natural Language Processing Con-
ference (ArabicNLP), 2024, pp. 210–224. doi:10.18653/v1/2024.
56

-- 56 of 62 --

arabicnlp-1.19.
URL https://aclanthology.org/2024.arabicnlp-1.19/
[118] C. Tran, H. L. Thanh, LaVy: Vietnamese Multimodal Large Language
Model, arXiv preprint arXiv:2404.07922 (2024).
URL https://arxiv.org/abs/2404.07922
[119] C. Onuoha, E. Uba, An analysis of minimal pairs in Igbo using a multi-
modal approach to speech perception, Unizik Journal of Arts and Humani-
ties 25 (2024) 31–50. doi:10.4314/ujah.v25i1.2.
URL https://www.ajol.info/index.php/ujah/article/view/272304
[120] Y. Wu, S. Zhao, J. Chen, Y. Zhang, X. Yuan, Z. Su, Improving captioning
for low-resource languages by cycle consistency, in: Proceedings of IEEE
International Conference on Multimedia and Expo (ICME), 2019, pp. 362–
367. doi:10.1109/ICME.2019.00070.
URL https://ieeexplore.ieee.org/document/8784910
[121] Mamta,

Chunk 83 · 1,998 chars

4
[120] Y. Wu, S. Zhao, J. Chen, Y. Zhang, X. Yuan, Z. Su, Improving captioning
for low-resource languages by cycle consistency, in: Proceedings of IEEE
International Conference on Multimedia and Expo (ICME), 2019, pp. 362–
367. doi:10.1109/ICME.2019.00070.
URL https://ieeexplore.ieee.org/document/8784910
[121] Mamta, A. Ekbal, P. Bhattacharyya, Exploring multi-lingual, multi-task,
and adversarial learning for low-resource sentiment analysis, ACM Trans-
actions on Asian and Low-Resource Language Information Processing
21 (5) (2022) 104. doi:10.1145/3514498.
URL https://dl.acm.org/doi/10.1145/3514498
[122] W. Jin, Y. Cheng, Y. Shen, W. Chen, X. Ren, A good prompt is worth
millions of parameters: Low-resource prompt-based learning for vision-
language models, in: Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (ACL), 2022, pp. 2763–2775.
doi:10.18653/v1/2022.acl-long.197.
URL https://aclanthology.org/2022.acl-long.197/
[123] R. Solomon, M. Abebe, Amharic language image captions generation us-
ing hybridized attention-based deep neural networks, Applied Computa-
tional Intelligence and Soft Computing 2023 (1) (2023) 9397325. doi:
10.1155/2023/9397325.
URL https://onlinelibrary.wiley.com/doi/10.1155/2023/9397325
[124] C. Rahul, T. Arathi, L. S. Panicker, R. Gopikakumari, Morphol-
ogy & word sense disambiguation embedded multimodal neu-
ral machine translation system between Sanskrit and Malayalam,
57

-- 57 of 62 --

Biomedical Signal Processing and Control 85 (2023) 105051.
doi:10.1016/j.bspc.2023.105051.
URL https://www.sciencedirect.com/science/article/pii/
S1746809423004846
[125] M. Tang, C. Wang, J. Wang, C. Tan, S. Huang, C. Chen, W. Qian, Xtreme-
CLIP: Extremely parameter-efficient tuning for low-resource vision lan-
guage understanding, in: Findings of the Association for Computational
Linguistics (ACL), 2023, pp. 6368–6376. doi:10.18653/v1/2023.
findings-acl.397.
URL https://aclanthology.org/2023.findings-acl.397/
[126]

Chunk 84 · 1,999 chars

n, S. Huang, C. Chen, W. Qian, Xtreme-
CLIP: Extremely parameter-efficient tuning for low-resource vision lan-
guage understanding, in: Findings of the Association for Computational
Linguistics (ACL), 2023, pp. 6368–6376. doi:10.18653/v1/2023.
findings-acl.397.
URL https://aclanthology.org/2023.findings-acl.397/
[126] A. Asgarov, S. Rustamov, LowCLIP: Adapting the CLIP Model Architec-
ture for Low-Resource Languages in Multimodal Image Retrieval Task,
arXiv preprint arXiv:2408.13909 (2024). doi:10.48550/arXiv.
2408.13909.
URL https://arxiv.org/abs/2408.13909
[127] H. A. Rahmon, T. G. Jimoh, F. O. Madaiyese, Speech Recognition Model
in Yoruba Language, Smartify: Journal of Smart Education and Pedagogy
1 (1) (2024) 28–46.
URL https://researchvision.us/index.php/smartify/article/view/5
[128] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman,
A. Mathur, A. Schelten, A. Yang, A. Fan, et al., The Llama 3 herd of mod-
els, arXiv preprint arXiv:2407.21783 (2024). doi:10.48550/ARXIV.
2407.21783.
URL https://arxiv.org/abs/2407.21783
[129] DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao,
C. Deng, C. Zhang, et al., Deepseek-v3 technical report, arXiv preprint
arXiv:2412.19437 (2024). doi:10.48550/ARXIV.2412.19437.
URL https://arxiv.org/abs/2412.19437
[130] Anthropic, Claude Opus 4 & Claude Sonnet 4 system card, accessed:
December 2025 (2025).
URL https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5ae987bc64dd.
pdf
58

-- 58 of 62 --

[131] T. Gunter, Z. Wang, C. Wang, R. Pang, A. Narayanan, A. Zhang, B. Zhang,
C. Chen, C. Chiu, D. Qiu, et al., Apple Intelligence Foundation Language
Models, arXiv preprint arXiv:2407.21075 (2024). doi:10.48550/
ARXIV.2407.21075.
URL https://arxiv.org/abs/2407.21075
[132] L. Yang, Y. Tian, B. Li, X. Zhang, K. Shen, Y. Tong, M. Wang,
MMaDA: Multimodal Large Diffusion Language Models, arXiv preprint
arXiv:2505.15809 (2025). doi:10.48550/ARXIV.2505.15809.
URL https://arxiv.org/abs/2505.15809
[133] Y. Shen,

Chunk 85 · 1,991 chars

(2024). doi:10.48550/
ARXIV.2407.21075.
URL https://arxiv.org/abs/2407.21075
[132] L. Yang, Y. Tian, B. Li, X. Zhang, K. Shen, Y. Tong, M. Wang,
MMaDA: Multimodal Large Diffusion Language Models, arXiv preprint
arXiv:2505.15809 (2025). doi:10.48550/ARXIV.2505.15809.
URL https://arxiv.org/abs/2505.15809
[133] Y. Shen, Z. Xu, Q. Wang, Y. Cheng, W. Yin, L. Huang, Multimodal In-
struction Tuning with Conditional Mixture of LoRA, in: Proceedings of
the 62nd Annual Meeting of the Association for Computational Linguistics
(ACL), 2024, pp. 637–648. doi:10.18653/V1/2024.ACL-LONG.
38.
URL https://aclanthology.org/2024.acl-long.38/
[134] M. D. A. Cheema, M. D. Shaiq, F. Mirza, A. Kamal, M. A. Naeem, Adapt-
ing multilingual vision language transformers for low-resource Urdu opti-
cal character recognition (OCR), PeerJ Computer Science 10 (2024) e1964.
URL https://peerj.com/articles/cs-1964/
[135] M. Kim, J. H. Yeo, J. Choi, Y. M. Ro, Lip reading for low-resource
languages by learning and combining general speech knowledge and
language-specific knowledge, in: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), 2023, pp. 15359–15371.
doi:10.1109/ICCV51070.2023.01409.
URL https://ieeexplore.ieee.org/document/10377080/
[136] A. Aruna Gladys, V. Vetriselvi, Sentiment analysis on a low-resource
language dataset using multimodal representation learning and cross-
lingual transfer learning, Applied Soft Computing 157 (C) (2024) 111553.
doi:10.1016/j.asoc.2024.111553.
URL https://www.sciencedirect.com/science/article/abs/pii/
S1568494624003272
[137] W. Chen, B. Yan, J. Shi, Y. Peng, S. Maiti, S. Watanabe, Improving Mas-
sively Multilingual ASR with Auxiliary CTC Objectives, in: Proceed-
ings of IEEE International Conference on Acoustics, Speech and Signal
59

-- 59 of 62 --

Processing (ICASSP), 2023, pp. 1–5. doi:10.1109/ICASSP49357.
2023.10095326.
URL https://ieeexplore.ieee.org/document/10095326
[138] G. O. dos Santos, D. A. Braga Moreira, A. I.

Chunk 86 · 1,992 chars

gual ASR with Auxiliary CTC Objectives, in: Proceed-
ings of IEEE International Conference on Acoustics, Speech and Signal
59

-- 59 of 62 --

Processing (ICASSP), 2023, pp. 1–5. doi:10.1109/ICASSP49357.
2023.10095326.
URL https://ieeexplore.ieee.org/document/10095326
[138] G. O. dos Santos, D. A. Braga Moreira, A. I. Ferreira, J. Silva, L. Pereira,
P. Bueno, T. Sousa, H. Maia, N. Da Silva, E. Colombini, H. Pedrini,
S. Avila, CAPIVARA: Cost-efficient approach for improving multilingual
CLIP performance on low-resource languages, in: Proceedings of the 3rd
Workshop on Multi-lingual Representation Learning (MRL), 2023, pp.
184–207. doi:10.18653/v1/2023.mrl-1.15.
URL https://aclanthology.org/2023.mrl-1.15/
[139] L. Nortje, D. Onea¸t˘a, H. Kamper, Visually grounded few-shot word
learning in low-resource settings, IEEE/ACM Transactions on Audio,
Speech and Language Processing 32 (2024) 2544–2554. doi:10.1109/
TASLP.2024.3393772.
URL https://ieeexplore.ieee.org/document/10508479/
[140] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sas-
try, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning
transferable visual models from natural language supervision, in: Proceed-
ings of the 38th International Conference on Machine Learning (ICML),
Vol. 139, 2021, pp. 8748–8763.
URL http://proceedings.mlr.press/v139/radford21a.html
[141] M. Tsimpoukelli, J. Menick, S. Cabi, S. M. A. Eslami, O. Vinyals,
F. Hill, Multimodal few-shot learning with frozen language models, in:
Proceedings of the 35th International Conference on Neural Information
Processing Systems (NeurIPS), 2021, pp. 200–212.
URL https://proceedings.neurips.cc/paper/2021/file/
01b7575c38dac42f3cfb7d500438b875-Paper.pdf
[142] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, L. Wang, An empirical
study of GPT-3 for few-shot knowledge-based VQA, in: Proceedings of
the AAAI Conference on Artificial Intelligence (AAAI), Vol. 36, 2022, pp.
3081–3089. doi:10.1609/aaai.v36i3.20215.
URL

Chunk 87 · 1,989 chars

er/2021/file/
01b7575c38dac42f3cfb7d500438b875-Paper.pdf
[142] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, L. Wang, An empirical
study of GPT-3 for few-shot knowledge-based VQA, in: Proceedings of
the AAAI Conference on Artificial Intelligence (AAAI), Vol. 36, 2022, pp.
3081–3089. doi:10.1609/aaai.v36i3.20215.
URL https://ojs.aaai.org/index.php/AAAI/article/view/20215
[143] S. R. Laskar, B. Paul, P. Pakray, S. Bandyopadhyay, English-Assamese
Multimodal Neural Machine Translation using Transliteration-based
60

-- 60 of 62 --

Phrase Augmentation Approach, in: Proceedings of International confer-
ence on Machine Learning and Data Engineering (ICMLDE), Vol. 218,
2023, pp. 979–988. doi:10.1016/j.procs.2023.01.078.
URL https://www.sciencedirect.com/science/article/pii/
S1877050923000789
[144] F. Alwajih, E. M. B. Nagoudi, G. Bhatia, A. Mohamed, M. Abdul-Mageed,
Peacock: A family of Arabic multimodal large language models and
benchmarks, in: Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics (ACL), 2024, pp. 12753–12776.
doi:10.18653/v1/2024.acl-long.689.
URL https://aclanthology.org/2024.acl-long.689/
[145] C. Wang, H. Tang, X. Yang, Y. Xie, J. Suh, S. Sitaram, J. Huang, Y. Xie,
Z. Gong, X. Xie, F. Wu, Uncovering inequalities in new knowledge learn-
ing by large language models across different languages, arXiv preprint
arXiv:2503.04064 (2025).
URL https://arxiv.org/abs/2503.04064
[146] B. Y. Lin, C. He, Z. Ze, H. Wang, Y. Hua, C. Dupuy, R. Gupta,
M. Soltanolkotabi, X. Ren, S. Avestimehr, FedNLP: Benchmarking fed-
erated learning methods for natural language processing tasks, in: Findings
of the Association for Computational Linguistics (NAACL), 2022, pp. 157–
175. doi:10.18653/v1/2022.findings-naacl.13.
URL https://aclanthology.org/2022.findings-naacl.13/
[147] W. Zhao, Y. Chen, R. Lee, X. Qiu, Y. Gao, H. Fan, N. D. Lane, Breaking
physical and linguistic borders: Multilingual federated prompt tuning for
low-resource

Chunk 88 · 1,753 chars

ion for Computational Linguistics (NAACL), 2022, pp. 157–
175. doi:10.18653/v1/2022.findings-naacl.13.
URL https://aclanthology.org/2022.findings-naacl.13/
[147] W. Zhao, Y. Chen, R. Lee, X. Qiu, Y. Gao, H. Fan, N. D. Lane, Breaking
physical and linguistic borders: Multilingual federated prompt tuning for
low-resource languages, arXiv preprint arXiv:2507.03003 (2025). doi:
10.48550/arXiv.2507.03003.
URL https://arxiv.org/abs/2507.03003
[148] L. Tran, W. Sun, S. Patterson, A. Milanova, Privacy-preserving person-
alized federated prompt learning for multimodal large language mod-
els, arXiv preprint arXiv:2501.13904 (2025). doi:10.48550/arXiv.
2501.13904.
URL https://arxiv.org/abs/2501.13904
[149] M. Andersland, Amharic LLaMA and LLaVA: Multimodal LLMs for Low
Resource Languages, arXiv preprint arXiv:2403.06354 (2024). doi:10.
61

-- 61 of 62 --

48550/arXiv.2403.06354.
URL https://arxiv.org/abs/2403.06354
[150] L. Shen, W. Tan, S. Chen, Y. Chen, J. Zhang, H. Xu, B. Zheng, P. Koehn,
D. Khashabi, The language barrier: Dissecting safety challenges of LLMs
in multilingual contexts, in: Findings of the Association for Computational
Linguistics (ACL), 2024, pp. 2668–2680. doi:10.18653/v1/2024.
findings-acl.156.
URL https://aclanthology.org/2024.findings-acl.156/
[151] S. Singh, A. Romanou, C. Fourrier, D. I. Adelani, J. G. Ngui, D. Vila-
Suero, P. Limkonchotiwat, K. Marchisio, W. Q. Leong, Y. Susanto, et al.,
Global MMLU: Understanding and addressing cultural and linguistic bi-
ases in multilingual evaluation, in: Proceedings of the 63rd Annual Meeting
of the Association for Computational Linguistics (ACL), 2025, pp. 18761–
18799. doi:10.18653/v1/2025.acl-long.919.
URL https://aclanthology.org/2025.acl-long.919/
62

-- 62 of 62 --